Glutton: large-scale integration of non-model organism transcriptome data for comparative analysis

Mapping Intimacies ◽

10.1101/077511 ◽

2016 ◽

Cited By ~ 2

Author(s):

Alan Medlar ◽

Laura Laakso ◽

Andreia Miraldo ◽

Ari Löytynoja

Keyword(s):

Comparative Analysis ◽

Large Scale ◽

De Novo ◽

Sequence Data ◽

Model Organism ◽

Model Organisms ◽

Rna Seq ◽

Reference Species ◽

Wide Range ◽

The Impact

AbstractHigh-throughput RNA-seq data has become ubiquitous in the study of non-model organisms, but its use in comparative analysis remains a challenge. Without a reference genome for mapping, sequence data has to be de novo assembled, producing large numbers of short, highly redundant contigs. Preparing these assemblies for comparative analyses requires the removal of redundant isoforms, assignment of orthologs and converting fragmented transcripts into gene alignments. In this article we present Glutton, a novel tool to process transcriptome assemblies for downstream evolutionary analyses. Glutton takes as input a set of fragmented, possibly erroneous transcriptome assemblies. Utilising phylogeny-aware alignment and reference data from a closely related species, it reconstructs one transcript per gene, finds orthologous sequences and produces accurate multiple alignments of coding sequences. We present a comprehensive analysis of Glutton’s performance across a wide range of divergence times between study and reference species. We demonstrate the impact choice of assembler has on both the number of alignments and the correctness of ortholog assignment and show substantial improvements over heuristic methods, without sacrificing correctness. Finally, using inference of Darwinian selection as an example of downstream analysis, we show that Glutton-processed RNA-seq data give results comparable to those obtained from full length gene sequences even with distantly related reference species. Glutton is available from http://wasabiapp.org/software/glutton/ and is licensed under the GPLv3.

Download Full-text

PIC-Me: paralogs and isoforms classifier based on machine-learning approaches

BMC Bioinformatics ◽

10.1186/s12859-021-04229-x ◽

2021 ◽

Vol 22 (S11) ◽

Author(s):

Jooseong Oh ◽

Sung-Gwon Lee ◽

Chungoo Park

Keyword(s):

Machine Learning ◽

Large Scale ◽

Gene Annotation ◽

Sequence Similarity ◽

Global Analysis ◽

Model Organism ◽

Model Organisms ◽

Support Vector ◽

Learning Approaches ◽

Rna Seq

Abstract Background Paralogs formed through gene duplication and isoforms formed through alternative splicing have been important processes for increasing protein diversity and maintaining cellular homeostasis. Despite their recognized importance and the advent of large-scale genomic and transcriptomic analyses, paradoxically, accurate annotations of all gene loci to allow the identification of paralogs and isoforms remain surprisingly incomplete. In particular, the global analysis of the transcriptome of a non-model organism for which there is no reference genome is especially challenging. Results To reliably discriminate between the paralogs and isoforms in RNA-seq data, we redefined the pre-existing sequence features (sequence similarity, inverse count of consecutive identical or non-identical blocks, and match-mismatch fraction) previously derived from full-length cDNAs and EST sequences and described newly discovered genomic and transcriptomic features (twilight zone of protein sequence alignment and expression level difference). In addition, the effectiveness and relevance of the proposed features were verified with two widely used support vector machine (SVM) and random forest (RF) models. From nine RNA-seq datasets, all AUC (area under the curve) scores of ROC (receiver operating characteristic) curves were over 0.9 in the RF model and significantly higher than those in the SVM model. Conclusions In this study, using an RF model with five proposed RNA-seq features, we implemented our method called Paralogs and Isoforms Classifier based on Machine-learning approaches (PIC-Me) and showed that it outperformed an existing method. Finally, we envision that our tool will be a valuable computational resource for the genomics community to help with gene annotation and will aid in comparative transcriptomics and evolutionary genomics studies, especially those on non-model organisms.

Download Full-text

AlbaTraDIS: Comparative analysis of large datasets from parallel transposon mutagenesis experiments

10.1101/593624 ◽

2019 ◽

Cited By ~ 3

Author(s):

Andrew J. Page ◽

Sarah Bastkowski ◽

Muhammad Yasir ◽

A. Keith Turner ◽

Thanh Le Viet ◽

...

Keyword(s):

Comparative Analysis ◽

Stress Responses ◽

Cell Biology ◽

Large Scale ◽

Transposon Mutagenesis ◽

Data Sets ◽

Wide Range ◽

Multiple Data Sets ◽

Triclosan Resistance ◽

The Impact

AbstractBackgroundBacteria have evolved over billions of years to survive in a wide range of environments. Currently, there is an incomplete understanding of the genetic basis for mechanisms underpinning survival in stressful conditions, such as the presence of anti-microbials. Transposon mutagenesis has been proven to be a powerful tool to identify genes and networks which are involved in survival and fitness under a given condition by simultaneously assaying the fitness of millions of mutants, thereby relating genotype to phenotype and contributing to an understanding of bacterial cell biology. A recent refinement of this approach allows the roles of essential genes in conditional stress survival to be inferred by altering their expression. These advancements combined with the rapidly falling costs of sequencing now allows comparisons between multiple experiments to identify commonalities in stress responses to different conditions. This capacity however poses a new challenge for analysis of multiple data sets in conjunction.ResultsTo address this analysis need, we have developed ‘AlbaTraDIS’; a software application for rapid large-scale comparative analysis of TraDIS experiments that predicts the impact of transposon insertions on nearby genes. AlbaTraDIS can identify genes which are up or down regulated, or inactivated, between multiple conditions, producing a filtered list of genes for further experimental validation as well as several accompanying data visualisations. We demonstrate the utility of our new approach by applying it to identify genes used byEscherichia colito survive in a wide range of different concentrations of the biocide Triclosan. AlbaTraDIS automatically identified all well characterised Triclosan resistance genes, including the primary target,fabI. A number of new loci were also implicated in Triclosan resistance and the predicted phenotypes for a selection of these were validated experimentally and results showed high consistency with predictions.ConclusionsAlbaTraDIS provides a simple and rapid method to analyse multiple transposon mutagenesis data sets allowing this technology to be used at large scale. To our knowledge this is the only tool currently available that can perform these tasks. AlbaTraDIS is written in Python 3 and is available under the open source licence GNU GPL 3 fromhttps://github.com/quadram-institute-bioscience/albatradis.

Download Full-text

RNA-Seq data analysis for Planarian with tensor decomposition-based unsupervised feature extraction

10.1101/2021.06.15.448531 ◽

2021 ◽

Author(s):

Makoto Kashima ◽

Nobuyoshi Kumagai ◽

Hiromi Hirata ◽

Y-h. Taguchi

Keyword(s):

Feature Extraction ◽

Data Analysis ◽

De Novo ◽

Model Organism ◽

Tensor Decomposition ◽

Time Development ◽

Model Organisms ◽

Rna Seq ◽

Experimental Conditions ◽

Unsupervised Feature Extraction

RNA-Seq data analysis of non-model organisms is often difficult because of the lack of a well-annotated genome. In model organisms, after short reads are mapped to the genome, it is possible to focus on the analysis of regions well-annotated regions. However, in non-model organisms, contigs can be generated by de novo assembling. This can result in a large number of transcripts, making it difficult to easily remove redundancy. A large number of transcripts can also lead to difficulty in the recognition of differentially expressed transcripts (DETs) between more than two experimental conditions, because P-values must be corrected by considering multiple comparison corrections whose effect is enhanced as the number of transcripts increases. Heavily corrected P-values often fail to take sufficiently small P-values as significant. In this study, we applied a recently proposed tensor decomposition (TD)-based unsupervised feature extraction (FE) to the RNA-seq data obtained for a non-model organism, Planarian; we successfully obtained a limited number of transcripts whose expression was altered between normal and defective samples as well as during time development. TD-based unsupervised FE is expected to be an effective tool that can identify a limited number of DETs, even when a poorly annotated genome is available.

Download Full-text

mtDNAcombine: tools to combine sequences from multiple studies

BMC Bioinformatics ◽

10.1186/s12859-021-04048-0 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Eleanor F. Miller ◽

Andrea Manica

Keyword(s):

Sequence Data ◽

Data Extraction ◽

Bayesian Skyline Plot ◽

Model Organisms ◽

Data Sets ◽

Data Handling ◽

Online Database ◽

Genetic Studies ◽

Wide Range ◽

Existing Data

Abstract Background Today an unprecedented amount of genetic sequence data is stored in publicly available repositories. For decades now, mitochondrial DNA (mtDNA) has been the workhorse of genetic studies, and as a result, there is a large volume of mtDNA data available in these repositories for a wide range of species. Indeed, whilst whole genome sequencing is an exciting prospect for the future, for most non-model organisms’ classical markers such as mtDNA remain widely used. By compiling existing data from multiple original studies, it is possible to build powerful new datasets capable of exploring many questions in ecology, evolution and conservation biology. One key question that these data can help inform is what happened in a species’ demographic past. However, compiling data in this manner is not trivial, there are many complexities associated with data extraction, data quality and data handling. Results Here we present the mtDNAcombine package, a collection of tools developed to manage some of the major decisions associated with handling multi-study sequence data with a particular focus on preparing sequence data for Bayesian skyline plot demographic reconstructions. Conclusions There is now more genetic information available than ever before and large meta-data sets offer great opportunities to explore new and exciting avenues of research. However, compiling multi-study datasets still remains a technically challenging prospect. The mtDNAcombine package provides a pipeline to streamline the process of downloading, curating, and analysing sequence data, guiding the process of compiling data sets from the online database GenBank.

Download Full-text

In Search of Species-Specific SNPs in a Non-Model Animal (European Bison (Bison bonasus))—Comparison of De Novo and Reference-Based Integrated Pipeline of STACKS Using Genotyping-by-Sequencing (GBS) Data

Animals ◽

10.3390/ani11082226 ◽

2021 ◽

Vol 11 (8) ◽

pp. 2226

Author(s):

Sazia Kunvar ◽

Sylwia Czarnomska ◽

Cino Pertoldi ◽

Małgorzata Tokarska

Keyword(s):

Reference Genome ◽

De Novo ◽

Bos Taurus ◽

Model Organism ◽

Genotyping By Sequencing ◽

Model Organisms ◽

European Bison ◽

Model Animal ◽

Pcr Duplicates ◽

Species Specific

The European bison is a non-model organism; thus, most of its genetic and genomic analyses have been performed using cattle-specific resources, such as BovineSNP50 BeadChip or Illumina Bovine 800 K HD Bead Chip. The problem with non-specific tools is the potential loss of evolutionary diversified information (ascertainment bias) and species-specific markers. Here, we have used a genotyping-by-sequencing (GBS) approach for genotyping 256 samples from the European bison population in Bialowieza Forest (Poland) and performed an analysis using two integrated pipelines of the STACKS software: one is de novo (without reference genome) and the other is a reference pipeline (with reference genome). Moreover, we used a reference pipeline with two different genomes, i.e., Bos taurus and European bison. Genotyping by sequencing (GBS) is a useful tool for SNP genotyping in non-model organisms due to its cost effectiveness. Our results support GBS with a reference pipeline without PCR duplicates as a powerful approach for studying the population structure and genotyping data of non-model organisms. We found more polymorphic markers in the reference pipeline in comparison to the de novo pipeline. The decreased number of SNPs from the de novo pipeline could be due to the extremely low level of heterozygosity in European bison. It has been confirmed that all the de novo/Bos taurus and Bos taurus reference pipeline obtained SNPs were unique and not included in 800 K BovineHD BeadChip.

Download Full-text

A practical guide to buildde-novoassemblies for single tissues of non-model organisms: the example of a Neotropical frog

PeerJ ◽

10.7717/peerj.3702 ◽

2017 ◽

Vol 5 ◽

pp. e3702 ◽

Cited By ~ 5

Author(s):

Santiago Montero-Mendieta ◽

Manfred Grabherr ◽

Henrik Lantz ◽

Ignacio De la Riva ◽

Jennifer A. Leonard ◽

...

Keyword(s):

Defense Mechanisms ◽

De Novo ◽

Transcriptome Assembly ◽

Cost Effective ◽

Model Organisms ◽

Rna Seq ◽

Assembly Pipeline ◽

Wide Variability ◽

History Of ◽

Inexperienced User

Whole genome sequencing (WGS) is a very valuable resource to understand the evolutionary history of poorly known species. However, in organisms with large genomes, as most amphibians, WGS is still excessively challenging and transcriptome sequencing (RNA-seq) represents a cost-effective tool to explore genome-wide variability. Non-model organisms do not usually have a reference genome and the transcriptome must be assembledde-novo. We used RNA-seq to obtain the transcriptomic profile forOreobates cruralis, a poorly known South American direct-developing frog. In total, 550,871 transcripts were assembled, corresponding to 422,999 putative genes. Of those, we identified 23,500, 37,349, 38,120 and 45,885 genes present in the Pfam, EggNOG, KEGG and GO databases, respectively. Interestingly, our results suggested that genes related to immune system and defense mechanisms are abundant in the transcriptome ofO. cruralis. We also present a pipeline to assist with pre-processing, assembling, evaluating and functionally annotating ade-novotranscriptome from RNA-seq data of non-model organisms. Our pipeline guides the inexperienced user in an intuitive way through all the necessary steps to buildde-novotranscriptome assemblies using readily available software and is freely available at:https://github.com/biomendi/TRANSCRIPTOME-ASSEMBLY-PIPELINE/wiki.

Download Full-text

Deconstructing Gastrulation at the Single Cell Level

10.1101/2021.09.16.460711 ◽

2021 ◽

Author(s):

Tomer Stern ◽

Sebastian J Streichan ◽

Stanislav Y Shvartsman ◽

Eric F Wieschaus

Keyword(s):

Large Scale ◽

Drosophila Embryo ◽

Cell Segmentation ◽

Model Organisms ◽

Specific Cell ◽

Cell Behaviors ◽

Animal Development ◽

Wide Range ◽

Cell Groups ◽

Epithelial Sheets

Gastrulation movements in all animal embryos start with regulated deformations of patterned epithelial sheets. Current studies of gastrulation use a wide range of model organisms and emphasize either large-scale tissue processes or dynamics of individual cells and cell groups. Here we take a step towards bridging these complementary strategies and deconstruct early stages of gastrulation in the entire Drosophila embryo, where transcriptional patterns in the blastoderm give rise to region-specific cell behaviors. Our approach relies on an integrated computational framework for cell segmentation and tracking and on efficient algorithms for event detection. Our results reveal how thousands of cell shape changes, divisions, and intercalations drive large-scale deformations of the patterned blastoderm, setting the stage for systems-level dissection of a pivotal step in animal development.

Download Full-text

Nonhuman primates’ tissue banks: resources for all model organism research

Mammalian Genome ◽

10.1007/s00335-021-09925-w ◽

2021 ◽

Author(s):

Claire Witham ◽

Sara Wells

Keyword(s):

Nonhuman Primates ◽

Ex Vivo ◽

Model Organism ◽

Translation Studies ◽

Model Organisms ◽

Laboratory Animals ◽

Tissue Banks ◽

Post Mortem ◽

Wide Range ◽

Research Programmes

AbstractBiobanks containing tissue and other biological samples from many model organisms provide easy and faster access to ex vivo resources for a wide-range of research programmes. For all laboratory animals, collecting and preserving tissue at post-mortem is an effective way of maximising the benefits of individual animals and potentially reducing the numbers required for experimentation in the future. For primate tissues, biobanks represent the scarcest of these resources but quite possibly those most valuable for preclinical and translation studies.

Download Full-text

OncoThreads: Visualization of Large Scale Longitudinal Cancer Molecular Data

10.31219/osf.io/b3n4u ◽

2021 ◽

Author(s):

Theresa A Harbig ◽

Sabrina Nusrat ◽

Tali Mazor ◽

Qianwen Wang ◽

Alexander Thomson ◽

...

Keyword(s):

Longitudinal Data ◽

Large Scale ◽

Cancer Genomics ◽

Temporal Patterns ◽

Molecular Data ◽

Liquid Biopsies ◽

Sequencing Technologies ◽

Molecular Features ◽

Wide Range ◽

The Impact

Molecular profiling of patient tumors and liquid biopsies over time with next-generation sequencing technologies and new immuno-profile assays are becoming part of standard research and clinical practice. With the wealth of new longitudinal data, there is a critical need for visualizations for cancer researchers to explore and interpret temporal patterns not just in a single patient but across cohorts. To address this need we developed OncoThreads, a tool for the visualization of longitudinal clinical and cancer genomics and other molecular data in patient cohorts. The tool visualizes patient cohorts as temporal heatmaps and Sankey diagrams that support the interactive exploration and ranking of a wide range of clinical and molecular features. This allows analysts to discover temporal patterns in longitudinal data, such as the impact of mutations on response to a treatment, e.g. emergence of resistant clones. We demonstrate the functionality of OncoThreads using a cohort of 23 glioma patients sampled at 2-4 timepoints. OncoThreads is freely available at http://oncothreads.gehlenborglab.org and implemented in Javascript using the cBioPortal web API as a backend.

Download Full-text

Rapture-ready darters: choice of reference genome and genotyping method (whole-genome or sequence capture) influence population genomic inference in Etheostoma

10.1101/2020.05.21.108274 ◽

2020 ◽

Author(s):

Brendan N. Reid ◽

Rachel L. Moran ◽

Christopher J. Kopack ◽

Sarah W. Fitzpatrick

Keyword(s):

Reference Genome ◽

Sequence Data ◽

Low Cost ◽

Read Depth ◽

Model Organisms ◽

Whole Genome ◽

Reduced Representation ◽

Sequence Capture ◽

Population Genomic ◽

The Impact

AbstractResearchers studying non-model organisms have an increasing number of methods available for generating genomic data. However, the applicability of different methods across species, as well as the effect of reference genome choice on population genomic inference, are still difficult to predict in many cases. We evaluated the impact of data type (whole-genome vs. reduced representation) and reference genome choice on data quality and on population genomic and phylogenomic inference across several species of darters (subfamily Etheostomatinae), a highly diverse radiation of freshwater fish. We generated a high-quality reference genome and developed a hybrid RADseq/sequence capture (Rapture) protocol for the Arkansas darter (Etheostoma cragini). Rapture data from 1900 individuals spanning four darter species showed recovery of most loci across darter species at high depth and consistent estimates of heterozygosity regardless of reference genome choice. Loci with baits spanning both sides of the restriction enzyme cut site performed especially well across species. For low-coverage whole-genome data, choice of reference genome affected read depth and inferred heterozygosity. For similar amounts of sequence data, Rapture performed better at identifying fine-scale genetic structure compared to whole-genome sequencing. Rapture loci also recovered an accurate phylogeny for the study species and demonstrated high phylogenetic informativeness across the evolutionary history of the genus Etheostoma. Low cost and high cross-species effectiveness regardless of reference genome suggest that Rapture and similar sequence capture methods may be worthwhile choices for studies of diverse species radiations.

Download Full-text