scholarly journals Error, noise and bias in de novo transcriptome assemblies

2019 ◽  
Author(s):  
Adam H. Freedman ◽  
Michele Clamp ◽  
Timothy B. Sackton

ABSTRACTDe novo transcriptome assembly is a powerful tool, widely used over the last decade for making evolutionary inferences. However, it relies on two implicit assumptions: that the assembled transcriptome is an unbiased representation of the underlying expressed transcriptome, and that expression estimates from the assembly are good, if noisy approximations of the relative abundance of expressed transcripts. Using publicly available data for model organisms, we demonstrate that, across assembly algorithms and data sets, these assumptions are consistently violated. Bias exists at the nucleotide level, with genotyping error rates ranging from 30-83%. As a result, diversity is underestimated in transcriptome assemblies, with consistent under-estimation of heterozygosity in all but the most inbred samples. Even at the gene level, expression estimates show wide deviations from map-to-reference estimates, and positive bias at lower expression levels. Standard filtering of transcriptome assemblies improves the robustness of gene expression estimates but leads to the loss of a meaningful number of protein-coding genes, including many that are highly expressed. We demonstrate a computational method, length-rescaled CPM, to partly alleviate noise and bias in expression estimates. Researchers should consider ways to minimize the impact of bias in transcriptome assemblies.

2015 ◽  
Author(s):  
Benjamin J Matthews ◽  
Carolyn S McBride ◽  
Matthew DeGennaro ◽  
Orion Despo ◽  
Leslie B Vosshall

Background A complete genome sequence and the advent of genome editing open up non-traditional model organisms to mechanistic genetic studies. The mosquito Aedes aegypti is an important vector of infectious diseases such as dengue, chikungunya, and yellow fever, and has a large and complex genome, which has slowed annotation efforts. We used comprehensive transcriptomic analysis of adult gene expression to improve the genome annotation and to provide a detailed tissue-specific catalogue of neural gene expression at different adult behavioral states. Results We carried out deep RNA sequencing across all major peripheral male and female sensory tissues, the brain, and (female) ovary. Furthermore, we examined gene expression across three important phases of the female reproductive cycle, a remarkable example of behavioral switching in which a female mosquito alternates between obtaining blood-meals from humans and laying eggs. Using genome-guided alignments and de novo transcriptome assembly, our re-annotation includes 572 new putative protein-coding genes and updates to 13.5% and 50.3% of existing transcripts within coding sequences and untranslated regions, respectively. Using this updated annotation, we detail gene expression in each tissue, identifying large numbers of transcripts regulated by blood-feeding and sexually dimorphic transcripts that may provide clues to the biology of male- and female-specific behaviors, such as mating and blood-feeding, which are areas of intensive study for those interested in vector control. Conclusions This neurotranscriptome forms a strong foundation for the study of genes in the mosquito nervous system and investigation of sensory-driven behaviors and their regulation. Furthermore, understanding the molecular genetic basis of mosquito chemosensory behavior has important implications for vector control.


2020 ◽  
Author(s):  
Yangmei Qin ◽  
Zhe Lin ◽  
Dan Shi ◽  
Mindong Zhong ◽  
Te An ◽  
...  

AbstractIt is a long-term challenge to undertake reliable transcriptomic research under different circumstances of genome availability. Here, we newly developed a genome-free computational method to aid accurate transcriptome assembly, using the amphioxus as the example. Via integrating ten next generation sequencing (NGS) transcriptome datasets and one third-generation sequencing (TGS) dataset, we built a sequence library of non-redundant expressed transcripts for the amphioxus. The library consisted of overall 91,915 distinct transcripts, 51,549 protein-coding transcripts, and 16,923 novel extragenic transcripts. This substantially improved current amphioxus genome annotation by expanding the distinct gene number from 21,954 to 38,777. We consolidated the library significantly outperformed the genome, as well as de novo method, in transcriptome assembly from multiple aspects. For convenience, we curated the Integrative Transcript Library database of the amphioxus (http://www.bio-add.org/InTrans/). In summary, this work provides a practical solution for most organisms to alleviate the heavy dependence on good quality genome in transcriptome research. It also ensures the amphioxus transcriptome research grounding on reliable data.


2021 ◽  
Vol 14 (1) ◽  
Author(s):  
Daniel Stribling ◽  
Peter L. Chang ◽  
Justin E. Dalton ◽  
Christopher A. Conow ◽  
Malcolm Rosenthal ◽  
...  

Abstract Objectives Arachnids have fascinating and unique biology, particularly for questions on sex differences and behavior, creating the potential for development of powerful emerging models in this group. Recent advances in genomic techniques have paved the way for a significant increase in the breadth of genomic studies in non-model organisms. One growing area of research is comparative transcriptomics. When phylogenetic relationships to model organisms are known, comparative genomic studies provide context for analysis of homologous genes and pathways. The goal of this study was to lay the groundwork for comparative transcriptomics of sex differences in the brain of wolf spiders, a non-model organism of the pyhlum Euarthropoda, by generating transcriptomes and analyzing gene expression. Data description To examine sex-differential gene expression, short read transcript sequencing and de novo transcriptome assembly were performed. Messenger RNA was isolated from brain tissue of male and female subadult and mature wolf spiders (Schizocosa ocreata). The raw data consist of sequences for the two different life stages in each sex. Computational analyses on these data include de novo transcriptome assembly and differential expression analyses. Sample-specific and combined transcriptomes, gene annotations, and differential expression results are described in this data note and are available from publicly-available databases.


2021 ◽  
Vol 22 (13) ◽  
pp. 6674
Author(s):  
Luisa Albarano ◽  
Valerio Zupo ◽  
Davide Caramiello ◽  
Maria Toscanesi ◽  
Marco Trifuoggi ◽  
...  

Sediment pollution is a major issue in coastal areas, potentially endangering human health and the marine environments. We investigated the short-term sublethal effects of sediments contaminated with polycyclic aromatic hydrocarbons (PAHs) and polychlorinated biphenyls (PCBs) on the sea urchin Paracentrotus lividus for two months. Spiking occurred at concentrations below threshold limit values permitted by the law (TLVPAHs = 900 µg/L, TLVPCBs = 8 µg/L, Legislative Italian Decree 173/2016). A multi-endpoint approach was adopted, considering both adults (mortality, bioaccumulation and gonadal index) and embryos (embryotoxicity, genotoxicity and de novo transcriptome assembly). The slight concentrations of PAHs and PCBs added to the mesocosms were observed to readily compartmentalize in adults, resulting below the detection limits just one week after their addition. Reconstructed sediment and seawater, as negative controls, did not affect sea urchins. PAH- and PCB-spiked mesocosms were observed to impair P. lividus at various endpoints, including bioaccumulation and embryo development (mainly PAHs) and genotoxicity (PAHs and PCBs). In particular, genotoxicity tests revealed that PAHs and PCBs affected the development of P. lividus embryos deriving from exposed adults. Negative effects were also detected by generating a de novo transcriptome assembly and its annotation, as well as by real-time qPCR performed to identify genes differentially expressed in adults exposed to the two contaminants. The effects on sea urchins (both adults and embryos) at background concentrations of PAHs and PCBs below TLV suggest a need for further investigations on the impact of slight concentrations of such contaminants on marine biota.


PeerJ ◽  
2017 ◽  
Vol 5 ◽  
pp. e3702 ◽  
Author(s):  
Santiago Montero-Mendieta ◽  
Manfred Grabherr ◽  
Henrik Lantz ◽  
Ignacio De la Riva ◽  
Jennifer A. Leonard ◽  
...  

Whole genome sequencing (WGS) is a very valuable resource to understand the evolutionary history of poorly known species. However, in organisms with large genomes, as most amphibians, WGS is still excessively challenging and transcriptome sequencing (RNA-seq) represents a cost-effective tool to explore genome-wide variability. Non-model organisms do not usually have a reference genome and the transcriptome must be assembledde-novo. We used RNA-seq to obtain the transcriptomic profile forOreobates cruralis, a poorly known South American direct-developing frog. In total, 550,871 transcripts were assembled, corresponding to 422,999 putative genes. Of those, we identified 23,500, 37,349, 38,120 and 45,885 genes present in the Pfam, EggNOG, KEGG and GO databases, respectively. Interestingly, our results suggested that genes related to immune system and defense mechanisms are abundant in the transcriptome ofO. cruralis. We also present a pipeline to assist with pre-processing, assembling, evaluating and functionally annotating ade-novotranscriptome from RNA-seq data of non-model organisms. Our pipeline guides the inexperienced user in an intuitive way through all the necessary steps to buildde-novotranscriptome assemblies using readily available software and is freely available at:https://github.com/biomendi/TRANSCRIPTOME-ASSEMBLY-PIPELINE/wiki.


2019 ◽  
Author(s):  
Thomas F. Martinez ◽  
Qian Chu ◽  
Cynthia Donaldson ◽  
Dan Tan ◽  
Maxim N. Shokhirev ◽  
...  

Protein-coding small open reading frames (smORFs) are emerging as an important class of genes, however, the coding capacity of smORFs in the human genome is unclear. By integrating de novo transcriptome assembly and Ribo-Seq, we confidently annotate thousands of novel translated smORFs in three human cell lines. We find that smORF translation prediction is noisier than for annotated coding sequences, underscoring the importance of analyzing multiple experiments and footprinting conditions. These smORFs are located within non-coding and antisense transcripts, the UTRs of mRNAs, and unannotated transcripts. Analysis of RNA levels and translation efficiency during cellular stress identifies regulated smORFs, providing an approach to select smORFs for further investigation. Sequence conservation and signatures of positive selection indicate that encoded microproteins are likely functional. Additionally, proteomics data from enriched human leukocyte antigen complexes validates the translation of hundreds of smORFs and positions them as a source of novel antigens. Thus, smORFs represent a significant number of important, yet unexplored human genes.


2012 ◽  
Vol 78 (22) ◽  
pp. 8025-8032 ◽  
Author(s):  
Anika Reinhold ◽  
Martin Westermann ◽  
Jana Seifert ◽  
Martin von Bergen ◽  
Torsten Schubert ◽  
...  

ABSTRACTCorrinoids are essential cofactors of reductive dehalogenases in anaerobic bacteria. Microorganisms mediating reductive dechlorination as part of their energy metabolism are either capable ofde novocorrinoid biosynthesis (e.g.,Desulfitobacteriumspp.) or dependent on exogenous vitamin B12(e.g.,Dehalococcoidesspp.). In this study, the impact of exogenous vitamin B12(cyanocobalamin) and of tetrachloroethene (PCE) on the synthesis and the subcellular localization of the reductive PCE dehalogenase was investigated in the Gram-positiveDesulfitobacterium hafniensestrain Y51, a bacterium able to synthesize corrinoidsde novo. PCE-depleted cells grown for several subcultivation steps on fumarate as an alternative electron acceptor lost the tetrachloroethene-reductive dehalogenase (PceA) activity by the transposition of thepcegene cluster. In the absence of vitamin B12, a gradual decrease of the PceA activity and protein amount was observed; after 5 subcultivation steps with 10% inoculum, more than 90% of the enzyme activity and of the PceA protein was lost. In the presence of vitamin B12, a significant delay in the decrease of the PceA activity with an ∼90% loss after 20 subcultivation steps was observed. This corresponded to the decrease in thepceAgene level, indicating that exogenous vitamin B12hampered the transposition of thepcegene cluster. In the absence or presence of exogenous vitamin B12, the intracellular corrinoid level decreased in fumarate-grown cells and the PceA precursor formed catalytically inactive, corrinoid-free multiprotein aggregates. The data indicate that exogenous vitamin B12is not incorporated into the PceA precursor, even though it affects the transposition of thepcegene cluster.


BMC Genomics ◽  
2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Rashmi Jain ◽  
Jerry Jenkins ◽  
Shengqiang Shu ◽  
Mawsheng Chern ◽  
Joel A. Martin ◽  
...  

Abstract Background The availability of thousands of complete rice genome sequences from diverse varieties and accessions has laid the foundation for in-depth exploration of the rice genome. One drawback to these collections is that most of these rice varieties have long life cycles, and/or low transformation efficiencies, which limits their usefulness as model organisms for functional genomics studies. In contrast, the rice variety Kitaake has a rapid life cycle (9 weeks seed to seed) and is easy to transform and propagate. For these reasons, Kitaake has emerged as a model for studies of diverse monocotyledonous species. Results Here, we report the de novo genome sequencing and analysis of Oryza sativa ssp. japonica variety KitaakeX, a Kitaake plant carrying the rice XA21 immune receptor. Our KitaakeX sequence assembly contains 377.6 Mb, consisting of 33 scaffolds (476 contigs) with a contig N50 of 1.4 Mb. Complementing the assembly are detailed gene annotations of 35,594 protein coding genes. We identified 331,335 genomic variations between KitaakeX and Nipponbare (ssp. japonica), and 2,785,991 variations between KitaakeX and Zhenshan97 (ssp. indica). We also compared Kitaake resequencing reads to the KitaakeX assembly and identified 219 small variations. The high-quality genome of the model rice plant KitaakeX will accelerate rice functional genomics. Conclusions The high quality, de novo assembly of the KitaakeX genome will serve as a useful reference genome for rice and will accelerate functional genomics studies of rice and other species.


Author(s):  
Nicolas Rodrigue ◽  
Thibault Latrille ◽  
Nicolas Lartillot

Abstract In recent years, codon substitution models based on the mutation–selection principle have been extended for the purpose of detecting signatures of adaptive evolution in protein-coding genes. However, the approaches used to date have either focused on detecting global signals of adaptive regimes—across the entire gene—or on contexts where experimentally derived, site-specific amino acid fitness profiles are available. Here, we present a Bayesian site-heterogeneous mutation–selection framework for site-specific detection of adaptive substitution regimes given a protein-coding DNA alignment. We offer implementations, briefly present simulation results, and apply the approach on a few real data sets. Our analyses suggest that the new approach shows greater sensitivity than traditional methods. However, more study is required to assess the impact of potential model violations on the method, and gain a greater empirical sense its behavior on a broader range of real data sets. We propose an outline of such a research program.


2019 ◽  
Vol 10 (1) ◽  
Author(s):  
Eugene J. Gardner ◽  
Elena Prigmore ◽  
Giuseppe Gallone ◽  
Petr Danecek ◽  
Kaitlin E. Samocha ◽  
...  

Abstract Mobile genetic Elements (MEs) are segments of DNA which can copy themselves and other transcribed sequences through the process of retrotransposition (RT). In humans several disorders have been attributed to RT, but the role of RT in severe developmental disorders (DD) has not yet been explored. Here we identify RT-derived events in 9738 exome sequenced trios with DD-affected probands. We ascertain 9 de novo MEs, 4 of which are likely causative of the patient’s symptoms (0.04%), as well as 2 de novo gene retroduplications. Beyond identifying likely diagnostic RT events, we estimate genome-wide germline ME mutation rate and selective constraint and demonstrate that coding RT events have signatures of purifying selection equivalent to those of truncating mutations. Overall, our analysis represents a comprehensive interrogation of the impact of retrotransposition on protein coding genes and a framework for future evolutionary and disease studies.


Sign in / Sign up

Export Citation Format

Share Document