scholarly journals Illuminating the dark side of the human transcriptome with TAMA Iso-Seq analysis

2019 ◽  
Author(s):  
Richard I. Kuo ◽  
Yuanyuan Cheng ◽  
Jacqueline Smith ◽  
Alan L. Archibald ◽  
David W. Burt

AbstractThe human transcriptome is one of the most well-annotated of the eukaryotic species. However, limitations in technology biased discovery toward protein coding spliced genes. Accurate high throughput long read RNA sequencing now has the potential to investigate genes that were previously undetectable. Using our Transcriptome Annotation by Modular Algorithms (TAMA) tool kit to analyze the Pacific Bioscience Universal Human Reference RNA Sequel II Iso-Seq dataset, we discovered thousands of potential novel genes and identified challenges in both RNA preparation and long read data processing that have major implications for transcriptome annotation.

2020 ◽  
Author(s):  
Richard Kuo ◽  
Yuanyuan Cheng ◽  
Runxuan Zhang ◽  
John W.S. Brown ◽  
Jacqueline Smith ◽  
...  

Abstract Background: The human transcriptome annotation is regarded as one of the most complete of any eukaryotic species. However, limitations in sequencing technologies have biased the annotation toward multi-exonic protein coding genes. Accurate high-throughput long read transcript sequencing can now provide additional evidence for rare transcripts and genes such as mono-exonic and non-coding genes that were previously either undetectable or impossible to differentiate from sequencing noise. Results: We developed the Transcriptome Annotation by Modular Algorithms (TAMA) software to leverage the power of long read transcript sequencing and address the issues with current data processing pipelines. TAMA achieved high sensitivity and precision for gene and transcript model predictions in both reference guided and unguided approaches in our benchmark tests using simulated Pacific Biosciences (PacBio) and Nanopore sequencing data and real PacBio datasets. By analyzing PacBio Sequel II Iso-Seq sequencing data of the Universal Human Reference RNA (UHRR) using TAMA and other commonly used tools, we found that the convention of using alignment identity to measure error correction performance does not reflect actual gain in accuracy of predicted transcript models. In addition, inter-read error correction can cause major changes to read mapping, resulting in potentially over 6K erroneous gene model predictions in the Iso-Seq based human genome annotation. Using TAMA’s genome assembly based error correction and gene feature evidence, we predicted 2,566 putative novel non-coding genes and 1,557 putative novel protein coding gene models.Conclusions: Long read transcript sequencing data has the power to identify novel genes within the highly annotated human genome. The use of parameter tuning and extensive output information of the TAMA software package allows for in depth exploration of eukaryotic transcriptomes. We have found long read data based evidence for thousands of unannotated genes within the human genome. More development in sequencing library preparation and data processing are required for differentiating sequencing noise from real genes in long read RNA sequencing data.


2020 ◽  
Author(s):  
Richard Kuo ◽  
Yuanyuan Cheng ◽  
Runxuan Zhang ◽  
John W.S. Brown ◽  
Jacqueline Smith ◽  
...  

Abstract Background: The human transcriptome annotation is regarded as one of the most complete of any eukaryotic species. However, limitations in sequencing technologies have biased the annotation toward multi-exonic protein coding genes. Accurate high-throughput long read transcript sequencing can now provide stronger evidence for rare transcripts and genes such as mono-exonic and non-coding genes that were previously either undetectable or impossible to differentiate from sequencing noise. Results: We developed the Transcriptome Annotation by Modular Algorithms (TAMA) software to leverage the power of long read transcript sequencing and address the issues with current data processing pipelines. TAMA achieved high sensitivity and precision for gene and transcript model predictions in both reference guided and unguided approaches in our benchmark tests using simulated Pacific Biosciences (PacBio) and Nanopore sequencing data and real PacBio datasets. By analyzing PacBio Sequel II Iso-Seq sequencing data of the Universal Human Reference RNA (UHRR) using TAMA and other commonly used pipelines, we found that the convention of using mapping identity to measure error correction performance does not reflect actual gain in accuracy of predicted transcript models. In addition, inter-read error correction can cause major changes to read mapping, resulting in potentially over 6K erroneous gene model predictions in the Iso-Seq based human genome annotation. Using TAMA’s genome assembly based error correction and gene feature evidence, we identified 2,566 putative novel non-coding genes and 1,557 putative novel protein coding gene models.Conclusions: Long read transcript sequencing data has the power to identify novel genes within the highly annotated human genome. The use of parameter tuning and extensive output information of the TAMA software package allows for in depth exploration of eukaryotic transcriptomes. We have found long read data based evidence for thousands of unannotated genes within the human genome. More development in sequencing library preparation and data processing are required for differentiating sequencing noise from real genes in long read RNA sequencing data.


BMC Genomics ◽  
2020 ◽  
Vol 21 (1) ◽  
Author(s):  
Richard I. Kuo ◽  
Yuanyuan Cheng ◽  
Runxuan Zhang ◽  
John W. S. Brown ◽  
Jacqueline Smith ◽  
...  

Abstract Background The human transcriptome annotation is regarded as one of the most complete of any eukaryotic species. However, limitations in sequencing technologies have biased the annotation toward multi-exonic protein coding genes. Accurate high-throughput long read transcript sequencing can now provide additional evidence for rare transcripts and genes such as mono-exonic and non-coding genes that were previously either undetectable or impossible to differentiate from sequencing noise. Results We developed the Transcriptome Annotation by Modular Algorithms (TAMA) software to leverage the power of long read transcript sequencing and address the issues with current data processing pipelines. TAMA achieved high sensitivity and precision for gene and transcript model predictions in both reference guided and unguided approaches in our benchmark tests using simulated Pacific Biosciences (PacBio) and Nanopore sequencing data and real PacBio datasets. By analyzing PacBio Sequel II Iso-Seq sequencing data of the Universal Human Reference RNA (UHRR) using TAMA and other commonly used tools, we found that the convention of using alignment identity to measure error correction performance does not reflect actual gain in accuracy of predicted transcript models. In addition, inter-read error correction can cause major changes to read mapping, resulting in potentially over 6 K erroneous gene model predictions in the Iso-Seq based human genome annotation. Using TAMA’s genome assembly based error correction and gene feature evidence, we predicted 2566 putative novel non-coding genes and 1557 putative novel protein coding gene models. Conclusions Long read transcript sequencing data has the power to identify novel genes within the highly annotated human genome. The use of parameter tuning and extensive output information of the TAMA software package allows for in depth exploration of eukaryotic transcriptomes. We have found long read data based evidence for thousands of unannotated genes within the human genome. More development in sequencing library preparation and data processing are required for differentiating sequencing noise from real genes in long read RNA sequencing data.


2021 ◽  
Author(s):  
Gábor Torma ◽  
Dóra Tombácz ◽  
Norbert Moldován ◽  
Ádám Fülöp ◽  
István Prazsák ◽  
...  

Abstract In this study, we used two long-read sequencing (LRS) techniques, Sequel from the Pacific Biosciences and MinION from Oxford Nanopore Technologies, for the transcriptional characterization of a prototype baculovirus, Autographacalifornica multiple nucleopolyhedrovirus. LRS is able to read full-length RNA molecules, and thereby to distinguish between transcript isoforms, mono- and polycistronic RNAs, and overlapping transcripts. Altogether, we detected 875 transcripts, of which 759 are novel and 116 have been annotated previously. These RNA molecules include 41 novel putative protein coding transcript (each containing 5’-truncated in-frame ORFs), 14 monocistronic transcripts, 99 multicistronic RNAs, 101 non-coding RNA, and 504 length isoforms. We also detected RNA methylation in 12 viral genes and RNA hyper-editing in the longer 5’-UTR transcript isoform of ORF 19 gene.


GigaScience ◽  
2019 ◽  
Vol 8 (7) ◽  
Author(s):  
Chang-Ming Bai ◽  
Lu-Sheng Xin ◽  
Umberto Rosani ◽  
Biao Wu ◽  
Qing-Chen Wang ◽  
...  

Abstract Background The blood clam, Scapharca (Anadara) broughtonii, is an economically and ecologically important marine bivalve of the family Arcidae. Efforts to study their population genetics, breeding, cultivation, and stock enrichment have been somewhat hindered by the lack of a reference genome. Herein, we report the complete genome sequence of S. broughtonii, a first reference genome of the family Arcidae. Findings A total of 75.79 Gb clean data were generated with the Pacific Biosciences and Oxford Nanopore platforms, which represented approximately 86× coverage of the S. broughtonii genome. De novo assembly of these long reads resulted in an 884.5-Mb genome, with a contig N50 of 1.80 Mb and scaffold N50 of 45.00 Mb. Genome Hi-C scaffolding resulted in 19 chromosomes containing 99.35% of bases in the assembled genome. Genome annotation revealed that nearly half of the genome (46.1%) is composed of repeated sequences, while 24,045 protein-coding genes were predicted and 84.7% of them were annotated. Conclusions We report here a chromosomal-level assembly of the S. broughtonii genome based on long-read sequencing and Hi-C scaffolding. The genomic data can serve as a reference for the family Arcidae and will provide a valuable resource for the scientific community and aquaculture sector.


2020 ◽  
Author(s):  
Richard Kuo ◽  
Yuanyuan Cheng ◽  
Runxuan Zhang ◽  
John W.S. Brown ◽  
Jacqueline Smith ◽  
...  

Abstract Background The human transcriptome annotation is regarded as one of the most complete of any eukaryotic species. However, limitations in sequencing technologies have biased the annotation toward multi-exonic protein coding genes. Accurate high-throughput long read transcript sequencing can now provide stronger evidence for genes that were previously either undetectable or impossible to differentiate from sequencing noise such as rare transcripts, mono-exonic, and non-coding genes.Results We analyzed Sequel II Iso-Seq sequencing data of the Universal Human Reference RNA (UHRR) using the Transcriptome Annotation by Modular Algorithms (TAMA) software. We found that the convention of using mapping identity to measure error correction performance does not reflect actual gain in accuracy of predicted transcript models. In addition, inter-read error correction leads to the thousands of erroneous gene models. Using genome assembly based error correction and gene feature evidence, we identified thousands of potentially functional novel genes.Conclusions The standard of using inter-read error correction for long read RNA sequencing data could be responsible for genome annotations with thousands of biologically inaccurate gene models. More than half of all real genes in the human genome may still be missing in current public annotations. We require better methods for differentiating sequencing noise from real genes in long read RNA sequencing data.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Evgenii Baiakhmetov ◽  
Cervin Guyomar ◽  
Ekaterina Shelest ◽  
Marcin Nobis ◽  
Polina D. Gudkova

AbstractThe Eurasian plant Stipa capillata is the most widespread species within feather grasses. Many taxa of the genus are dominants in steppe plant communities and can be used for their classification and in studies related to climate change. Moreover, some species are of economic importance mainly as fodder plants and can be used for soil remediation processes. Although large-scale molecular data has begun to appear, there is still no complete or draft genome for any Stipa species. Thus, here we present a single-molecule long-read sequencing dataset generated using the Pacific Biosciences Sequel System. A draft genome of about 1004 Mb was obtained with a contig N50 length of 351 kb. Importantly, here we report 81,224 annotated protein-coding genes, present 77,614 perfect and 58 unique imperfect SSRs, reveal the putative allopolyploid nature of S. capillata, investigate the evolutionary history of the genus, demonstrate structural heteroplasmy of the chloroplast genome and announce for the first time the mitochondrial genome in Stipa. The assembled nuclear, mitochondrial and chloroplast genomes provide a significant source of genetic data for further works on phylogeny, hybridisation and population studies within Stipa and the grass family Poaceae.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Robin-Lee Troskie ◽  
Yohaann Jafrani ◽  
Tim R. Mercer ◽  
Adam D. Ewing ◽  
Geoffrey J. Faulkner ◽  
...  

AbstractPseudogenes are gene copies presumed to mainly be functionless relics of evolution due to acquired deleterious mutations or transcriptional silencing. Using deep full-length PacBio cDNA sequencing of normal human tissues and cancer cell lines, we identify here hundreds of novel transcribed pseudogenes expressed in tissue-specific patterns. Some pseudogene transcripts have intact open reading frames and are translated in cultured cells, representing unannotated protein-coding genes. To assess the biological impact of noncoding pseudogenes, we CRISPR-Cas9 delete the nucleus-enriched pseudogene PDCL3P4 and observe hundreds of perturbed genes. This study highlights pseudogenes as a complex and dynamic component of the human transcriptional landscape.


2021 ◽  
Vol 18 (1) ◽  
Author(s):  
Ahmed Al Qaffas ◽  
Salvatore Camiolo ◽  
Mai Vo ◽  
Alexis Aguiar ◽  
Amine Ourahmane ◽  
...  

AbstractThe advent of whole genome sequencing has revealed that common laboratory strains of human cytomegalovirus (HCMV) have major genetic deficiencies resulting from serial passage in fibroblasts. In particular, tropism for epithelial and endothelial cells is lost due to mutations disrupting genes UL128, UL130, or UL131A, which encode subunits of a virion-associated pentameric complex (PC) important for viral entry into these cells but not for entry into fibroblasts. The endothelial cell-adapted strain TB40/E has a relatively intact genome and has emerged as a laboratory strain that closely resembles wild-type virus. However, several heterogeneous TB40/E stocks and cloned variants exist that display a range of sequence and tropism properties. Here, we report the use of PacBio sequencing to elucidate the genetic changes that occurred, both at the consensus level and within subpopulations, upon passaging a TB40/E stock on ARPE-19 epithelial cells. The long-read data also facilitated examination of the linkage between mutations. Consistent with inefficient ARPE-19 cell entry, at least 83% of viral genomes present before adaptation contained changes impacting PC subunits. In contrast, and consistent with the importance of the PC for entry into endothelial and epithelial cells, genomes after adaptation lacked these or additional mutations impacting PC subunits. The sequence data also revealed six single noncoding substitutions in the inverted repeat regions, single nonsynonymous substitutions in genes UL26, UL69, US28, and UL122, and a frameshift truncating gene UL141. Among the changes affecting protein-coding regions, only the one in UL122 was strongly selected. This change, resulting in a D390H substitution in the encoded protein IE2, has been previously implicated in rendering another viral protein, UL84, essential for viral replication in fibroblasts. This finding suggests that IE2, and perhaps its interactions with UL84, have important functions unique to HCMV replication in epithelial cells.


Sign in / Sign up

Export Citation Format

Share Document