TGStools: A Bioinformatics Suit to Facilitate Transcriptome Analysis of Long Reads from Third Generation Sequencing Platform

Danze Chen; Qianqian Zhao; Leiming Jiang; Shuaiyuan Liao; Zhigang Meng; Jianzhen Xu

doi:10.3390/genes10070519

TGStools: A Bioinformatics Suit to Facilitate Transcriptome Analysis of Long Reads from Third Generation Sequencing Platform

Genes ◽

10.3390/genes10070519 ◽

2019 ◽

Vol 10 (7) ◽

pp. 519

Author(s):

Danze Chen ◽

Qianqian Zhao ◽

Leiming Jiang ◽

Shuaiyuan Liao ◽

Zhigang Meng ◽

...

Keyword(s):

Transcriptome Analysis ◽

De Novo ◽

Noncoding Rnas ◽

Long Noncoding Rnas ◽

Third Generation ◽

Sequencing Platform ◽

Third Generation Sequencing ◽

Long Reads ◽

Routine Tasks ◽

Generation Sequencing

Recent analyses show that transcriptome sequencing can be utilized as a diagnostic tool for rare Mendelian diseases. The third generation sequencing de novo detects long reads of thousands of base pairs, thus greatly expanding the isoform discovery and identification of novel long noncoding RNAs. In this study, we developed TGStools, a bioinformatics suite to facilitate routine tasks such as characterizing full-length transcripts, detecting shifted types of alternative splicing, and long noncoding RNAs (lncRNAs) identification in transcriptome analysis. It also prioritizes the transcripts with a visualization framework that automatically integrates rich annotation with known genomic features. TGStools is a Python package freely available at Github.

Download Full-text

Third-Generation Sequencing: The Spearhead towards the Radical Transformation of Modern Genomics

Life ◽

10.3390/life12010030 ◽

2021 ◽

Vol 12 (1) ◽

pp. 30

Author(s):

Konstantina Athanasopoulou ◽

Michaela A. Boti ◽

Panagiotis G. Adamopoulos ◽

Paraskevi C. Skourou ◽

Andreas Scorilas

Keyword(s):

De Novo ◽

Direct Detection ◽

Transcriptional Profiling ◽

Third Generation ◽

De Novo Genome Assembly ◽

Rna Molecules ◽

Third Generation Sequencing ◽

Long Reads ◽

Long Read ◽

Generation Sequencing

Although next-generation sequencing (NGS) technology revolutionized sequencing, offering a tremendous sequencing capacity with groundbreaking depth and accuracy, it continues to demonstrate serious limitations. In the early 2010s, the introduction of a novel set of sequencing methodologies, presented by two platforms, Pacific Biosciences (PacBio) and Oxford Nanopore Sequencing (ONT), gave birth to third-generation sequencing (TGS). The innovative long-read technologies turn genome sequencing into an ease-of-handle procedure by greatly reducing the average time of library construction workflows and simplifying the process of de novo genome assembly due to the generation of long reads. Long sequencing reads produced by both TGS methodologies have already facilitated the decipherment of transcriptional profiling since they enable the identification of full-length transcripts without the need for assembly or the use of sophisticated bioinformatics tools. Long-read technologies have also provided new insights into the field of epitranscriptomics, by allowing the direct detection of RNA modifications on native RNA molecules. This review highlights the advantageous features of the newly introduced TGS technologies, discusses their limitations and provides an in-depth comparison regarding their scientific background and available protocols as well as their potential utility in research and clinical applications.

Download Full-text

IsoDetect: Detection of splice isoforms from third generation long reads based on short feature sequences

Current Bioinformatics ◽

10.2174/1574893615666200316101205 ◽

2020 ◽

Vol 15 ◽

Author(s):

Hongdong Li ◽

Wenjing Zhang ◽

Yuwen Luo ◽

Jianxin Wang

Keyword(s):

Sequence Similarity ◽

Detection Methods ◽

Sequence Information ◽

Third Generation ◽

Sequencing Data ◽

Splice Isoforms ◽

Third Generation Sequencing ◽

Long Reads ◽

Feature Sequence ◽

Generation Sequencing

Aims: Accurately detect isoforms from third generation sequencing data. Background: Transcriptome annotation is the basis for the analysis of gene expression and regulation. The transcriptome annotation of many organisms such as humans is far from incomplete, due partly to the challenge in the identification of isoforms that are produced from the same gene through alternative splicing. Third generation sequencing (TGS) reads provide unprecedented opportunity for detecting isoforms due to their long length that exceeds the length of most isoforms. One limitation of current TGS reads-based isoform detection methods is that they are exclusively based on sequence reads, without incorporating the sequence information of known isoforms. Objective: Develop an efficient method for isoform detection. Method: Based on annotated isoforms, we propose a splice isoform detection method called IsoDetect. First, the sequence at exon-exon junction is extracted from annotated isoforms as the “short feature sequence”, which is used to distinguish different splice isoforms. Second, we aligned these feature sequences to long reads and divided long reads into groups that contain the same set of feature sequences, thereby avoiding the pair-wise comparison among the large number of long reads. Third, clustering and consensus generation are carried out based on sequence similarity. For the long reads that do not contain any short feature sequence, clustering analysis based on sequence similarity is performed to identify isoforms. Result: Tested on two datasets from Calypte Anna and Zebra Finch, IsoDetect showed higher speed and compelling accuracy compared with four existing methods. Conclusion: IsoDetect is a promising method for isoform detection. Other: This paper was accepted by the CBC2019 conference.

Download Full-text

Apollo: a sequencing-technology-independent, scalable and accurate assembly polishing algorithm

Bioinformatics ◽

10.1093/bioinformatics/btaa179 ◽

2020 ◽

Vol 36 (12) ◽

pp. 3669-3679 ◽

Cited By ~ 3

Author(s):

Can Firtina ◽

Jeremie S Kim ◽

Mohammed Alser ◽

Damla Senol Cali ◽

A Ercument Cicek ◽

...

Keyword(s):

Genome Analysis ◽

Supplementary Information ◽

Third Generation ◽

Sequencing Technology ◽

Base Pairs ◽

Sequencing Technologies ◽

Third Generation Sequencing ◽

Long Reads ◽

Generation Sequencing ◽

Large Genomes

Abstract Motivation Third-generation sequencing technologies can sequence long reads that contain as many as 2 million base pairs. These long reads are used to construct an assembly (i.e. the subject’s genome), which is further used in downstream genome analysis. Unfortunately, third-generation sequencing technologies have high sequencing error rates and a large proportion of base pairs in these long reads is incorrectly identified. These errors propagate to the assembly and affect the accuracy of genome analysis. Assembly polishing algorithms minimize such error propagation by polishing or fixing errors in the assembly by using information from alignments between reads and the assembly (i.e. read-to-assembly alignment information). However, current assembly polishing algorithms can only polish an assembly using reads from either a certain sequencing technology or a small assembly. Such technology-dependency and assembly-size dependency require researchers to (i) run multiple polishing algorithms and (ii) use small chunks of a large genome to use all available readsets and polish large genomes, respectively. Results We introduce Apollo, a universal assembly polishing algorithm that scales well to polish an assembly of any size (i.e. both large and small genomes) using reads from all sequencing technologies (i.e. second- and third-generation). Our goal is to provide a single algorithm that uses read sets from all available sequencing technologies to improve the accuracy of assembly polishing and that can polish large genomes. Apollo (i) models an assembly as a profile hidden Markov model (pHMM), (ii) uses read-to-assembly alignment to train the pHMM with the Forward–Backward algorithm and (iii) decodes the trained model with the Viterbi algorithm to produce a polished assembly. Our experiments with real readsets demonstrate that Apollo is the only algorithm that (i) uses reads from any sequencing technology within a single run and (ii) scales well to polish large assemblies without splitting the assembly into multiple parts. Availability and implementation Source code is available at https://github.com/CMU-SAFARI/Apollo. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Quality of Third Generation Sequencing

Journal of Computational and Theoretical Nanoscience ◽

10.1166/jctn.2020.9630 ◽

2020 ◽

Vol 17 (12) ◽

pp. 5205-5209

Author(s):

Ali Elbialy ◽

M. A. El-Dosuky ◽

Ibrahim M. El-Henawy

Keyword(s):

Neural Network ◽

Deep Neural Network ◽

Gc Content ◽

Error Rates ◽

Third Generation ◽

Third Generation Sequencing ◽

Long Reads ◽

Generation Sequencing

Third generation sequencing (TGS) relates to long reads but with relatively high error rates. Quality of TGS is a hot topic, dealing with errors. This paper combines and investigates three quality related metrics. They are basecalling accuracy, Phred Quality Scores, and GC content. For basecalling accuracy, a deep neural network is adopted. The measured loss does not exceed 5.42.

Download Full-text

de novo repeat detection based on the third generation sequencing reads

2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) ◽

10.1109/bibm47256.2019.8982959 ◽

2019 ◽

Cited By ~ 1

Author(s):

Xingyu Liao ◽

Xiankai Zhang ◽

Fang-Xiang Wu ◽

Jianxin Wang

Keyword(s):

De Novo ◽

Third Generation ◽

The Third ◽

Third Generation Sequencing ◽

Generation Sequencing ◽

Repeat Detection

Download Full-text

A Sequence-Based Novel Approach for Quality Evaluation of Third-Generation Sequencing Reads

Genes ◽

10.3390/genes10010044 ◽

2019 ◽

Vol 10 (1) ◽

pp. 44 ◽

Cited By ~ 1

Author(s):

Wenjing Zhang ◽

Neng Huang ◽

Jiantao Zheng ◽

Xingyu Liao ◽

Jianxin Wang ◽

...

Keyword(s):

Quality Evaluation ◽

Training Data ◽

Third Generation ◽

Contig Assembly ◽

High Quality ◽

Promising Alternative ◽

Third Generation Sequencing ◽

Long Reads ◽

Generation Sequencing

The advent of third-generation sequencing (TGS) technologies, such as the Pacific Biosciences (PacBio) and Oxford Nanopore machines, provides new possibilities for contig assembly, scaffolding, and high-performance computing in bioinformatics due to its long reads. However, the high error rate and poor quality of TGS reads provide new challenges for accurate genome assembly and long-read alignment. Efficient processing methods are in need to prioritize high-quality reads for improving the results of error correction and assembly. In this study, we proposed a novel Read Quality Evaluation and Selection Tool (REQUEST) for evaluating the quality of third-generation long reads. REQUEST generates training data of high-quality and low-quality reads which are characterized by their nucleotide combinations. A linear regression model was built to score the quality of reads. The method was tested on three datasets of different species. The results showed that the top-scored reads prioritized by REQUEST achieved higher alignment accuracies. The contig assembly results based on the top-scored reads also outperformed conventional approaches that use all reads. REQUEST is able to distinguish high-quality reads from low-quality ones without using reference genomes, making it a promising alternative sequence-quality evaluation method to alignment-based algorithms.

Download Full-text

De Novo genome assembly for third generation sequencing data

Photonics Applications in Astronomy, Communications, Industry, and High-Energy Physics Experiments 2018 ◽

10.1117/12.2501543 ◽

2018 ◽

Author(s):

Robert M. Nowak ◽

Mateusz Forc ◽

Wiktor Kuśmirek

Keyword(s):

Genome Assembly ◽

De Novo ◽

Third Generation ◽

Sequencing Data ◽

De Novo Genome Assembly ◽

Third Generation Sequencing ◽

Generation Sequencing

Download Full-text

SMOOTH-seq: single-cell genome sequencing of human cells on a third-generation sequencing platform

Genome Biology ◽

10.1186/s13059-021-02406-y ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Xiaoying Fan ◽

Cheng Yang ◽

Wen Li ◽

Xiuzhen Bai ◽

Xin Zhou ◽

...

Keyword(s):

Single Cell ◽

Genome Sequencing ◽

Single Molecule ◽

Human Cancer ◽

Whole Genome ◽

Third Generation ◽

Sequencing Platform ◽

Human Cancer Cell Lines ◽

Third Generation Sequencing ◽

Generation Sequencing

AbstractThere is no effective way to detect structure variations (SVs) and extra-chromosomal circular DNAs (ecDNAs) at single-cell whole-genome level. Here, we develop a novel third-generation sequencing platform-based single-cell whole-genome sequencing (scWGS) method named SMOOTH-seq (single-molecule real-time sequencing of long fragments amplified through transposon insertion). We evaluate the method for detecting CNVs, SVs, and SNVs in human cancer cell lines and a colorectal cancer sample and show that SMOOTH-seq reliably and effectively detects SVs and ecDNAs in individual cells, but shows relatively limited accuracy in detection of CNVs and SNVs. SMOOTH-seq opens a new chapter in scWGS as it generates high fidelity reads of kilobases long.

Download Full-text

Third generation indexing for third generation sequencing

10.1101/2020.05.07.082347 ◽

2020 ◽

Author(s):

Abdulqader Jighly

Keyword(s):

Dna Sequences ◽

De Novo ◽

Total Error ◽

Error Rates ◽

Routine Activity ◽

Considerable Proportion ◽

Third Generation ◽

Structural Variations ◽

Third Generation Sequencing ◽

Generation Sequencing

AbstractIndexing of DNA sequences is the art of sorting massive genomic data in a user-friendly structure to enable rapid accessing and comparing of different patterns in the data. Current genome assemblers use general algorithms for string indexing that do not exploit the special structural arrangement of genomes. Here, I am proposing a new algorithm that indexes only the configuration of microsatellite motifs along reads assuming that the order of microsatellites will be the same in overlapped sequences. The index size is >1000 times smaller than currently used indices and it has higher tolerance to the high error rates produced by third generation sequencing platforms. The results showed that the proposed algorithm can rapidly detect overlaps among considerable proportion of uncorrected long reads (~50% of all simulated base pairs with average read size of 8.16 kb and total error rates of 14.4%) to build large initial contigs. Unassembled reads can be then mapped to these contigs or can be assembled with them with currently used algorithms. Thus, the proposed algorithm can efficiently be used as an initial stage to significantly reduce the number of pairwise sequence comparisons among reads and/or references and improve the performance of different software but not replacing them. The algorithm was also useful for comparative genomics and detect large locally colinear blocks and structural variations among ten saccharomyces cerevisiae strains. The proposed algorithm has the power to make de novo assembly of individuals as routine activity which can lead to more accurate variant calling and pan genomics.

Download Full-text

CoLoRd: Compressing long reads

10.1101/2021.07.17.452767 ◽

2021 ◽

Author(s):

Marek Kokot ◽

Adam Gudys ◽

Heng Li ◽

Sebastian Deorowicz

Keyword(s):

General Purpose ◽

Third Generation ◽

Sequencing Data ◽

The Third ◽

Third Generation Sequencing ◽

Long Reads ◽

Order Of Magnitude ◽

Generation Sequencing

The costs of maintaining exabytes of data produced by sequencing experiments every year has become a major issue in today's genomics. In spite of the increasing popularity of the third generation sequencing, the existing algorithms for compressing long reads exhibit minor advantage over general purpose gzip. We present CoLoRd, an algorithm able to reduce 3rd generation sequencing data by an order of magnitude without affecting the accuracy of downstream analyzes.

Download Full-text