scholarly journals Efficient and accurate detection of splice junctions from RNAseq with Portcullis

2017 ◽  
Author(s):  
Daniel Mapleson ◽  
Luca Venturini ◽  
Gemy Kaithakottil ◽  
David Swarbreck

ABSTRACTNext generation sequencing (NGS) technologies enable rapid and cheap genome-wide transcriptome analysis, providing vital information about gene structure, transcript expression and alternative splicing. Key to this is the the accurate identification of exon-exon junctions from RNA sequenced (RNA-seq) reads. A number of RNA-seq aligners capable of splitting reads across these splice junctions (SJs) have been developed, however, it has been shown that while they correctly identify most genuine SJs available in a given sample, they also often produce large numbers of incorrect SJs. Herein we describe the extent of this problem using popular RNA-seq mapping tools, and present a new method, called Portcullis, to rapidly filter false SJs junctions from spliced alignments produced by any RNA-seq mapper capable of creating SAM/BAM files. We show that Portcullis distinguishes between genuine and false positive junctions to a high-degree of accuracy across different species, samples, expression levels, error profiles and read lengths. Portcullis makes efficient use of memory and threading and, to our knowledge, is currently the only SJ prediction tool that reliably scales for use with large RNAseq datasets and large highly fragmented genomes, whilst delivering highly accurate SJs.AvailabilityPortcullis is available under the GPLv3 license at: http://maplesond.github.io/portcullis/[email protected]

2017 ◽  
Author(s):  
Claire Rioualen ◽  
Lucie Charbonnier-Khamvongsa ◽  
Jacques van Helden

AbstractSummaryNext-Generation Sequencing (NGS) is becoming a routine approach for most domains of life sciences, yet there is a crucial need to improve the automation of processing for the huge amounts of data generated and to ensure reproducible results. We present SnakeChunks, a collection of Snakemake rules enabling to compose modular and user-configurable workflows, and show its usage with analyses of transcriptome (RNA-seq) and genome-wide location (ChIP-seq) data.AvailabilityThe code is freely available (github.com/SnakeChunks/SnakeChunks), and documented with tutorials and illustrative demos (snakechunks.readthedocs.io)[email protected], [email protected] informationSupplementary data are available at Bioinformatics online.


2020 ◽  
Author(s):  
Shuhei Noguchi ◽  
Hideya Kawaji ◽  
Takeya Kasukawa

AbstractBackgroundGenome mapping is an essential step in data processing for transcriptome analysis, and many previous studies have evaluated various methods and strategies for mapping RNA-seq data. Cap Analysis of Gene Expression (CAGE) is a sequencing-based protocol particularly designed to capture the 5□-ends of transcripts for quantitatively measuring the expression levels of transcription start sites genome-wide. Because CAGE analysis can also predict the activities of promoters and enhancers, this protocol has been an essential tool in studies of transcriptional regulation. Typically, the same mapping software is used to align both RNA-seq data and CAGE reads to a reference genome, but which mapping software and options are most appropriate for mapping the 5□-end sequence reads obtained through CAGE has not previously been evaluated systematically.ResultsHere we assessed various strategies for aligning CAGE reads, particularly ∼50-bp sequences, with the human genome by using the HISAT2, LAST, and STAR programs both with and without a reference transcriptome. One of the major inconsistencies among the tested strategies involves alignments to pseudogenes and parent genes: some of the strategies prioritized alignments with pseudogenes even when the read could be aligned with coding genes with fewer mismatches. Another inconsistency concerned the detection of exon-exon junctions. These preferences depended on the program applied and whether a reference transcriptome was included. Overall, the choice of strategy yielded different mapping results for approximately 2% of all promoters.ConclusionsAlthough the various alignment strategies produced very similar results overall, we noted several important and measurable differences. In particular, using the reference transcriptome in STAR yielded alignments with the fewest mismatches. In addition, the inconsistencies among the strategies were especially noticeable regarding alignments to pseudogenes and novel splice junctions. Our results indicate that the choice of alignment strategy is important because it might affect the biological interpretation of the data.


2017 ◽  
Author(s):  
Claire Marchal ◽  
Takayo Sasaki ◽  
Daniel Vera ◽  
Korey Wilson ◽  
Jiao Sima ◽  
...  

ABSTRACTCycling cells duplicate their DNA content during S phase, following a defined program called replication timing (RT). Early and late replicating regions differ in terms of mutation rates, transcriptional activity, chromatin marks and sub-nuclear position. Moreover, RT is regulated during development and is altered in disease. Exploring mechanisms linking RT to other cellular processes in normal and diseased cells will be facilitated by rapid and robust methods with which to measure RT genome wide. Here, we describe a rapid, robust and relatively inexpensive protocol to analyze genome-wide RT by next-generation sequencing (NGS). This protocol yields highly reproducible results across laboratories and platforms. We also provide computational pipelines for analysis, parsing phased genomes using single nucleotide polymorphisms (SNP) for analyzing RT allelic asynchrony, and for direct comparison to Repli-chip data obtained by analyzing nascent DNA by microarrays.


Genes ◽  
2020 ◽  
Vol 11 (1) ◽  
pp. 92 ◽  
Author(s):  
Shannon J. McKie ◽  
Anthony Maxwell ◽  
Keir C. Neuman

Next-generation sequencing (NGS) platforms have been adapted to generate genome-wide maps and sequence context of binding and cleavage of DNA topoisomerases (topos). Continuous refinements of these techniques have resulted in the acquisition of data with unprecedented depth and resolution, which has shed new light on in vivo topo behavior. Topos regulate DNA topology through the formation of reversible single- or double-stranded DNA breaks. Topo activity is critical for DNA metabolism in general, and in particular to support transcription and replication. However, the binding and activity of topos over the genome in vivo was difficult to study until the advent of NGS. Over and above traditional chromatin immunoprecipitation (ChIP)-seq approaches that probe protein binding, the unique formation of covalent protein–DNA linkages associated with DNA cleavage by topos affords the ability to probe cleavage and, by extension, activity over the genome. NGS platforms have facilitated genome-wide studies mapping the behavior of topos in vivo, how the behavior varies among species and how inhibitors affect cleavage. Many NGS approaches achieve nucleotide resolution of topo binding and cleavage sites, imparting an extent of information not previously attainable. We review the development of NGS approaches to probe topo interactions over the genome in vivo and highlight general conclusions and quandaries that have arisen from this rapidly advancing field of topoisomerase research.


2014 ◽  
Vol 42 (S1) ◽  
pp. 22-41 ◽  
Author(s):  
Patricia A. Deverka ◽  
Jennifer C. Dreyfus

Clinical next generation sequencing (NGS) is a term that refers to a variety of technologies that permit rapid sequencing of large numbers of DNA segments, up to and including entire genomes. As an approach that is playing an increasingly important role in obtaining genetic information from patients, it may be viewed by public and private payers either positively, as an enabler of the promised benefits of personalized medicine, or as “the perfect storm” resulting from the confluence of high market demand, an uproven technology, and an unprepared delivery system. A number of recent studies have noted that coverage and reimbursement will be critical for clinical integration of NGS, yet the evidentiary pathway for payer decision-making is unclear. Although there are multiple reasons for this uncertain reimbursement environment, the situation stems in large part from a long-standing lack of alignment between the information needs of regulators and post-regulatory decision-makers such as payers.


2010 ◽  
Vol 2010 ◽  
pp. 1-19 ◽  
Author(s):  
Valerio Costa ◽  
Claudia Angelini ◽  
Italia De Feis ◽  
Alfredo Ciccodicola

In recent years, the introduction of massively parallel sequencing platforms for Next Generation Sequencing (NGS) protocols, able to simultaneously sequence hundred thousand DNA fragments, dramatically changed the landscape of the genetics studies. RNA-Seq for transcriptome studies, Chip-Seq for DNA-proteins interaction, CNV-Seq for large genome nucleotide variations are only some of the intriguing new applications supported by these innovative platforms. Among them RNA-Seq is perhaps the most complex NGS application. Expression levels of specific genes, differential splicing, allele-specific expression of transcripts can be accurately determined by RNA-Seq experiments to address many biological-related issues. All these attributes are not readily achievable from previously widespread hybridization-based or tag sequence-based approaches. However, the unprecedented level of sensitivity and the large amount of available data produced by NGS platforms provide clear advantages as well as new challenges and issues. This technology brings the great power to make several new biological observations and discoveries, it also requires a considerable effort in the development of new bioinformatics tools to deal with these massive data files. The paper aims to give a survey of the RNA-Seq methodology, particularly focusing on the challenges that this application presents both from a biological and a bioinformatics point of view.


2019 ◽  
Author(s):  
Tim O. Nieuwenhuis ◽  
Stephanie Yang ◽  
Rohan X. Verma ◽  
Vamsee Pillalamarri ◽  
Dan E. Arking ◽  
...  

AbstractOne of the challenges of next generation sequencing (NGS) is read contamination. We used the Genotype-Tissue Expression (GTEx) project, a large, diverse, and robustly generated dataset, to understand the factors that contribute to contamination. We obtained GTEx datasets and technical metadata and validating RNA-Seq from other studies. Of 48 analyzed tissues in GTEx, 26 had variant co-expression clusters of four known highly expressed and pancreas-enriched genes (PRSS1, PNLIP, CLPS, and/or CELA3A). Fourteen additional highly expressed genes from other tissues also indicated contamination. Sample contamination by non-native genes was associated with a sample being sequenced on the same day as a tissue that natively expressed those genes. This was highly significant for pancreas and esophagus genes (linear model, p=9.5e-237 and p=5e-260 respectively). Nine SNPs in four genes shown to contaminate non-native tissues demonstrated allelic differences between DNA-based genotypes and contaminated sample RNA-based genotypes, validating the contamination. Low-level contamination affected 4,497 (39.6%) samples (defined as 10 PRSS1 TPM). It also led ≥ to eQTL assignments in inappropriate tissues among these 18 genes. We note this type of contamination occurs widely, impacting bulk and single cell data set analysis. In conclusion, highly expressed, tissue-enriched genes basally contaminate GTEx and other datasets impacting analyses. Awareness of this process is necessary to avoid assigning inaccurate importance to low-level gene expression in inappropriate tissues and cells.


Author(s):  
Naiyar Iqbal ◽  
Pradeep Kumar

Disease classification based on biological data is an important area in bioinformatics and biomedical research. It helps the doctors and medical practitioners for the early detection of disease and support them as a computer-aided diagnostic tool for accurate diagnosis, prognosis, and treatment of disease. Earlier Microarray gene expression data have wide application for the classification of disease, but now Next-generation sequencing (NGS) has replaced the Microarray technology. From the last few years, RNA sequence (RNA-Seq) data are widely used for the transcriptomic analysis. Hence, RNA-Seq based classification of disease is in its infancy. In this article, we present a general framework for the classification of disease constructed on RNA-Seq data. This framework will guide the researchers to process RNA-Seq, extract relevant features and apply the appropriate classifier to classify any kind of disease.


F1000Research ◽  
2016 ◽  
Vol 5 ◽  
pp. 2851 ◽  
Author(s):  
Panu Artimo ◽  
Séverine Duvaud ◽  
Mikhail Pachkov ◽  
Vassilios Ioannidis ◽  
Erik van Nimwegen ◽  
...  

ISMARA (ismara.unibas.ch) automatically infers the key regulators and regulatory interactions from high-throughput gene expression or chromatin state data. However, given the large sizes of current next generation sequencing (NGS) datasets, data uploading times are a major bottleneck. Additionally, for proprietary data, users may be uncomfortable with uploading entire raw datasets to an external server. Both these problems could be alleviated by providing a means by which users could pre-process their raw data locally, transferring only a small summary file to the ISMARA server. We developed a stand-alone client application that pre-processes large input files (RNA-seq or ChIP-seq data) on the user's computer for performing ISMARA analysis in a completely automated manner, including uploading of small processed summary files to the ISMARA server. This reduces file sizes by up to a factor of 1000, and upload times from many hours to mere seconds. The client application is available from ismara.unibas.ch/ISMARA/client.


F1000Research ◽  
2016 ◽  
Vol 5 ◽  
pp. 888
Author(s):  
Elizabeth Baskin ◽  
Peter DeFord ◽  
Allison F. Dennis ◽  
Ian Misner ◽  
Frederick J. Tan ◽  
...  

The rapid rise of high-throughput, data intensive experimental techniques has thrust many biologists into the role of data analyst – a role many biologists feel ill equipped to fill. Novices often struggle to find the resources and expertise they need to analyze their experimental results in a wet-lab environment. To fill this need, we developed an educational resource as part of a National Center for Biotechnology Information (NCBI) hackathon. Using RNA-seq as a model, our tutorial guides new users through the steps of data analysis, while placing an emphasis on understanding the motivation behind choices made in the process. To advance the goal of providing a deeper understanding of the analysis process, we developed a new tool, bamDiff. bamDiff allows users to compare the performance of multiple RNA-seq aligners, allowing users to select the most appropriate aligner for the data in question and experimental end-goal. Our tutorial is accessible via a GitHub wiki, with associated data and software provided on an Amazon Machine Image (AMI), which can be completed at no cost to the user through the Amazon Educate Program. Following the hackathon, our tutorial was integrated into the October 2015 offering of NCBI NOW (Next Generation Sequencing (NGS) Online Workshop) a free online experience targeting individuals new to NGS analysis.


Sign in / Sign up

Export Citation Format

Share Document