Elysium: RNA-seq Alignment in the Cloud

Mapping Intimacies ◽

10.1101/382937 ◽

2018 ◽

Cited By ~ 2

Author(s):

Alexander Lachmann ◽

Zhuorui Xie ◽

Avi Ma’ayan

Keyword(s):

Job Scheduling ◽

Rna Seq ◽

Skill Levels ◽

Transcript Quantification ◽

Link Type ◽

Genome Wide ◽

Gene Level ◽

Programming Knowledge ◽

Computational Resources ◽

Alignment Step

MotivationRNA-sequencing (RNA-seq) is currently the leading technology for genome-wide transcript quantification. Mapping the raw reads to transcript and gene level counts can be achieved by a variety of aligners and pipelines. The diversity of processing options reduces interoperability. In addition, the alignment step requires significant computational resources and basic programming knowledge. Elysium enables users of all skill levels to perform a uniform and free RNA-seq alignment in the cloud.ResultsThe Elysium infrastructure is comprised of four components: A file upload API that enables storage of FASTQ files on Amazon S3 without Amazon credentials; an API to handle the cloud alignment job scheduling for uploaded files; and a graphical user interface (GUI) to provide intuitive access to users that do not have command-line access skills.AvailabilityThe Elysium source code is available under the Apache Licence 2.0 on GitHub at: https://github.com/maayanlab/elysiumThe service of cloud based RNA-seq alignment is freely accessible through the Elysium GUI at: http://elysium.cloud

Download Full-text

Read trimming is not required for mapping and quantification of RNA-seq reads at the gene level

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqaa068 ◽

2020 ◽

Vol 2 (3) ◽

Author(s):

Yang Liao ◽

Wei Shi

Keyword(s):

Data Analysis ◽

Pearson Correlation ◽

Rna Seq ◽

Genome Wide ◽

Gene Level ◽

Sequencing Quality ◽

Total Data ◽

Order Of Magnitude ◽

Gene Expression Quantification ◽

The Impact

Abstract RNA sequencing (RNA-seq) is currently the standard method for genome-wide expression profiling. RNA-seq reads often need to be mapped to a reference genome before read counts can be produced for genes. Read trimming methods have been developed to assist read mapping by removing adapter sequences and low-sequencing-quality bases. It is however unclear what is the impact of read trimming on the quantification of RNA-seq data, an important task in RNA-seq data analysis. In this study, we used a benchmark RNA-seq dataset and simulation data to assess the impact of read trimming on mapping and quantification of RNA-seq reads. We found that adapter sequences can be effectively removed by read aligner via ’soft-clipping’ and that many low-sequencing-quality bases, which would be removed by read trimming tools, were rescued by the aligner. Accuracy of gene expression quantification from using untrimmed reads was found to be comparable to or slightly better than that from using trimmed reads, based on Pearson correlation with reverse transcriptase-polymerase chain reaction data and simulation truth. Total data analysis time was reduced by up to an order of magnitude when read trimming was not performed. Our study suggests that read trimming is a redundant process in the quantification of RNA-seq expression data.

Download Full-text

OPTIMIR, a novel algorithm for integrating available genome-wide genotype data into miRNA sequence alignment analysis

10.1101/479097 ◽

2018 ◽

Author(s):

Florian Thibord ◽

Claire Perret ◽

Maguelonne Roux ◽

Pierre Suchon ◽

Marine Germain ◽

...

Keyword(s):

Mirna Sequence ◽

Biological Knowledge ◽

Genotype Data ◽

Link Type ◽

Genome Wide ◽

Heterozygous Carriers ◽

Alignment Analysis ◽

Mirna Editing ◽

Novel Algorithm ◽

Alignment Step

AbstractNext-generation sequencing is an increasingly popular and efficient approach to characterize the full set of microRNAs (miRNAs) present in human biosamples. MiRNAs’ detection and quantification still remain a challenge as they can undergo different post transcriptional modifications and might harbor genetic variations (polymiRs) that may impact on the alignment step. We present a novel algorithm, OPTIMIR, that incorporates biological knowledge on miRNA editing and genome-wide genotype data available in the processed samples to improve alignment accuracy.OPTIMIR was applied to 391 human plasma samples that had been typed with genome-wide genotyping arrays. OPTIMIR was able to detect genotyping errors, suggested the existence of novel miRNAs and highlighted the allelic imbalance expression of polymiRs in heterozygous carriers.OPTIMIR is written in python, and freely available on the GENMED website (http://www.genmed.fr/index.php/fr/) and on Github (github.com/FlorianThibord/OptimiR).

Download Full-text

A CLEAR pipeline for direct comparison of circular and linear RNA expression

10.1101/668657 ◽

2019 ◽

Author(s):

Xu-Kai Ma ◽

Meng-Ran Wang ◽

Chu-Xiao Liu ◽

Rui Dong ◽

Gordon G. Carmichael ◽

...

Keyword(s):

Ribosomal Rna ◽

Circular Rnas ◽

Rna Expression ◽

Computational Pipeline ◽

Rna Seq ◽

Link Type ◽

Genome Wide ◽

A Genome

ABSTRACTSequences of circular RNAs (circRNAs) produced from back-splicing of exon(s) completely overlap with sequences from cognate linear RNAs transcribed from the same gene loci with the exception of their back-splicing junction (BSJ) sites. Examination of global circRNA expression from RNA-seq datasets generally relies on the detection of RNA-seq fragments spanning BSJ sites, but a direct comparison of circular and linear RNA expression from the same gene loci in a genome-wide manner has remained challenging. This is because quantification of BSJ fragments differs from that of linear RNA expression that uses normalized RNA-seq fragments mapped to the whole gene bodies. Here, we have developed a computational pipeline for circular and linear RNA expression analysis from ribosomal-RNA depleted RNA-seq (CLEAR, https://github.com/YangLab/CLEAR). A new quantitation parameter, FPB (fragments per billion mapped bases), is applied to evaluate circular and linear RNA expression individually by fragments mapped to circRNA-specific BSJ sites or to linear RNA-specific splicing junction (SJ) sites. Then, circular and linear RNA expression are directly compared by dividing FPBcirc by FPBlinear to generate a CIRCscore, which indicates the relative circRNA expression using linear RNA expression as the background. Highly-expressed circRNAs with low cognate linear RNA expression background can be identified for further investigation.

Download Full-text

The value of genotype-specific reference for transcriptome analyses

10.1101/2021.09.14.460213 ◽

2021 ◽

Author(s):

Wenbin Guo ◽

Max Coulter ◽

Robbie Waugh ◽

Runxuan Zhang

Keyword(s):

Alternative Splicing ◽

Reference Genome ◽

Transcriptome Assembly ◽

Specific Reference ◽

Rna Seq ◽

High Quality ◽

Common Reference ◽

Transcript Quantification ◽

Gene Level ◽

Reference Transcript

High quality transcriptome assembly using short reads from RNA-seq data still heavily relies upon reference-based approaches, of which the primary step is to align RNA-seq reads to a single reference genome of haploid sequence. However, it is increasingly apparent that while different genotypes within a species share core genes, they also contain variable numbers of specific genes that are only present a subset of individuals. Using a common reference may thus lead to a loss of genotype-specific information in the assembled transcript dataset and the generation of erroneous, incomplete or misleading transcriptomics analysis results. With the recent development of pan-genome information in many species, it is important that we understand the limitations of single genotype references for transcriptomics analysis. In this study, we quantitively evaluated the advantages of using genotype-specific reference genomes for transcriptome assembly and analysis using cultivated barley as a model. We mapped barley cultivar Barke RNA-seq reads to the Barke genome and to the cultivar Morex genome (common barley genome reference) to construct a genotype specific Reference Transcript Dataset (sRTD) and a common Reference Transcript Datasets (cRTD), respectively. We compared the two RTDs according to their transcript diversity, transcript sequence and structure similarity and the accuracy they provided for transcript quantification and differential expression analysis. Our evaluation shows that the sRTD has a significantly higher diversity of transcripts and alternative splicing events. Despite using a high-quality reference genome for assembly of the cRTD, we miss ca. 40% transcripts present in the sRTD and cRTD only has ca. 70% true assemblies. We found that the sRTD is more accurate for transcript quantification as well as differential expression and differential alternative splicing analysis. However, gene level quantification and comparative expression analysis are less affected by the source RTD, which indicates that analysing transcriptomic data at the gene level may be a reasonable compromise when a high-quality genotype-specific reference is not available.

Download Full-text

Kmerator Suite: design of specific k-mer signatures and automatic metadata discovery in large RNA-Seq datasets.

10.1101/2021.05.20.444982 ◽

2021 ◽

Author(s):

Sebastien Riquier ◽

Chloe Bessiere ◽

Benoit Guibert ◽

Anne-Laure Bouge ◽

Anthony Boureux ◽

...

Keyword(s):

Gene Expression ◽

Large Datasets ◽

Rna Seq ◽

Transcript Quantification ◽

Human Genes ◽

Novel Transcripts ◽

New Biomarkers ◽

Computational Resources ◽

User Friendly ◽

Health Applications

The huge body of publicly available RNA-seq libraries is a treasure of functional information allowing to quantify the expression of known or novel transcripts in tissues. However, transcript quantification commonly relies on alignment methods requiring a lot of computational resources and processing time, which does not scale easily to large datasets. K-mer decomposition constitutes a new way to process RNA-seq data for the identification of transcriptional signatures, as k-mers can be used to quantify accurately gene expression in a less resource-consuming way. We present the Kmerator Suite, a set of three tools designed to extract specific k-mer signatures, quantify these k-mers into RNA-seq datasets and quickly visualize large datasets characteristics. The core tool, Kmerator, produces specific k-mers for 97% of human genes, enabling the measure of gene expression with high accuracy in simulated datasets. KmerExploR, a direct application of Kmerator, uses a set of predictor genes specific k-mers to infer metadata including library protocol, sample features or contaminations from RNA-seq datasets. KmerExploR results are visualised through a user-friendly interface. Moreover, we demonstrate that the Kmerator Suite can be used for advanced queries targeting known or new biomarkers such as mutations, gene fusions or long non coding-RNAs for human health applications.

Download Full-text

RiboPlotR: a visualization tool for periodic Ribo-seq reads

Plant Methods ◽

10.1186/s13007-021-00824-4 ◽

2021 ◽

Vol 17 (1) ◽

Author(s):

Hsin-Yen Larry Wu ◽

Polly Yingshan Hsu

Keyword(s):

Mrna Translation ◽

Untranslated Regions ◽

Rna Seq ◽

Color Coding ◽

Link Type ◽

Genome Wide ◽

Non Coding Rnas ◽

Gene Structures ◽

Small Orfs ◽

Reading Frames

Abstract Background Ribo-seq has revolutionized the study of genome-wide mRNA translation. High-quality Ribo-seq data display strong 3-nucleotide (nt) periodicity, which corresponds to translating ribosomes deciphering three nts at a time. While 3-nt periodicity has been widely used to study novel translation events such as upstream ORFs in 5′ untranslated regions and small ORFs in presumed non-coding RNAs, tools that allow the visualization of these events remain underdeveloped. Results RiboPlotR is a visualization package written in R that presents both RNA-seq coverage and Ribo-seq reads in genomic coordinates for all annotated transcript isoforms of a gene. Specifically, for individual isoform models, RiboPlotR plots Ribo-seq data in the context of gene structures, including 5′ and 3′ untranslated regions and introns, and it presents the reads for all three reading frames in three different colors. The inclusion of gene structures and color-coding the reading frames facilitate observing new translation events and identifying potential regulatory mechanisms. Conclusions RiboPlotR is freely available (https://github.com/hsinyenwu/RiboPlotR and https://sourceforge.net/projects/riboplotr/) and allows the visualization of translated features identified in Ribo-seq data.

Download Full-text

A comprehensive online database for exploring ~20,000 public Arabidopsis RNA-Seq libraries

10.1101/844522 ◽

2019 ◽

Author(s):

Hong Zhang ◽

Fei Zhang ◽

Li Feng ◽

Jinbu Jia ◽

Jixian Zhai

Keyword(s):

Transcriptional Regulation ◽

Transcriptome Profiling ◽

Research Community ◽

Online Database ◽

Rna Seq ◽

Huge Amount ◽

Genome Wide ◽

Wide Scale ◽

Computational Resources ◽

User Friendly

AbstractApplication of Next Generating Sequencing (NGS) technology in transcriptome profiling has greatly improved our understanding of transcriptional regulation at genome-wide scale in the last decade, and tens of thousands of RNA-sequencing (RNA-seq) libraries have been produced by the research community. However, accessing such huge amount of RNA-seq data poses a big challenge for groups that lack dedicated bioinformatic personnel or expensive computational resources. Here, we introduce the Arabidopsis RNA-seq database (ARS), a free, web-accessible, and user-friendly to quickly explore expression level of any gene in 20,000+ publicly available Arabidopsis RNA-seq libraries.

Download Full-text

Ribo-Seq and RNA-Seq of TMA46 (DFRP1) and GIR2 (DFRP2) knockout yeast strains

F1000Research ◽

10.12688/f1000research.74727.1 ◽

2021 ◽

Vol 10 ◽

pp. 1162

Author(s):

Artyom A. Egorov ◽

Desislava S. Makeeva ◽

Nadezhda E. Makarova ◽

Dmitri A. Bykov ◽

Yanislav S. Hrytseniuk ◽

...

Keyword(s):

Mrna Translation ◽

Gene Expression Omnibus ◽

Ribosome Profiling ◽

Rna Seq ◽

General Stress Response ◽

Ncbi Gene Expression Omnibus ◽

Ribosome Stalling ◽

Link Type ◽

Gene Level

In eukaryotes, stalled and collided ribosomes are recognized by several conserved multicomponent systems, which either block protein synthesis in situ and resolve the collision locally, or trigger a general stress response. Yeast ribosome-binding GTPases RBG1 (DRG1 in mammals) and RBG2 (DRG2) form two distinct heterodimers with TMA46 (DFRP1) and GIR2 (DFRP2), respectively, both involved in mRNA translation. Accumulated evidence suggests that the dimers play partially redundant roles in elongation processivity and resolution of ribosome stalling and collision events, as well as in the regulation of GCN1-mediated signaling involved in ribosome-associated quality control (RQC). They also genetically interact with SLH1 (ASCC3) helicase, a key component of RQC trigger (RQT) complex disassembling collided ribosomes. Here, we present RNA-Seq and ribosome profiling (Ribo-Seq) data from S. cerevisiae strains with individual deletions of the TMA46 and GIR2 genes. Raw RNA-Seq and Ribo-Seq data as well as gene-level read counts are available in NCBI Gene Expression Omnibus (GEO) repository under GEO accession GSE185458 and GSE185286.

Download Full-text

AtRTD2: A Reference Transcript Dataset for accurate quantification of alternative splicing and expression changes in Arabidopsis thaliana RNA-seq data

10.1101/051938 ◽

2016 ◽

Cited By ~ 4

Author(s):

Runxuan Zhang ◽

Cristiane P. G. Calixto ◽

Yamile Marquez ◽

Peter Venhuizen ◽

Nikoleta A. Tzioutziou ◽

...

Keyword(s):

Gene Expression ◽

Experimental Data ◽

Alternative Splicing ◽

Rna Seq ◽

Protein Coding ◽

Transcript Isoforms ◽

Transcript Quantification ◽

Protein Coding Genes ◽

Genome Wide ◽

Reference Transcript

AbstractBackgroundAlternative splicing is the major post-transcriptional mechanism by which gene expression is regulated and affects a wide range of processes and responses in most eukaryotic organisms. RNA-sequencing (RNA-seq) can generate genome-wide quantification of individual transcript isoforms to identify changes in expression and alternative splicing. RNA-seq is an essential modern tool but its ability to accurately quantify transcript isoforms depends on the diversity, completeness and quality of the transcript information.ResultsWe have developed a new Reference Transcript Dataset for Arabidopsis (AtRTD2) for RNA-seq analysis containing over 82k non-redundant transcripts, whereby 74,194 transcripts originate from 27,667 protein-coding genes. A total of 13,524 protein-coding genes have at least one alternatively spliced transcript in AtRTD2 such that about 60% of the 22,453 protein-coding, intron-containing genes in Arabidopsis undergo alternative splicing. More than 600 putative U12 introns were identified in more than 2,000 transcripts. AtRTD2 was generated from transcript assemblies of ca. 8.5 billion pairs of reads from 285 RNA-seq data sets obtained from 129 RNA-seq libraries and merged along with the previous version, AtRTD, and Araport11 transcript assemblies. AtRTD2 increases the diversity of transcripts and through application of stringent filters represents the most extensive and accurate transcript collection for Arabidopsis to date. We have demonstrated a generally good correlation of alternative splicing ratios from RNA-seq data analysed by Salmon and experimental data from high resolution RT-PCR. However, we have observed inaccurate quantification of transcript isoforms for genes with multiple transcripts which have variation in the lengths of their UTRs. This variation is not effectively corrected in RNA-seq analysis programmes and will therefore impact RNA-seq analyses generally. To address this, we have tested different genome-wide modifications of AtRTD2 to improve transcript quantification and alternative splicing analysis. As a result, we release AtRTD2-QUASI specifically for use in Quantification of Alternatively Spliced Isoforms and demonstrate that it out-performs other available transcriptomes for RNA-seq analysis.ConclusionsWe have generated a new transcriptome resource for RNA-seq analyses in Arabidopsis (AtRTD2) designed to address quantification of different isoforms and alternative splicing in gene expression studies. Experimental validation of alternative splicing changes identified inaccuracies in transcript quantification due to UTR length variation. To solve this problem, we also release a modified reference transcriptome, AtRTD2-QUASI for quantification of transcript isoforms, which shows high correlation with experimental data.

Download Full-text

Compression of quantification uncertainty for scRNA-seq counts

10.1101/2020.07.06.189639 ◽

2020 ◽

Author(s):

Scott Van Buren ◽

Hirak Sarkar ◽

Avi Srivastava ◽

Naim U. Rashid ◽

Rob Patro ◽

...

Keyword(s):

Negative Binomial ◽

General Procedure ◽

Computation Time ◽

Statistical Testing ◽

Rna Seq ◽

Testing Framework ◽

Link Type ◽

Gene Level ◽

Mean And Variance ◽

Improved Performance

AbstractMotivationQuantification estimates of gene expression from single-cell RNA-seq (scRNA-seq) data have inherent uncertainty due to reads that map to multiple genes. Many existing scRNA-seq quantification pipelines ignore multi-mapping reads and therefore underestimate expected read counts for many genes. alevin accounts for multi-mapping reads and allows for the generation of “inferential replicates”, which reflect quantification uncertainty. Previous methods have shown improved performance when incorporating these replicates into statistical analyses, but storage and use of these replicates increases computation time and memory requirements.ResultsWe demonstrate that storing only the mean and variance from a set of inferential replicates (“compression”) is sufficient to capture gene-level quantification uncertainty. Using these values, we generate “pseudo-inferential” replicates from a negative binomial distribution and propose a general procedure for incorporating these replicates into a proposed statistical testing framework. We show reduced false positives when applying this procedure to trajectory-based differential expression analyses. We additionally extend the Swish method to incorporate pseudo-inferential replicates and demonstrate improvements in computation time and memory consumption without any loss in performance. Lastly, we show that the removal of multi-mapping reads can result in significant underestimation of counts for functionally important genes in a real dataset.Availability and implementationmakeInfReps and splitSwish are implemented in the development branch of the R/Bioconductor fishpond package available at http://bioconductor.org/packages/devel/bioc/html/fishpond.html. Sample code to calculate the uncertainty-aware p-values can be found on GitHub at https://github.com/skvanburen/[email protected]

Download Full-text