Magic-BLAST, an accurate DNA and RNA-seq aligner for long and short reads

Mapping Intimacies ◽

10.1101/390013 ◽

2018 ◽

Cited By ~ 10

Author(s):

Grzegorz M Boratyn ◽

Jean Thierry-Mieg ◽

Danielle Thierry-Mieg ◽

Ben Busby ◽

Thomas L Madden

Keyword(s):

Data Sets ◽

Rna Seq ◽

Seed Selection ◽

Sequencing Technologies ◽

Dna And Rna ◽

Spliced Alignment ◽

Long Reads ◽

Wide Range ◽

Blast Database ◽

Innovative Techniques

ABSTRACTNext-generation sequencing technologies can produce tens of millions of reads, often paired-end, from transcripts or genomes. But few programs can align RNA on the genome and accurately discover introns, especially with long reads. We introduce Magic-BLAST, a new aligner based on ideas from the Magic pipeline. It uses innovative techniques that include the optimization of a spliced alignment score and selective masking during seed selection. We evaluate the performance of Magic-BLAST to accurately map short or long sequences and its ability to discover introns on real RNA-seq data sets from PacBio, Roche and Illumina runs, and on six benchmarks, and compare it to other popular aligners. Additionally, we look at alignments of human idealized RefSeq mRNA sequences perfectly matching the genome. We show that Magic-BLAST is the best at intron discovery over a wide range of conditions and the best at mapping reads longer than 250 bases, from any platform. It is versatile and robust to high levels of mismatches or extreme base composition, and reasonably fast. It can align reads to a BLAST database or a FASTA file. It can accept a FASTQ file as input or automatically retrieve an accession from the SRA repository at the NCBI.

Download Full-text

MUREN: a robust and multi-reference approach of RNA-seq transcript normalization

BMC Bioinformatics ◽

10.1186/s12859-021-04288-0 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Yance Feng ◽

Lei M. Li

Keyword(s):

Biological Significance ◽

Housekeeping Genes ◽

R Package ◽

Data Sets ◽

Statistical Regression ◽

Rna Seq ◽

Least Trimmed Squares ◽

Standard Data ◽

Wide Range ◽

Multiple References

Abstract Background Normalization of RNA-seq data aims at identifying biological expression differentiation between samples by removing the effects of unwanted confounding factors. Explicitly or implicitly, the justification of normalization requires a set of housekeeping genes. However, the existence of housekeeping genes common for a very large collection of samples, especially under a wide range of conditions, is questionable. Results We propose to carry out pairwise normalization with respect to multiple references, selected from representative samples. Then the pairwise intermediates are integrated based on a linear model that adjusts the reference effects. Motivated by the notion of housekeeping genes and their statistical counterparts, we adopt the robust least trimmed squares regression in pairwise normalization. The proposed method (MUREN) is compared with other existing tools on some standard data sets. The goodness of normalization emphasizes on preserving possible asymmetric differentiation, whose biological significance is exemplified by a single cell data of cell cycle. MUREN is implemented as an R package. The code under license GPL-3 is available on the github platform: github.com/hippo-yf/MUREN and on the conda platform: anaconda.org/hippo-yf/r-muren. Conclusions MUREN performs the RNA-seq normalization using a two-step statistical regression induced from a general principle. We propose that the densities of pairwise differentiations are used to evaluate the goodness of normalization. MUREN adjusts the mode of differentiation toward zero while preserving the skewness due to biological asymmetric differentiation. Moreover, by robustly integrating pre-normalized counts with respect to multiple references, MUREN is immune to individual outlier samples.

Download Full-text

CAMISIM: Simulating metagenomes and microbial communities

10.1101/300970 ◽

2018 ◽

Cited By ~ 4

Author(s):

Adrian Fritz ◽

Peter Hofmann ◽

Stephan Majda ◽

Eik Dahms ◽

Johannes Dröge ◽

...

Keyword(s):

Microbial Communities ◽

De Novo ◽

Real Data ◽

Small Data ◽

Data Sets ◽

Sequencing Data ◽

Taxonomic Profiling ◽

Benchmark Data ◽

Sequencing Technologies ◽

Wide Range

Shotgun metagenome data sets of microbial communities are highly diverse, not only due to the natural variation of the underlying biological systems, but also due to differences in laboratory protocols, replicate numbers, and sequencing technologies. Accordingly, to effectively assess the performance of metagenomic analysis software, a wide range of benchmark data sets are required. Here, we describe the CAMISIM microbial community and metagenome simulator. The software can model different microbial abundance profiles, multi-sample time series and differential abundance studies, includes real and simulated strain-level diversity, and generates second and third generation sequencing data from taxonomic profiles or de novo. Gold standards are created for sequence assembly, genome binning, taxonomic binning, and taxonomic profiling. CAMSIM generated the benchmark data sets of the first CAMI challenge. For two simulated multi-sample data sets of the human and mouse gut microbiomes we observed high functional congruence to the real data. As further applications, we investigated the effect of varying evolutionary genome divergence, sequencing depth, and read error profiles on two popular metagenome assemblers, MEGAHIT and metaSPAdes, on several thousand small data sets generated with CAMISIM. CAMISIM can simulate a wide variety of microbial communities and metagenome data sets together with truth standards for method evaluation. All data sets and the software are freely available at: https://github.com/CAMI-challenge/CAMISIM

Download Full-text

HISAT: Hierarchical Indexing for Spliced Alignment of Transcripts

10.1101/012591 ◽

2014 ◽

Cited By ~ 2

Author(s):

Daehwan Kim ◽

Ben Langmead ◽

Steven Salzberg

Keyword(s):

Human Genome ◽

Simulated Data ◽

Genomic Region ◽

Data Sets ◽

Rna Seq ◽

Efficient System ◽

Spliced Alignment ◽

Burrows Wheeler Transform ◽

Simulated Data Sets ◽

Free Open Source

HISAT is a new, highly efficient system for alignment of sequences from RNA sequencing experiments that achieves dramatically faster performance than previous methods. HISAT uses a new indexing scheme, hierarchical indexing, which is based on the Burrows-Wheeler transform and the Ferragina-Manzini (FM) index. Hierarchical indexing employs two types of indexes for alignment: (1) a whole-genome FM index to anchor each alignment, and (2) numerous local FM indexes for very rapid extensions of these alignments. HISAT?s hierarchical index for the human genome contains 48,000 local FM indexes, each representing a genomic region of ~64,000 bp. The algorithm includes several customized alignment strategies specifically designed for mapping RNA-seq reads across multiple exons. In tests on a variety of real and simulated data sets, we show that HISAT is the fastest system currently available, approximately 50 times faster than TopHat2 and 12 times faster than GSNAP, with equal or better accuracy than any other method. Despite its very large number of indexes, HISAT requires only 4.3 Gigabytes of memory to align reads to the human genome. HISAT supports genomes of any size, including those larger than 4 billion bases. HISAT is available as free, open-source software from http://www.ccb.jhu.edu/software/hisat.

Download Full-text

The Impact of Normalization Methods on RNA-Seq Data Analysis

BioMed Research International ◽

10.1155/2015/621690 ◽

2015 ◽

Vol 2015 ◽

pp. 1-10 ◽

Cited By ~ 44

Author(s):

J. Zyprych-Walczak ◽

A. Szabelska ◽

L. Handschuh ◽

K. Górczak ◽

K. Klamecka ◽

...

Keyword(s):

High Throughput Sequencing ◽

Data Sets ◽

Complex Data ◽

Rna Seq ◽

Medical Problems ◽

Data Set ◽

Normalization Methods ◽

Wide Range ◽

The Impact ◽

Selection Of

High-throughput sequencing technologies, such as the Illumina Hi-seq, are powerful new tools for investigating a wide range of biological and medical problems. Massive and complex data sets produced by the sequencers create a need for development of statistical and computational methods that can tackle the analysis and management of data. The data normalization is one of the most crucial steps of data processing and this process must be carefully considered as it has a profound effect on the results of the analysis. In this work, we focus on a comprehensive comparison of five normalization methods related to sequencing depth, widely used for transcriptome sequencing (RNA-seq) data, and their impact on the results of gene expression analysis. Based on this study, we suggest a universal workflow that can be applied for the selection of the optimal normalization procedure for any particular data set. The described workflow includes calculation of the bias and variance values for the control genes, sensitivity and specificity of the methods, and classification errors as well as generation of the diagnostic plots. Combining the above information facilitates the selection of the most appropriate normalization method for the studied data sets and determines which methods can be used interchangeably.

Download Full-text

Comprehensive genomic resources related to domestication and crop improvement traits in Lima bean

Nature Communications ◽

10.1038/s41467-021-20921-1 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Tatiana Garcia ◽

Jorge Duitama ◽

Stephanie Smolenski Zullo ◽

Juanita Gil ◽

Andrea Ariani ◽

...

Keyword(s):

Climate Change ◽

Crop Improvement ◽

Lima Bean ◽

Phaseolus Lunatus ◽

Rna Seq ◽

Sequencing Technologies ◽

Biparental Population ◽

Wide Range ◽

Genomic Analyses ◽

Intrachromosomal Rearrangements

AbstractLima bean (Phaseolus lunatus L.), one of the five domesticated Phaseolus bean crops, shows a wide range of ecological adaptations along its distribution range from Mexico to Argentina. These adaptations make it a promising crop for improving food security under predicted scenarios of climate change in Latin America and elsewhere. In this work, we combine long and short read sequencing technologies with a dense genetic map from a biparental population to obtain the chromosome-level genome assembly for Lima bean. Annotation of 28,326 gene models show high diversity among 1917 genes with conserved domains related to disease resistance. Structural comparison across 22,180 orthologs with common bean reveals high genome synteny and five large intrachromosomal rearrangements. Population genomic analyses show that wild Lima bean is organized into six clusters with mostly non-overlapping distributions and that Mesomerican landraces can be further subdivided into three subclusters. RNA-seq data reveal 4275 differentially expressed genes, which can be related to pod dehiscence and seed development. We expect the resources presented here to serve as a solid basis to achieve a comprehensive view of the degree of convergent evolution of Phaseolus species under domestication and provide tools and information for breeding for climate change resiliency.

Download Full-text

A reference genome for the critically endangered woylie, Bettongia penicillata ogilbyi

10.1101/2021.12.07.471656 ◽

2021 ◽

Author(s):

Emma Peel ◽

Luke Silver ◽

Parice Brandies ◽

Carolyn J Hogg ◽

Katherine Belov

Keyword(s):

Wildlife Conservation ◽

Reference Genome ◽

Conservation Management ◽

Critically Endangered ◽

Rna Seq ◽

Genome Sequences ◽

Australian Species ◽

Sequencing Technologies ◽

Bettongia Penicillata ◽

Long Reads

Biodiversity is declining globally, and Australia has one of the worst extinction records for mammals. The development of sequencing technologies means that genomic approaches are now available as important tools for wildlife conservation and management. Despite this, genome sequences are available for only 5% of threatened Australian species. Here we report the first reference genome for the woylie (Bettongia penicillata ogilbyi), a critically endangered marsupial from Western Australia, and the first genome within the Potoroidae family. The woylie reference genome was generated using Pacific Biosciences HiFi long-reads, resulting in a 3.39 Gbp assembly with a scaffold N50 of 6.49 Mbp and 86.5% complete mammalian BUSCOs. Assembly of a global transcriptome from pouch skin, tongue, heart and blood RNA-seq reads was used to guide annotation with Fgenesh++, resulting in the annotation of 24,655 genes. The woylie reference genome is a valuable resource for conservation, management and investigations into disease-induced decline of this critically endangered marsupial.

Download Full-text

Sequencing Enabling Design and Learning in Synthetic Biology

10.20944/preprints202002.0243.v1 ◽

2020 ◽

Author(s):

Pierre-Aurelien Gilliot ◽

Thomas E. Gorochowski

Keyword(s):

Machine Learning ◽

Synthetic Biology ◽

De Novo ◽

Data Sets ◽

New Wave ◽

Design Rules ◽

Biological Design ◽

Sequencing Technologies ◽

Dna And Rna

The ability to read and quantify nucleic acids such as DNA and RNA using sequencing technologies has revolutionized our understanding of life. With the emergence of synthetic biology, these tools are now being put to work in new ways - enabling de novo biological design. Here, we show how sequencing is supporting the creation of a new wave of biological parts and systems, as well as providing the vast data sets needed for the machine learning of design rules for predictive bioengineering. However, we believe this is only the tip of the iceberg and end by providing an outlook on recent advances that will likely broaden the role of sequencing in synthetic biology and its deployment in real-world environments.

Download Full-text

TALC: Transcript-level Aware Long Read Correction

10.1101/2020.01.10.901728 ◽

2020 ◽

Cited By ~ 1

Author(s):

Lucile Broseus ◽

Aubin Thomas ◽

Andrew J. Oldfield ◽

Dany Severac ◽

Emeric Dubois ◽

...

Keyword(s):

Transcriptome Sequencing ◽

Transcript Level ◽

De Bruijn Graph ◽

Rna Seq ◽

Sequencing Data ◽

Sequencing Technologies ◽

Long Reads ◽

Long Read ◽

De Bruijn ◽

Rna Transcript

ABSTRACTMotivationLong-read sequencing technologies are invaluable for determining complex RNA transcript architectures but are error-prone. Numerous “hybrid correction” algorithms have been developed for genomic data that correct long reads by exploiting the accuracy and depth of short reads sequenced from the same sample. These algorithms are not suited for correcting more complex transcriptome sequencing data.ResultsWe have created a novel reference-free algorithm called TALC (Transcription Aware Long Read Correction) which models changes in RNA expression and isoform representation in a weighted De-Bruijn graph to correct long reads from transcriptome studies. We show that transcription aware correction by TALC improves the accuracy of the whole spectrum of downstream RNA-seq applications and is thus necessary for transcriptome analyses that use long read technology.Availability and ImplementationTALC is implemented in C++ and available at https://gitlab.igh.cnrs.fr/lbroseus/[email protected]

Download Full-text

HISAT-3N: a rapid and accurate three-nucleotide sequence aligner

10.1101/2020.12.15.422906 ◽

2020 ◽

Author(s):

Yun Zhang ◽

Chanhee Park ◽

Christopher Bennett ◽

Micah Thornton ◽

Daehwan Kim

Keyword(s):

Nucleotide Sequence ◽

Simulated Data ◽

Alignment Accuracy ◽

Data Sets ◽

Cellular Processes ◽

Sequencing Technologies ◽

Spliced Alignment ◽

Hierarchical Index ◽

Simulated Data Sets ◽

The Ideal

Nucleotide conversion sequencing technologies such as bisulfite-seq and SLAM-seq are powerful tools to explore the intricacies of cellular processes. In this paper, we describe HISAT-3N (hierarchical indexing for spliced alignment of transcripts - 3 nucleotides), which rapidly and accurately aligns sequences consisting of nucleotide conversions by leveraging powerful hierarchical index and repeat index algorithms originally developed for the HISAT software. Tests on real and simulated data sets demonstrate that HISAT-3N is over 7 times faster, has greater alignment accuracy, and has smaller memory requirements than other modern systems. Taken together HISAT-3N is the ideal aligner for use with converted sequence technologies.

Download Full-text

Processing Oxford Nanopore Long Reads Using Amazon Web Services

Biomedical Chemistry Research and Methods ◽

10.18097/bmcrm00131 ◽

2020 ◽

Vol 3 (4) ◽

pp. e00131

Author(s):

V.V. Shapovalova ◽

S.P. Radko ◽

K.G. Ptitsyn ◽

G.S. Krasnov ◽

K.V. Nakhod ◽

...

Keyword(s):

Web Services ◽

Primary Data ◽

Rna Sequences ◽

Computing Power ◽

Dna And Rna ◽

Long Reads ◽

Oxford Nanopore ◽

Wide Range ◽

Amazon Web Services ◽

Selection Of

Studies of genomes and transcriptomes are performed using sequencers that read the sequence of nucleotide residues of genomic DNA, RNA, or complementary DNA (cDNA). The analysis consists of an experimental part (obtaining primary data) and bioinformatic processing of primary data. The bioinformatics part is performed with different sets of input parameters. The selection of the optimal values of the parameters, as a rule, requires significant computing power. The article describes a protocol for processing transcriptome data by virtual computers provided by the cloud platform Amazon Web Services (AWS) using the example of the recently emerging technology of long DNA and RNA sequences (Oxford Nanopore Technology). As a result, a virtual machine and instructions for its use have been developed, thus allowing a wide range of molecular biologists to independently process the results obtained using the "Oxford nanopore".

Download Full-text