Large-Scale Search of Transcriptomic Read Sets with Sequence Bloom Trees

Mapping Intimacies ◽

10.1101/017087 ◽

2015 ◽

Cited By ~ 6

Author(s):

Brad Solomon ◽

Carleton Kingsford

Keyword(s):

Large Scale ◽

Rna Seq ◽

Large Collection ◽

Short Read ◽

Indexing Scheme ◽

Short Read Sequencing ◽

Sequence Read Archive ◽

Gene Isoform ◽

Expressed Sequence ◽

Novel Isoforms

Enormous databases of short-read RNA-seq sequencing experiments such as the NIH Sequence Read Archive (SRA) are now available. However, these collections remain difficult to use due to the inability to search for a particular expressed sequence. A natural question is which of these experiments contain sequences that indicate the expression of a particular sequence such as a gene isoform, lncRNA, or uORF. However, at present this is a computationally demanding question at the scale of these databases. We introduce an indexing scheme, the Sequence Bloom Tree (SBT), to support sequence-based querying of terabase-scale collections of thousands of short-read sequencing experiments. We apply SBT to the problem of finding conditions under which query transcripts are expressed. Our experiments are conducted on a set of 2652 publicly available RNA-seq experiments contained in the NIH for the breast, blood, and brain tissues, comprising 5 terabytes of sequence. SBTs of this size can be queried for a 1000 nt sequence in 19 minutes using less than 300 MB of RAM, over 100 times faster than standard usage of SRA-BLAST and 119 times faster than STAR. SBTs allow for fast identification of experiments with expressed novel isoforms, even if these isoforms were unknown at the time the SBT was built. We also provide some theoretical guidance about appropriate parameter selection in SBT and propose a sampling-based scheme for potentially scaling SBT to even larger collections of files. While SBT can handle any set of reads, we demonstrate the effectiveness of SBT by searching a large collection of blood, brain, and breast RNA-seq files for all 214,293 known human transcripts to identify tissue-specific transcripts. The implementation used in the experiments below is in C++ and is available as open source at http://www.cs.cmu.edu/~ckingsf/software/bloomtree.

Download Full-text

Improved Search of Large Transcriptomic Sequencing Databases Using Split Sequence Bloom Trees

10.1101/086561 ◽

2016 ◽

Author(s):

Brad Solomon ◽

Carl Kingsford

Keyword(s):

Population Variation ◽

Rna Seq ◽

Specific Expression ◽

Short Read ◽

Short Read Sequencing ◽

Split Sequence ◽

Transcriptomic Sequencing ◽

And Storage ◽

Expressed Sequence ◽

Over Time

AbstractEnormous databases of short-read RNA-seq sequencing experiments such as the NIH Sequencing Read Archive (SRA) are now available. These databases could answer many questions about the condition-specific expression or population variation, and this resource is only going to grow over time. However, these collections remain difficult to use due to the inability to search for a particular expressed sequence. While some progress has been made on this problem, it is still not feasible to search collections of hundreds of terabytes of short-read sequencing experiments. We introduce an indexing scheme called Split Sequence Bloom Tree (SSBT) to support sequence-based querying of terabyte-scale collections of thousands of short-read sequencing experiments. SSBT is an improvement over the SBT [1] data structure for the same task. We apply SSBT to the problem of finding conditions under which query transcripts are expressed. Our experiments are conducted on a set of 2,652 publicly available RNA-seq experiments contained in the NIH for the breast, blood, and brain tissues. We demonstrate that this SSBT index can be queried for a 1000 nt sequence in under 4 minutes using a single thread and can be stored in just 39 GB, a five-fold improvement in search and storage costs compared to SBT. We further report that SSBT can be further optimized by pre-loading the entire index to accomplish the same search in 30 seconds.

Download Full-text

RNA-Seq Analysis and Gene Discovery of Andrias davidianus Using Illumina Short Read Sequencing

PLoS ONE ◽

10.1371/journal.pone.0123730 ◽

2015 ◽

Vol 10 (4) ◽

pp. e0123730 ◽

Cited By ~ 16

Author(s):

Fenggang Li ◽

Lixin Wang ◽

Qingjing Lan ◽

Hui Yang ◽

Yang Li ◽

...

Keyword(s):

Gene Discovery ◽

Rna Seq ◽

Andrias Davidianus ◽

Short Read ◽

Short Read Sequencing

Download Full-text

Characterizing short read sequencing for gene discovery and RNA-Seq analysis in Crassostrea gigas

Comparative Biochemistry and Physiology Part D Genomics and Proteomics ◽

10.1016/j.cbd.2011.12.003 ◽

2012 ◽

Vol 7 (2) ◽

pp. 94-99 ◽

Cited By ~ 16

Author(s):

Mackenzie R. Gavery ◽

Steven B. Roberts

Keyword(s):

Crassostrea Gigas ◽

Gene Discovery ◽

Rna Seq ◽

Short Read ◽

Short Read Sequencing

Download Full-text

The Lair: A resource for exploratory analysis of published RNA-Seq data

10.1101/056200 ◽

2016 ◽

Author(s):

Harold Pimentel ◽

Pascal Sturmfels ◽

Nicolas Bray ◽

Páll Melsted ◽

Lior Pachter

Keyword(s):

Large Scale ◽

Exploratory Analysis ◽

Technical Expertise ◽

Rna Seq ◽

Sequencing Data ◽

Short Read ◽

Link Type ◽

Short Read Archive ◽

Published Research

AbstractIncreased emphasis on reproducibility of published research in the last few years has led to the large-scale archiving of sequencing data. While this data can, in theory, be used to reproduce results in papers, it is typically not easily usable in practice. We introduce a series of tools for processing and analyzing RNA-Seq data in the Short Read Archive, that together have allowed us to build an easily extendable resource for analysis of data underlying published papers. Our system makes the exploration of data easily accessible and usable without technical expertise. Our database and associated tools can be accessed at The Lair: http://pachterlab.github.io/lair

Download Full-text

Indel variant analysis of short-read sequencing data with Scalpel

10.1101/028050 ◽

2015 ◽

Cited By ~ 1

Author(s):

Han Fang ◽

Ewa A. Grabowska ◽

Kanika Arora ◽

Vladimir Vacic ◽

Michael C. Zody ◽

...

Keyword(s):

Large Scale ◽

De Novo ◽

Sequencing Data ◽

Short Read ◽

Short Read Sequencing ◽

Indel Detection ◽

Indel Calling ◽

Assembly Technique ◽

Variant Analysis ◽

Family Based

As the second most common type of variations in the human genome, insertions and deletions (indels) have been linked to many diseases, but indels of more than a few bases are still challenging to discover from short-read sequencing data. Scalpel (http://scalpel.sourceforge.net) is open-source software for reliable indel detection based on the micro-assembly technique. To date, it has been successfully used to discover mutations in novel candidate genes for autism, and is extensively used in other large-scale studies of human diseases. This protocol gives an overview of the algorithm and describes how to use Scalpel to perform highly accurate indel calling from whole genome and exome sequencing data. We provide detailed instructions for an exemplary family-based de novo study, but we also characterize the other two supported modes of operation for single sample and somatic analysis. Indel normalization, visualization, and annotation of the mutations are also illustrated. Using a standard server, indel discovery and characterization in the exonic regions of the example sequencing data can be finished in ~6 hours after read mapping.

Download Full-text

Transcriptional and morphological profiling of parvalbumin interneuron subpopulations in the mouse hippocampus

Nature Communications ◽

10.1038/s41467-020-20328-4 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Lin Que ◽

David Lukacsovich ◽

Wenshu Luo ◽

Csaba Földy

Keyword(s):

Large Scale ◽

Cell Types ◽

Rna Seq ◽

Neuronal Identity ◽

Parvalbumin Interneurons ◽

Different Types ◽

Parvalbumin Interneuron ◽

Cam Profile ◽

Developmental Domains

AbstractThe diversity reflected by >100 different neural cell types fundamentally contributes to brain function and a central idea is that neuronal identity can be inferred from genetic information. Recent large-scale transcriptomic assays seem to confirm this hypothesis, but a lack of morphological information has limited the identification of several known cell types. In this study, we used single-cell RNA-seq in morphologically identified parvalbumin interneurons (PV-INs), and studied their transcriptomic states in the morphological, physiological, and developmental domains. Overall, we find high transcriptomic similarity among PV-INs, with few genes showing divergent expression between morphologically different types. Furthermore, PV-INs show a uniform synaptic cell adhesion molecule (CAM) profile, suggesting that CAM expression in mature PV cells does not reflect wiring specificity after development. Together, our results suggest that while PV-INs differ in anatomy and in vivo activity, their continuous transcriptomic and homogenous biophysical landscapes are not predictive of these distinct identities.

Download Full-text

Biosynthetic potential of uncultured Antarctic soil bacteria revealed through long-read metagenomic sequencing

The ISME Journal ◽

10.1038/s41396-021-01052-3 ◽

2021 ◽

Author(s):

Valentin Waschulin ◽

Chiara Borsetto ◽

Robert James ◽

Kevin K. Newsham ◽

Stefano Donadio ◽

...

Keyword(s):

Genome Mining ◽

Gene Clusters ◽

Biosynthetic Gene Cluster ◽

Full Length ◽

Metagenomic Sequencing ◽

Short Read ◽

Short Read Sequencing ◽

Rich Diversity ◽

Long Read ◽

The Rich

AbstractThe growing problem of antibiotic resistance has led to the exploration of uncultured bacteria as potential sources of new antimicrobials. PCR amplicon analyses and short-read sequencing studies of samples from different environments have reported evidence of high biosynthetic gene cluster (BGC) diversity in metagenomes, indicating their potential for producing novel and useful compounds. However, recovering full-length BGC sequences from uncultivated bacteria remains a challenge due to the technological restraints of short-read sequencing, thus making assessment of BGC diversity difficult. Here, long-read sequencing and genome mining were used to recover >1400 mostly full-length BGCs that demonstrate the rich diversity of BGCs from uncultivated lineages present in soil from Mars Oasis, Antarctica. A large number of highly divergent BGCs were not only found in the phyla Acidobacteriota, Verrucomicrobiota and Gemmatimonadota but also in the actinobacterial classes Acidimicrobiia and Thermoleophilia and the gammaproteobacterial order UBA7966. The latter furthermore contained a potential novel family of RiPPs. Our findings underline the biosynthetic potential of underexplored phyla as well as unexplored lineages within seemingly well-studied producer phyla. They also showcase long-read metagenomic sequencing as a promising way to access the untapped genetic reservoir of specialised metabolite gene clusters of the uncultured majority of microbes.

Download Full-text

Multiple Alu exonization in 3’UTR of a primate specific isoform of CYP20A1 creates a potential miRNA sponge

Genome Biology and Evolution ◽

10.1093/gbe/evaa233 ◽

2020 ◽

Author(s):

Aniket Bhattacharya ◽

Vineet Jha ◽

Khushboo Singhal ◽

Mahar Fatima ◽

Dayanidhi Singh ◽

...

Keyword(s):

Heat Shock ◽

Cortical Neurons ◽

Regulatory Networks ◽

Large Scale ◽

Neuronal Development ◽

Random Sets ◽

Rna Seq ◽

Orphan Gene ◽

Mirna Sponge ◽

Human Neurons

Abstract Alu repeats contribute to phylogenetic novelties in conserved regulatory networks in primates. Our study highlights how exonized Alus could nucleate large-scale mRNA-miRNA interactions. Using a functional genomics approach, we characterize a transcript isoform of an orphan gene, CYP20A1 (CYP20A1_Alu-LT) that has exonization of 23 Alus in its 3’UTR. CYP20A1_Alu-LT, confirmed by 3’RACE, is an outlier in length (9 kb 3’UTR) and widely expressed. Using publically available datasets, we demonstrate its expression in higher primates and presence in single nucleus RNA-seq of 15928 human cortical neurons. miRanda predicts ∼4700 miRNA recognition elements (MREs) for ∼1000 miRNAs, primarily originated within these 3’UTR-Alus. CYP20A1_Alu-LT could be a potential multi-miRNA sponge as it harbors ≥10 MREs for 140 miRNAs and has cytosolic localization. We further tested whether expression of CYP20A1_Alu-LT correlates with mRNAs harboring similar MRE targets. RNA-seq with conjoint miRNA-seq analysis was done in primary human neurons where we observed CYP20A1_Alu-LT to be downregulated during heat shock response and upregulated in HIV1-Tat treatment. 380 genes were positively correlated with its expression (significantly downregulated in heat shock and upregulated in Tat) and they harbored MREs for nine expressed miRNAs which were also enriched in CYP20A1_Alu-LT. MREs were significantly enriched in these 380 genes compared to random sets of differentially expressed genes (p = 8.134e-12). Gene ontology suggested involvement of these genes in neuronal development and hemostasis pathways thus proposing a novel component of Alu-miRNA mediated transcriptional modulation that could govern specific physiological outcomes in higher primates.

Download Full-text

NASA GeneLab RNA-Seq Consensus Pipeline: Standardized Processing of Short-Read RNA-Seq Data

iScience ◽

10.1016/j.isci.2021.102361 ◽

2021 ◽

pp. 102361

Author(s):

Eliah G. Overbey ◽

Amanda M. Saravia-Butler ◽

Zhe Zhang ◽

Komal S. Rathi ◽

Homer Fogle ◽

...

Keyword(s):

Rna Seq ◽

Short Read

Download Full-text

REscan: inferring repeat expansions and structural variation in paired-end short read sequencing data

Bioinformatics ◽

10.1093/bioinformatics/btaa753 ◽

2020 ◽

Author(s):

Russell Lewis McLaughlin

Keyword(s):

Structural Variation ◽

Sequence Data ◽

Neurological Diseases ◽

Repeat Expansion ◽

Sequencing Data ◽

Short Read ◽

Short Read Sequencing ◽

Repeat Expansions ◽

Paired End Sequencing

Abstract Motivation Repeat expansions are an important class of genetic variation in neurological diseases. However, the identification of novel repeat expansions using conventional sequencing methods is a challenge due to their typical lengths relative to short sequence reads and difficulty in producing accurate and unique alignments for repetitive sequence. However, this latter property can be harnessed in paired-end sequencing data to infer the possible locations of repeat expansions and other structural variation. Results This article presents REscan, a command-line utility that infers repeat expansion loci from paired-end short read sequencing data by reporting the proportion of reads orientated towards a locus that do not have an adequately mapped mate. A high REscan statistic relative to a population of data suggests a repeat expansion locus for experimental follow-up. This approach is validated using genome sequence data for 259 cases of amyotrophic lateral sclerosis, of which 24 are positive for a large repeat expansion in C9orf72, showing that REscan statistics readily discriminate repeat expansion carriers from non-carriers. Availabilityand implementation C source code at https://github.com/rlmcl/rescan (GNU General Public Licence v3).

Download Full-text