Profiling immunoglobulin repertoires across multiple human tissues by RNA Sequencing

Mapping Intimacies ◽

10.1101/089235 ◽

2016 ◽

Cited By ~ 6

Author(s):

Serghei Mangul ◽

Igor Mandric ◽

Harry Taegyun Yang ◽

Nicolas Strauli ◽

Dennis Montoya ◽

...

Keyword(s):

Rna Sequencing ◽

Adaptive Immune System ◽

Tissue Expression ◽

Computational Method ◽

Rna Seq ◽

Variable Regions ◽

Link Type ◽

Novel Method ◽

Immunoglobulin Repertoire ◽

Secondary Lymphoid Organs

AbstractAssay-based approaches provide a detailed view of the adaptive immune system by profiling immunoglobulin (Ig) receptor repertoires. However, these methods carry a high cost and lack the scale of standard RNA sequencing (RNA-Seq). Here we report the development of ImReP, a novel computational method for rapid and accurate profiling of the immunoglobulin repertoire from regular RNA-Seq data. ImReP can also accurately assemble the complementary determining regions 3 (CDR3s), the most variable regions of Ig receptors. We applied our novel method to 8,555 samples across 53 tissues from 544 individuals in the Genotype-Tissue Expression (GTEx v6) project. ImReP is able to efficiently extract Ig-derived reads from RNA-Seq data. Using ImReP, we have created a systematic atlas of 3.6 million Ig sequences across a broad range of tissue types, most of which have not been studied for Ig receptor repertoires. We also compared the GTEx tissues to track the flow of Ig clonotypes across immune-related tissues, including secondary lymphoid organs and organs encompassing mucosal, exocrine, and endocrine sites, and we examined the compositional similarities of clonal populations between these tissues. The Atlas of Immunoglobulin Repertoires (The AIR), is freely available at https://github.com/smangul1/TheAIR/wiki, is one of the largest collection of CDR3 sequences and tissue types. We anticipate this recourse will enhance future immunology studies and advance the development of therapies for human diseases. ImReP is freely available at https://github.com/mandricigor/imrep/wiki

Download Full-text

REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets

10.1101/2020.03.29.014159 ◽

2020 ◽

Cited By ~ 4

Author(s):

Camille Marchet ◽

Zamin Iqbal ◽

Daniel Gautheret ◽

Mikael Salson ◽

Rayan Chikhi

Keyword(s):

Large Datasets ◽

Computational Method ◽

De Bruijn Graph ◽

Rna Seq ◽

Indexing Methods ◽

Link Type ◽

De Bruijn

AbstractMotivationIn this work we present REINDEER, a novel computational method that performs indexing of sequences and records their abundances across a collection of datasets. To the best of our knowledge, other indexing methods have so far been unable to record abundances efficiently across large datasets.ResultsWe used REINDEER to index the abundances of sequences within 2,585 human RNA-seq experiments in 45 hours using only 56 GB of RAM. This makes REINDEER the first method able to record abundances at the scale of 4 billion distinct k-mers across 2,585 datasets. REINDEER also supports exact presence/absence queries of k-mers. Briefly, REINDEER constructs the compacted de Bruijn graph (DBG) of each dataset, then conceptually merges those DBGs into a single global one. Then, REINDEER constructs and indexes monotigs, which in a nutshell are groups of k-mers of similar abundances.Availabilityhttps://github.com/kamimrcht/[email protected]

Download Full-text

Slinker: Visualising novel splicing events in RNA-Seq data

F1000Research ◽

10.12688/f1000research.74836.1 ◽

2021 ◽

Vol 10 ◽

pp. 1255

Author(s):

Breon Schmidt ◽

Marek Cmero ◽

Paul Ekert ◽

Nadia Davidson ◽

Alicia Oshlack

Keyword(s):

Rare Disease ◽

Human Genome ◽

Rna Sequencing ◽

Reference Genome ◽

Data Driven ◽

Rna Seq ◽

Bioinformatics Pipeline ◽

Link Type ◽

Muscular Disorders ◽

Data Driven Approach

Visualisation of the transcriptome relative to a reference genome is fraught with sparsity. This is due to RNA sequencing (RNA-Seq) reads being predominantly mapped to exons that account for just under 3% of the human genome. Recently, we have used exon-only references, superTranscripts, to improve visualisation of aligned RNA-Seq data through the omission of supposedly unexpressed regions such as introns. However, variation within these regions can lead to novel splicing events that may drive a pathogenic phenotype. In these cases, the loss of information in only retaining annotated exons presents significant drawbacks. Here we present Slinker, a bioinformatics pipeline written in Python and Bpipe that uses a data-driven approach to assemble sample-specific superTranscripts. At its core, Slinker uses Stringtie2 to assemble transcripts with any sequence across any gene. This assembly is merged with reference transcripts, converted to a superTranscript, of which rich visualisations are made through Plotly with associated annotation and coverage information. Slinker was validated on five novel splicing events of rare disease samples from a cohort of primary muscular disorders. In addition, Slinker was shown to be effective in visualising deletion events within transcriptomes of tumour samples in the important leukemia gene, IKZF1. Slinker offers a succinct visualisation of RNA-Seq alignments across typically sparse regions and is freely available on Github.

Download Full-text

Deconvolution of Expression for Nascent RNA Sequencing Data (DENR) Highlights Pre-RNA Isoform Diversity in Human Cells

10.1101/2021.03.16.435537 ◽

2021 ◽

Author(s):

Yixin Zhao ◽

Noah Dukler ◽

Gilad Barshad ◽

Shushan Toneyan ◽

Charles G. Danko ◽

...

Keyword(s):

T Cells ◽

Rna Sequencing ◽

Cell Types ◽

Transcription Unit ◽

Human Cells ◽

Computational Method ◽

Rna Seq ◽

Sequencing Data ◽

Isoform Diversity ◽

Nascent Rna

AbstractQuantification of mature-RNA isoform abundance from RNA-seq data has been extensively studied, but much less attention has been devoted to quantifying the abundance of distinct precursor RNAs based on nascent RNA sequencing data. Here we address this problem with a new computational method called Deconvolution of Expression for Nascent RNA sequencing data (DENR). DENR models the nascent RNA read counts at each locus as a mixture of user-provided isoforms. The performance of the baseline algorithm is enhanced by the use of machine-learning predictions of transcription start sites (TSSs) and an adjustment for the typical “shape profile” of read counts along a transcription unit. We show using simulated data that DENR clearly outperforms simple read-count-based methods for estimating the abundances of both whole genes and isoforms. By applying DENR to previously published PRO-seq data from K562 and CD4+ T cells, we find that transcription of multiple isoforms per gene is widespread, and the dominant isoform frequently makes use of an internal TSS. We also identify > 200 genes whose dominant isoforms make use of different TSSs in these two cell types. Finally, we apply DENR and StringTie to newly generated PRO-seq and RNA-seq data, respectively, for human CD4+ T cells and CD14+ monocytes, and show that entropy at the pre-RNA level makes a disproportionate contribution to overall isoform diversity, especially across cell types. Altogether, DENR is the first computational tool to enable abundance quantification of pre-RNA isoforms based on nascent RNA sequencing data, and it reveals high levels of pre-RNA isoform diversity in human cells.

Download Full-text

Comprehensive analysis of RNA-sequencing to find the source of 1 trillion reads across diverse adult human tissues

10.1101/053041 ◽

2016 ◽

Cited By ~ 1

Author(s):

Serghei Mangul ◽

Harry Taegyun Yang ◽

Nicolas Strauli ◽

Franziska Gruhl ◽

Hagit T. Porath ◽

...

Keyword(s):

B Cell ◽

Rna Sequencing ◽

Single Cells ◽

Cell Receptor ◽

Tissue Expression ◽

Read Length ◽

Rna Seq ◽

Disease Etiology ◽

Rna Molecules ◽

Sequencing Technologies

AbstractHigh throughput RNA sequencing technologies have provided invaluable research opportunities across distinct scientific domains by producing quantitative readouts of the transcriptional activity of both entire cellular populations and single cells. The majority of RNA-Seq analyses begin by mapping each experimentally produced sequence (i.e., read) to a set of annotated reference sequences for the organism of interest. For both biological and technical reasons, a significant fraction of reads remains unmapped. In this work, we develop Read Origin Protocol (ROP) to discover the source of all reads originating from complex RNA molecules, recombinant T and B cell receptors, and microbial communities. We applied ROP to 8,641 samples across 630 individuals from 54 tissues. A fraction of RNA-Seq data (n=86) was obtained in-house; the remaining data was obtained from the Genotype-Tissue Expression (GTEx v6) project. To generalize the reported number of accounted reads, we also performed ROP analysis on thousands of different, randomly selected, and publicly available RNA-Seq samples in the Sequence Read Archive (SRA). Our approach can account for 99.9% of 1 trillion reads of various read length across the merged dataset (n=10641). Using in-house RNA-Seq data, we show that immune profiles of asthmatic individuals are significantly different from the profiles of control individuals, with decreased average per sample T and B cell receptor diversity. We also show that immune diversity is inversely correlated with microbial load. Our results demonstrate the potential of ROP to exploit unmapped reads in order to better understand the functional mechanisms underlying connections between the immune system, microbiome, human gene expression, and disease etiology. ROP is freely available athttps://github.com/smangul1/ropand currently supports human and mouse RNA-Seq reads.

Download Full-text

Genotype-free demultiplexing of pooled single-cell RNA-seq

Genome Biology ◽

10.1186/s13059-019-1852-7 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 6

Author(s):

Jun Xu ◽

Caitlin Falconer ◽

Quan Nguyen ◽

Joanna Crawford ◽

Brett D. McKinnon ◽

...

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Genetic Differences ◽

Rna Seq ◽

Link Type ◽

Genotype Information ◽

Single Cell Rna Sequencing ◽

Pooled Samples

AbstractA variety of methods have been developed to demultiplex pooled samples in a single cell RNA sequencing (scRNA-seq) experiment which either require hashtag barcodes or sample genotypes prior to pooling. We introduce scSplit which utilizes genetic differences inferred from scRNA-seq data alone to demultiplex pooled samples. scSplit also enables mapping clusters to original samples. Using simulated, merged, and pooled multi-individual datasets, we show that scSplit prediction is highly concordant with demuxlet predictions and is highly consistent with the known truth in cell-hashing dataset. scSplit is ideally suited to samples without external genotype information and is available at: https://github.com/jon-xu/scSplit

Download Full-text

RsQTL: correlation of expressed SNVs with splicing using RNA-sequencing data

10.1101/840504 ◽

2019 ◽

Cited By ~ 1

Author(s):

Justin Sein ◽

Liam F. Spurr ◽

Pavlos Bousounis ◽

N M Prashant ◽

Hongyu Liu ◽

...

Keyword(s):

Rna Sequencing ◽

Tissue Expression ◽

Supplementary Information ◽

Rna Seq ◽

Sequencing Data ◽

Dynamic Nature ◽

Dynamic Variation ◽

Exon Junctions ◽

Variant Allele Fraction ◽

Allele Fraction

SummaryRsQTL is a tool for identification of splicing quantitative trait loci (sQTLs) from RNA-sequencing (RNA-seq) data by correlating the variant allele fraction at expressed SNV loci in the transcriptome (VAFRNA) with the proportion of molecules spanning local exon-exon junctions at loci with differential intron excision (percent spliced in, PSI). We exemplify the method on sets of RNA-seq data from human tissues obtained though the Genotype-Tissue Expression Project (GTEx). RsQTL does not require matched DNA and can identify a subset of expressed sQTL loci. Due to the dynamic nature of VAFRNA, RsQTL is applicable for the assessment of conditional and dynamic variation-splicing relationships.Availability and implementationhttps://github.com/HorvathLab/[email protected] or [email protected] InformationRsQTL_Supplementary_Data.zip

Download Full-text

Diagnosing Cornelia de Lange syndrome and related neurodevelopmental disorders using RNA-sequencing

10.1101/19008300 ◽

2019 ◽

Author(s):

Stefan Rentas ◽

Komal S. Rathi ◽

Maninder Kaur ◽

Pichai Raman ◽

Ian D. Krantz ◽

...

Keyword(s):

Rna Sequencing ◽

Neurodevelopmental Disorders ◽

Diagnostic Testing ◽

Tissue Expression ◽

Cornelia De Lange Syndrome ◽

Rna Seq ◽

Genetic Syndromes ◽

Gene Testing ◽

Mendelian Gene ◽

Cornelia De Lange

ABSTRACTPurposeNeurodevelopmental phenotypes represent major indications for children undergoing clinical exome sequencing. However, 50% of cases remain undiagnosed even upon exome reanalysis. Here we show RNA sequencing (RNA-seq) on human B lymphoblastoid cell lines (LCL) is highly suitable for neurodevelopmental Mendelian gene testing and demonstrate the utility of this approach in suspected cases of Cornelia de Lange syndrome (CdLS).MethodsGenotype-Tissue Expression project transcriptome data for LCL, blood, and brain was assessed for neurodevelopmental Mendelian gene expression. Detection of abnormal splicing and pathogenic variants in these genes was performed with a novel RNA-seq diagnostic pipeline and using a validation CdLS-LCL cohort (n=10) and test cohort of patients who carry a clinical diagnosis of CdLS but negative genetic testing (n=5).ResultsLCLs share isoform diversity of brain tissue for a large subset of neurodevelopmental genes and express 1.8-fold more of these genes compared to blood (LCL, n=1706; whole blood, n=917). This enables testing of over 1000 genetic syndromes. The RNA-seq pipeline had 90% sensitivity for detecting pathogenic events and revealed novel diagnoses such as abnormal splice products in NIPBL and pathogenic coding variants in BRD4 and ANKRD11.ConclusionThe LCL transcriptome enables robust frontline and/or reflexive diagnostic testing for neurodevelopmental disorders.

Download Full-text

Rigorous Benchmarking of HLA Callers for RNA Sequencing Data

10.31219/osf.io/t4n72 ◽

2021 ◽

Author(s):

Ram Ayyala ◽

Junghyun Jung ◽

Sergey Knyazev ◽

SERGHEI MANGUL

Keyword(s):

Rna Sequencing ◽

Gold Standard ◽

Large Scale ◽

Human Leukocyte ◽

Tissue Expression ◽

Hla Typing ◽

Read Length ◽

Evaluation Metrics ◽

Rna Seq ◽

Sequencing Data

Although precise identification of the human leukocyte antigen (HLA) allele is crucial for various clinical and research applications, HLA typing remains challenging due to high polymorphism of the HLA loci. However, with Next-Generation Sequencing (NGS) data becoming widely accessible, many computational tools have been developed to predict HLA types from RNA sequencing (RNA-seq) data. However, there is a lack of comprehensive and systematic benchmarking of RNA-seq HLA callers using large-scale and realist gold standards. In order to address this limitation, we rigorously compared the performance of 12 HLA callers over 50,000 HLA tasks including searching 30 pairwise combinations of HLA callers and reference in over 1,500 samples. In each case, we produced evaluation metrics of accuracy that is the percentage of correctly predicted alleles (two and four-digit resolution) based on six gold standard datasets spanning 650 RNA-seq samples. To determine the influence of the relationship of the read length over the HLA region on prediction quality using each tool, we explored the read length effect by considering read length in the range 37-126 bp, which was available in our gold standard datasets. Moreover, using the Genotype-Tissue Expression (GTEx) v8 data, we carried out evaluation metrics by calculating the concordance of the same HLA type across different tissues from the same individual to evaluate how well the HLA callers can maintain consistent results across various tissues of the same individual. This study offers crucial information for researchers regarding appropriate choices of methods for an HLA analysis.

Download Full-text

recount3: summaries and queries for large-scale RNA-seq expression and splicing

Genome Biology ◽

10.1186/s13059-021-02533-6 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Christopher Wilks ◽

Shijie C. Zheng ◽

Feng Yong Chen ◽

Rone Charles ◽

Brad Solomon ◽

...

Keyword(s):

Rna Sequencing ◽

Large Scale ◽

Rna Seq ◽

Analysis Pipeline ◽

Web Resources ◽

Link Type ◽

Private Data ◽

Exon Junctions ◽

Human And Mouse

AbstractWe present recount3, a resource consisting of over 750,000 publicly available human and mouse RNA sequencing (RNA-seq) samples uniformly processed by our new analysis pipeline. To facilitate access to the data, we provide the and R/Bioconductor packages as well as complementary web resources. Using these tools, data can be downloaded as study-level summaries or queried for specific exon-exon junctions, genes, samples, or other features. can be used to process local and/or private data, allowing results to be directly compared to any study in recount3. Taken together, our tools help biologists maximize the utility of publicly available RNA-seq data, especially to improve their understanding of newly collected data. recount3 is available from http://rna.recount.bio.

Download Full-text

Red panda: a novel method for detecting variants in single-cell RNA sequencing

BMC Genomics ◽

10.1186/s12864-020-07224-3 ◽

2020 ◽

Vol 21 (S11) ◽

Author(s):

Adam Cornish ◽

Shrabasti Roychoudhury ◽

Krishna Sarma ◽

Suravi Pramanik ◽

Kishor Bhakat ◽

...

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Articular Chondrocytes ◽

Genetic Diseases ◽

Simulated Data ◽

Single Nucleotide ◽

Single Cell Rna Sequencing ◽

Red Panda ◽

Novel Method ◽

Rare Cells

Abstract Background Single-cell sequencing enables us to better understand genetic diseases, such as cancer or autoimmune disorders, which are often affected by changes in rare cells. Currently, no existing software is aimed at identifying single nucleotide variations or micro (1-50 bp) insertions and deletions in single-cell RNA sequencing (scRNA-seq) data. Generating high-quality variant data is vital to the study of the aforementioned diseases, among others. Results In this study, we report the design and implementation of Red Panda, a novel method to accurately identify variants in scRNA-seq data. Variants were called on scRNA-seq data from human articular chondrocytes, mouse embryonic fibroblasts (MEFs), and simulated data stemming from the MEF alignments. Red Panda had the highest Positive Predictive Value at 45.0%, while other tools—FreeBayes, GATK HaplotypeCaller, GATK UnifiedGenotyper, Monovar, and Platypus—ranged from 5.8–41.53%. From the simulated data, Red Panda had the highest sensitivity at 72.44%. Conclusions We show that our method provides a novel and improved mechanism to identify variants in scRNA-seq as compared to currently existing software. However, methods for identification of genomic variants using scRNA-seq data can be still improved.

Download Full-text