Mosaic deletion patterns of the human antibody heavy chain gene locus as revealed by Bayesian haplotyping

Mapping Intimacies ◽

10.1101/314476 ◽

2018 ◽

Cited By ~ 3

Author(s):

Moriah Gidoni ◽

Omri Snir ◽

Ayelet Peres ◽

Pazit Polak ◽

Ida Lindeman ◽

...

Keyword(s):

High Throughput Sequencing ◽

Haplotype Inference ◽

Copy Number Variations ◽

Chain Gene ◽

Human Antibody ◽

Sequencing Data ◽

Data Set ◽

Repertoire Sequencing ◽

Novel Method ◽

Antibody Heavy Chain

AbstractAnalysis of antibody repertoires by high-throughput sequencing is of major importance in understanding adaptive immune responses. Our knowledge of variations in the genomic loci encoding antibody genes is incomplete, mostly due to technical difficulties in aligning short reads to these highly repetitive loci. The partial knowledge results in conflicting V-D-J gene assignments between different algorithms, and biased genotype and haplotype inference. Previous studies have shown that haplotypes can be inferred by taking advantage of IGHJ6 heterozygosity, observed in approximately one third of the population. Here, we propose a robust novel method for determining V-D-J haplotypes by adapting a Bayesian framework. Our method extends haplotype inference to IGHD- and IGHV-based analysis, thereby enabling inference of complex genetic events like deletions and copy number variations in the entire population. We generated the largest multi individual data set, to date, of naïve B-cell repertoires, and tested our method on it. We present evidence for allele usage bias, as well as a mosaic, tiled pattern of deleted and present IGHD and IGHV nearby genes, across the population. The inferred haplotypes and deletion patterns may have clinical implications for genetic predispositions to diseases. Our findings greatly expand the knowledge that can be extracted from antibody repertoire sequencing data.

Download Full-text

Mosaic deletion patterns of the human antibody heavy chain gene locus shown by Bayesian haplotyping

Nature Communications ◽

10.1038/s41467-019-08489-3 ◽

2019 ◽

Vol 10 (1) ◽

Cited By ~ 19

Author(s):

Moriah Gidoni ◽

Omri Snir ◽

Ayelet Peres ◽

Pazit Polak ◽

Ida Lindeman ◽

...

Keyword(s):

Heavy Chain ◽

Gene Locus ◽

Chain Gene ◽

Human Antibody ◽

Heavy Chain Gene ◽

Antibody Heavy Chain

Download Full-text

A Novel Method to Detect Bias in Short Read NGS Data

Journal of Integrative Bioinformatics ◽

10.1515/jib-2017-0025 ◽

2017 ◽

Vol 14 (3) ◽

Cited By ~ 1

Author(s):

Jamie Alnasir ◽

Hugh P. Shanahan

Keyword(s):

Biological Significance ◽

Gc Content ◽

Next Generation Sequencing Data ◽

Data Sets ◽

Sequencing Data ◽

Data Set ◽

Short Read ◽

Novel Method ◽

Type Data ◽

Ngs Data

AbstractDetecting sources of bias in transcriptomic data is essential to determine signals of Biological significance. We outline a novel method to detect sequence specific bias in short read Next Generation Sequencing data. This is based on determining intra-exon correlations between specific motifs. This requires a mild assumption that short reads sampled from specific regions from the same exon will be correlated with each other. This has been implemented on Apache Spark and used to analyse two D. melanogaster eye-antennal disc data sets generated at the same laboratory. The wild type data set in drosophila indicates a variation due to motif GC content that is more significant than that found due to exon GC content. The software is available online and could be applied for cross-experiment transcriptome data analysis in eukaryotes.

Download Full-text

An Extensive Meta-Metagenomic Search Identifies SARS-CoV-2-Homologous Sequences in Pangolin Lung Viromes

mSphere ◽

10.1128/msphere.00160-20 ◽

2020 ◽

Vol 5 (3) ◽

Cited By ~ 9

Author(s):

Lamia Wahba ◽

Nimit Jain ◽

Andrew Z. Fire ◽

Massa J. Shoura ◽

Karen L. Artiles ◽

...

Keyword(s):

Nucleic Acid ◽

High Speed ◽

High Throughput Sequencing ◽

Biological Significance ◽

Metagenomic Data ◽

Data Sets ◽

Sequencing Data ◽

Data Set ◽

Link Type ◽

Recent Emergence

ABSTRACT In numerous instances, tracking the biological significance of a nucleic acid sequence can be augmented through the identification of environmental niches in which the sequence of interest is present. Many metagenomic data sets are now available, with deep sequencing of samples from diverse biological niches. While any individual metagenomic data set can be readily queried using web-based tools, meta-searches through all such data sets are less accessible. In this brief communication, we demonstrate such a meta-metagenomic approach, examining close matches to the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in all high-throughput sequencing data sets in the NCBI Sequence Read Archive accessible with the “virome” keyword. In addition to the homology to bat coronaviruses observed in descriptions of the SARS-CoV-2 sequence (F. Wu, S. Zhao, B. Yu, Y. M. Chen, et al., Nature 579:265–269, 2020, https://doi.org/10.1038/s41586-020-2008-3; P. Zhou, X. L. Yang, X. G. Wang, B. Hu, et al., Nature 579:270–273, 2020, https://doi.org/10.1038/s41586-020-2012-7), we note a strong homology to numerous sequence reads in metavirome data sets generated from the lungs of deceased pangolins reported by Liu et al. (P. Liu, W. Chen, and J. P. Chen, Viruses 11:979, 2019, https://doi.org/10.3390/v11110979). While analysis of these reads indicates the presence of a similar viral sequence in pangolin lung, the similarity is not sufficient to either confirm or rule out a role for pangolins as an intermediate host in the recent emergence of SARS-CoV-2. In addition to the implications for SARS-CoV-2 emergence, this study illustrates the utility and limitations of meta-metagenomic search tools in effective and rapid characterization of potentially significant nucleic acid sequences. IMPORTANCE Meta-metagenomic searches allow for high-speed, low-cost identification of potentially significant biological niches for sequences of interest.

Download Full-text

Highly comparable metabarcoding results from MGI-Tech and Illumina sequencing platforms

PeerJ ◽

10.7717/peerj.12254 ◽

2021 ◽

Vol 9 ◽

pp. e12254

Author(s):

Sten Anslan ◽

Vladimir Mikryukov ◽

Kęstutis Armolaitis ◽

Jelena Ankuda ◽

Dagnija Lazdina ◽

...

Keyword(s):

High Throughput Sequencing ◽

Coi Gene ◽

Synthesis Methods ◽

Sequence Variant ◽

Sequencing Data ◽

Data Set ◽

Soil Dna ◽

Sequencing Platform ◽

Sequencing By Synthesis ◽

Sequencing Platforms

With the developments in DNA nanoball sequencing technologies and the emergence of new platforms, there is an increasing interest in their performance in comparison with the widely used sequencing-by-synthesis methods. Here, we test the consistency of metabarcoding results from DNBSEQ-G400RS (DNA nanoball sequencing platform by MGI-Tech) and NovaSeq 6000 (sequencing-by-synthesis platform by Illumina) platforms using technical replicates of DNA libraries that consist of COI gene amplicons from 120 soil DNA samples. By subjecting raw sequencing data from both platforms to a uniform bioinformatics processing, we found that the proportion of high-quality reads passing through the filtering steps was similar in both datasets. Per-sample operational taxonomic unit (OTU) and amplicon sequence variant (ASV) richness patterns were highly correlated, but sequencing data from DNBSEQ-G400RS harbored a higher number of OTUs. This may be related to the lower dominance of most common OTUs in DNBSEQ data set (thus revealing higher richness by detecting rare taxa) and/or to a lower effective read quality leading to generation of spurious OTUs. However, there was no statistical difference in the ASV and post-clustered ASV richness between platforms, suggesting that additional denoising step in the ASV workflow had effectively removed the ‘noisy’ reads. Both OTU-based and ASV-based composition were strongly correlated between the sequencing platforms, with essentially interchangeable results. Therefore, we conclude that DNBSEQ-G400RS and NovaSeq 6000 are both equally efficient high-throughput sequencing platforms to be utilized in studies aiming to apply the metabarcoding approach, but the main benefit of the former is related to lower sequencing cost.

Download Full-text

A generalized HIV vaccine design strategy for priming of broadly neutralizing antibody responses

Science ◽

10.1126/science.aax4380 ◽

2019 ◽

Vol 366 (6470) ◽

pp. eaax4380 ◽

Cited By ~ 35

Author(s):

Jon M. Steichen ◽

Ying-Cing Lin ◽

Colin Havenar-Daughton ◽

Simone Pecetta ◽

Gabriel Ozorowski ◽

...

Keyword(s):

B Cells ◽

Neutralizing Antibodies ◽

Ex Vivo ◽

Strong Dependence ◽

Neutralizing Antibody ◽

Human Antibody ◽

Sequencing Data ◽

Major Barrier ◽

Precursor B Cells ◽

Antibody Heavy Chain

Vaccine induction of broadly neutralizing antibodies (bnAbs) to HIV remains a major challenge. Germline-targeting immunogens hold promise for initiating the induction of certain bnAb classes; yet for most bnAbs, a strong dependence on antibody heavy chain complementarity-determining region 3 (HCDR3) is a major barrier. Exploiting ultradeep human antibody sequencing data, we identified a diverse set of potential antibody precursors for a bnAb with dominant HCDR3 contacts. We then developed HIV envelope trimer–based immunogens that primed responses from rare bnAb-precursor B cells in a mouse model and bound a range of potential bnAb-precursor human naïve B cells in ex vivo screens. Our repertoire-guided germline-targeting approach provides a framework for priming the induction of many HIV bnAbs and could be applied to most HCDR3-dominant antibodies from other pathogens.

Download Full-text

Great differences in performance and outcome of high-throughput sequencing data analysis platforms for fungal metabarcoding

10.7287/peerj.preprints.27019v2 ◽

2018 ◽

Author(s):

Sten Anslan ◽

Henrik Nilsson ◽

Christian Wurzbacher ◽

Petr Baldrian ◽

Leho Tedersoo ◽

...

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Computation Time ◽

Potential Effect ◽

Data Sets ◽

Sequencing Data ◽

Data Set ◽

Operational Taxonomic Units ◽

High Throughput Sequencing Data ◽

Recent Developments

Along with recent developments in high-throughput sequencing (HTS) technologies and thus fast accumulation of HTS data, there has been a growing need and interest for developing tools for HTS data processing and communication. In particular, a number of bioinformatics tools have been designed for analysing metabarcoding data, each with specific features, assumptions and outputs. To evaluate the potential effect of the application of different bioinformatics workflow on the results, we compared the performance of different analysis platforms on two contrasting high-throughput sequencing data sets. Our analysis revealed that the computation time, quality of error filtering and hence output of specific bioinformatics process largely depends on the platform used. Our results show that none of the bioinformatics workflows appear to perfectly filter out the accumulated errors and generate Operational Taxonomic Units, although PipeCraft, LotuS and PIPITS perform better than QIIME2 and Galaxy for the tested fungal amplicon data set. We conclude that the output of each platform require manual validation of the OTUs by examining the taxonomy assignment values.

Download Full-text

DiscoMark: Nuclear marker discovery from orthologous sequences using draft genome data

10.1101/047282 ◽

2016 ◽

Author(s):

Sereina Rutschmann ◽

Harald Detering ◽

Sabrina Simon ◽

Jakob Fredslund ◽

Michael T. Monaghan

Keyword(s):

Related Species ◽

High Throughput Sequencing ◽

Nuclear Dna ◽

Draft Genome ◽

Local Alignment ◽

Sequencing Data ◽

Sequence Alignments ◽

Multiple Sequence ◽

Data Set ◽

Haplotype Networks

AbstractHigh-throughput sequencing has laid the foundation for fast and cost-effective development of phylogenetic markers. Here we present the program DISCOMARK, which streamlines the development of nuclear DNA (nDNA) markers from whole-genome (or whole-transcriptome) sequencing data, combining local alignment, alignment trimming, reference mapping and primer design based on multiple sequence alignments in order to design primer pairs from input orthologous sequences. In order to demonstrate the suitability of DISCOMARK we designed markers for two groups of species, one consisting of closely related species and one group of distantly related species. For the closely related members of the species complex of Cloeon dipterum s.l. (Insecta, Ephemeroptera), the program discovered a total of 78 markers. Among these, we selected eight markers for amplification and Sanger sequencing. The exon sequence alignments (2,526 base pairs (bp)) were used to reconstruct a well supported phylogeny and to infer clearly structured haplotype networks. For the distantly related species we designed primers for several families in the insect order Ephemeroptera, using available genomic data from four sequenced species. We developed primer pairs for 23 markers that are designed to amplify across several families. The DISCOMARK program will enhance the development of new nDNA markersby providing a streamlined, automated approach to perform genome-scale scans for phylogenetic markers. The program is written in Python, released under a public license (GNU GPL v2), and together with a manual and example data set available at: https://github.com/hdetering/discomark.

Download Full-text

PairMotifChIP: A Fast Algorithm for Discovery of Patterns Conserved in Large ChIP-seq Data Sets

BioMed Research International ◽

10.1155/2016/4986707 ◽

2016 ◽

Vol 2016 ◽

pp. 1-10 ◽

Cited By ~ 3

Author(s):

Qiang Yu ◽

Hongwei Huo ◽

Dazheng Feng

Keyword(s):

Dna Sequences ◽

Motif Discovery ◽

High Throughput Sequencing ◽

Hamming Distance ◽

Simulated Data ◽

Real Data ◽

Identification Accuracy ◽

Data Sets ◽

Sequencing Data ◽

Data Set

Identifying conserved patterns in DNA sequences, namely, motif discovery, is an important and challenging computational task. With hundreds or more sequences contained, the high-throughput sequencing data set is helpful to improve the identification accuracy of motif discovery but requires an even higher computing performance. To efficiently identify motifs in large DNA data sets, a new algorithm called PairMotifChIP is proposed by extracting and combining pairs of l-mers in the input with relatively small Hamming distance. In particular, a method for rapidly extracting pairs of l-mers is designed, which can be used not only for PairMotifChIP, but also for other DNA data mining tasks with the same demand. Experimental results on the simulated data show that the proposed algorithm can find motifs successfully and runs faster than the state-of-the-art motif discovery algorithms. Furthermore, the validity of the proposed algorithm has been verified on real data.

Download Full-text

SeqCNV: a novel method for identification of copy number variations in targeted next-generation sequencing data

BMC Bioinformatics ◽

10.1186/s12859-017-1566-3 ◽

2017 ◽

Vol 18 (1) ◽

Cited By ~ 28

Author(s):

Yong Chen ◽

Li Zhao ◽

Yi Wang ◽

Ming Cao ◽

Violet Gelowani ◽

...

Keyword(s):

Next Generation Sequencing ◽

Copy Number ◽

Copy Number Variations ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Targeted Next Generation Sequencing ◽

Novel Method ◽

Generation Sequencing

Download Full-text

vcfView: An Extensible Data Visualization and Quality Assurance Platform for Integrated Somatic Variant Analysis

Cancer Informatics ◽

10.1177/1176935120972377 ◽

2020 ◽

Vol 19 ◽

pp. 117693512097237

Author(s):

Brian O’Sullivan ◽

Cathal Seoighe

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Somatic Mutations ◽

Driver Mutations ◽

Sequencing Data ◽

Data Set ◽

Therapeutic Implications ◽

Cancer Driver ◽

Sequencing Errors ◽

The Status

Motivation: Somatic mutations can have critical prognostic and therapeutic implications for cancer patients. Although targeted methods are often used to assay specific cancer driver mutations, high throughput sequencing is frequently applied to discover novel driver mutations and to determine the status of less-frequent driver mutations. The task of recovering somatic mutations from these data is nontrivial as somatic mutations must be distinguished from germline variants, sequencing errors, and other artefacts. Consequently, bioinformatics pipelines for recovery of somatic mutations from high throughput sequencing typically involve a large number of analytical choices in the form of quality filters. Results: We present vcfView, an interactive tool designed to support the evaluation of somatic mutation calls from cancer sequencing data. The tool takes as input a single variant call format (VCF) file and enables researchers to explore the impacts of analytical choices on the mutant allele frequency spectrum, on mutational signatures and on annotated somatic variants in genes of interest. It allows variants that have failed variant caller filters to be re-examined to improve sensitivity or guide the design of future experiments. It is extensible, allowing other algorithms to be incorporated easily. Availability: The shiny application can be downloaded from GitHub ( https://github.com/BrianOSullivanGit/vcfView ). All data processing is performed within R to ensure platform independence. The app has been tested on RStudio, version 1.1.456, with base R 3.6.2 and Shiny 1.4.0. A vignette based on a publicly available data set is also available on GitHub.

Download Full-text