Multi-platform discovery of haplotype-resolved structural variation in human genomes

Mapping Intimacies ◽

10.1101/193144 ◽

2017 ◽

Cited By ~ 32

Author(s):

Mark J.P. Chaisson ◽

Ashley D. Sanders ◽

Xuefang Zhao ◽

Ankit Malhotra ◽

David Porubsky ◽

...

Keyword(s):

Genome Sequencing ◽

Large Scale ◽

Structural Variation ◽

High Throughput Sequencing ◽

Whole Genome Sequencing Data ◽

Sequencing Data ◽

Full Spectrum ◽

Variant Discovery ◽

Sequencing Technologies ◽

Sequencing Studies

ABSTRACTThe incomplete identification of structural variants (SVs) from whole-genome sequencing data limits studies of human genetic diversity and disease association. Here, we apply a suite of long-read, short-read, and strand-specific sequencing technologies, optical mapping, and variant discovery algorithms to comprehensively analyze three human parent–child trios to define the full spectrum of human genetic variation in a haplotype-resolved manner. We identify 818,054 indel variants (<50 bp) and 27,622 SVs (≥50 bp) per human genome. We also discover 156 inversions per genome—most of which previously escaped detection. Fifty-eight of the inversions we discovered intersect with the critical regions of recurrent microdeletion and microduplication syndromes. Taken together, our SV callsets represent a sevenfold increase in SV detection compared to most standard high-throughput sequencing studies, including those from the 1000 Genomes Project. The method and the dataset serve as a gold standard for the scientific community and we make specific recommendations for maximizing structural variation sensitivity for future large-scale genome sequencing studies.

Download Full-text

Plasmids or no plasmids? A comparison between the agilent TapeStation and whole-genome sequencing data in a large-scale bacterial sequencing project

10.26226/morressier.56d5ba27d462b80296c95fe7 ◽

2016 ◽

Author(s):

Sarah Alexander

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Large Scale ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Sequencing Project

Download Full-text

Utilizing the VirIdAl Pipeline to Search for Viruses in the Metagenomic Data of Bat Samples

Viruses ◽

10.3390/v13102006 ◽

2021 ◽

Vol 13 (10) ◽

pp. 2006

Author(s):

Anna Y Budkina ◽

Elena V Korneenko ◽

Ivan A Kotov ◽

Daniil A Kiselev ◽

Ilya V Artyushin ◽

...

Keyword(s):

Large Scale ◽

High Throughput Sequencing ◽

Metagenomic Data ◽

Sequencing Data ◽

Viral Pathogens ◽

Genomic Databases ◽

Bioinformatic Pipeline ◽

Viral Genomes ◽

Sequencing Technologies ◽

Viral Screening

According to various estimates, only a small percentage of existing viruses have been discovered, naturally much less being represented in the genomic databases. High-throughput sequencing technologies develop rapidly, empowering large-scale screening of various biological samples for the presence of pathogen-associated nucleotide sequences, but many organisms are yet to be attributed specific loci for identification. This problem particularly impedes viral screening, due to vast heterogeneity in viral genomes. In this paper, we present a new bioinformatic pipeline, VirIdAl, for detecting and identifying viral pathogens in sequencing data. We also demonstrate the utility of the new software by applying it to viral screening of the feces of bats collected in the Moscow region, which revealed a significant variety of viruses associated with bats, insects, plants, and protozoa. The presence of alpha and beta coronavirus reads, including the MERS-like bat virus, deserves a special mention, as it once again indicates that bats are indeed reservoirs for many viral pathogens. In addition, it was shown that alignment-based methods were unable to identify the taxon for a large proportion of reads, and we additionally applied other approaches, showing that they can further reveal the presence of viral agents in sequencing data. However, the incompleteness of viral databases remains a significant problem in the studies of viral diversity, and therefore necessitates the use of combined approaches, including those based on machine learning methods.

Download Full-text

P1-129: Structural Variation (SV) in Heterogenous Whole-Genome Sequencing Data from 111 Families at Risk For Alzheimer's Disease: Alzheimer's Disease Sequencing Project SV Study

Alzheimer s & Dementia ◽

10.1016/j.jalz.2016.06.877 ◽

2016 ◽

Vol 12 ◽

pp. P453-P453

Author(s):

Li Charlie Xia ◽

John Farrell ◽

Nancy Zhang ◽

William Salerno ◽

John Malamon ◽

...

Keyword(s):

Alzheimer’S Disease ◽

Alzheimer's Disease ◽

At Risk ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Structural Variation ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Sequencing Project

Download Full-text

eSCAN: Scan Regulatory Regions for Aggregate Association Testing using Whole Genome Sequencing Data

10.1101/2020.11.30.405266 ◽

2020 ◽

Author(s):

Yingxi Yang ◽

Yuchen Yang ◽

Le Huang ◽

Jai G. Broome ◽

Adolfo Correa ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

New Technologies ◽

Real Data ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Association Testing ◽

Wide Range ◽

Sequencing Studies

AbstractWith advances in whole genome sequencing (WGS) technology, multiple statistical methods for aggregate association testing have been developed. Many common approaches aggregate variants in a given genomic window of a fixed/varying size and are not reliant on existing knowledge to define appropriate test units, resulting in most identified regions not being clearly linked to genes, limiting biological understanding. Functional information from new technologies (such as Hi-C and its derivatives), which can help link enhancers to the genes they affect, can be leveraged to predefine variant sets for aggregate testing in WGS. Therefore, in this paper we propose the eSCAN (Scan the Enhancers) method for genome-wide assessment of enhancer regions in sequencing studies, combining the advantages of dynamic window selection in SCANG with the advantages of increased incorporation of genomic annotation. eSCAN searches biologically meaningful searching windows, increasing power and aiding biological interpretation, as demonstrated by simulation studies under a wide range of scenarios. We also apply eSCAN for association analysis of blood cell traits using TOPMed WGS data from Women’s Health Initiative (WHI) and Jackson Heart Study (JHS). Results from this real data example show that eSCAN is able to capture more significant signals, and these signals are of shorter length and drive association of larger regions detected by other methods.

Download Full-text

Evaluation of Single-Molecule Sequencing Technologies for Structural Variant Detection in Two Swedish Human Genomes

Genes ◽

10.3390/genes11121444 ◽

2020 ◽

Vol 11 (12) ◽

pp. 1444

Author(s):

Nazeefa Fatima ◽

Anna Petri ◽

Ulf Gyllensten ◽

Lars Feuk ◽

Adam Ameur

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Single Molecule ◽

Large Scale ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Structural Variations ◽

Single Molecule Sequencing ◽

Human Samples

Long-read single molecule sequencing is increasingly used in human genomics research, as it allows to accurately detect large-scale DNA rearrangements such as structural variations (SVs) at high resolution. However, few studies have evaluated the performance of different single molecule sequencing platforms for SV detection in human samples. Here we performed Oxford Nanopore Technologies (ONT) whole-genome sequencing of two Swedish human samples (average 32× coverage) and compared the results to previously generated Pacific Biosciences (PacBio) data for the same individuals (average 66× coverage). Our analysis inferred an average of 17k and 23k SVs from the ONT and PacBio data, respectively, with a majority of them overlapping with an available multi-platform SV dataset. When comparing the SV calls in the two Swedish individuals, we find a higher concordance between ONT and PacBio SVs detected in the same individual as compared to SVs detected by the same technology in different individuals. Downsampling of PacBio reads, performed to obtain similar coverage levels for all datasets, resulted in 17k SVs per individual and improved overlap with the ONT SVs. Our results suggest that ONT and PacBio have a similar performance for SV detection in human whole genome sequencing data, and that both technologies are feasible for population-scale studies.

Download Full-text

The MOBSTER R package for tumour subclonal deconvolution from bulk DNA whole-genome sequencing data

BMC Bioinformatics ◽

10.1186/s12859-020-03863-1 ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Giulio Caravagna ◽

Guido Sanguinetti ◽

Trevor A. Graham ◽

Andrea Sottoriva

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Large Scale ◽

R Package ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Evolutionary Forces ◽

Evolutionary Trajectories ◽

Cancer Tissues

Abstract Background The large-scale availability of whole-genome sequencing profiles from bulk DNA sequencing of cancer tissues is fueling the application of evolutionary theory to cancer. From a bulk biopsy, subclonal deconvolution methods are used to determine the composition of cancer subpopulations in the biopsy sample, a fundamental step to determine clonal expansions and their evolutionary trajectories. Results In a recent work we have developed a new model-based approach to carry out subclonal deconvolution from the site frequency spectrum of somatic mutations. This new method integrates, for the first time, an explicit model for neutral evolutionary forces that participate in clonal expansions; in that work we have also shown that our method improves largely over competing data-driven methods. In this Software paper we present mobster, an open source R package built around our new deconvolution approach, which provides several functions to plot data and fit models, assess their confidence and compute further evolutionary analyses that relate to subclonal deconvolution. Conclusions We present the mobster package for tumour subclonal deconvolution from bulk sequencing, the first approach to integrate Machine Learning and Population Genetics which can explicitly model co-existing neutral and positive selection in cancer. We showcase the analysis of two datasets, one simulated and one from a breast cancer patient, and overview all package functionalities.

Download Full-text

Rainbow: a tool for large-scale whole-genome sequencing data analysis using cloud computing

BMC Genomics ◽

10.1186/1471-2164-14-425 ◽

2013 ◽

Vol 14 (1) ◽

pp. 425 ◽

Cited By ~ 32

Author(s):

Shanrong Zhao ◽

Kurt Prenger ◽

Lance Smith ◽

Thomas Messina ◽

Hongtao Fan ◽

...

Keyword(s):

Cloud Computing ◽

Data Analysis ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Large Scale ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Sequencing Data Analysis

Download Full-text

VPsero: Rapid Serotyping of Vibrio parahaemolyticus Using Serogroup-Specific Genes Based on Whole-Genome Sequencing Data

Frontiers in Microbiology ◽

10.3389/fmicb.2021.620224 ◽

2021 ◽

Vol 12 ◽

Author(s):

Shengzhe Bian ◽

Yangyang Jia ◽

Qiuyao Zhan ◽

Nai-Kei Wong ◽

Qinghua Hu ◽

...

Keyword(s):

Vibrio Parahaemolyticus ◽

High Throughput Sequencing ◽

Capsular Polysaccharide ◽

High Specificity ◽

Gene Clusters ◽

Whole Genome Sequencing Data ◽

Sequencing Data ◽

Sequencing Technologies ◽

Marine Habitats ◽

Specificity And Sensitivity

Vibrio parahaemolyticus has emerged as a significant enteropathogen in human and marine habitats worldwide, notably in regions where aquaculture products constitute a major nutritional source. It is a growing cause of diseases including gastroenteritis, wound infections, and septicemia. Serotyping assays use commercially available antisera to identify V. parahaemolyticus strains, but this approach is limited by high costs, complicated procedures, cross-immunoreactivity, and often subjective interpretation. By leveraging high-throughput sequencing technologies, we developed an in silico method based on comparison of gene clusters for lipopolysaccharide (LPSgc) and capsular polysaccharide (CPSgc) by firstly using the unique-gene strategy. The algorithm, VPsero, which exploits serogroup-specific genes as markers, covers 43 K and all 12 O serogroups in serotyping assays. VPsero is capable of predicting serotypes from assembled draft genomes, outputting LPSgc/CPSgc sequences, and recognizing possible novel serogroups or populations. Our tool displays high specificity and sensitivity in prediction toward V. parahaemolyticus strains, with an average sensitivity in serogroup prediction of 0.910 for O and 0.961 for K serogroups and a corresponding average specificity of 0.990 for O and 0.998 for K serogroups.

Download Full-text

F1-01-01: Structural Variation (SV) in Heterogenous Whole-Genome Sequencing Data From 111 Families at Risk For Alzheimer Disease: Alzheimer Disease Sequencing Project SV Study

Alzheimer s & Dementia ◽

10.1016/j.jalz.2016.06.271 ◽

2016 ◽

Vol 12 ◽

pp. P162-P162

Author(s):

Li Charlie Xia

Keyword(s):

At Risk ◽

Alzheimer Disease ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Structural Variation ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Sequencing Project

Download Full-text

Identification of putative causal loci in whole-genome sequencing data via knockoff statistics

Nature Communications ◽

10.1038/s41467-021-22889-4 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Zihuai He ◽

Linxi Liu ◽

Chen Wang ◽

Yann Le Guen ◽

Justin Lee ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Rare Variants ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Association Tests ◽

Sequencing Project ◽

Risk Variants ◽

Sequencing Studies

AbstractThe analysis of whole-genome sequencing studies is challenging due to the large number of rare variants in noncoding regions and the lack of natural units for testing. We propose a statistical method to detect and localize rare and common risk variants in whole-genome sequencing studies based on a recently developed knockoff framework. It can (1) prioritize causal variants over associations due to linkage disequilibrium thereby improving interpretability; (2) help distinguish the signal due to rare variants from shadow effects of significant common variants nearby; (3) integrate multiple knockoffs for improved power, stability, and reproducibility; and (4) flexibly incorporate state-of-the-art and future association tests to achieve the benefits proposed here. In applications to whole-genome sequencing data from the Alzheimer’s Disease Sequencing Project (ADSP) and COPDGene samples from NHLBI Trans-Omics for Precision Medicine (TOPMed) Program we show that our method compared with conventional association tests can lead to substantially more discoveries.

Download Full-text