Reveel: large-scale population genotyping using low-coverage sequencing data

Mapping Intimacies ◽

10.1101/011882 ◽

2014 ◽

Cited By ~ 1

Author(s):

Lin Huang ◽

Bo Wang ◽

Ruitang Chen ◽

Sivan Bercovici ◽

Serafim Batzoglou

Keyword(s):

Linkage Disequilibrium ◽

Missing Data ◽

Large Scale ◽

Low Frequency ◽

Genomic Variation ◽

Whole Genome ◽

Single Individual ◽

Joint Inference ◽

Low Coverage ◽

Allele Discovery

Population low-coverage whole-genome sequencing is rapidly emerging as a prominent approach for discovering genomic variation and genotyping a cohort. This approach combines substantially lower cost than full-coverage sequencing with whole-genome discovery of low-allele-frequency variants, to an extent that is not possible with array genotyping or exome sequencing. However, a challenging computational problem arises when attempting to discover variants and genotype the entire cohort. Variant discovery and genotyping are relatively straightforward on a single individual that has been sequenced at high coverage, because the inference decomposes into the independent genotyping of each genomic position for which a sufficient number of confidently mapped reads are available. However, in cases where low-coverage population data are given, the joint inference requires leveraging the complex linkage disequilibrium patterns in the cohort to compensate for sparse and missing data in each individual. The potentially massive computation time for such inference, as well as the missing data that confound low-frequency allele discovery, need to be overcome for this approach to become practical. Here, we present Reveel, a novel method for single nucleotide variant calling and genotyping of large cohorts that have been sequenced at low coverage. Reveel introduces a novel technique for leveraging linkage disequilibrium that deviates from previous Markov-based models. We evaluate Reveel???s performance through extensive simulations as well as real data from the 1000 Genomes Project, and show that it achieves higher accuracy in low-frequency allele discovery and substantially lower computation cost than previous state-of-the-art methods.

Download Full-text

ABySS 2.0: Resource-Efficient Assembly of Large Genomes using a Bloom Filter

10.1101/068338 ◽

2016 ◽

Cited By ~ 4

Author(s):

Shaun D Jackman ◽

Benjamin P Vandervalk ◽

Hamid Mohamadi ◽

Justin Chu ◽

Sarah Yeo ◽

...

Keyword(s):

Human Genome ◽

Dna Sequences ◽

Message Passing ◽

Large Scale ◽

De Novo ◽

Bloom Filter ◽

Genomic Variation ◽

De Bruijn Graph ◽

Single Individual ◽

Probabilistic Data Structure

AbstractThe assembly of DNA sequences de novo is fundamental to genomics research. It is the first of many steps towards elucidating and characterizing whole genomes. Downstream applications, including analysis of genomic variation between species, between or within individuals critically depends on robustly assembled sequences. In the span of a single decade, the sequence throughput of leading DNA sequencing instruments has increased drastically, and coupled with established and planned large-scale, personalized medicine initiatives to sequence genomes in the thousands and even millions, the development of efficient, scalable and accurate bioinformatics tools for producing high-quality reference draft genomes is timely.With ABySS 1.0, we originally showed that assembling the human genome using short 50 bp sequencing reads was possible by aggregating the half terabyte of compute memory needed over several computers using a standardized message-passing system (MPI). We present here its re-design, which departs from MPI and instead implements algorithms that employ a Bloom filter, a probabilistic data structure, to represent a de Bruijn graph and reduce memory requirements.We present assembly benchmarks of human Genome in a Bottle 250 bp Illumina paired-end and 6 kbp mate-pair libraries from a single individual, yielding a NG50 (NGA50) scaffold contiguity of 3.5 (3.0) Mbp using less than 35 GB of RAM, a modest memory requirement by today’s standard that is often available on a single computer. We also investigate the use of BioNano Genomics and 10x Genomics’ Chromium data to further improve the scaffold contiguity of this assembly to 42 (15) Mbp.

Download Full-text

Seave: a comprehensive web platform for storing and interrogating human genomic variation

10.1101/258061 ◽

2018 ◽

Cited By ~ 3

Author(s):

Velimir Gayevskiy ◽

Tony Roscioli ◽

Marcel E Dinger ◽

Mark J Cowley

Keyword(s):

Cloud Computing ◽

Large Scale ◽

Variant Calling ◽

Genomic Variation ◽

Whole Genome ◽

Genome Data ◽

Pathogenicity Prediction ◽

Data Scaling ◽

Human Genomic ◽

Web Platform

AbstractCapability for genome sequencing and variant calling has increased dramatically, enabling large scale genomic interrogation of human disease. However, discovery is hindered by the current limitations in genomic interpretation, which remains a complicated and disjointed process. We introduce Seave, a web platform that enables variants to be easily filtered and annotated with in silico pathogenicity prediction scores and annotations from popular disease databases. Seave stores genomic variation of all types and sizes, and allows filtering for specific inheritance patterns, quality values, allele frequencies and gene lists. Seave is open source and deployable locally, or on a cloud computing provider, and works readily with gene panel, exome and whole genome data, scaling from single labs to multi-institution scale.

Download Full-text

Variance in Estimated Pairwise Genetic Distance Under High versus Low Coverage Sequencing: the Contribution of Linkage Disequilibrium

10.1101/108928 ◽

2017 ◽

Author(s):

Max Shpak ◽

Yang Ni ◽

Jie Lu ◽

Peter Müller

Keyword(s):

Linkage Disequilibrium ◽

Genetic Distance ◽

Low Frequency ◽

Allele Frequencies ◽

High Coverage ◽

Pairwise Linkage Disequilibrium ◽

Pairwise Genetic Distance ◽

Linkage Disequilibria ◽

The Mean ◽

Low Coverage

AbstractThe mean pairwise genetic distance among haplotypes is an estimator of the population mutation rate θ and a standard measure of variation in a population. With the advent of next-generation sequencing (NGS) methods, this and other population parameters can be estimated under different modes of sampling. One approach is to sequence individual genomes with high coverage, and to calculate genetic distance over all sample pairs. The second approach, typically used for microbial samples or for tumor cells, is sequencing a large number of pooled genomes with very low individual coverage. With low coverage, pairwise genetic distances are calculated across independently sampled sites rather than across individual genomes. In this study, we show that the variance in genetic distance estimates is reduced with low coverage sampling if the mean pairwise linkage disequilibrium weighted by allele frequencies is positive. Practically, this means that if on average the most frequent alleles over pairs of loci are in positive linkage disequilibrium, low coverage sequencing results in improved estimates of θ, assuming similar per-site read depths. We show that this result holds under the expected distribution of allele frequencies and linkage disequilibria for an infinite sites model at mutation-drift equilibrium. From simulations, we find that the conditions for reduced variance only fail to hold in cases where variant alleles are few and at very low frequency. These results are applied to haplotype frequencies from a lung cancer tumor to compute the weighted linkage disequilibria and the expected error in estimated genetic distance using high versus low coverage.

Download Full-text

Loose ends in cancer genome structure

10.1101/2021.05.26.445837 ◽

2021 ◽

Author(s):

Julie M Behr ◽

Xiaotong Yao ◽

Kevin Hadi ◽

Huasong Tian ◽

Aditya Deshpande ◽

...

Keyword(s):

Large Scale ◽

Structural Variation ◽

Genome Structure ◽

Cancer Genome ◽

Genomic Variation ◽

Future Research ◽

Whole Genome ◽

Structural Genomic ◽

Short Read ◽

Pan Cancer

Recent pan-cancer studies have delineated patterns of structural genomic variation across thousands of tumor whole genome sequences. It is not known to what extent the shortcomings of short read (≤ 150 bp) whole genome sequencing (WGS) used for structural variant analysis has limited our understanding of cancer genome structure. To formally address this, we introduce the concept of "loose ends" - copy number alterations that cannot be mapped to a rearrangement by WGS but can be indirectly detected through the analysis of junction-balanced genome graphs. Analyzing 2,319 pan-cancer WGS cases across 31 tumor types, we found loose ends were enriched in reference repeats and fusions of the mappable genome to repetitive or foreign sequences. Among these we found genomic footprints of neotelomeres, which were surprisingly enriched in cancers with low telomerase expression and alternate lengthening of telomeres phenotype. Our results also provide a rigorous upper bound on the role of non-allelic homologous recombination (NAHR) in large-scale cancer structural variation, while nominating INO80, FANCA, and ARID1A as positive modulators of somatic NAHR. Taken together, we estimate that short read WGS maps >97% of all large-scale (>10 kbp) cancer structural variation; the rest represent loose ends that require long molecule profiling to unambiguously resolve. Our results have broad relevance for future research and clinical applications of short read WGS and delineate precise directions where long molecule studies might provide transformative insight into cancer genome structure.

Download Full-text

Population Genotype Calling from Low-coverage Sequencing Data

10.1101/085936 ◽

2016 ◽

Author(s):

Lin Huang ◽

Petr Danecek ◽

Sivan Bercovici ◽

Serafim Batzoglou

Keyword(s):

Large Scale ◽

Whole Genome ◽

Sequencing Data ◽

Efficient Manner ◽

Entire Cohort ◽

The Public ◽

Wide Range ◽

Scale Population ◽

Cost Efficient ◽

Low Coverage

In recent years, several large-scale whole-genome projects sequencing tens of thousands of individuals were completed, with larger studies are underway. These projects aim to provide high-quality genotypes for a large number of whole genomes in a cost-efficient manner, by sequencing each genome at low coverage and subsequently identifying alleles jointly in the entire cohort. Here we present Ref-Reveel, a novel method for large-scale population genotyping. We show that Ref-Reveel provides genotyping at a higher accuracy and higher efficiency in comparison to existing methods by applying our method to one of the largest whole-genome sequencing datasets presently available to the public. We further show that utilizing the resulting genotype panel as references, through the Ref-Reveel framework, greatly improves the ability to call genotypes accurately on newly sequenced genomes. In addition, we present a Ref-Reveel pipeline that is applicable for genotyping of very small datasets. In summary, Ref-Reveel is an accurate, scalable and applicable method for a wide range of genotyping scenarios, and will greatly improves the quality of calling genomic alterations in current and future large-scale sequencing projects.

Download Full-text

An upsurge of SARS CoV-2 B.1.1.7 Variant in Pakistan

10.1101/2021.02.26.21252562 ◽

2021 ◽

Author(s):

Massab Umair ◽

Muhammad Salman ◽

Zaira Rehman ◽

Nazish Badar ◽

Abdul Ahad ◽

...

Keyword(s):

Large Scale ◽

Low Frequency ◽

Whole Genome ◽

Initial Screening ◽

Gene Target ◽

Health Concerns ◽

Partial Sequencing ◽

Spike Gene ◽

The United Kingdom ◽

High Prevalence

The emergence of a more transmissible variant of SARS-CoV-2 (B1.1.7) in the United Kingdom (UK) during late 2020 has raised major public health concerns. Several mutations have been reported in the genome of the B.1.1.7 variant including the N501Y and 69-70deletion in the Spike that has implications on virus transmissibility and diagnostics. Although the B.1.1.7 variant has been reported from several countries, only two cases have been identified through whole-genome sequencing from Pakistan. We used a two-step strategy for the detection of B.1.1.7 with initial screening through ThermoFisher TaqPathTM SARS-CoV-2 kit followed by partial sequencing of Spike (S) gene of samples having spike gene target failure (SGTF) on real-time PCR. From January 01, 2021, to February 21, 2021, a total of 2,650 samples were tested for the presence of SARS-CoV-2 using TaqPathTM kit and 70.4% (n=1,867) showed amplification of all the 3 genes (S, N, and ORF). Notably, 29.6% (n=783) samples had the spike gene target failure (SGTF). The SGTF cases were detected at a low frequency during the first three weeks of January (n=10, n=13, and n=1 respectively) however, the cases started to increase in the last week. During February, 726 (93%) cases of SGTF was reported with a peak (n=345) found during the 3rd week. Based on the partial sequencing of spike gene of SGTF samples (n=15), 93% (n=14) showed the characteristic N501Y, A570D, P681H, and T716I mutations found in the B.1.1.7 variant. Our findings highlight the high prevalence of B.1.1.7 in Pakistan and warrant large scale genomic surveillance and strengthening of laboratory network in the country.

Download Full-text

Genomic Comparison of Campylobacter spp. and Their Potential for Zoonotic Transmission between Birds, Primates, and Livestock

Applied and Environmental Microbiology ◽

10.1128/aem.01746-16 ◽

2016 ◽

Vol 82 (24) ◽

pp. 7165-7175 ◽

Cited By ~ 39

Author(s):

Allison M. Weis ◽

Dylan B. Storey ◽

Conor C. Taff ◽

Andrea K. Townsend ◽

Bihua C. Huang ◽

...

Keyword(s):

Host Species ◽

Large Scale ◽

Genomic Variation ◽

Zoonotic Transmission ◽

Fluoroquinolone Resistance ◽

Whole Genome ◽

Whole Genome Analysis ◽

Pathogenic Potential ◽

Content Type ◽

New Genes

ABSTRACTCampylobacteris the leading cause of human gastroenteritis worldwide. Wild birds, including American crows, are abundant in urban, suburban, and agricultural settings and are likely zoonotic vectors ofCampylobacter. Their proximity to humans and livestock increases the potential spreading ofCampylobactervia crows between the environment, livestock, and humans. However, no studies have definitively demonstrated that crows are a vector for pathogenicCampylobacter. We used genomics to evaluate the zoonotic and pathogenic potential ofCampylobacterfrom crows to other animals with 184 isolates obtained from crows, chickens, cows, sheep, goats, humans, and nonhuman primates. Whole-genome analysis uncovered two distinct clades ofCampylobacter jejunigenotypes; the first contained genotypes found only in crows, while a second genotype contained “generalist” genomes that were isolated from multiple host species, including isolates implicated in human disease, primate gastroenteritis, and livestock abortion. Two major β-lactamase genes were observed frequently in these genomes (oxa-184, 55%, andoxa-61, 29%), whereoxa-184was associated only with crows andoxa-61was associated with generalists. Mutations ingyrA, indicative of fluoroquinolone resistance, were observed in 14% of the isolates. Tetracycline resistance (tetO) was present in 22% of the isolates, yet it occurred in 91% of the abortion isolates. Virulence genes were distributed throughout the genomes; however,cdtCalleles recapitulated the crow-only and generalist clades. A specificcdtCallele was associated with abortion in livestock and was concomitant withtetO. These findings indicate that crows harboring a generalistC. jejunigenotype may act as a vector for the zoonotic transmission ofCampylobacter.IMPORTANCEThis study examined the link between public health and the genomic variation ofCampylobacterin relation to disease in humans, primates, and livestock. Use of large-scale whole-genome sequencing enabled population-level assessment to find new genes that are linked to livestock disease. With 184Campylobactergenomes, we assessed virulence traits, antibiotic resistance susceptibility, and the potential for zoonotic transfer to observe that there is a “generalist” genotype that may move between host species.

Download Full-text

0306 Exploring the feasibility of using copy number variants as genetic markers through large-scale whole genome sequencing experiments

Journal of Animal Science ◽

10.2527/jam2016-0306 ◽

2016 ◽

Vol 94 (suppl_5) ◽

pp. 146-146

Author(s):

D. M. Bickhart ◽

L. Xu ◽

J. L. Hutchison ◽

J. B. Cole ◽

D. J. Null ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genetic Markers ◽

Genome Sequencing ◽

Copy Number ◽

Large Scale ◽

Copy Number Variants ◽

Whole Genome

Download Full-text

Plasmids or no plasmids? A comparison between the agilent TapeStation and whole-genome sequencing data in a large-scale bacterial sequencing project

10.26226/morressier.56d5ba27d462b80296c95fe7 ◽

2016 ◽

Author(s):

Sarah Alexander

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Large Scale ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Sequencing Project

Download Full-text

Chinese Holstein Cattle effective population size estimated from whole genome linkage disequilibrium

Hereditas (Beijing) ◽

10.3724/sp.j.1005.2012.00050 ◽

2012 ◽

Vol 34 (1) ◽

pp. 50-58 ◽

Cited By ~ 2

Author(s):

Gui-Yan NI ◽

Zhe ZHANG ◽

Li JIANG ◽

Pei-Pei MA ◽

Qin ZHANG ◽

...

Keyword(s):

Linkage Disequilibrium ◽

Population Size ◽

Effective Population Size ◽

Holstein Cattle ◽

Whole Genome ◽

Effective Population ◽

Chinese Holstein ◽

Chinese Holstein Cattle

Download Full-text