Population Genotype Calling from Low-coverage Sequencing Data

Mapping Intimacies ◽

10.1101/085936 ◽

2016 ◽

Author(s):

Lin Huang ◽

Petr Danecek ◽

Sivan Bercovici ◽

Serafim Batzoglou

Keyword(s):

Large Scale ◽

Whole Genome ◽

Sequencing Data ◽

Efficient Manner ◽

Entire Cohort ◽

The Public ◽

Wide Range ◽

Scale Population ◽

Cost Efficient ◽

Low Coverage

In recent years, several large-scale whole-genome projects sequencing tens of thousands of individuals were completed, with larger studies are underway. These projects aim to provide high-quality genotypes for a large number of whole genomes in a cost-efficient manner, by sequencing each genome at low coverage and subsequently identifying alleles jointly in the entire cohort. Here we present Ref-Reveel, a novel method for large-scale population genotyping. We show that Ref-Reveel provides genotyping at a higher accuracy and higher efficiency in comparison to existing methods by applying our method to one of the largest whole-genome sequencing datasets presently available to the public. We further show that utilizing the resulting genotype panel as references, through the Ref-Reveel framework, greatly improves the ability to call genotypes accurately on newly sequenced genomes. In addition, we present a Ref-Reveel pipeline that is applicable for genotyping of very small datasets. In summary, Ref-Reveel is an accurate, scalable and applicable method for a wide range of genotyping scenarios, and will greatly improves the quality of calling genomic alterations in current and future large-scale sequencing projects.

Download Full-text

Reveel: large-scale population genotyping using low-coverage sequencing data

Bioinformatics ◽

10.1093/bioinformatics/btv530 ◽

2015 ◽

Vol 32 (11) ◽

pp. 1686-1696 ◽

Cited By ~ 4

Author(s):

Lin Huang ◽

Bo Wang ◽

Ruitang Chen ◽

Sivan Bercovici ◽

Serafim Batzoglou

Keyword(s):

Large Scale ◽

Sequencing Data ◽

Scale Population ◽

Low Coverage

Download Full-text

Genotyping by low-coverage whole-genome sequencing in intercross pedigrees from outbred founders: a cost efficient approach

10.1101/421768 ◽

2018 ◽

Author(s):

Yanjun Zan ◽

Thibaut Payen ◽

Mette Lillie ◽

Christa F. Honaker ◽

Paul B. Siegel ◽

...

Keyword(s):

High Resolution ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Sequence Data ◽

Genotype Imputation ◽

Whole Genome ◽

Efficient Manner ◽

Founder Line ◽

Cost Efficient ◽

Low Coverage

ABSTRACTBackgroundExperimental intercrosses between outbred founder populations are powerful resources for mapping loci contributing to complex traits (Quantitative Trait Loci or QTL). Here, we present an approach and accompanying software for high-resolution genotype imputation in such populations using whole-genome high coverage sequence data on founder individuals (∼30×) and low coverage sequence data on intercross individuals (∼0.4×). The method is illustrated in a large F2 pedigree between lines of chickens that have been divergently selected for 40 generations for the same trait (body weight at 8 weeks of age).ResultsDescribed is how hundreds of individuals were whole-genome sequenced in a cost- and time-efficient manner using a Tn5-based library preparation protocol optimized for this application. In total, 7.6M markers segregated in this pedigree and 10.0 to 13.7% were informative for imputing the founder line genotypes within the F0-F2 families. The genotypes imputed from low coverage sequence data were consistent with the founder line genotypes estimated using SNP and microsatellite markers both at individual imputed sites (92%) and across the genome of individual chickens (93%). The resolution of the recombination breakpoints was high with 50% being resolved within <10kb.ConclusionsA method for genotype imputation from low-coverage whole-genome sequencing in outbred intercrosses is described and evaluated. By applying it to an outbred chicken F2 cross it is illustrated that it provides high quality, high-resolution genotypes in a time and cost efficient manner.

Download Full-text

Plasmids or no plasmids? A comparison between the agilent TapeStation and whole-genome sequencing data in a large-scale bacterial sequencing project

10.26226/morressier.56d5ba27d462b80296c95fe7 ◽

2016 ◽

Author(s):

Sarah Alexander

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Large Scale ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Sequencing Project

Download Full-text

High-precision and cost-efficient sequencing for real-time COVID-19 surveillance

Scientific Reports ◽

10.1038/s41598-021-93145-4 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Sung Yong Park ◽

Gina Faraci ◽

Pamela M. Ward ◽

Jane F. Emerson ◽

Ha Youn Lee

Keyword(s):

Los Angeles ◽

Whole Genome Sequencing ◽

Real Time ◽

Genome Sequencing ◽

High Precision ◽

High Throughput Sequencing ◽

Whole Genome ◽

Sequencing Data ◽

Public Health Response ◽

Cost Efficient

AbstractCOVID-19 global cases have climbed to more than 33 million, with over a million total deaths, as of September, 2020. Real-time massive SARS-CoV-2 whole genome sequencing is key to tracking chains of transmission and estimating the origin of disease outbreaks. Yet no methods have simultaneously achieved high precision, simple workflow, and low cost. We developed a high-precision, cost-efficient SARS-CoV-2 whole genome sequencing platform for COVID-19 genomic surveillance, CorvGenSurv (Coronavirus Genomic Surveillance). CorvGenSurv directly amplified viral RNA from COVID-19 patients’ Nasopharyngeal/Oropharyngeal (NP/OP) swab specimens and sequenced the SARS-CoV-2 whole genome in three segments by long-read, high-throughput sequencing. Sequencing of the whole genome in three segments significantly reduced sequencing data waste, thereby preventing dropouts in genome coverage. We validated the precision of our pipeline by both control genomic RNA sequencing and Sanger sequencing. We produced near full-length whole genome sequences from individuals who were COVID-19 test positive during April to June 2020 in Los Angeles County, California, USA. These sequences were highly diverse in the G clade with nine novel amino acid mutations including NSP12-M755I and ORF8-V117F. With its readily adaptable design, CorvGenSurv grants wide access to genomic surveillance, permitting immediate public health response to sudden threats.

Download Full-text

A Phylogenomic Supertree of Birds

Diversity ◽

10.3390/d11070109 ◽

2019 ◽

Vol 11 (7) ◽

pp. 109 ◽

Cited By ~ 17

Author(s):

Rebecca T. Kimball ◽

Carl H. Oliveros ◽

Ning Wang ◽

Noor D. White ◽

F. Keith Barker ◽

...

Keyword(s):

Large Scale ◽

Sequence Data ◽

Bird Species ◽

Divide And Conquer ◽

Clear Understanding ◽

Whole Genome ◽

Efficient Manner ◽

Sequence Capture ◽

Branch Lengths ◽

Supertree Methods

It has long been appreciated that analyses of genomic data (e.g., whole genome sequencing or sequence capture) have the potential to reveal the tree of life, but it remains challenging to move from sequence data to a clear understanding of evolutionary history, in part due to the computational challenges of phylogenetic estimation using genome-scale data. Supertree methods solve that challenge because they facilitate a divide-and-conquer approach for large-scale phylogeny inference by integrating smaller subtrees in a computationally efficient manner. Here, we combined information from sequence capture and whole-genome phylogenies using supertree methods. However, the available phylogenomic trees had limited overlap so we used taxon-rich (but not phylogenomic) megaphylogenies to weave them together. This allowed us to construct a phylogenomic supertree, with support values, that included 707 bird species (~7% of avian species diversity). We estimated branch lengths using mitochondrial sequence data and we used these branch lengths to estimate divergence times. Our time-calibrated supertree supports radiation of all three major avian clades (Palaeognathae, Galloanseres, and Neoaves) near the Cretaceous-Paleogene (K-Pg) boundary. The approach we used will permit the continued addition of taxa to this supertree as new phylogenomic data are published, and it could be applied to other taxa as well.

Download Full-text

Improving tuberculosis surveillance by detecting international transmission using publicly available whole-genome sequencing data

10.1101/834150 ◽

2019 ◽

Author(s):

Andrea Sanchini ◽

Christine Jandrasits ◽

Julius Tembrockhaus ◽

Thomas Andreas Kohl ◽

Christian Utpatel ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Large Scale ◽

Added Value ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

International Transmission ◽

The Public ◽

Public Dataset ◽

Public Repositories

AbstractIntroductionImproving the surveillance of tuberculosis (TB) is especially important for multidrug-resistant (MDR) and extensively drug-resistant (XDR)-TB. The large amount of publicly available whole-genome sequencing (WGS) data for TB gives us the chance to re-use data and to perform additional analysis at a large scale.AimWe assessed the usefulness of raw WGS data of global MDR/XDR-TB isolates available from public repositories to improve TB surveillance.MethodsWe extracted raw WGS data and the related metadata of Mycobacterium tuberculosis isolates available from the Sequence Read Archive. We compared this public dataset with WGS data and metadata of 131 MDR- and XDR-TB isolates from Germany in 2012-2013.ResultsWe aggregated a dataset that includes 1,081 MDR and 250 XDR isolates among which we identified 133 molecular clusters. In 16 clusters, the isolates were from at least two different countries. For example, cluster2 included 56 MDR/XDR isolates from Moldova, Georgia, and Germany. By comparing the WGS data from Germany and the public dataset, we found that 11 clusters contained at least one isolate from Germany and at least one isolate from another country. We could, therefore, connect TB cases despite missing epidemiological information.ConclusionWe demonstrated the added value of using WGS raw data from public repositories to contribute to TB surveillance. By comparing the German and the public dataset, we identified potential international transmission events. Thus, using this approach might support the interpretation of national surveillance results in an international context.

Download Full-text

Batch effects in population genomic studies with low‐coverage whole genome sequencing data: causes, detection, and mitigation

Molecular Ecology Resources ◽

10.1111/1755-0998.13559 ◽

2021 ◽

Author(s):

Runyang Nicolas Lou ◽

Nina Overgaard Therkildsen

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Batch Effects ◽

Sequencing Data ◽

Population Genomic ◽

Genomic Studies ◽

Low Coverage

Download Full-text

A Comparison of Double-Isotope Derivative and Radioimmunological Estimation of Plasma Aldosterone Concentration in Man

Clinical Science ◽

10.1042/cs0450411 ◽

1973 ◽

Vol 45 (3) ◽

pp. 411-415 ◽

Cited By ~ 24

Author(s):

R. Fraser ◽

Sheena Guest ◽

Jessie Young

Keyword(s):

Large Scale ◽

Plasma Aldosterone ◽

Plasma Aldosterone Concentration ◽

Plasma Pool ◽

Wide Range ◽

Significant Difference ◽

Derivative Method ◽

Scale Population ◽

Suitable Technique ◽

Double Isotope

1. Two techniques for estimating plasma aldosterone concentration are compared by means of repeated assays of a plasma pool and also by analysis of a wide range of plasma samples. 2. No significant difference was found in the results obtained by the methods. Radioimmunoassay required only one tenth of the volume of plasma needed for the double-isotope derivative method. 3. Its rapidity and relative inexpensiveness makes radioimmunoassay at present the most suitable technique for large-scale population screening.

Download Full-text

Identifying tumor clones in sparse single-cell mutation data

Bioinformatics ◽

10.1093/bioinformatics/btaa449 ◽

2020 ◽

Vol 36 (Supplement_1) ◽

pp. i186-i193

Author(s):

Matthew A Myers ◽

Simone Zaccaria ◽

Benjamin J Raphael

Keyword(s):

Single Cell ◽

Genome Sequencing ◽

Whole Genome ◽

Sequencing Data ◽

Single Nucleotide ◽

Sequencing Coverage ◽

Sequencing Technologies ◽

Low Coverage ◽

Clonal Composition ◽

Cancer Studies

Abstract Motivation Recent single-cell DNA sequencing technologies enable whole-genome sequencing of hundreds to thousands of individual cells. However, these technologies have ultra-low sequencing coverage (<0.5× per cell) which has limited their use to the analysis of large copy-number aberrations (CNAs) in individual cells. While CNAs are useful markers in cancer studies, single-nucleotide mutations are equally important, both in cancer studies and in other applications. However, ultra-low coverage sequencing yields single-nucleotide mutation data that are too sparse for current single-cell analysis methods. Results We introduce SBMClone, a method to infer clusters of cells, or clones, that share groups of somatic single-nucleotide mutations. SBMClone uses a stochastic block model to overcome sparsity in ultra-low coverage single-cell sequencing data, and we show that SBMClone accurately infers the true clonal composition on simulated datasets with coverage at low as 0.2×. We applied SBMClone to single-cell whole-genome sequencing data from two breast cancer patients obtained using two different sequencing technologies. On the first patient, sequenced using the 10X Genomics CNV solution with sequencing coverage ≈0.03×, SBMClone recovers the major clonal composition when incorporating a small amount of additional information. On the second patient, where pre- and post-treatment tumor samples were sequenced using DOP-PCR with sequencing coverage ≈0.5×, SBMClone shows that tumor cells are present in the post-treatment sample, contrary to published analysis of this dataset. Availability and implementation SBMClone is available on the GitHub repository https://github.com/raphael-group/SBMClone. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Real-bogus classification for the Zwicky Transient Facility using deep learning

Monthly Notices of the Royal Astronomical Society ◽

10.1093/mnras/stz2357 ◽

2019 ◽

Vol 489 (3) ◽

pp. 3582-3590 ◽

Cited By ~ 15

Author(s):

Dmitry A Duev ◽

Ashish Mahabal ◽

Frank J Masci ◽

Matthew J Graham ◽

Ben Rusholme ◽

...

Keyword(s):

Deep Learning ◽

False Positive ◽

Large Scale ◽

Moving Objects ◽

False Negative ◽

Efficient Manner ◽

Astronomical Surveys ◽

Comparable Performance ◽

Cost Efficient ◽

Initial Results

ABSTRACT Efficient automated detection of flux-transient, re-occurring flux-variable, and moving objects is increasingly important for large-scale astronomical surveys. We present braai, a convolutional-neural-network, deep-learning real/bogus classifier designed to separate genuine astrophysical events and objects from false positive, or bogus, detections in the data of the Zwicky Transient Facility (ZTF), a new robotic time-domain survey currently in operation at the Palomar Observatory in California, USA. Braai demonstrates a state-of-the-art performance as quantified by its low false negative and false positive rates. We describe the open-source software tools used internally at Caltech to archive and access ZTF’s alerts and light curves (kowalski ), and to label the data (zwickyverse). We also report the initial results of the classifier deployment on the Edge Tensor Processing Units that show comparable performance in terms of accuracy, but in a much more (cost-) efficient manner, which has significant implications for current and future surveys.

Download Full-text