Reproducible integration of multiple sequencing datasets to form high-confidence SNP, indel, and reference calls for five human genome reference materials

Mapping Intimacies ◽

10.1101/281006 ◽

2018 ◽

Cited By ~ 28

Author(s):

Justin M. Zook ◽

Jennifer McDaniel ◽

Hemang Parikh ◽

Haynes Heaton ◽

Sean A. Irvine ◽

...

Keyword(s):

Reference Materials ◽

Performance Metrics ◽

Genome Project ◽

Personal Genome ◽

High Confidence ◽

Global Alliance ◽

Human Genomes ◽

Open Consent ◽

Genome Context ◽

Human Genome Reference

AbstractBenchmark small variant calls from the Genome in a Bottle Consortium (GIAB) for the CEPH/HapMap genome NA12878 (HG001) have been used extensively for developing, optimizing, and demonstrating performance of sequencing and bioinformatics methods. Here, we develop a reproducible, cloud-based pipeline to integrate multiple sequencing datasets and form benchmark calls, enabling application to arbitrary human genomes. We use these reproducible methods to form high-confidence calls with respect to GRCh37 and GRCh38 for HG001 and 4 additional broadly-consented genomes from the Personal Genome Project that are available as NIST Reference Materials. These new genomes’ broad, open consent with few restrictions on availability of samples and data is enabling a uniquely diverse array of applications. Our new methods produce 17% more high-confidence SNPs, 176% more indels, and 12% larger regions than our previously published calls. To demonstrate that these calls can be used for accurate benchmarking, we compare other high-quality callsets to ours (e.g., Illumina Platinum Genomes), and we demonstrate that the majority of discordant calls are errors in the other callsets, We also highlight challenges in interpreting performance metrics when benchmarking against imperfect high-confidence calls. We show that benchmarking tools from the Global Alliance for Genomics and Health can be used with our calls to stratify performance metrics by variant type and genome context and elucidate strengths and weaknesses of a method.

Download Full-text

Extensive sequencing of seven human genomes to characterize benchmark reference materials

10.1101/026468 ◽

2015 ◽

Cited By ~ 9

Author(s):

Justin M Zook ◽

David Catoe ◽

Jennifer McDaniel ◽

Lindsay Vang ◽

Noah Spies ◽

...

Keyword(s):

Human Genome ◽

Reference Materials ◽

De Novo ◽

Variant Calling ◽

Genome Project ◽

Genome Comparison ◽

Personal Genome ◽

Sequencing Data ◽

Sequencing Technologies ◽

Human Genomes

The Genome in a Bottle Consortium, hosted by the National Institute of Standards and Technology (NIST) is creating reference materials and data for human genome sequencing, as well as methods for genome comparison and benchmarking. Here, we describe a large, diverse set of sequencing data for seven human genomes; five are current or candidate NIST Reference Materials. The pilot genome, NA12878, has been released as NIST RM 8398. We also describe data from two Personal Genome Project trios, one of Ashkenazim Jewish ancestry and one of Chinese ancestry. The data come from 12 technologies: BioNano Genomics, Complete Genomics paired-end and LFR, Ion Proton exome, Oxford Nanopore, Pacific Biosciences, SOLiD, 10X Genomics GemCodeTM WGS, and Illumina exome and WGS paired-end, mate-pair, and synthetic long reads. Cell lines, DNA, and data from these individuals are publicly available. Therefore, we expect these data to be useful for revealing novel information about the human genome and improving sequencing technologies, SNP, indel, and structural variant calling, and de novo assembly.

Download Full-text

Best Practices for Benchmarking Germline Small Variant Calls in Human Genomes

10.1101/270157 ◽

2018 ◽

Cited By ~ 14

Author(s):

Peter Krusche ◽

Len Trigg ◽

Paul C. Boutros ◽

Christopher E. Mason ◽

Francisco M. De La Vega ◽

...

Keyword(s):

Best Practices ◽

Performance Metrics ◽

Variant Calling ◽

Confidence Regions ◽

List Type ◽

High Confidence ◽

Global Alliance ◽

Comparative Performance ◽

Standardized Report ◽

Genome Context

AbstractAssessing accuracy of NGS variant calling is immensely facilitated by a robust benchmarking strategy and tools to carry it out in a standard way. Benchmarking variant calls requires careful attention to definitions of performance metrics, sophisticated comparison approaches, and stratification by variant type and genome context. The Global Alliance for Genomics and Health (GA4GH) Benchmarking Team has developed standardized performance metrics and tools for benchmarking germline small variant calls. This team includes representatives from sequencing technology developers, government agencies, academic bioinformatics researchers, clinical laboratories, and commercial technology and bioinformatics developers for whom benchmarking variant calls is essential to their work. Benchmarking variant calls is a challenging problem for many reasons:Evaluating variant calls requires complex matching algorithms and standardized counting because the same variant may be represented differently in truth and query callsets.Defining and interpreting resulting metrics such as precision (aka positive predictive value = TP/(TP+FP)) and recall (aka sensitivity = TP/(TP+FN)) requires standardization to draw robust conclusions about comparative performance for different variant calling methods.Performance of NGS methods can vary depending on variant types and genome context; and as a result understanding performance requires meaningful stratification.High-confidence variant calls and regions that can be used as “truth” to accurately identify false positives and negatives are difficult to define, and reliable calls for the most challenging regions and variants remain out of reach.We have made significant progress on standardizing comparison methods, metric definitions and reporting, as well as developing and using truth sets. Our methods are publicly available on GitHub (https://github.com/ga4gh/benchmarking-tools) and in a web-based app on precisionFDA, which allow users to compare their variant calls against truth sets and to obtain a standardized report on their variant calling performance. Our methods have been piloted in the precisionFDA variant calling challenges to identify the best-in-class variant calling methods within high-confidence regions. Finally, we recommend a set of best practices for using our tools and critically evaluating the results.

Download Full-text

PGP-UK: a research and citizen science hybrid project in support of personalized medicine

10.1101/288829 ◽

2018 ◽

Cited By ~ 1

Author(s):

◽

Stephan Beck ◽

Alison M Berner ◽

Graham Bignell ◽

Maggie Bond ◽

...

Keyword(s):

Personalized Medicine ◽

Citizen Science ◽

Public Awareness ◽

Open Data ◽

Genome Project ◽

Personal Genome ◽

The Public ◽

Open Consent ◽

Free Open Source ◽

And Personalized Medicine

AbstractMolecular analyses such as whole-genome sequencing have become routine and are expected to be transformational for future healthcare and lifestyle decisions. Population-wide implementation of such analyses is, however, not without challenges, and multiple studies are ongoing to identify what these are and explore how they can be addressed. Defined as a research project, the Personal Genome Project UK (PGP-UK) is part of the global PGP network and focuses on open data sharing and citizen science to advance and accelerate personalized genomics and medicine. Here we report our findings on using an open consent recruitment protocol, active participant involvement, open access release of personal genome, methylome and transcriptome data and associated analyses, including 47 new variants predicted to affect gene function and innovative reports based on the analysis of genetic and epigenetic variants. For this pilot study, we recruited ten participants willing to actively engage as citizen scientists with the project. In addition, we introduce Genome Donation as a novel mechanism for openly sharing previously restricted data and discuss the first three donations received. Lastly, we present GenoME, a free, open-source educational app suitable for the lay public to allow exploration of personal genomes. Our findings demonstrate that citizen science-based approaches like PGP-UK have an important role to play in the public awareness, acceptance and implementation of genomics and personalized medicine.

Download Full-text

Significant abundance of cis configurations of mutations in diploid human genomes

10.1101/221085 ◽

2017 ◽

Cited By ~ 2

Author(s):

Margret R. Hoehe ◽

Ralf Herwig ◽

Qing Mao ◽

Brock A. Peters ◽

Radoje Drmanac ◽

...

Keyword(s):

Genetic Variation ◽

Genome Project ◽

Personal Genome ◽

Functional Interpretation ◽

Protein Coding ◽

Specific Distribution ◽

Gene Sets ◽

Human Genomes ◽

Significant Enrichment ◽

Coding Variants

AbstractTo fully understand human genetic variation, one must assess the specific distribution of variants between the two chromosomal homologues of genes, and any functional units of interest, as the phase of variants can significantly impact gene function and phenotype. To this end, we have systematically analyzed 18,121 autosomal protein-coding genes in 1,092 statistically phased genomes from the 1000 Genomes Project, and an unprecedented number of 184 experimentally phased genomes from the Personal Genome Project. Here we show that mutations predicted to functionally alter the protein, and coding variants as a whole, are not randomly distributed between the two homologues of a gene, but do occur significantly more frequently in cis-than trans-configurations, with cis/trans ratios of ∼60:40. Significant cis-abundance was observed in virtually all individual genomes in all populations. Nearly all variable genes exhibited either cis, or trans configurations of protein-altering mutations in significant excess, allowing distinction of cis- and trans-abundant genes. These common patterns of phase were largely constituted by a shared, global set of phase-sensitive genes. We show significant enrichment of this global set with gene sets indicating its involvement in adaptation and evolution. Moreover, cis- and trans-abundant genes were found functionally distinguishable, and exhibited strikingly different distributional patterns of protein-altering mutations. This work establishes common patterns of phase as key characteristics of diploid human exomes and provides evidence for their potential functional significance. Thus, it highlights the importance of phase for the interpretation of protein-coding genetic variation, challenging the current conceptual and functional interpretation of autosomal genes.

Download Full-text

A Diploid Assembly-based Benchmark for Variants in the Major Histocompatibility Complex

10.1101/831792 ◽

2019 ◽

Cited By ~ 4

Author(s):

Chen-Shan Chin ◽

Justin Wagner ◽

Qiandong Zeng ◽

Erik Garrison ◽

Shilpa Garg ◽

...

Keyword(s):

Major Histocompatibility Complex ◽

De Novo ◽

Genome Project ◽

Personal Genome ◽

Major Histocompatibility ◽

Base Level ◽

Histocompatibility Complex ◽

Human Genomes ◽

Long Reads ◽

Complex Variation

AbstractWe develop the first human benchmark derived from a diploid assembly for the openly-consented Genome in a Bottle/Personal Genome Project Ashkenazi son (HG002). As a proof-of-principle, we focus on a medically important, highly variable, 5 million base-pair region - the Major Histocompatibility Complex (MHC). Most human genomes are characterized by aligning individual reads to the reference genome, but accurate long reads and linked reads now enable us to construct base-level accurate, phased de novo assemblies from the reads. We assemble a single haplotig (haplotype-specific contig) for each haplotype, and align reads back to each assembled haplotig to identify two regions of lower confidence. We align the haplotigs to the reference, call phased small and structural variants, and define the first small variant benchmark for the MHC, covering 21496 small variants in 4.58 million base-pairs (92 % of the MHC). The assembly-based benchmark is 99.95 % concordant with a draft mapping-based benchmark from the same long and linked reads within both benchmark regions, but covers 50 % more variants outside the mapping-based benchmark regions. The haplotigs and variant calls are completely concordant with phased clinical HLA types for HG002. This benchmark reliably identifies false positives and false negatives from mapping-based callsets, and enables performance assessment in regions with much denser, complex variation than regions covered by previous benchmarks. These methods demonstrate a path towards future diploid assembly-based benchmarks for other complex regions of the genome.

Download Full-text

Use of SNP chips to detect rare pathogenic variants: retrospective, population based diagnostic evaluation

BMJ ◽

10.1136/bmj.n214 ◽

2021 ◽

pp. n214

Author(s):

Weedon MN ◽

Jackson L ◽

Harrison JW ◽

Ruth KS ◽

Tyrrell J ◽

...

Keyword(s):

Positive Predictive Value ◽

Predictive Value ◽

Population Based ◽

Genome Project ◽

Personal Genome ◽

Uk Biobank ◽

Sequencing Data ◽

Snp Chip ◽

Pathogenic Variants ◽

The Uk

Abstract Objective To determine whether the sensitivity and specificity of SNP chips are adequate for detecting rare pathogenic variants in a clinically unselected population. Design Retrospective, population based diagnostic evaluation. Participants 49 908 people recruited to the UK Biobank with SNP chip and next generation sequencing data, and an additional 21 people who purchased consumer genetic tests and shared their data online via the Personal Genome Project. Main outcome measures Genotyping (that is, identification of the correct DNA base at a specific genomic location) using SNP chips versus sequencing, with results split by frequency of that genotype in the population. Rare pathogenic variants in the BRCA1 and BRCA2 genes were selected as an exemplar for detailed analysis of clinically actionable variants in the UK Biobank, and BRCA related cancers (breast, ovarian, prostate, and pancreatic) were assessed in participants through use of cancer registry data. Results Overall, genotyping using SNP chips performed well compared with sequencing; sensitivity, specificity, positive predictive value, and negative predictive value were all above 99% for 108 574 common variants directly genotyped on the SNP chips and sequenced in the UK Biobank. However, the likelihood of a true positive result decreased dramatically with decreasing variant frequency; for variants that are very rare in the population, with a frequency below 0.001% in UK Biobank, the positive predictive value was very low and only 16% of 4757 heterozygous genotypes from the SNP chips were confirmed with sequencing data. Results were similar for SNP chip data from the Personal Genome Project, and 20/21 individuals analysed had at least one false positive rare pathogenic variant that had been incorrectly genotyped. For pathogenic variants in the BRCA1 and BRCA2 genes, which are individually very rare, the overall performance metrics for the SNP chips versus sequencing in the UK Biobank were: sensitivity 34.6%, specificity 98.3%, positive predictive value 4.2%, and negative predictive value 99.9%. Rates of BRCA related cancers in UK Biobank participants with a positive SNP chip result were similar to those for age matched controls (odds ratio 1.31, 95% confidence interval 0.99 to 1.71) because the vast majority of variants were false positives, whereas sequence positive participants had a significantly increased risk (odds ratio 4.05, 2.72 to 6.03). Conclusions SNP chips are extremely unreliable for genotyping very rare pathogenic variants and should not be used to guide health decisions without validation.

Download Full-text

A study of transposable element-associated structural variations (TASVs) using a de novo-assembled Korean genome

Experimental & Molecular Medicine ◽

10.1038/s12276-021-00586-y ◽

2021 ◽

Author(s):

Seyoung Mun ◽

Songmi Kim ◽

Wooseok Lee ◽

Keunsoo Kang ◽

Thomas J. Meyer ◽

...

Keyword(s):

Genome Sequencing ◽

Genome Assembly ◽

De Novo ◽

Personal Genome ◽

Human Populations ◽

Whole Genome ◽

Structural Variations ◽

Insert Size ◽

Human Genomes ◽

Next Generation Sequencing Ngs

AbstractAdvances in next-generation sequencing (NGS) technology have made personal genome sequencing possible, and indeed, many individual human genomes have now been sequenced. Comparisons of these individual genomes have revealed substantial genomic differences between human populations as well as between individuals from closely related ethnic groups. Transposable elements (TEs) are known to be one of the major sources of these variations and act through various mechanisms, including de novo insertion, insertion-mediated deletion, and TE–TE recombination-mediated deletion. In this study, we carried out de novo whole-genome sequencing of one Korean individual (KPGP9) via multiple insert-size libraries. The de novo whole-genome assembly resulted in 31,305 scaffolds with a scaffold N50 size of 13.23 Mb. Furthermore, through computational data analysis and experimental verification, we revealed that 182 TE-associated structural variation (TASV) insertions and 89 TASV deletions contributed 64,232 bp in sequence gain and 82,772 bp in sequence loss, respectively, in the KPGP9 genome relative to the hg19 reference genome. We also verified structural differences associated with TASVs by comparative analysis with TASVs in recent genomes (AK1 and TCGA genomes) and reported their details. Here, we constructed a new Korean de novo whole-genome assembly and provide the first study, to our knowledge, focused on the identification of TASVs in an individual Korean genome. Our findings again highlight the role of TEs as a major driver of structural variations in human individual genomes.

Download Full-text

Performance Comparison of Massively Parallel Sequencing (MPS) Instruments Using Single-Nucleotide Polymorphism (SNP) Panels for Ancestry

SLAS TECHNOLOGY Translating Life Sciences Innovation ◽

10.1177/2472630320954180 ◽

2020 ◽

pp. 247263032095418

Author(s):

Ashley M. Cooley ◽

Kelly A. Meiklejohn ◽

Natalie Damaso ◽

James M. Robertson ◽

Tracey Dawson Cruz

Keyword(s):

Single Nucleotide Polymorphism ◽

Performance Metrics ◽

Massively Parallel Sequencing ◽

Performance Comparison ◽

Massively Parallel ◽

Personal Genome ◽

Nucleotide Polymorphism ◽

Single Nucleotide ◽

Parallel Sequencing ◽

Two Systems

Thermo Fisher Scientific released the Precision ID Ancestry Panel, a 165-single-nucleotide polymorphism (SNP) panel for ancestry prediction that was initially compatible with the manufacturer’s massively parallel sequencer, the Ion Torrent Personal Genome Machine (PGM). The semiautomated workflow using the panel with the PGM involved several time-consuming manual steps across three instruments, including making templating solutions and loading sequencing chips. In 2014, the manufacturer released the Ion Chef robot, followed by the Ion S5 massively parallel sequencer in late 2015. The robot performs the templating with reagent cartridges and loads the chips, thus creating a fully automated workflow across two instruments. The objective of the work reported here is to compare the performance of two massively parallel sequencing systems and ascertain if the change in the workflow produces different ancestry predictions. For performance comparison of the two systems, forensic-type samples ( n = 16) were used to make libraries. Libraries were templated either with the Ion OneTouch 2 system (for the PGM) or on the Ion Chef robot (for the S5). Sequencing results indicated that the ion sphere particle performance metrics were similar for the two systems. The total coverages per SNP and SNP quality were both higher for the S5 system. Ancestry predictions were concordant for the mock forensic-type samples sequenced on both massively parallel sequencing systems. The results indicated that automating the workflow with the Ion Chef system reduced the labor involved and increased the sequencing quality.

Download Full-text

An ethnically relevant consensus Korean reference genome is a step towards personal reference genomes

Nature Communications ◽

10.1038/ncomms13637 ◽

2016 ◽

Vol 7 (1) ◽

Cited By ~ 26

Author(s):

Yun Sung Cho ◽

Hyunho Kim ◽

Hak-Min Kim ◽

Sungwoong Jho ◽

JeHoon Jun ◽

...

Keyword(s):

Large Scale ◽

New Technologies ◽

Reference Genome ◽

Genomic Structure ◽

Genome Project ◽

Personal Genome ◽

Personal Genomic ◽

Personal Reference ◽

Scale Population ◽

Genome Assemblies

Abstract Human genomes are routinely compared against a universal reference. However, this strategy could miss population-specific and personal genomic variations, which may be detected more efficiently using an ethnically relevant or personal reference. Here we report a hybrid assembly of a Korean reference genome (KOREF) for constructing personal and ethnic references by combining sequencing and mapping methods. We also build its consensus variome reference, providing information on millions of variants from 40 additional ethnically homogeneous genomes from the Korean Personal Genome Project. We find that the ethnically relevant consensus reference can be beneficial for efficient variant detection. Systematic comparison of human assemblies shows the importance of assembly quality, suggesting the necessity of new technologies to comprehensively map ethnic and personal genomic structure variations. In the era of large-scale population genome projects, the leveraging of ethnicity-specific genome assemblies as well as the human reference genome will accelerate mapping all human genome diversity.

Download Full-text

Tiling the genome into consistently named subsequences enables precision medicine and machine learning with millions of complex individual data-sets

10.7287/peerj.preprints.1426 ◽

2015 ◽

Author(s):

Sarah Guthrie ◽

Abram Connelly ◽

Peter Amstutz ◽

Adam F. Berrey ◽

Nicolas Cesar ◽

...

Keyword(s):

Machine Learning ◽

Precision Medicine ◽

Blood Type ◽

Whole Genome Sequence ◽

Medical Community ◽

Support Vector ◽

Personal Genome ◽

Whole Genome ◽

Genome Sequences ◽

Global Alliance

The scientific and medical community is reaching an era of inexpensive whole genome sequencing, opening the possibility of precision medicine for millions of individuals. Here we present tiling: a flexible representation of whole genome sequences that supports simple and consistent names, annotation, queries, machine learning, and clinical screening. We partitioned the genome into 10,655,006 tiles: overlapping, variable-length sequences that begin and end with unique 24-base tags. We tiled and annotated 680 public whole genome sequences from the 1000 Genomes Project Consortium (1KG) and Harvard Personal Genome Project (PGP) using ClinVar database information. These genomes cover 14.13 billion tile sequences (4.087 trillion high quality bases and 0.4321 trillion low quality bases) and 251 phenotypes spanning ICD-9 code ranges 140-289, 320-629, and 680-759. We used these data to build a Global Alliance for Genomics and Health Beacon and graph database. We performed principal component analysis (PCA) on the 680 public whole genomes, and by projecting the tiled genomes onto their first two principal components, we replicated the 1KG principle component separation by population ethnicity codes. Interestingly, we found the PGP self reported ethnicities cluster consistently with 1KG ethnicity codes. We built a set of support-vector ABO blood-type classifiers using 75 PGP participants who had both a whole genome sequence and a self-reported blood type. Our classifier predicts A antigen presence to within 1% of the current state-of-the art for in silico A antigen prediction. Finally, we found six PGP participants with previously undiscovered pathogenic BRCA variants, and using our tiling, gave them simple, consistent names, which can be easily and independently re-derived. Given the near-future requirements of genomics research and precision medicine, we propose the adoption of tiling and invite all interested individuals and groups to view, rerun, copy, and modify these analyses at https://curover.se/su92l- j7d0g-swtofxa2rct8495

Download Full-text