scholarly journals A Diploid Assembly-based Benchmark for Variants in the Major Histocompatibility Complex

2019 ◽  
Author(s):  
Chen-Shan Chin ◽  
Justin Wagner ◽  
Qiandong Zeng ◽  
Erik Garrison ◽  
Shilpa Garg ◽  
...  

AbstractWe develop the first human benchmark derived from a diploid assembly for the openly-consented Genome in a Bottle/Personal Genome Project Ashkenazi son (HG002). As a proof-of-principle, we focus on a medically important, highly variable, 5 million base-pair region - the Major Histocompatibility Complex (MHC). Most human genomes are characterized by aligning individual reads to the reference genome, but accurate long reads and linked reads now enable us to construct base-level accurate, phased de novo assemblies from the reads. We assemble a single haplotig (haplotype-specific contig) for each haplotype, and align reads back to each assembled haplotig to identify two regions of lower confidence. We align the haplotigs to the reference, call phased small and structural variants, and define the first small variant benchmark for the MHC, covering 21496 small variants in 4.58 million base-pairs (92 % of the MHC). The assembly-based benchmark is 99.95 % concordant with a draft mapping-based benchmark from the same long and linked reads within both benchmark regions, but covers 50 % more variants outside the mapping-based benchmark regions. The haplotigs and variant calls are completely concordant with phased clinical HLA types for HG002. This benchmark reliably identifies false positives and false negatives from mapping-based callsets, and enables performance assessment in regions with much denser, complex variation than regions covered by previous benchmarks. These methods demonstrate a path towards future diploid assembly-based benchmarks for other complex regions of the genome.

2020 ◽  
Vol 11 (1) ◽  
Author(s):  
Chen-Shan Chin ◽  
Justin Wagner ◽  
Qiandong Zeng ◽  
Erik Garrison ◽  
Shilpa Garg ◽  
...  

Abstract Most human genomes are characterized by aligning individual reads to the reference genome, but accurate long reads and linked reads now enable us to construct accurate, phased de novo assemblies. We focus on a medically important, highly variable, 5 million base-pair (bp) region where diploid assembly is particularly useful - the Major Histocompatibility Complex (MHC). Here, we develop a human genome benchmark derived from a diploid assembly for the openly-consented Genome in a Bottle sample HG002. We assemble a single contig for each haplotype, align them to the reference, call phased small and structural variants, and define a small variant benchmark for the MHC, covering 94% of the MHC and 22368 variants smaller than 50 bp, 49% more variants than a mapping-based benchmark. This benchmark reliably identifies errors in mapping-based callsets, and enables performance assessment in regions with much denser, complex variation than regions covered by previous benchmarks.


2010 ◽  
Vol 62 (2) ◽  
pp. 85-100 ◽  
Author(s):  
Elizabeth A. Archie ◽  
Tammy Henry ◽  
Jesus E. Maldonado ◽  
Cynthia J. Moss ◽  
Joyce H. Poole ◽  
...  

2004 ◽  
Vol 133 (5) ◽  
pp. 1117-1137 ◽  
Author(s):  
Terry D. Beacham ◽  
Michael Lapointe ◽  
John R. Candy ◽  
Brenda McIntosh ◽  
Cathy MacConnachie ◽  
...  

2015 ◽  
Author(s):  
Justin M Zook ◽  
David Catoe ◽  
Jennifer McDaniel ◽  
Lindsay Vang ◽  
Noah Spies ◽  
...  

The Genome in a Bottle Consortium, hosted by the National Institute of Standards and Technology (NIST) is creating reference materials and data for human genome sequencing, as well as methods for genome comparison and benchmarking. Here, we describe a large, diverse set of sequencing data for seven human genomes; five are current or candidate NIST Reference Materials. The pilot genome, NA12878, has been released as NIST RM 8398. We also describe data from two Personal Genome Project trios, one of Ashkenazim Jewish ancestry and one of Chinese ancestry. The data come from 12 technologies: BioNano Genomics, Complete Genomics paired-end and LFR, Ion Proton exome, Oxford Nanopore, Pacific Biosciences, SOLiD, 10X Genomics GemCodeTM WGS, and Illumina exome and WGS paired-end, mate-pair, and synthetic long reads. Cell lines, DNA, and data from these individuals are publicly available. Therefore, we expect these data to be useful for revealing novel information about the human genome and improving sequencing technologies, SNP, indel, and structural variant calling, and de novo assembly.


2016 ◽  
Vol 6 (12) ◽  
pp. 3991-4003 ◽  
Author(s):  
Collin P. Jaeger ◽  
Melvin R. Duvall ◽  
Bradley J. Swanson ◽  
Christopher A. Phillips ◽  
Michael J. Dreslik ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document