scholarly journals Rapid Genotype Refinement for Whole-Genome Sequencing Data using Multi-Variate Normal Distributions

2015 ◽  
Author(s):  
Rudy Arthur ◽  
Jared O'Connell ◽  
Ole Schulz-Trieglaff ◽  
Anthony J Cox

Whole-genome low-coverage sequencing has been combined with linkage-disequilibrium (LD) based genotype refinement to accurately and cost-effectively infer genotypes in large cohorts of individuals. Most genotype refinement methods are based on hidden Markov models, which are accurate but computationally expensive. We introduce an algorithm that models LD using a simple multivariate Gaussian distribution. The key feature of our algorithm is its speed, it is hundreds of times faster than other methods on the same data set and its scaling behaviour is linear in the number of samples. We demonstrate the performance of the method on both low-coverage and high-coverage samples.

2019 ◽  
Author(s):  
Marc Haber ◽  
Riyadh Saif-Ali ◽  
Molham Al-Habori ◽  
Yuan Chen ◽  
Daniel E. Platt ◽  
...  

AbstractWe report high-coverage whole-genome sequencing data from 46 Yemeni individuals as well as genome-wide genotyping data from 169 Yemenis from diverse locations. We use this dataset to define the genetic diversity in Yemen and how it relates to people elsewhere in the Near East. Yemen is a vast region with substantial cultural and geographic diversity, but we found little genetic structure correlating with geography among the Yemenis – probably reflecting continuous movement of people between the regions. African ancestry from admixture in the past 800 years is widespread in Yemen and is the main contributor to the country’s limited genetic structure, with some individuals in Hudayda and Hadramout having up to 20% of their genetic ancestry from Africa. In contrast, individuals from Maarib appear to have been genetically isolated from the African gene flow and thus have genomes likely to reflect Yemen’s ancestry before the admixture. This ancestry was comparable to the ancestry present during the Bronze Age in the distant Northern regions of the Near East. After the Bronze Age, the South and North of the Near East therefore followed different genetic trajectories: in the North the Levantines admixed with a Eurasian population carrying steppe ancestry whose impact never reached as far south as the Yemen, where people instead admixed with Africans leading to the genetic structure observed in the Near East today.


2016 ◽  
Vol 32 (15) ◽  
pp. 2306-2312 ◽  
Author(s):  
Rudy Arthur ◽  
Jared O’Connell ◽  
Ole Schulz-Trieglaff ◽  
Anthony J. Cox

2019 ◽  
Vol 57 (6) ◽  
Author(s):  
R. C. Jones ◽  
L. G. Harris ◽  
S. Morgan ◽  
M. C. Ruddy ◽  
M. Perry ◽  
...  

ABSTRACT An inability to standardize the bioinformatic data produced by whole-genome sequencing (WGS) has been a barrier to its widespread use in tuberculosis phylogenetics. The aim of this study was to carry out a phylogenetic analysis of tuberculosis in Wales, United Kingdom, using Ridom SeqSphere software for core genome multilocus sequence typing (cgMLST) analysis of whole-genome sequencing data. The phylogenetics of tuberculosis in Wales have not previously been studied. Sixty-six Mycobacterium tuberculosis isolates (including 42 outbreak-associated isolates) from south Wales were sequenced using an Illumina platform. Isolates were assigned to principal genetic groups, single nucleotide polymorphism (SNP) cluster groups, lineages, and sublineages using SNP-calling protocols. WGS data were submitted to the Ridom SeqSphere software for cgMLST analysis and analyzed alongside 179 previously lineage-defined isolates. The data set was dominated by the Euro-American lineage, with the sublineage composition being dominated by T, X, and Haarlem family strains. The cgMLST analysis successfully assigned 58 isolates to major lineages, and the results were consistent with those obtained by traditional SNP mapping methods. In addition, the cgMLST scheme was used to resolve an outbreak of tuberculosis occurring in the region. This study supports the use of a cgMLST method for standardized phylogenetic assignment of tuberculosis isolates and for outbreak resolution and provides the first insight into Welsh tuberculosis phylogenetics, identifying the presence of the Haarlem sublineage commonly associated with virulent traits.


Author(s):  
Alicia R. Martin ◽  
Elizabeth G. Atkinson ◽  
Sinéad B. Chapman ◽  
Anne Stevenson ◽  
Rocky E. Stroud ◽  
...  

AbstractBackgroundGenetic studies of biomedical phenotypes in underrepresented populations identify disproportionate numbers of novel associations. However, current genomics infrastructure--including most genotyping arrays and sequenced reference panels--best serves populations of European descent. A critical step for facilitating genetic studies in underrepresented populations is to ensure that genetic technologies accurately capture variation in all populations. Here, we quantify the accuracy of low-coverage sequencing in diverse African populations.ResultsWe sequenced the whole genomes of 91 individuals to high-coverage (≥20X) from the Neuropsychiatric Genetics of African Population-Psychosis (NeuroGAP-Psychosis) study, in which participants were recruited from Ethiopia, Kenya, South Africa, and Uganda. We empirically tested two data generation strategies, GWAS arrays versus low-coverage sequencing, by calculating the concordance of imputed variants from these technologies with those from deep whole genome sequencing data. We show that low-coverage sequencing at a depth of ≥4X captures variants of all frequencies more accurately than all commonly used GWAS arrays investigated and at a comparable cost. Lower depths of sequencing (0.5-1X) performed comparable to commonly used low-density GWAS arrays. Low-coverage sequencing is also sensitive to novel variation, with 4X sequencing detecting 45% of singletons and 95% of common variants identified in high-coverage African whole genomes.ConclusionThese results indicate that low-coverage sequencing approaches surmount the problems induced by the ascertainment of common genotyping arrays, including those that capture variation most common in Europeans and Africans. Low-coverage sequencing effectively identifies novel variation (particularly in underrepresented populations), and presents opportunities to enhance variant discovery at a similar cost to traditional approaches.


BMC Genomics ◽  
2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Johannes Smolander ◽  
Sofia Khan ◽  
Kalaimathy Singaravelu ◽  
Leni Kauko ◽  
Riikka J. Lund ◽  
...  

Abstract Background Detection of copy number variations (CNVs) from high-throughput next-generation whole-genome sequencing (WGS) data has become a widely used research method during the recent years. However, only a little is known about the applicability of the developed algorithms to ultra-low-coverage (0.0005–0.8×) data that is used in various research and clinical applications, such as digital karyotyping and single-cell CNV detection. Result Here, the performance of six popular read-depth based CNV detection algorithms (BIC-seq2, Canvas, CNVnator, FREEC, HMMcopy, and QDNAseq) was studied using ultra-low-coverage WGS data. Real-world array- and karyotyping kit-based validation were used as a benchmark in the evaluation. Additionally, ultra-low-coverage WGS data was simulated to investigate the ability of the algorithms to identify CNVs in the sex chromosomes and the theoretical minimum coverage at which these tools can accurately function. Our results suggest that while all the methods were able to detect large CNVs, many methods were susceptible to producing false positives when smaller CNVs (< 2 Mbp) were detected. There was also significant variability in their ability to identify CNVs in the sex chromosomes. Overall, BIC-seq2 was found to be the best method in terms of statistical performance. However, its significant drawback was by far the slowest runtime among the methods (> 3 h) compared with FREEC (~ 3 min), which we considered the second-best method. Conclusions Our comparative analysis demonstrates that CNV detection from ultra-low-coverage WGS data can be a highly accurate method for the detection of large copy number variations when their length is in millions of base pairs. These findings facilitate applications that utilize ultra-low-coverage CNV detection.


Sign in / Sign up

Export Citation Format

Share Document