Rapid Genotype Refinement for Whole-Genome Sequencing Data using Multi-Variate Normal Distributions

Mapping Intimacies ◽

10.1101/031484 ◽

2015 ◽

Author(s):

Rudy Arthur ◽

Jared O'Connell ◽

Ole Schulz-Trieglaff ◽

Anthony J Cox

Keyword(s):

Markov Models ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

High Coverage ◽

Multivariate Gaussian Distribution ◽

Data Set ◽

Normal Distributions ◽

Computationally Expensive ◽

Low Coverage

Whole-genome low-coverage sequencing has been combined with linkage-disequilibrium (LD) based genotype refinement to accurately and cost-effectively infer genotypes in large cohorts of individuals. Most genotype refinement methods are based on hidden Markov models, which are accurate but computationally expensive. We introduce an algorithm that models LD using a simple multivariate Gaussian distribution. The key feature of our algorithm is its speed, it is hundreds of times faster than other methods on the same data set and its scaling behaviour is linear in the number of samples. We demonstrate the performance of the method on both low-coverage and high-coverage samples.

Download Full-text

Batch effects in population genomic studies with low‐coverage whole genome sequencing data: causes, detection, and mitigation

Molecular Ecology Resources ◽

10.1111/1755-0998.13559 ◽

2021 ◽

Author(s):

Runyang Nicolas Lou ◽

Nina Overgaard Therkildsen

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Batch Effects ◽

Sequencing Data ◽

Population Genomic ◽

Genomic Studies ◽

Low Coverage

Download Full-text

dpGMM: A Dirichlet Process Gaussian Mixture Model for Copy Number Variation Detection in Low-Coverage Whole-Genome Sequencing Data

IEEE Access ◽

10.1109/access.2020.2971863 ◽

2020 ◽

Vol 8 ◽

pp. 27973-27985

Author(s):

Yaoyao Li ◽

Junying Zhang ◽

Xiguo Yuan ◽

Junping Li

Keyword(s):

Genome Sequencing ◽

Dirichlet Process ◽

Copy Number ◽

Gaussian Mixture ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Number Variation ◽

Low Coverage ◽

Copy Number Variation Detection

Download Full-text

Application of risk score analysis to low-coverage whole genome sequencing data for the noninvasive detection of trisomy 21, trisomy 18, and trisomy 13

Prenatal Diagnosis ◽

10.1002/pd.4712 ◽

2015 ◽

Vol 36 (1) ◽

pp. 56-62 ◽

Cited By ~ 8

Author(s):

J. A. Tynan ◽

S. K. Kim ◽

A. R. Mazloom ◽

C. Zhao ◽

G. McLennan ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Trisomy 21 ◽

Trisomy 13 ◽

Noninvasive Detection ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Score Analysis ◽

Low Coverage

Download Full-text

Insight into the genomic history of the Near East from whole-genome sequences and genotypes of Yemenis

10.1101/749341 ◽

2019 ◽

Cited By ~ 1

Author(s):

Marc Haber ◽

Riyadh Saif-Ali ◽

Molham Al-Habori ◽

Yuan Chen ◽

Daniel E. Platt ◽

...

Keyword(s):

Genetic Structure ◽

Bronze Age ◽

Near East ◽

African Ancestry ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

High Coverage ◽

The North ◽

The Bronze Age

AbstractWe report high-coverage whole-genome sequencing data from 46 Yemeni individuals as well as genome-wide genotyping data from 169 Yemenis from diverse locations. We use this dataset to define the genetic diversity in Yemen and how it relates to people elsewhere in the Near East. Yemen is a vast region with substantial cultural and geographic diversity, but we found little genetic structure correlating with geography among the Yemenis – probably reflecting continuous movement of people between the regions. African ancestry from admixture in the past 800 years is widespread in Yemen and is the main contributor to the country’s limited genetic structure, with some individuals in Hudayda and Hadramout having up to 20% of their genetic ancestry from Africa. In contrast, individuals from Maarib appear to have been genetically isolated from the African gene flow and thus have genomes likely to reflect Yemen’s ancestry before the admixture. This ancestry was comparable to the ancestry present during the Bronze Age in the distant Northern regions of the Near East. After the Bronze Age, the South and North of the Near East therefore followed different genetic trajectories: in the North the Levantines admixed with a Eurasian population carrying steppe ancestry whose impact never reached as far south as the Yemen, where people instead admixed with Africans leading to the genetic structure observed in the Near East today.

Download Full-text

Rapid genotype refinement for whole-genome sequencing data using multi-variate normal distributions

Bioinformatics ◽

10.1093/bioinformatics/btw097 ◽

2016 ◽

Vol 32 (15) ◽

pp. 2306-2312 ◽

Cited By ~ 2

Author(s):

Rudy Arthur ◽

Jared O’Connell ◽

Ole Schulz-Trieglaff ◽

Anthony J. Cox

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Normal Distributions

Download Full-text

Phylogenetic Analysis of Mycobacterium tuberculosis Strains in Wales by Use of Core Genome Multilocus Sequence Typing To Analyze Whole-Genome Sequencing Data

Journal of Clinical Microbiology ◽

10.1128/jcm.02025-18 ◽

2019 ◽

Vol 57 (6) ◽

Cited By ~ 4

Author(s):

R. C. Jones ◽

L. G. Harris ◽

S. Morgan ◽

M. C. Ruddy ◽

M. Perry ◽

...

Keyword(s):

Phylogenetic Analysis ◽

Mycobacterium Tuberculosis ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Multilocus Sequence Typing ◽

Core Genome ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Data Set

ABSTRACT An inability to standardize the bioinformatic data produced by whole-genome sequencing (WGS) has been a barrier to its widespread use in tuberculosis phylogenetics. The aim of this study was to carry out a phylogenetic analysis of tuberculosis in Wales, United Kingdom, using Ridom SeqSphere software for core genome multilocus sequence typing (cgMLST) analysis of whole-genome sequencing data. The phylogenetics of tuberculosis in Wales have not previously been studied. Sixty-six Mycobacterium tuberculosis isolates (including 42 outbreak-associated isolates) from south Wales were sequenced using an Illumina platform. Isolates were assigned to principal genetic groups, single nucleotide polymorphism (SNP) cluster groups, lineages, and sublineages using SNP-calling protocols. WGS data were submitted to the Ridom SeqSphere software for cgMLST analysis and analyzed alongside 179 previously lineage-defined isolates. The data set was dominated by the Euro-American lineage, with the sublineage composition being dominated by T, X, and Haarlem family strains. The cgMLST analysis successfully assigned 58 isolates to major lineages, and the results were consistent with those obtained by traditional SNP mapping methods. In addition, the cgMLST scheme was used to resolve an outbreak of tuberculosis occurring in the region. This study supports the use of a cgMLST method for standardized phylogenetic assignment of tuberculosis isolates and for outbreak resolution and provides the first insight into Welsh tuberculosis phylogenetics, identifying the presence of the Haarlem sublineage commonly associated with virulent traits.

Download Full-text

Lep-MAP3: robust linkage mapping even for low-coverage whole genome sequencing data

Bioinformatics ◽

10.1093/bioinformatics/btx494 ◽

2017 ◽

Vol 33 (23) ◽

pp. 3726-3732 ◽

Cited By ~ 85

Author(s):

Pasi Rastas

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Linkage Mapping ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Low Coverage

Download Full-text

Low-coverage sequencing cost-effectively detects known and novel variation in underrepresented populations

10.1101/2020.04.27.064832 ◽

2020 ◽

Cited By ~ 2

Author(s):

Alicia R. Martin ◽

Elizabeth G. Atkinson ◽

Sinéad B. Chapman ◽

Anne Stevenson ◽

Rocky E. Stroud ◽

...

Keyword(s):

Whole Genome Sequencing Data ◽

Data Generation ◽

Sequencing Data ◽

Underrepresented Populations ◽

High Coverage ◽

Genetic Studies ◽

Variant Discovery ◽

Whole Genomes ◽

Low Coverage ◽

Novel Variation

AbstractBackgroundGenetic studies of biomedical phenotypes in underrepresented populations identify disproportionate numbers of novel associations. However, current genomics infrastructure--including most genotyping arrays and sequenced reference panels--best serves populations of European descent. A critical step for facilitating genetic studies in underrepresented populations is to ensure that genetic technologies accurately capture variation in all populations. Here, we quantify the accuracy of low-coverage sequencing in diverse African populations.ResultsWe sequenced the whole genomes of 91 individuals to high-coverage (≥20X) from the Neuropsychiatric Genetics of African Population-Psychosis (NeuroGAP-Psychosis) study, in which participants were recruited from Ethiopia, Kenya, South Africa, and Uganda. We empirically tested two data generation strategies, GWAS arrays versus low-coverage sequencing, by calculating the concordance of imputed variants from these technologies with those from deep whole genome sequencing data. We show that low-coverage sequencing at a depth of ≥4X captures variants of all frequencies more accurately than all commonly used GWAS arrays investigated and at a comparable cost. Lower depths of sequencing (0.5-1X) performed comparable to commonly used low-density GWAS arrays. Low-coverage sequencing is also sensitive to novel variation, with 4X sequencing detecting 45% of singletons and 95% of common variants identified in high-coverage African whole genomes.ConclusionThese results indicate that low-coverage sequencing approaches surmount the problems induced by the ascertainment of common genotyping arrays, including those that capture variation most common in Europeans and Africans. Low-coverage sequencing effectively identifies novel variation (particularly in underrepresented populations), and presents opportunities to enhance variant discovery at a similar cost to traditional approaches.

Download Full-text

Evaluation of tools for identifying large copy number variations from ultra-low-coverage whole-genome sequencing data

BMC Genomics ◽

10.1186/s12864-021-07686-z ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Johannes Smolander ◽

Sofia Khan ◽

Kalaimathy Singaravelu ◽

Leni Kauko ◽

Riikka J. Lund ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Sex Chromosomes ◽

Copy Number ◽

Copy Number Variations ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Low Coverage ◽

Cnv Detection

Abstract Background Detection of copy number variations (CNVs) from high-throughput next-generation whole-genome sequencing (WGS) data has become a widely used research method during the recent years. However, only a little is known about the applicability of the developed algorithms to ultra-low-coverage (0.0005–0.8×) data that is used in various research and clinical applications, such as digital karyotyping and single-cell CNV detection. Result Here, the performance of six popular read-depth based CNV detection algorithms (BIC-seq2, Canvas, CNVnator, FREEC, HMMcopy, and QDNAseq) was studied using ultra-low-coverage WGS data. Real-world array- and karyotyping kit-based validation were used as a benchmark in the evaluation. Additionally, ultra-low-coverage WGS data was simulated to investigate the ability of the algorithms to identify CNVs in the sex chromosomes and the theoretical minimum coverage at which these tools can accurately function. Our results suggest that while all the methods were able to detect large CNVs, many methods were susceptible to producing false positives when smaller CNVs (< 2 Mbp) were detected. There was also significant variability in their ability to identify CNVs in the sex chromosomes. Overall, BIC-seq2 was found to be the best method in terms of statistical performance. However, its significant drawback was by far the slowest runtime among the methods (> 3 h) compared with FREEC (~ 3 min), which we considered the second-best method. Conclusions Our comparative analysis demonstrates that CNV detection from ultra-low-coverage WGS data can be a highly accurate method for the detection of large copy number variations when their length is in millions of base pairs. These findings facilitate applications that utilize ultra-low-coverage CNV detection.

Download Full-text

Using Mendelian inheritance errors as quality control criteria in whole genome sequencing data set

BMC Proceedings ◽

10.1186/1753-6561-8-s1-s21 ◽

2014 ◽

Vol 8 (S1) ◽

Cited By ~ 9

Author(s):

Valentina V Pilipenko ◽

Hua He ◽

Brad G Kurowski ◽

Eileen S Alexander ◽

Xue Zhang ◽

...

Keyword(s):

Quality Control ◽

Whole Genome Sequencing ◽

Genome Sequencing ◽

Mendelian Inheritance ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Data Set ◽

Quality Control Criteria ◽

Control Criteria

Download Full-text