Analysis of genomic variation in non-coding elements using population-scale sequencing data from the 1000 Genomes Project

Xinmeng Jasmine Mu; Zhi John Lu; Yong Kong; Hugo Y. K. Lam; Mark B. Gerstein

doi:10.1093/nar/gkr342

Utilizing the Jaccard index to reveal population stratification in sequencing data: a simulation study and an application to the 1000 Genomes Project

Bioinformatics ◽

10.1093/bioinformatics/btv752 ◽

2015 ◽

Vol 32 (9) ◽

pp. 1366-1372 ◽

Cited By ~ 23

Author(s):

Dmitry Prokopenko ◽

Julian Hecker ◽

Edwin K. Silverman ◽

Marcello Pagano ◽

Markus M. Nöthen ◽

...

Keyword(s):

Simulation Study ◽

Population Stratification ◽

Jaccard Index ◽

Sequencing Data ◽

1000 Genomes Project ◽

1000 Genomes

Download Full-text

Legacy Data Confound Genomics Studies

Molecular Biology and Evolution ◽

10.1093/molbev/msz201 ◽

2019 ◽

Vol 37 (1) ◽

pp. 2-10 ◽

Cited By ~ 5

Author(s):

Luke Anderson-Trocmé ◽

Rick Farouni ◽

Mathieu Bourgey ◽

Yoichiro Kamatani ◽

Koichiro Higasa ◽

...

Keyword(s):

Population Stratification ◽

Quality Data ◽

Human Populations ◽

Batch Effects ◽

Sequencing Data ◽

1000 Genomes Project ◽

Mutational Spectra ◽

1000 Genomes ◽

Legacy Data ◽

Early Phases

Abstract Recent reports have identified differences in the mutational spectra across human populations. Although some of these reports have been replicated in other cohorts, most have been reported only in the 1000 Genomes Project (1kGP) data. While investigating an intriguing putative population stratification within the Japanese population, we identified a previously unreported batch effect leading to spurious mutation calls in the 1kGP data and to the apparent population stratification. Because the 1kGP data are used extensively, we find that the batch effects also lead to incorrect imputation by leading imputation servers and a small number of suspicious GWAS associations. Lower quality data from the early phases of the 1kGP thus continue to contaminate modern studies in hidden ways. It may be time to retire or upgrade such legacy sequencing data.

Download Full-text

Ancestral Spectrum Analysis With Population-Specific Variants

Frontiers in Genetics ◽

10.3389/fgene.2021.724638 ◽

2021 ◽

Vol 12 ◽

Author(s):

Gang Shi ◽

Qingmin Kuang

Keyword(s):

Nucleotide Polymorphisms ◽

Sequencing Data ◽

1000 Genomes Project ◽

Specific Population ◽

High Coverage ◽

Single Nucleotide ◽

Target Populations ◽

1000 Genomes ◽

Sequencing Studies ◽

Best Linear Unbiased

With the advance of sequencing technology, an increasing number of populations have been sequenced to study the histories of worldwide populations, including their divergence, admixtures, migration, and effective sizes. The variants detected in sequencing studies are largely rare and mostly population specific. Population-specific variants are often recent mutations and are informative for revealing substructures and admixtures in populations; however, computational methods and tools to analyze them are still lacking. In this work, we propose using reference populations and single nucleotide polymorphisms (SNPs) specific to the reference populations. Ancestral information, the best linear unbiased estimator (BLUE) of the ancestral proportion, is proposed, which can be used to infer ancestral proportions in recently admixed target populations and measure the extent to which reference populations serve as good proxies for the admixing sources. Based on the same panel of SNPs, the ancestral information is comparable across samples from different studies and is not affected by genetic outliers, related samples, or the sample sizes of the admixed target populations. In addition, ancestral spectrum is useful for detecting genetic outliers or exploring co-ancestry between study samples and the reference populations. The methods are implemented in a program, Ancestral Spectrum Analyzer (ASA), and are applied in analyzing high-coverage sequencing data from the 1000 Genomes Project and the Human Genome Diversity Project (HGDP). In the analyses of American populations from the 1000 Genomes Project, we demonstrate that recent admixtures can be dissected from ancient admixtures by comparing ancestral spectra with and without indigenous Americans being included in the reference populations.

Download Full-text

Accurate, scalable cohort variant calls using DeepVariant and GLnexus

10.1101/2020.02.10.942086 ◽

2020 ◽

Cited By ~ 4

Author(s):

Taedong Yun ◽

Helen Li ◽

Pi-Chuan Chang ◽

Michael F. Lin ◽

Andrew Carroll ◽

...

Keyword(s):

Genetic Variation ◽

Best Practices ◽

Open Source ◽

Variant Calling ◽

Cost Savings ◽

Quality Improvements ◽

1000 Genomes Project ◽

Genetic Analyses ◽

1000 Genomes ◽

Population Scale

AbstractPopulation-scale sequenced cohorts are foundational resources for genetic analyses, but processing raw reads into analysis-ready variants remains challenging. Here we introduce an open-source cohort variant-calling method using the highly-accurate caller DeepVariant and scalable merging tool GLnexus. We optimized callset quality based on benchmark samples and Mendelian consistency across many sample sizes and sequencing specifications, resulting in substantial quality improvements and cost savings over existing best practices. We further evaluated our pipeline in the 1000 Genomes Project (1KGP) samples, showing superior quality metrics and imputation performance. We publicly release the 1KGP callset to foster development of broad studies of genetic variation.

Download Full-text

Legacy Data Confounds Genomics Studies

10.1101/624908 ◽

2019 ◽

Author(s):

Luke Anderson-Trocmé ◽

Rick Farouni ◽

Mathieu Bourgey ◽

Yoichiro Kamatani ◽

Koichiro Higasa ◽

...

Keyword(s):

Population Stratification ◽

Quality Data ◽

Human Populations ◽

Batch Effects ◽

Sequencing Data ◽

1000 Genomes Project ◽

Mutational Spectra ◽

1000 Genomes ◽

Legacy Data ◽

Early Phases

AbstractRecent reports have identified differences in the mutational spectra across human populations. While some of these reports have been replicated in other cohorts, most have been reported only in the 1000 Genomes Project (1kGP) data. While investigating an intriguing putative population stratification within the Japanese population, we identified a previously unreported batch effect leading to spurious mutation calls in the 1kGP data and to the apparent population stratification. Because the 1kGP data is used extensively, we find that the batch effects also lead to incorrect imputation by leading imputation servers and a small number of suspicious GWAS associations. Lower-quality data from the early phases of the 1kGP thus continues to contaminate modern studies in hidden ways. It may be time to retire or upgrade such legacy sequencing data.

Download Full-text

Evaluation of MC1R high-throughput nucleotide sequencing data generated by the 1000 Genomes Project

Genetics and Molecular Biology ◽

10.1590/1678-4685-gmb-2016-0180 ◽

2017 ◽

Vol 40 (2) ◽

pp. 530-539 ◽

Cited By ~ 3

Author(s):

Leonardo Arduino Marano ◽

Letícia Marcorin ◽

Erick da Cruz Castelli ◽

Celso Teixeira Mendes-Junior

Keyword(s):

High Throughput ◽

Nucleotide Sequencing ◽

Sequencing Data ◽

1000 Genomes Project ◽

1000 Genomes

Download Full-text

The International Genome Sample Resource (IGSR) collection of open human genomic variation resources

Nucleic Acids Research ◽

10.1093/nar/gkz836 ◽

2019 ◽

Vol 48 (D1) ◽

pp. D941-D947 ◽

Cited By ~ 20

Author(s):

Susan Fairley ◽

Ernesto Lowy-Gallego ◽

Emily Perry ◽

Paul Flicek

Keyword(s):

Sequence Data ◽

Genomic Variation ◽

1000 Genomes Project ◽

High Coverage ◽

Web Based ◽

1000 Genomes ◽

Open Consent ◽

Unified View ◽

Human Genomic ◽

Project Data

Abstract To sustain and develop the largest fully open human genomic resources the International Genome Sample Resource (IGSR) (https://www.internationalgenome.org) was established. It is built on the foundation of the 1000 Genomes Project, which created the largest openly accessible catalogue of human genomic variation developed from samples spanning five continents. IGSR (i) maintains access to 1000 Genomes Project resources, (ii) updates 1000 Genomes Project resources to the GRCh38 human reference assembly, (iii) adds new data generated on 1000 Genomes Project cell lines, (iv) shares data from samples with a similarly open consent to increase the number of samples and populations represented in the resources and (v) provides support to users of these resources. Among recent updates are the release of variation calls from 1000 Genomes Project data calculated directly on GRCh38 and the addition of high coverage sequence data for the 2504 samples in the 1000 Genomes Project phase three panel. The data portal, which facilitates web-based exploration of the IGSR resources, has been updated to include samples which were not part of the 1000 Genomes Project and now presents a unified view of data and samples across almost 5000 samples from multiple studies. All data is fully open and publicly accessible.

Download Full-text

Accurate, scalable cohort variant calls using DeepVariant and GLnexus

Bioinformatics ◽

10.1093/bioinformatics/btaa1081 ◽

2021 ◽

Author(s):

Taedong Yun ◽

Helen Li ◽

Pi-Chuan Chang ◽

Michael F Lin ◽

Andrew Carroll ◽

...

Keyword(s):

Best Practices ◽

Quality Metrics ◽

Supplementary Information ◽

Public Research ◽

Supplementary Data ◽

Quality Improvements ◽

1000 Genomes Project ◽

Individual Level ◽

1000 Genomes ◽

Population Scale

Abstract Motivation Population-scale sequenced cohorts are foundational resources for genetic analyses, but processing raw reads into analysis-ready cohort-level variants remains challenging. Results We introduce an open-source cohort-calling method that uses the highly-accurate caller DeepVariant and scalable merging tool GLnexus. Using callset quality metrics based on variant recall and precision in benchmark samples and Mendelian consistency in father-mother-child trios, we optimized the method across a range of cohort sizes, sequencing methods, and sequencing depths. The resulting callsets show consistent quality improvements over those generated using existing best practices with reduced cost. We further evaluate our pipeline in the deeply sequenced 1000 Genomes Project (1KGP) samples and show superior callset quality metrics and imputation reference panel performance compared to an independently-generated GATK Best Practices pipeline. Availability and Implementation We publicly release the 1KGP individual-level variant calls and cohort callset (https://console.cloud.google.com/storage/browser/brain-genomics-public/research/cohort/1KGP) to foster additional development and evaluation of cohort merging methods as well as broad studies of genetic variation. Both DeepVariant (https://github.com/google/deepvariant) and GLnexus (https://github.com/dnanexus-rnd/GLnexus) are open-sourced, and the optimized GLnexus setup discovered in this study is also integrated into GLnexus public releases v1.2.2 and later. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Inflated type I error rates when using aggregation methods to analyze rare variants in the 1000 Genomes Project exon sequencing data in unrelated individuals: summary results from Group 7 at Genetic Analysis Workshop 17

Genetic Epidemiology ◽

10.1002/gepi.20650 ◽

2011 ◽

Vol 35 (S1) ◽

pp. S56-S60 ◽

Cited By ~ 17

Author(s):

Nathan Tintle ◽

Hugues Aschard ◽

Inchi Hu ◽

Nora Nock ◽

Haitian Wang ◽

...

Keyword(s):

Genetic Analysis ◽

Type I Error ◽

Rare Variants ◽

Error Rates ◽

Type I ◽

Sequencing Data ◽

1000 Genomes Project ◽

Type I Error Rates ◽

1000 Genomes ◽

Inflated Type

Download Full-text

Prioritising positively selected variants in whole-genome sequencing data using FineMAV

BMC Bioinformatics ◽

10.1186/s12859-021-04506-9 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Fadilla Wahyudi ◽

Farhang Aghakhanian ◽

Sadequr Rahman ◽

Yik-Ying Teo ◽

Michał Szpak ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Population Genomics ◽

Software Tool ◽

Human Populations ◽

Whole Genome ◽

Sequencing Data ◽

1000 Genomes Project ◽

1000 Genomes ◽

Genome Browsers

Abstract Background In population genomics, polymorphisms that are highly differentiated between geographically separated populations are often suggestive of Darwinian positive selection. Genomic scans have highlighted several such regions in African and non-African populations, but only a handful of these have functional data that clearly associates candidate variations driving the selection process. Fine-Mapping of Adaptive Variation (FineMAV) was developed to address this in a high-throughput manner using population based whole-genome sequences generated by the 1000 Genomes Project. It pinpoints positively selected genetic variants in sequencing data by prioritizing high frequency, population-specific and functional derived alleles. Results We developed a stand-alone software that implements the FineMAV statistic. To graphically visualise the FineMAV scores, it outputs the statistics as bigWig files, which is a common file format supported by many genome browsers. It is available as a command-line and graphical user interface. The software was tested by replicating the FineMAV scores obtained using 1000 Genomes Project African, European, East and South Asian populations and subsequently applied to whole-genome sequencing datasets from Singapore and China to highlight population specific variants that can be subsequently modelled. The software tool is publicly available at https://github.com/fadilla-wahyudi/finemav. Conclusions The software tool described here determines genome-wide FineMAV scores, using low or high-coverage whole-genome sequencing datasets, that can be used to prioritize a list of population specific, highly differentiated candidate variants for in vitro or in vivo functional screens. The tool displays these scores on the human genome browsers for easy visualisation, annotation and comparison between different genomic regions in worldwide human populations.

Download Full-text