Legacy Data Confounds Genomics Studies

Mapping Intimacies ◽

10.1101/624908 ◽

2019 ◽

Author(s):

Luke Anderson-Trocmé ◽

Rick Farouni ◽

Mathieu Bourgey ◽

Yoichiro Kamatani ◽

Koichiro Higasa ◽

...

Keyword(s):

Population Stratification ◽

Quality Data ◽

Human Populations ◽

Batch Effects ◽

Sequencing Data ◽

1000 Genomes Project ◽

Mutational Spectra ◽

1000 Genomes ◽

Legacy Data ◽

Early Phases

AbstractRecent reports have identified differences in the mutational spectra across human populations. While some of these reports have been replicated in other cohorts, most have been reported only in the 1000 Genomes Project (1kGP) data. While investigating an intriguing putative population stratification within the Japanese population, we identified a previously unreported batch effect leading to spurious mutation calls in the 1kGP data and to the apparent population stratification. Because the 1kGP data is used extensively, we find that the batch effects also lead to incorrect imputation by leading imputation servers and a small number of suspicious GWAS associations. Lower-quality data from the early phases of the 1kGP thus continues to contaminate modern studies in hidden ways. It may be time to retire or upgrade such legacy sequencing data.

Download Full-text

Legacy Data Confound Genomics Studies

Molecular Biology and Evolution ◽

10.1093/molbev/msz201 ◽

2019 ◽

Vol 37 (1) ◽

pp. 2-10 ◽

Cited By ~ 5

Author(s):

Luke Anderson-Trocmé ◽

Rick Farouni ◽

Mathieu Bourgey ◽

Yoichiro Kamatani ◽

Koichiro Higasa ◽

...

Keyword(s):

Population Stratification ◽

Quality Data ◽

Human Populations ◽

Batch Effects ◽

Sequencing Data ◽

1000 Genomes Project ◽

Mutational Spectra ◽

1000 Genomes ◽

Legacy Data ◽

Early Phases

Abstract Recent reports have identified differences in the mutational spectra across human populations. Although some of these reports have been replicated in other cohorts, most have been reported only in the 1000 Genomes Project (1kGP) data. While investigating an intriguing putative population stratification within the Japanese population, we identified a previously unreported batch effect leading to spurious mutation calls in the 1kGP data and to the apparent population stratification. Because the 1kGP data are used extensively, we find that the batch effects also lead to incorrect imputation by leading imputation servers and a small number of suspicious GWAS associations. Lower quality data from the early phases of the 1kGP thus continue to contaminate modern studies in hidden ways. It may be time to retire or upgrade such legacy sequencing data.

Download Full-text

Utilizing the Jaccard index to reveal population stratification in sequencing data: a simulation study and an application to the 1000 Genomes Project

Bioinformatics ◽

10.1093/bioinformatics/btv752 ◽

2015 ◽

Vol 32 (9) ◽

pp. 1366-1372 ◽

Cited By ~ 23

Author(s):

Dmitry Prokopenko ◽

Julian Hecker ◽

Edwin K. Silverman ◽

Marcello Pagano ◽

Markus M. Nöthen ◽

...

Keyword(s):

Simulation Study ◽

Population Stratification ◽

Jaccard Index ◽

Sequencing Data ◽

1000 Genomes Project ◽

1000 Genomes

Download Full-text

Prioritising positively selected variants in whole-genome sequencing data using FineMAV

BMC Bioinformatics ◽

10.1186/s12859-021-04506-9 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Fadilla Wahyudi ◽

Farhang Aghakhanian ◽

Sadequr Rahman ◽

Yik-Ying Teo ◽

Michał Szpak ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Genome Sequencing ◽

Population Genomics ◽

Software Tool ◽

Human Populations ◽

Whole Genome ◽

Sequencing Data ◽

1000 Genomes Project ◽

1000 Genomes ◽

Genome Browsers

Abstract Background In population genomics, polymorphisms that are highly differentiated between geographically separated populations are often suggestive of Darwinian positive selection. Genomic scans have highlighted several such regions in African and non-African populations, but only a handful of these have functional data that clearly associates candidate variations driving the selection process. Fine-Mapping of Adaptive Variation (FineMAV) was developed to address this in a high-throughput manner using population based whole-genome sequences generated by the 1000 Genomes Project. It pinpoints positively selected genetic variants in sequencing data by prioritizing high frequency, population-specific and functional derived alleles. Results We developed a stand-alone software that implements the FineMAV statistic. To graphically visualise the FineMAV scores, it outputs the statistics as bigWig files, which is a common file format supported by many genome browsers. It is available as a command-line and graphical user interface. The software was tested by replicating the FineMAV scores obtained using 1000 Genomes Project African, European, East and South Asian populations and subsequently applied to whole-genome sequencing datasets from Singapore and China to highlight population specific variants that can be subsequently modelled. The software tool is publicly available at https://github.com/fadilla-wahyudi/finemav. Conclusions The software tool described here determines genome-wide FineMAV scores, using low or high-coverage whole-genome sequencing datasets, that can be used to prioritize a list of population specific, highly differentiated candidate variants for in vitro or in vivo functional screens. The tool displays these scores on the human genome browsers for easy visualisation, annotation and comparison between different genomic regions in worldwide human populations.

Download Full-text

Ancestral Spectrum Analysis With Population-Specific Variants

Frontiers in Genetics ◽

10.3389/fgene.2021.724638 ◽

2021 ◽

Vol 12 ◽

Author(s):

Gang Shi ◽

Qingmin Kuang

Keyword(s):

Nucleotide Polymorphisms ◽

Sequencing Data ◽

1000 Genomes Project ◽

Specific Population ◽

High Coverage ◽

Single Nucleotide ◽

Target Populations ◽

1000 Genomes ◽

Sequencing Studies ◽

Best Linear Unbiased

With the advance of sequencing technology, an increasing number of populations have been sequenced to study the histories of worldwide populations, including their divergence, admixtures, migration, and effective sizes. The variants detected in sequencing studies are largely rare and mostly population specific. Population-specific variants are often recent mutations and are informative for revealing substructures and admixtures in populations; however, computational methods and tools to analyze them are still lacking. In this work, we propose using reference populations and single nucleotide polymorphisms (SNPs) specific to the reference populations. Ancestral information, the best linear unbiased estimator (BLUE) of the ancestral proportion, is proposed, which can be used to infer ancestral proportions in recently admixed target populations and measure the extent to which reference populations serve as good proxies for the admixing sources. Based on the same panel of SNPs, the ancestral information is comparable across samples from different studies and is not affected by genetic outliers, related samples, or the sample sizes of the admixed target populations. In addition, ancestral spectrum is useful for detecting genetic outliers or exploring co-ancestry between study samples and the reference populations. The methods are implemented in a program, Ancestral Spectrum Analyzer (ASA), and are applied in analyzing high-coverage sequencing data from the 1000 Genomes Project and the Human Genome Diversity Project (HGDP). In the analyses of American populations from the 1000 Genomes Project, we demonstrate that recent admixtures can be dissected from ancient admixtures by comparing ancestral spectra with and without indigenous Americans being included in the reference populations.

Download Full-text

Genetic diversity of ‘Very Important Pharmacogenes’ in two South-Asian populations

PeerJ ◽

10.7717/peerj.12294 ◽

2021 ◽

Vol 9 ◽

pp. e12294

Author(s):

Neeraj Bharti ◽

Ruma Banerjee ◽

Archana Achalere ◽

Sunitha Manjari Kasibhatla ◽

Rajendra Joshi

Keyword(s):

Genetic Diversity ◽

Allele Frequency ◽

Population Stratification ◽

Drug Response ◽

Fixation Index ◽

Frequency Variation ◽

1000 Genomes Project ◽

1000 Genomes ◽

Link Type ◽

Allele Frequency Variation

Objectives Reliable identification of population-specific variants is important for building the single nucleotide polymorphism (SNP) profile. In this study, genomic variation using allele frequency differences of pharmacologically important genes for Gujarati Indians in Houston (GIH) and Indian Telugu in the U.K. (ITU) from the 1000 Genomes Project vis-à-vis global population data was studied to understand its role in drug response. Methods Joint genotyping approach was used to derive variants of GIH and ITU independently. SNPs of both these populations with significant allele frequency variation (minor allele frequency ≥ 0.05) with super-populations from the 1000 Genomes Project and gnomAD based on Chi-square distribution with p-value of ≤ 0.05 and Bonferroni’s multiple adjustment tests were identified. Population stratification and fixation index analysis was carried out to understand genetic differentiation. Functional annotation of variants was carried out using SnpEff, VEP and CADD score. Results Population stratification of VIP genes revealed four clusters viz., single cluster of GIH and ITU, one cluster each of East Asian, European, African populations and Admixed American was found to be admixed. A total of 13 SNPs belonging to ten pharmacogenes were identified to have significant allele frequency variation in both GIH and ITU populations as compared to one or more super-populations. These SNPs belong to VKORC1 (rs17708472, rs2359612, rs8050894) involved in Vitamin K cycle, cytochrome P450 isoforms CYP2C9 (rs1057910), CYP2B6 (rs3211371), CYP2A2 (rs4646425) and CYP2A4 (rs4646440); ATP-binding cassette (ABC) transporter ABCB1 (rs12720067), DPYD1 (rs12119882, rs56160474) involved in pyrimidine metabolism, methyltransferase COMT (rs9332377) and transcriptional factor NR1I2 (rs6785049). SNPs rs1544410 (VDR), rs2725264 (ABCG2), rs5215 and rs5219 (KCNJ11) share high fixation index (≥ 0.5) with either EAS/AFR populations. Missense variants rs1057910 (CYP2C9), rs1801028 (DRD2) and rs1138272 (GSTP1), rs116855232 (NUDT15); intronic variants rs1131341 (NQO1) and rs115349832 (DPYD) are identified to be ‘deleterious’. Conclusions Analysis of SNPs pertaining to pharmacogenes in GIH and ITU populations using population structure, fixation index and allele frequency variation provides a premise for understanding the role of genetic diversity in drug response in Asian Indians.

Download Full-text

Population Stratification and Underrepresentation of Indian Subcontinent Genetic Diversity in the 1000 Genomes Project Dataset

Genome Biology and Evolution ◽

10.1093/gbe/evw244 ◽

2016 ◽

Vol 8 (11) ◽

pp. 3460-3470 ◽

Cited By ~ 13

Author(s):

Dhriti Sengupta ◽

Ananyo Choudhury ◽

Analabha Basu ◽

Michèle Ramsay

Keyword(s):

Genetic Diversity ◽

Population Stratification ◽

Indian Subcontinent ◽

1000 Genomes Project ◽

1000 Genomes

Download Full-text

TypeTE: a tool to genotype mobile element insertions from whole genome resequencing data

10.1101/791665 ◽

2019 ◽

Cited By ~ 1

Author(s):

Clement Goubert ◽

Jainy Thomas ◽

Lindsay M. Payer ◽

Jeffrey M. Kidd ◽

Julie Feusier ◽

...

Keyword(s):

Population Genomics ◽

Mobile Element ◽

Whole Genome Sequencing Data ◽

Human Populations ◽

Whole Genome ◽

Structural Variants ◽

Sequencing Data ◽

1000 Genomes ◽

Standard Set ◽

Whole Genome Resequencing

ABSTRACTAlu retrotransposons account for more than 10% of the human genome, and insertions of these elements create structural variants segregating in human populations. Such polymorphic Alu are powerful markers to understand population structure, and they represent variants that can greatly impact genome function, including gene expression. Accurate genotyping of Alu and other mobile elements has been challenging. Indeed, we found that Alu genotypes previously called for the 1000 Genomes Project are sometimes erroneous, which poses significant problems for phasing these insertions with other variants that comprise the haplotype. To ameliorate this issue, we introduce a new pipeline -- TypeTE -- which genotypes Alu insertions from whole-genome sequencing data. Starting from a list of polymorphic Alus, TypeTE identifies the hallmarks (poly-A tail and target site duplication) and orientation of Alu insertions using local re-assembly to reconstruct presence and absence alleles. Genotype likelihoods are then computed after re-mapping sequencing reads to the reconstructed alleles. Using a ‘gold standard’ set of PCR-based genotyping of >200 loci, we show that TypeTE improves genotype accuracy from 83% to 92% in the 1000 Genomes dataset. TypeTE can be readily adapted to other retrotransposon families and brings a valuable toolbox addition for population genomics.

Download Full-text

Evaluation of MC1R high-throughput nucleotide sequencing data generated by the 1000 Genomes Project

Genetics and Molecular Biology ◽

10.1590/1678-4685-gmb-2016-0180 ◽

2017 ◽

Vol 40 (2) ◽

pp. 530-539 ◽

Cited By ~ 3

Author(s):

Leonardo Arduino Marano ◽

Letícia Marcorin ◽

Erick da Cruz Castelli ◽

Celso Teixeira Mendes-Junior

Keyword(s):

High Throughput ◽

Nucleotide Sequencing ◽

Sequencing Data ◽

1000 Genomes Project ◽

1000 Genomes

Download Full-text

TypeTE: a tool to genotype mobile element insertions from whole genome resequencing data

Nucleic Acids Research ◽

10.1093/nar/gkaa074 ◽

2020 ◽

Vol 48 (6) ◽

pp. e36-e36 ◽

Cited By ~ 4

Author(s):

Clément Goubert ◽

Jainy Thomas ◽

Lindsay M Payer ◽

Jeffrey M Kidd ◽

Julie Feusier ◽

...

Keyword(s):

Population Genomics ◽

Mobile Element ◽

Whole Genome Sequencing Data ◽

Human Populations ◽

Whole Genome ◽

Structural Variants ◽

Sequencing Data ◽

1000 Genomes ◽

Alu Insertions ◽

Whole Genome Resequencing

Abstract Alu retrotransposons account for more than 10% of the human genome, and insertions of these elements create structural variants segregating in human populations. Such polymorphic Alus are powerful markers to understand population structure, and they represent variants that can greatly impact genome function, including gene expression. Accurate genotyping of Alus and other mobile elements has been challenging. Indeed, we found that Alu genotypes previously called for the 1000 Genomes Project are sometimes erroneous, which poses significant problems for phasing these insertions with other variants that comprise the haplotype. To ameliorate this issue, we introduce a new pipeline – TypeTE – which genotypes Alu insertions from whole-genome sequencing data. Starting from a list of polymorphic Alus, TypeTE identifies the hallmarks (poly-A tail and target site duplication) and orientation of Alu insertions using local re-assembly to reconstruct presence and absence alleles. Genotype likelihoods are then computed after re-mapping sequencing reads to the reconstructed alleles. Using a high-quality set of PCR-based genotyping of >200 loci, we show that TypeTE improves genotype accuracy from 83% to 92% in the 1000 Genomes dataset. TypeTE can be readily adapted to other retrotransposon families and brings a valuable toolbox addition for population genomics.

Download Full-text

Analysis of genomic variation in non-coding elements using population-scale sequencing data from the 1000 Genomes Project

Nucleic Acids Research ◽

10.1093/nar/gkr342 ◽

2011 ◽

Vol 39 (16) ◽

pp. 7058-7076 ◽

Cited By ~ 49

Author(s):

Xinmeng Jasmine Mu ◽

Zhi John Lu ◽

Yong Kong ◽

Hugo Y. K. Lam ◽

Mark B. Gerstein

Keyword(s):

Genomic Variation ◽

Sequencing Data ◽

1000 Genomes Project ◽

1000 Genomes ◽

Population Scale

Download Full-text