Recombination impacts damaging and disease mutation accumulation in human populations

Archaic mitochondrial DNA inserts in modern day nuclear genomes

BMC Genomics ◽

10.1186/s12864-019-6392-8 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 2

Author(s):

Robert Bücking ◽

Murray P Cox ◽

Georgi Hudjashov ◽

Lauri Saag ◽

Herawati Sudoyo ◽

...

Keyword(s):

Mitochondrial Dna ◽

Nuclear Dna ◽

Modern Human ◽

Sub Saharan Africa ◽

Next Generation Sequencing Data ◽

Human Populations ◽

Sequencing Data ◽

Modern Humans ◽

High Coverage ◽

Nuclear Genomes

Abstract Background Traces of interbreeding of Neanderthals and Denisovans with modern humans in the form of archaic DNA have been detected in the genomes of present-day human populations outside sub-Saharan Africa. Up to now, only nuclear archaic DNA has been detected in modern humans; we therefore attempted to identify archaic mitochondrial DNA (mtDNA) residing in modern human nuclear genomes as nuclear inserts of mitochondrial DNA (NUMTs). Results We analysed 221 high-coverage genomes from Oceania and Indonesia using an approach which identifies reads that map both to the nuclear and mitochondrial DNA. We then classified reads according to the source of the mtDNA, and found one NUMT of Denisovan mtDNA origin, present in 15 analysed genomes; analysis of the flanking region suggests that this insertion is more likely to have happened in a Denisovan individual and introgressed into modern humans with the Denisovan nuclear DNA, rather than in a descendant of a Denisovan female and a modern human male. Conclusions Here we present our pipeline for detecting introgressed NUMTs in next generation sequencing data that can be used on genomes sequenced in the future. Further discovery of such archaic NUMTs in modern humans can be used to detect interbreeding between archaic and modern humans and can reveal new insights into the nature of such interbreeding events.

Download Full-text

Legacy Data Confound Genomics Studies

Molecular Biology and Evolution ◽

10.1093/molbev/msz201 ◽

2019 ◽

Vol 37 (1) ◽

pp. 2-10 ◽

Cited By ~ 5

Author(s):

Luke Anderson-Trocmé ◽

Rick Farouni ◽

Mathieu Bourgey ◽

Yoichiro Kamatani ◽

Koichiro Higasa ◽

...

Keyword(s):

Population Stratification ◽

Quality Data ◽

Human Populations ◽

Batch Effects ◽

Sequencing Data ◽

1000 Genomes Project ◽

Mutational Spectra ◽

1000 Genomes ◽

Legacy Data ◽

Early Phases

Abstract Recent reports have identified differences in the mutational spectra across human populations. Although some of these reports have been replicated in other cohorts, most have been reported only in the 1000 Genomes Project (1kGP) data. While investigating an intriguing putative population stratification within the Japanese population, we identified a previously unreported batch effect leading to spurious mutation calls in the 1kGP data and to the apparent population stratification. Because the 1kGP data are used extensively, we find that the batch effects also lead to incorrect imputation by leading imputation servers and a small number of suspicious GWAS associations. Lower quality data from the early phases of the 1kGP thus continue to contaminate modern studies in hidden ways. It may be time to retire or upgrade such legacy sequencing data.

Download Full-text

Ancestral Spectrum Analysis With Population-Specific Variants

Frontiers in Genetics ◽

10.3389/fgene.2021.724638 ◽

2021 ◽

Vol 12 ◽

Author(s):

Gang Shi ◽

Qingmin Kuang

Keyword(s):

Nucleotide Polymorphisms ◽

Sequencing Data ◽

1000 Genomes Project ◽

Specific Population ◽

High Coverage ◽

Single Nucleotide ◽

Target Populations ◽

1000 Genomes ◽

Sequencing Studies ◽

Best Linear Unbiased

With the advance of sequencing technology, an increasing number of populations have been sequenced to study the histories of worldwide populations, including their divergence, admixtures, migration, and effective sizes. The variants detected in sequencing studies are largely rare and mostly population specific. Population-specific variants are often recent mutations and are informative for revealing substructures and admixtures in populations; however, computational methods and tools to analyze them are still lacking. In this work, we propose using reference populations and single nucleotide polymorphisms (SNPs) specific to the reference populations. Ancestral information, the best linear unbiased estimator (BLUE) of the ancestral proportion, is proposed, which can be used to infer ancestral proportions in recently admixed target populations and measure the extent to which reference populations serve as good proxies for the admixing sources. Based on the same panel of SNPs, the ancestral information is comparable across samples from different studies and is not affected by genetic outliers, related samples, or the sample sizes of the admixed target populations. In addition, ancestral spectrum is useful for detecting genetic outliers or exploring co-ancestry between study samples and the reference populations. The methods are implemented in a program, Ancestral Spectrum Analyzer (ASA), and are applied in analyzing high-coverage sequencing data from the 1000 Genomes Project and the Human Genome Diversity Project (HGDP). In the analyses of American populations from the 1000 Genomes Project, we demonstrate that recent admixtures can be dissected from ancient admixtures by comparing ancestral spectra with and without indigenous Americans being included in the reference populations.

Download Full-text

Archaic mitochondrial DNA inserts in modern day nuclear genomes

10.21203/rs.2.14881/v3 ◽

2019 ◽

Author(s):

Robert Bücking ◽

Murray P Cox ◽

Georgi Hudjashov ◽

Lauri Saag ◽

Herawati Sudoyo ◽

...

Keyword(s):

Mitochondrial Dna ◽

Nuclear Dna ◽

Modern Human ◽

Next Generation Sequencing Data ◽

Human Populations ◽

Sequencing Data ◽

Modern Humans ◽

High Coverage ◽

Nuclear Genomes ◽

Generation Sequencing

Abstract Background: Traces of interbreeding of Neanderthals and Denisovans with modern humans in the form of archaic DNA have been detected in the genomes of present-day human populations outside sub-Sahara Africa. Up to now, only nuclear archaic DNA has been detected in modern humans; we therefore attempted to identify archaic mitochondrial DNA (mtDNA) residing in modern human nuclear genomes as nuclear inserts of mitochondrial DNA (NUMTs). Results: We analysed 221 high-coverage genomes from Oceania and Indonesia using an approach which identifies reads that map both to the nuclear and mitochondrial DNA. We then classified reads according to the source of the mtDNA, and found one NUMT of Denisovan mtDNA origin; analysis of the flanking region suggests that this insertion is more likely to have happened in a Denisovan individual and introgressed into modern humans with the Denisovan nuclear DNA, rather than in a descendant of a Denisovan female and a modern human male. Conclusions: Here we present our pipeline for detecting introgressed NUMTs in next generation sequencing data that can be used on genomes sequenced in the future. Further discovery of such archaic NUMTs in modern humans can be used to detect interbreeding between archaic and modern humans and can reveal new insights into the nature of such interbreeding events.

Download Full-text

Archaic mitochondrial DNA inserts in modern day nuclear genomes

10.21203/rs.2.14881/v1 ◽

2019 ◽

Author(s):

Robert Bücking ◽

Murray P Cox ◽

Georgi Hudjashov ◽

Lauri Saag ◽

Herawati Sudoyo ◽

...

Keyword(s):

Mitochondrial Dna ◽

Nuclear Dna ◽

Modern Human ◽

Next Generation Sequencing Data ◽

Human Populations ◽

Sequencing Data ◽

Modern Humans ◽

High Coverage ◽

Nuclear Genomes ◽

Generation Sequencing

Abstract Background Traces of interbreeding of Neanderthals and Denisovans with modern humans in the form of archaic DNA have been detected in the genomes of present-day human populations outside sub-Sahara Africa. Up to now, only nuclear archaic DNA has been detected in modern humans; we therefore attempted to identify archaic mitochondrial DNA (mtDNA) residing in modern human nuclear genomes as nuclear inserts of mitochondrial DNA (NUMTs). Results We analysed 221 high-coverage genomes from Oceania and Indonesia using an approach which identifies reads that map both to the nuclear and mitochondrial DNA. We then classified reads according to the source of the mtDNA, and found one NUMT of Denisovan mtDNA origin; analysis of the flanking region suggests that this insertion is more likely to have happened in a Denisovan individual and introgressed into modern humans with the Denisovan nuclear DNA, rather than in a descendant of a Denisovan female and a modern human male. Conclusions Here we present our pipeline for detecting introgressed NUMTs in next generation sequencing data that can be used on genomes sequenced in the future. Further discovery of such archaic NUMTs in modern humans can be used to detect interbreeding between archaic and modern humans and can reveal new insights into the nature of such interbreeding events.

Download Full-text

Archaic mitochondrial DNA inserts in modern day nuclear genomes

10.21203/rs.2.14881/v2 ◽

2019 ◽

Author(s):

Robert Bücking ◽

Murray P Cox ◽

Georgi Hudjashov ◽

Lauri Saag ◽

Herawati Sudoyo ◽

...

Keyword(s):

Mitochondrial Dna ◽

Nuclear Dna ◽

Modern Human ◽

Next Generation Sequencing Data ◽

Human Populations ◽

Sequencing Data ◽

Modern Humans ◽

High Coverage ◽

Nuclear Genomes ◽

Generation Sequencing

Abstract Background: Traces of interbreeding of Neanderthals and Denisovans with modern humans in the form of archaic DNA have been detected in the genomes of present-day human populations outside sub-Sahara Africa. Up to now, only nuclear archaic DNA has been detected in modern humans; we therefore attempted to identify archaic mitochondrial DNA (mtDNA) residing in modern human nuclear genomes as nuclear inserts of mitochondrial DNA (NUMTs). Results: We analysed 221 high-coverage genomes from Oceania and Indonesia using an approach which identifies reads that map both to the nuclear and mitochondrial DNA. We then classified reads according to the source of the mtDNA, and found one NUMT of Denisovan mtDNA origin; analysis of the flanking region suggests that this insertion is more likely to have happened in a Denisovan individual and introgressed into modern humans with the Denisovan nuclear DNA, rather than in a descendant of a Denisovan female and a modern human male. Conclusions: Here we present our pipeline for detecting introgressed NUMTs in next generation sequencing data that can be used on genomes sequenced in the future. Further discovery of such archaic NUMTs in modern humans can be used to detect interbreeding between archaic and modern humans and can reveal new insights into the nature of such interbreeding events.

Download Full-text

Local adaptation and archaic introgression shape global diversity at human structural variant loci

10.1101/2021.01.26.428314 ◽

2021 ◽

Author(s):

Stephanie M. Yan ◽

Rachel M. Sherman ◽

Dylan J. Taylor ◽

Divya R. Nair ◽

Andrew N. Bortvin ◽

...

Keyword(s):

Local Adaptation ◽

Human Populations ◽

Sequencing Data ◽

High Coverage ◽

Short Read ◽

Asian Populations ◽

Long Read ◽

Archaic Introgression ◽

Immune Related Genes

AbstractLarge genomic insertions, deletions, and inversions are a potent source of functional and fitness-altering variation, but are challenging to resolve with short-read DNA sequencing alone. While recent long-read sequencing technologies have greatly expanded the catalog of structural variants (SVs), their costs have so far precluded their application at population scales. Given these limitations, the role of SVs in human adaptation remains poorly characterized. Here, we used a graph-based approach to genotype 107,866 long-read-discovered SVs in short-read sequencing data from diverse human populations. We then applied an admixture-aware method to scan these SVs for patterns of population-specific frequency differentiation—a signature of local adaptation. We identified 220 SVs exhibiting extreme frequency differentiation, including several SVs that were among the lead variants at their corresponding loci. The top two signatures traced to separate insertion and deletion polymorphisms at the immunoglobulin heavy chain locus, together tagging a 325 Kbp haplotype that swept to high frequency and was subsequently fragmented by recombination. Alleles defining this haplotype are nearly fixed (60-95%) in certain Southeast Asian populations, but are rare or absent from other global populations composing the 1000 Genomes Project. Further investigation revealed that the haplotype closely matches with sequences observed in two of three high-coverage Neanderthal genomes, providing strong evidence of a Neanderthal-introgressed origin. This extraordinary episode of positive selection, which we infer to have occurred between 1700 and 8400 years ago, corroborates the role of immune-related genes as prominent targets of adaptive archaic introgression. Our study demonstrates how combining recent advances in genome sequencing, genotyping algorithms, and population genetic methods can reveal signatures of key evolutionary events that remained hidden within poorly resolved regions of the genome.

Download Full-text

TypeTE: a tool to genotype mobile element insertions from whole genome resequencing data

10.1101/791665 ◽

2019 ◽

Cited By ~ 1

Author(s):

Clement Goubert ◽

Jainy Thomas ◽

Lindsay M. Payer ◽

Jeffrey M. Kidd ◽

Julie Feusier ◽

...

Keyword(s):

Population Genomics ◽

Mobile Element ◽

Whole Genome Sequencing Data ◽

Human Populations ◽

Whole Genome ◽

Structural Variants ◽

Sequencing Data ◽

1000 Genomes ◽

Standard Set ◽

Whole Genome Resequencing

ABSTRACTAlu retrotransposons account for more than 10% of the human genome, and insertions of these elements create structural variants segregating in human populations. Such polymorphic Alu are powerful markers to understand population structure, and they represent variants that can greatly impact genome function, including gene expression. Accurate genotyping of Alu and other mobile elements has been challenging. Indeed, we found that Alu genotypes previously called for the 1000 Genomes Project are sometimes erroneous, which poses significant problems for phasing these insertions with other variants that comprise the haplotype. To ameliorate this issue, we introduce a new pipeline -- TypeTE -- which genotypes Alu insertions from whole-genome sequencing data. Starting from a list of polymorphic Alus, TypeTE identifies the hallmarks (poly-A tail and target site duplication) and orientation of Alu insertions using local re-assembly to reconstruct presence and absence alleles. Genotype likelihoods are then computed after re-mapping sequencing reads to the reconstructed alleles. Using a ‘gold standard’ set of PCR-based genotyping of >200 loci, we show that TypeTE improves genotype accuracy from 83% to 92% in the 1000 Genomes dataset. TypeTE can be readily adapted to other retrotransposon families and brings a valuable toolbox addition for population genomics.

Download Full-text

Legacy Data Confounds Genomics Studies

10.1101/624908 ◽

2019 ◽

Author(s):

Luke Anderson-Trocmé ◽

Rick Farouni ◽

Mathieu Bourgey ◽

Yoichiro Kamatani ◽

Koichiro Higasa ◽

...

Keyword(s):

Population Stratification ◽

Quality Data ◽

Human Populations ◽

Batch Effects ◽

Sequencing Data ◽

1000 Genomes Project ◽

Mutational Spectra ◽

1000 Genomes ◽

Legacy Data ◽

Early Phases

AbstractRecent reports have identified differences in the mutational spectra across human populations. While some of these reports have been replicated in other cohorts, most have been reported only in the 1000 Genomes Project (1kGP) data. While investigating an intriguing putative population stratification within the Japanese population, we identified a previously unreported batch effect leading to spurious mutation calls in the 1kGP data and to the apparent population stratification. Because the 1kGP data is used extensively, we find that the batch effects also lead to incorrect imputation by leading imputation servers and a small number of suspicious GWAS associations. Lower-quality data from the early phases of the 1kGP thus continues to contaminate modern studies in hidden ways. It may be time to retire or upgrade such legacy sequencing data.

Download Full-text

Comparison of sequencing data processing pipelines and application to underrepresented African human populations

BMC Bioinformatics ◽

10.1186/s12859-021-04407-x ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Gwenna Breton ◽

Anna C. V. Johansson ◽

Per Sjödin ◽

Carina M. Schlebusch ◽

Mattias Jakobsson

Keyword(s):

Best Practices ◽

High Throughput ◽

High Throughput Sequencing ◽

Variant Calling ◽

Human Populations ◽

Sequencing Data ◽

High Coverage ◽

Individual Level ◽

Bioinformatic Tools ◽

The Individual

Abstract Background Population genetic studies of humans make increasing use of high-throughput sequencing in order to capture diversity in an unbiased way. There is an abundance of sequencing technologies, bioinformatic tools and the available genomes are increasing in number. Studies have evaluated and compared some of these technologies and tools, such as the Genome Analysis Toolkit (GATK) and its “Best Practices” bioinformatic pipelines. However, studies often focus on a few genomes of Eurasian origin in order to detect technical issues. We instead surveyed the use of the GATK tools and established a pipeline for processing high coverage full genomes from a diverse set of populations, including Sub-Saharan African groups, in order to reveal challenges from human diversity and stratification. Results We surveyed 29 studies using high-throughput sequencing data, and compared their strategies for data pre-processing and variant calling. We found that processing of data is very variable across studies and that the GATK “Best Practices” are seldom followed strictly. We then compared three versions of a GATK pipeline, differing in the inclusion of an indel realignment step and with a modification of the base quality score recalibration step. We applied the pipelines on a diverse set of 28 individuals. We compared the pipelines in terms of count of called variants and overlap of the callsets. We found that the pipelines resulted in similar callsets, in particular after callset filtering. We also ran one of the pipelines on a larger dataset of 179 individuals. We noted that including more individuals at the joint genotyping step resulted in different counts of variants. At the individual level, we observed that the average genome coverage was correlated to the number of variants called. Conclusions We conclude that applying the GATK “Best Practices” pipeline, including their recommended reference datasets, to underrepresented populations does not lead to a decrease in the number of called variants compared to alternative pipelines. We recommend to aim for coverage of > 30X if identifying most variants is important, and to work with large sample sizes at the variant calling stage, also for underrepresented individuals and populations.

Download Full-text