Harvesting information from ultra-short ancient DNA sequences

Mapping Intimacies ◽

10.1101/319277 ◽

2018 ◽

Cited By ~ 2

Author(s):

Cesare de Filippo ◽

Matthias Meyer ◽

Kay Prüfer

Keyword(s):

Ancient Dna ◽

Dna Sequences ◽

Computational Analysis ◽

Rare Variants ◽

Sequence Data ◽

Minimum Length ◽

Dna Fragments ◽

Laboratory Methods ◽

Sima De Los Huesos ◽

The Impact

AbstractThe study of ancient DNA is hampered by degradation, resulting in short DNA fragments. Advances in laboratory methods have made it possible to retrieve short DNA fragments, thereby improving access to DNA preserved in highly degraded, ancient material. However, such material contains large amounts of microbial contamination in addition to DNA fragments from the ancient organism. The resulting mixture of sequences constitute a challenge for computational analysis, since microbial sequences are hard to distinguish from the ancient sequences of interest, especially when they are short. Here, we develop a method to quantify spurious alignments based on the presence or absence of rare variants. We find that spurious alignments are enriched for mismatches and insertion/deletion differences and lack substitution patterns typical of ancient DNA. The impact of spurious alignments can be reduced by filtering on these features and by imposing a sample-specific minimum length cutoff. We apply this approach to sequences from the ~430,000 year-old Sima de los Huesos hominin remains, which contain particularly short DNA fragments, and increase the amount of usable sequence data by 17-150%. This allows us to place a third specimen from the site on the Neandertal lineage. Our method maximizes the sequence data amenable to genetic analysis from highly degraded ancient material and avoids pitfalls that are associated with the analysis of ultra-short DNA sequences.

Download Full-text

Systematic benchmark of ancient DNA read mapping

Briefings in Bioinformatics ◽

10.1093/bib/bbab076 ◽

2021 ◽

Author(s):

Adrien Oliva ◽

Raymond Tobler ◽

Alan Cooper ◽

Bastien Llamas ◽

Yassine Souilmi

Keyword(s):

Ancient Dna ◽

Dna Sequences ◽

Population Genetic ◽

Reference Genome ◽

Population Data ◽

Human Populations ◽

Current Standard ◽

Read Mapping ◽

Reference Bias ◽

The Impact

Abstract The current standard practice for assembling individual genomes involves mapping millions of short DNA sequences (also known as DNA ‘reads’) against a pre-constructed reference genome. Mapping vast amounts of short reads in a timely manner is a computationally challenging task that inevitably produces artefacts, including biases against alleles not found in the reference genome. This reference bias and other mapping artefacts are expected to be exacerbated in ancient DNA (aDNA) studies, which rely on the analysis of low quantities of damaged and very short DNA fragments (~30–80 bp). Nevertheless, the current gold-standard mapping strategies for aDNA studies have effectively remained unchanged for nearly a decade, during which time new software has emerged. In this study, we used simulated aDNA reads from three different human populations to benchmark the performance of 30 distinct mapping strategies implemented across four different read mapping software—BWA-aln, BWA-mem, NovoAlign and Bowtie2—and quantified the impact of reference bias in downstream population genetic analyses. We show that specific NovoAlign, BWA-aln and BWA-mem parameterizations achieve high mapping precision with low levels of reference bias, particularly after filtering out reads with low mapping qualities. However, unbiased NovoAlign results required the use of an IUPAC reference genome. While relevant only to aDNA projects where reference population data are available, the benefit of using an IUPAC reference demonstrates the value of incorporating population genetic information into the aDNA mapping process, echoing recent results based on graph genome representations.

Download Full-text

Predicting RNA splicing from DNA sequence using Pangolin

10.1101/2021.07.06.451243 ◽

2021 ◽

Author(s):

Tony Zeng ◽

Yang I Li

Keyword(s):

Deep Learning ◽

Dna Sequence ◽

Rna Splicing ◽

Rare Variants ◽

Sequence Data ◽

Learning Approaches ◽

Loss Of Function ◽

Pathogenic Variants ◽

Uncertain Significance ◽

The Impact

Recent progress in deep learning approaches have greatly improved the prediction of RNA splicing from DNA sequence. Here, we present Pangolin, a deep learning model to predict splice site strength in multiple tissues that has been trained on RNA splicing and sequence data from four species. Pangolin outperforms state of the art methods for predicting RNA splicing on a variety of prediction tasks. We use Pangolin to study the impact of genetic variants on RNA splicing, including lineage-specific variants and rare variants of uncertain significance. Pangolin predicts loss-of-function mutations with high accuracy and recall, particularly for mutations that are not missense or nonsense (AUPRC = 0.93), demonstrating remarkable potential for identifying pathogenic variants.

Download Full-text

Phylogenetic relationships of Chacodelphys (Marsupialia: Didelphidae: Didelphinae) based on “ancient” DNA sequences

Journal of Mammalogy ◽

10.1093/jmammal/gyv197 ◽

2015 ◽

Vol 97 (2) ◽

pp. 394-404 ◽

Cited By ~ 4

Author(s):

Juan F. Díaz-Nieto ◽

Sharon A. Jansa ◽

Robert S. Voss

Keyword(s):

Soft Tissue ◽

Ancient Dna ◽

Dna Sequences ◽

Phylogenetic Relationships ◽

Morphological Character ◽

Sequence Data ◽

Phylogenetic Analyses ◽

Sister Taxon ◽

Evolutionary Relationships ◽

Fresh Material

Abstract Morphological character data are inadequate to resolve the evolutionary relationships of the didelphid genus Chacodelphys , which previous phylogenetic analyses have alternatively suggested might be the sister taxon of Lestodelphys and Thylamys (tribe Thylamyini) or of Monodelphis (tribe Marmosini) in the subfamily Didelphinae. Because fresh material of Chacodelphys is unavailable, we extracted DNA from microscopic fragments of soft tissue adhering to the 95-year-old holotype skull of C. formosa. Phylogenetic analyses of the resulting sequence data convincingly resolve Chacodelphys as the sister taxon of Cryptonanus , a genus with which it had not previously been thought to be closely related. This novel clade ( Chacodelphys + Cryptonanus ) belongs to an unnamed thylamyine lineage with Gracilinanus and Lestodelphys + Thylamys , but relationships among these taxa remain to be convincingly resolved. Los análisis basados en caracteres morfológicos han sido inadecuados para resolver las relaciones evolutivas del género marsupial didélfido Chacodelphys . Previos análisis filogenéticos han sugerido como hipótesis alternativas que Chacodelphys sea el grupo hermano de Lestodelphys y Thylamys (tribu Thylamyini) o de Monodelphis (tribu Marmosini), todos estos géneros pertenecientes a la subfamilia Didelphinae. Debido a la ausencia de material fresco de Chacodelphys , extrajimos ADN de fragmentos microscópicos de tejido adherido al cráneo de 95 años del holotipo de C. formosa . Análisis filogenéticos de las secuencias obtenidas resuelven convincentemente la posición filogenética de Chacodelphys como el taxón hermano de Cryptonanus , un género con el cual nunca antes se había pensado que estuviera cercanamente relacionado. Aunque reconocemos a este nuevo clado ( Chacodelphys + Cryptonanus ) junto con Gracilinanus y Lestodelphys + Thylamys pertenecientes a un linaje sin nombre, las relaciones entre estas taxa siguen sin estar convincentemente resueltas.

Download Full-text

Association Analysis and Meta-Analysis of Multi-allelic Variants for Large Scale Sequence Data

10.1101/197913 ◽

2017 ◽

Author(s):

Xiaowei Zhan ◽

Sai Chen ◽

Yu Jiang ◽

Mengzhen Liu ◽

William G. Iacono ◽

...

Keyword(s):

Large Scale ◽

Rare Variants ◽

Sequence Data ◽

Meta Analysis ◽

Joint Modeling ◽

Allelic Variants ◽

Association Analyses ◽

Link Type ◽

Gene Level ◽

The Impact

AbstractMotivation:There is great interest to understand the impact of rare variants in human diseases using large sequence datasets. In deep sequences datasets of >10,000 samples, ∼10% of the variant sites are observed to be multi-allelic. Many of the multi-allelic variants have been shown to be functional and disease relevant. Proper analysis of multi-allelic variants is critical to the success of a sequencing study, but existing methods do not properly handle multi-allelic variants and can produce highly misleading association results.Results:We propose novel methods to encode multi-allelic sites, conduct single variant and gene-level association analyses, and perform meta-analysis for multi-allelic variants. We evaluated these methods through extensive simulations and the study of a large meta-analysis of ∼18,000 samples on the cigarettes-per-day phenotype. We showed that our joint modeling approach provided an unbiased estimate of genetic effects, greatly improved the power of single variant association tests, and enhanced gene-level tests over existing approaches.Availability:Software packages implementing these methods are available at (https://github.com/zhanxw/rvtestshttp://genome.sph.umich.edu/wiki/RareMETAL).Contact:[email protected]; [email protected]

Download Full-text

The presence and impact of reference bias on population genomic studies of prehistoric human populations

10.1101/487983 ◽

2018 ◽

Cited By ~ 1

Author(s):

Torsten Günther ◽

Carl Nettelblad

Keyword(s):

Ancient Dna ◽

Sequence Data ◽

Genomic Analysis ◽

Genomic Research ◽

Human Populations ◽

Reference Allele ◽

Population Genomic ◽

Genomic Studies ◽

Reference Bias ◽

The Impact

AbstractHigh quality reference genomes are an important resource in genomic research projects. A consequence is that DNA fragments carrying the reference allele will be more likely to map suc-cessfully, or receive higher quality scores. This reference bias can have effects on downstream population genomic analysis when heterozygous sites are falsely considered homozygous for the reference allele.In palaeogenomic studies of human populations, mapping against the human reference genome is used to identify endogenous human sequences. Ancient DNA studies usually operate with low sequencing coverages and fragmentation of DNA molecules causes a large proportion of the sequenced fragments to be shorter than 50 bp – reducing the amount of accepted mismatches, and increasing the probability of multiple matching sites in the genome. These ancient DNA specific properties are potentially exacerbating the impact of reference bias on downstream analyses, especially since most studies of ancient human populations use pseudohaploid data, i.e. they randomly sample only one sequencing read per site.We show that reference bias is pervasive in published ancient DNA sequence data of pre-historic humans with some differences between individual genomic regions. We illustrate that the strength of reference bias is negatively correlated with fragment length. Reference bias can cause differences in the results of downstream analyses such as population affinities, heterozygosity estimates and estimates of archaic ancestry. These spurious results highlight how important it is to be aware of these technical artifacts and that we need strategies to mitigate the effect. Therefore, we suggest some post-mapping filtering strategies to resolve reference bias which help to reduce its impact substantially.

Download Full-text

The impact of Mendelian sleep and circadian genetic variants in a population setting

10.1101/2022.01.04.21268199 ◽

2022 ◽

Author(s):

Michael N Weedon ◽

Samuel E Jones ◽

Jacqueline Lane ◽

Jiwon Lee ◽

Hanna M Ollila ◽

...

Keyword(s):

Sleep Duration ◽

Rare Variants ◽

Sequence Data ◽

Large Population ◽

Population Based ◽

Self Report ◽

Uk Biobank ◽

Sleep Phase ◽

Sleep Timing ◽

The Impact

Rare variants in ten genes have been reported to cause Mendelian sleep conditions characterised by extreme sleep duration or timing. These include familial natural short sleep (ADRB1, DEC2/BHLHE41, GRM1 and NPSR1), advanced sleep phase (PER2, PER3, CRY2, CSNK1D and TIMELESS) and delayed sleep phase (CRY1). The association of variants of these genes with extreme sleep conditions were usually based on clinically ascertained families, and their effects when identified in the population are unknown. We aimed to determine the effects of these variants on sleep traits in large population-based cohorts. We performed genetic association analysis of variants previously reported to be causal for Mendelian sleep and circadian conditions. Analyses were performed using 191,929 individuals with data on sleep and whole-exome or genome-sequence data from 4 population-based studies: UK Biobank, FINRISK, Health-2000-2001, and the Multi-Ethnic Study of Atherosclerosis (MESA). We identified sleep disorders from self-report, hospital and primary care data. We estimated sleep duration and timing measures from self-report and accelerometery data. We identified carriers for 10 out of 12 previously reported pathogenic variants for 8 of the 10 genes. They ranged in frequency from 1 individual with the variant in CSNK1D to 1,574 individuals with a reported variant in the PER3 gene in the UK Biobank. We found no association of any of these variants with extreme sleep or circadian phenotypes. Using sleep timing as a proxy measure for sleep phase, only PER3 and CRY1 variants demonstrated association with earlier and later sleep timing, respectively; however, the magnitude of effect was smaller than previously reported (sleep midpoint ~7 mins earlier and ~5 mins later, respectively). We also performed burden tests of protein truncating (PTVs) or rare missense variants for the 10 genes. Only PTVs in PER2 and PER3 were associated with a relevant trait (for example, 64 individuals with a PTV in PER2 had an odds ratio of 4.4 for being "definitely a morning person", P=4x10-8; and had a 57-minute earlier midpoint sleep, P=5x10-7). Our results indicate that previously reported variants for Mendelian sleep and circadian conditions are often not highly penetrant when ascertained incidentally from the general population.

Download Full-text

Association Analysis and Meta-Analysis of Multi-Allelic Variants for Large-Scale Sequence Data

Genes ◽

10.3390/genes11050586 ◽

2020 ◽

Vol 11 (5) ◽

pp. 586

Author(s):

Yu Jiang ◽

Sai Chen ◽

Xingyan Wang ◽

Mengzhen Liu ◽

William G. Iacono ◽

...

Keyword(s):

Large Scale ◽

Rare Variants ◽

Sequence Data ◽

Meta Analysis ◽

Joint Modeling ◽

Allelic Variants ◽

Association Analyses ◽

Software Packages ◽

Gene Level ◽

The Impact

There is great interest in understanding the impact of rare variants in human diseases using large sequence datasets. In deep sequence datasets of >10,000 samples, ~10% of the variant sites are observed to be multi-allelic. Many of the multi-allelic variants have been shown to be functional and disease-relevant. Proper analysis of multi-allelic variants is critical to the success of a sequencing study, but existing methods do not properly handle multi-allelic variants and can produce highly misleading association results. We discuss practical issues and methods to encode multi-allelic sites, conduct single-variant and gene-level association analyses, and perform meta-analysis for multi-allelic variants. We evaluated these methods through extensive simulations and the study of a large meta-analysis of ~18,000 samples on the cigarettes-per-day phenotype. We showed that our joint modeling approach provided an unbiased estimate of genetic effects, greatly improved the power of single-variant association tests among methods that can properly estimate allele effects, and enhanced gene-level tests over existing approaches. Software packages implementing these methods are available online.

Download Full-text

Phylogenetic Estimation of Timescales Using Ancient DNA: The Effects of Temporal Sampling Scheme and Uncertainty in Sample Ages

Molecular Biology and Evolution ◽

10.1093/molbev/mss232 ◽

2012 ◽

Vol 30 (2) ◽

pp. 253-262 ◽

Cited By ~ 31

Author(s):

Martyna Molak ◽

Eline D. Lorenzen ◽

Beth Shapiro ◽

Simon Y.W. Ho

Keyword(s):

Ancient Dna ◽

Dna Sequences ◽

Estimation Error ◽

Phylogenetic Analyses ◽

Published Data ◽

Molecular Clocks ◽

Data Sets ◽

Data Set ◽

Temporal Sampling ◽

The Impact

Abstract In recent years, ancient DNA has increasingly been used for estimating molecular timescales, particularly in studies of substitution rates and demographic histories. Molecular clocks can be calibrated using temporal information from ancient DNA sequences. This information comes from the ages of the ancient samples, which can be estimated by radiocarbon dating the source material or by dating the layers in which the material was deposited. Both methods involve sources of uncertainty. The performance of Bayesian phylogenetic inference depends on the information content of the data set, which includes variation in the DNA sequences and the structure of the sample ages. Various sources of estimation error can reduce our ability to estimate rates and timescales accurately and precisely. We investigated the impact of sample-dating uncertainties on the estimation of evolutionary timescale parameters using the software BEAST. Our analyses involved 11 published data sets and focused on estimates of substitution rate and root age. We show that, provided that samples have been accurately dated and have a broad temporal span, it might be unnecessary to account for sample-dating uncertainty in Bayesian phylogenetic analyses of ancient DNA. We also investigated the sample size and temporal span of the ancient DNA sequences needed to estimate phylogenetic timescales reliably. Our results show that the range of sample ages plays a crucial role in determining the quality of the results but that accurate and precise phylogenetic estimates of timescales can be made even with only a few ancient sequences. These findings have important practical consequences for studies of molecular rates, timescales, and population dynamics.

Download Full-text

Ancient DNA from a 2,700-year-old goitered gazelle (Gazella subgutturosa) confirms gazelle hunting in Iron Age Central Asia

10.1101/2021.12.08.471591 ◽

2021 ◽

Author(s):

André Elias Rodrigues Soares ◽

Nikolaus Boroffka ◽

Oskar Schröder ◽

Leonid Sverchkov ◽

Norbert Benecke ◽

...

Keyword(s):

Central Asia ◽

Iron Age ◽

Ancient Dna ◽

Dna Sequences ◽

Nuclear Dna ◽

Sequence Data ◽

Genetic Identification ◽

Important Region ◽

Depth Analysis ◽

Gazella Subgutturosa

Central Asia has been an important region connecting the different parts of Eurasia throughout history and prehistory, with large states developing in this region during the Iron Age. Archaeogenomics is a powerful addition to the zooarchaeological toolkit for understanding the relation of these societies to animals. Here, we present the genetic identification of a goitered gazelle specimen (Gazella subgutturosa) at the site Gazimulla-Tepa, in modern-day Uzbekistan, confirming hunting of the species in the region during the Iron Age. The sample was directly radiocarbon dated to 2724-2439 calBP. A phylogenetic analysis of the mitochondrial genome places the individual into the modern variation of G. subgutturosa. Our data does represent both the first ancient DNA and the first nuclear DNA sequences of this species. The lack of genomic resources available for this gazelle and related species prevented us from performing a more in-depth analysis of the nuclear sequences generated. Therefore, we are making our sequence data available to the research community to facilitate other research of this nowadays threatened species which has been subject to human hunting for several millennia across its entire range on the Asian continent.

Download Full-text

A Comparison of Methods for Estimating Substitution Rates from Ancient DNA Sequence Data

10.1101/162529 ◽

2017 ◽

Author(s):

K. Jun Tong ◽

David A. Duchêne ◽

Sebastián Duchêne ◽

Jemma L. Geoghegan ◽

Simon Y.W. Ho

Keyword(s):

Bayesian Inference ◽

Least Squares ◽

Ancient Dna ◽

Dna Sequences ◽

Sequence Data ◽

Evolutionary Rates ◽

High Rate ◽

Rate Variation ◽

Data Sets ◽

Dna Sequence Data

AbstractThe estimation of evolutionary rates from ancient DNA sequences can be negatively affected by among-lineage rate variation and non-random sampling. Using a simulation study, we compared the performance of three phylogenetic methods for inferring evolutionary rates from time-structured data sets: root-to-tip regression, least-squares dating, and Bayesian inference. Our results show that these methods produce reliable estimates when the substitution rate is high, rate variation is low, and samples of similar ages are not phylogenetically clustered. The interaction of these factors is particularly important for Bayesian estimation of evolutionary rates. We also inferred rates for time-structured mitogenomic data sets from six vertebrate species. Root-to-tip regression estimated a different rate from least-squares dating and Bayesian inference for mitogenomes from the horse, which has high levels of among-lineage rate variation. We recommend using multiple methods of inference and testing data for temporal signal, among-lineage rate variation, and phylo-temporal clustering.

Download Full-text