scholarly journals Harvesting information from ultra-short ancient DNA sequences

2018 ◽  
Author(s):  
Cesare de Filippo ◽  
Matthias Meyer ◽  
Kay Prüfer

AbstractThe study of ancient DNA is hampered by degradation, resulting in short DNA fragments. Advances in laboratory methods have made it possible to retrieve short DNA fragments, thereby improving access to DNA preserved in highly degraded, ancient material. However, such material contains large amounts of microbial contamination in addition to DNA fragments from the ancient organism. The resulting mixture of sequences constitute a challenge for computational analysis, since microbial sequences are hard to distinguish from the ancient sequences of interest, especially when they are short. Here, we develop a method to quantify spurious alignments based on the presence or absence of rare variants. We find that spurious alignments are enriched for mismatches and insertion/deletion differences and lack substitution patterns typical of ancient DNA. The impact of spurious alignments can be reduced by filtering on these features and by imposing a sample-specific minimum length cutoff. We apply this approach to sequences from the ~430,000 year-old Sima de los Huesos hominin remains, which contain particularly short DNA fragments, and increase the amount of usable sequence data by 17-150%. This allows us to place a third specimen from the site on the Neandertal lineage. Our method maximizes the sequence data amenable to genetic analysis from highly degraded ancient material and avoids pitfalls that are associated with the analysis of ultra-short DNA sequences.

Author(s):  
Adrien Oliva ◽  
Raymond Tobler ◽  
Alan Cooper ◽  
Bastien Llamas ◽  
Yassine Souilmi

Abstract The current standard practice for assembling individual genomes involves mapping millions of short DNA sequences (also known as DNA ‘reads’) against a pre-constructed reference genome. Mapping vast amounts of short reads in a timely manner is a computationally challenging task that inevitably produces artefacts, including biases against alleles not found in the reference genome. This reference bias and other mapping artefacts are expected to be exacerbated in ancient DNA (aDNA) studies, which rely on the analysis of low quantities of damaged and very short DNA fragments (~30–80 bp). Nevertheless, the current gold-standard mapping strategies for aDNA studies have effectively remained unchanged for nearly a decade, during which time new software has emerged. In this study, we used simulated aDNA reads from three different human populations to benchmark the performance of 30 distinct mapping strategies implemented across four different read mapping software—BWA-aln, BWA-mem, NovoAlign and Bowtie2—and quantified the impact of reference bias in downstream population genetic analyses. We show that specific NovoAlign, BWA-aln and BWA-mem parameterizations achieve high mapping precision with low levels of reference bias, particularly after filtering out reads with low mapping qualities. However, unbiased NovoAlign results required the use of an IUPAC reference genome. While relevant only to aDNA projects where reference population data are available, the benefit of using an IUPAC reference demonstrates the value of incorporating population genetic information into the aDNA mapping process, echoing recent results based on graph genome representations.


2021 ◽  
Author(s):  
Tony Zeng ◽  
Yang I Li

Recent progress in deep learning approaches have greatly improved the prediction of RNA splicing from DNA sequence. Here, we present Pangolin, a deep learning model to predict splice site strength in multiple tissues that has been trained on RNA splicing and sequence data from four species. Pangolin outperforms state of the art methods for predicting RNA splicing on a variety of prediction tasks. We use Pangolin to study the impact of genetic variants on RNA splicing, including lineage-specific variants and rare variants of uncertain significance. Pangolin predicts loss-of-function mutations with high accuracy and recall, particularly for mutations that are not missense or nonsense (AUPRC = 0.93), demonstrating remarkable potential for identifying pathogenic variants.


2015 ◽  
Vol 97 (2) ◽  
pp. 394-404 ◽  
Author(s):  
Juan F. Díaz-Nieto ◽  
Sharon A. Jansa ◽  
Robert S. Voss

Abstract Morphological character data are inadequate to resolve the evolutionary relationships of the didelphid genus Chacodelphys , which previous phylogenetic analyses have alternatively suggested might be the sister taxon of Lestodelphys and Thylamys (tribe Thylamyini) or of Monodelphis (tribe Marmosini) in the subfamily Didelphinae. Because fresh material of Chacodelphys is unavailable, we extracted DNA from microscopic fragments of soft tissue adhering to the 95-year-old holotype skull of C. formosa. Phylogenetic analyses of the resulting sequence data convincingly resolve Chacodelphys as the sister taxon of Cryptonanus , a genus with which it had not previously been thought to be closely related. This novel clade ( Chacodelphys + Cryptonanus ) belongs to an unnamed thylamyine lineage with Gracilinanus and Lestodelphys + Thylamys , but relationships among these taxa remain to be convincingly resolved. Los análisis basados en caracteres morfológicos han sido inadecuados para resolver las relaciones evolutivas del género marsupial didélfido Chacodelphys . Previos análisis filogenéticos han sugerido como hipótesis alternativas que Chacodelphys sea el grupo hermano de Lestodelphys y Thylamys (tribu Thylamyini) o de Monodelphis (tribu Marmosini), todos estos géneros pertenecientes a la subfamilia Didelphinae. Debido a la ausencia de material fresco de Chacodelphys , extrajimos ADN de fragmentos microscópicos de tejido adherido al cráneo de 95 años del holotipo de C. formosa . Análisis filogenéticos de las secuencias obtenidas resuelven convincentemente la posición filogenética de Chacodelphys como el taxón hermano de Cryptonanus , un género con el cual nunca antes se había pensado que estuviera cercanamente relacionado. Aunque reconocemos a este nuevo clado ( Chacodelphys + Cryptonanus ) junto con Gracilinanus y Lestodelphys + Thylamys pertenecientes a un linaje sin nombre, las relaciones entre estas taxa siguen sin estar convincentemente resueltas.


2017 ◽  
Author(s):  
Xiaowei Zhan ◽  
Sai Chen ◽  
Yu Jiang ◽  
Mengzhen Liu ◽  
William G. Iacono ◽  
...  

AbstractMotivation:There is great interest to understand the impact of rare variants in human diseases using large sequence datasets. In deep sequences datasets of >10,000 samples, ∼10% of the variant sites are observed to be multi-allelic. Many of the multi-allelic variants have been shown to be functional and disease relevant. Proper analysis of multi-allelic variants is critical to the success of a sequencing study, but existing methods do not properly handle multi-allelic variants and can produce highly misleading association results.Results:We propose novel methods to encode multi-allelic sites, conduct single variant and gene-level association analyses, and perform meta-analysis for multi-allelic variants. We evaluated these methods through extensive simulations and the study of a large meta-analysis of ∼18,000 samples on the cigarettes-per-day phenotype. We showed that our joint modeling approach provided an unbiased estimate of genetic effects, greatly improved the power of single variant association tests, and enhanced gene-level tests over existing approaches.Availability:Software packages implementing these methods are available at (https://github.com/zhanxw/rvtestshttp://genome.sph.umich.edu/wiki/RareMETAL).Contact:[email protected]; [email protected]


2018 ◽  
Author(s):  
Torsten Günther ◽  
Carl Nettelblad

AbstractHigh quality reference genomes are an important resource in genomic research projects. A consequence is that DNA fragments carrying the reference allele will be more likely to map suc-cessfully, or receive higher quality scores. This reference bias can have effects on downstream population genomic analysis when heterozygous sites are falsely considered homozygous for the reference allele.In palaeogenomic studies of human populations, mapping against the human reference genome is used to identify endogenous human sequences. Ancient DNA studies usually operate with low sequencing coverages and fragmentation of DNA molecules causes a large proportion of the sequenced fragments to be shorter than 50 bp – reducing the amount of accepted mismatches, and increasing the probability of multiple matching sites in the genome. These ancient DNA specific properties are potentially exacerbating the impact of reference bias on downstream analyses, especially since most studies of ancient human populations use pseudohaploid data, i.e. they randomly sample only one sequencing read per site.We show that reference bias is pervasive in published ancient DNA sequence data of pre-historic humans with some differences between individual genomic regions. We illustrate that the strength of reference bias is negatively correlated with fragment length. Reference bias can cause differences in the results of downstream analyses such as population affinities, heterozygosity estimates and estimates of archaic ancestry. These spurious results highlight how important it is to be aware of these technical artifacts and that we need strategies to mitigate the effect. Therefore, we suggest some post-mapping filtering strategies to resolve reference bias which help to reduce its impact substantially.


2022 ◽  
Author(s):  
Michael N Weedon ◽  
Samuel E Jones ◽  
Jacqueline Lane ◽  
Jiwon Lee ◽  
Hanna M Ollila ◽  
...  

Rare variants in ten genes have been reported to cause Mendelian sleep conditions characterised by extreme sleep duration or timing. These include familial natural short sleep (ADRB1, DEC2/BHLHE41, GRM1 and NPSR1), advanced sleep phase (PER2, PER3, CRY2, CSNK1D and TIMELESS) and delayed sleep phase (CRY1). The association of variants of these genes with extreme sleep conditions were usually based on clinically ascertained families, and their effects when identified in the population are unknown. We aimed to determine the effects of these variants on sleep traits in large population-based cohorts. We performed genetic association analysis of variants previously reported to be causal for Mendelian sleep and circadian conditions. Analyses were performed using 191,929 individuals with data on sleep and whole-exome or genome-sequence data from 4 population-based studies: UK Biobank, FINRISK, Health-2000-2001, and the Multi-Ethnic Study of Atherosclerosis (MESA). We identified sleep disorders from self-report, hospital and primary care data. We estimated sleep duration and timing measures from self-report and accelerometery data. We identified carriers for 10 out of 12 previously reported pathogenic variants for 8 of the 10 genes. They ranged in frequency from 1 individual with the variant in CSNK1D to 1,574 individuals with a reported variant in the PER3 gene in the UK Biobank. We found no association of any of these variants with extreme sleep or circadian phenotypes. Using sleep timing as a proxy measure for sleep phase, only PER3 and CRY1 variants demonstrated association with earlier and later sleep timing, respectively; however, the magnitude of effect was smaller than previously reported (sleep midpoint ~7 mins earlier and ~5 mins later, respectively). We also performed burden tests of protein truncating (PTVs) or rare missense variants for the 10 genes. Only PTVs in PER2 and PER3 were associated with a relevant trait (for example, 64 individuals with a PTV in PER2 had an odds ratio of 4.4 for being "definitely a morning person", P=4x10-8; and had a 57-minute earlier midpoint sleep, P=5x10-7). Our results indicate that previously reported variants for Mendelian sleep and circadian conditions are often not highly penetrant when ascertained incidentally from the general population.


Genes ◽  
2020 ◽  
Vol 11 (5) ◽  
pp. 586
Author(s):  
Yu Jiang ◽  
Sai Chen ◽  
Xingyan Wang ◽  
Mengzhen Liu ◽  
William G. Iacono ◽  
...  

There is great interest in understanding the impact of rare variants in human diseases using large sequence datasets. In deep sequence datasets of >10,000 samples, ~10% of the variant sites are observed to be multi-allelic. Many of the multi-allelic variants have been shown to be functional and disease-relevant. Proper analysis of multi-allelic variants is critical to the success of a sequencing study, but existing methods do not properly handle multi-allelic variants and can produce highly misleading association results. We discuss practical issues and methods to encode multi-allelic sites, conduct single-variant and gene-level association analyses, and perform meta-analysis for multi-allelic variants. We evaluated these methods through extensive simulations and the study of a large meta-analysis of ~18,000 samples on the cigarettes-per-day phenotype. We showed that our joint modeling approach provided an unbiased estimate of genetic effects, greatly improved the power of single-variant association tests among methods that can properly estimate allele effects, and enhanced gene-level tests over existing approaches. Software packages implementing these methods are available online.


2012 ◽  
Vol 30 (2) ◽  
pp. 253-262 ◽  
Author(s):  
Martyna Molak ◽  
Eline D. Lorenzen ◽  
Beth Shapiro ◽  
Simon Y.W. Ho

Abstract In recent years, ancient DNA has increasingly been used for estimating molecular timescales, particularly in studies of substitution rates and demographic histories. Molecular clocks can be calibrated using temporal information from ancient DNA sequences. This information comes from the ages of the ancient samples, which can be estimated by radiocarbon dating the source material or by dating the layers in which the material was deposited. Both methods involve sources of uncertainty. The performance of Bayesian phylogenetic inference depends on the information content of the data set, which includes variation in the DNA sequences and the structure of the sample ages. Various sources of estimation error can reduce our ability to estimate rates and timescales accurately and precisely. We investigated the impact of sample-dating uncertainties on the estimation of evolutionary timescale parameters using the software BEAST. Our analyses involved 11 published data sets and focused on estimates of substitution rate and root age. We show that, provided that samples have been accurately dated and have a broad temporal span, it might be unnecessary to account for sample-dating uncertainty in Bayesian phylogenetic analyses of ancient DNA. We also investigated the sample size and temporal span of the ancient DNA sequences needed to estimate phylogenetic timescales reliably. Our results show that the range of sample ages plays a crucial role in determining the quality of the results but that accurate and precise phylogenetic estimates of timescales can be made even with only a few ancient sequences. These findings have important practical consequences for studies of molecular rates, timescales, and population dynamics.


2021 ◽  
Author(s):  
André Elias Rodrigues Soares ◽  
Nikolaus Boroffka ◽  
Oskar Schröder ◽  
Leonid Sverchkov ◽  
Norbert Benecke ◽  
...  

Central Asia has been an important region connecting the different parts of Eurasia throughout history and prehistory, with large states developing in this region during the Iron Age. Archaeogenomics is a powerful addition to the zooarchaeological toolkit for understanding the relation of these societies to animals. Here, we present the genetic identification of a goitered gazelle specimen (Gazella subgutturosa) at the site Gazimulla-Tepa, in modern-day Uzbekistan, confirming hunting of the species in the region during the Iron Age. The sample was directly radiocarbon dated to 2724-2439 calBP. A phylogenetic analysis of the mitochondrial genome places the individual into the modern variation of G. subgutturosa. Our data does represent both the first ancient DNA and the first nuclear DNA sequences of this species. The lack of genomic resources available for this gazelle and related species prevented us from performing a more in-depth analysis of the nuclear sequences generated. Therefore, we are making our sequence data available to the research community to facilitate other research of this nowadays threatened species which has been subject to human hunting for several millennia across its entire range on the Asian continent.


2017 ◽  
Author(s):  
K. Jun Tong ◽  
David A. Duchêne ◽  
Sebastián Duchêne ◽  
Jemma L. Geoghegan ◽  
Simon Y.W. Ho

AbstractThe estimation of evolutionary rates from ancient DNA sequences can be negatively affected by among-lineage rate variation and non-random sampling. Using a simulation study, we compared the performance of three phylogenetic methods for inferring evolutionary rates from time-structured data sets: root-to-tip regression, least-squares dating, and Bayesian inference. Our results show that these methods produce reliable estimates when the substitution rate is high, rate variation is low, and samples of similar ages are not phylogenetically clustered. The interaction of these factors is particularly important for Bayesian estimation of evolutionary rates. We also inferred rates for time-structured mitogenomic data sets from six vertebrate species. Root-to-tip regression estimated a different rate from least-squares dating and Bayesian inference for mitogenomes from the horse, which has high levels of among-lineage rate variation. We recommend using multiple methods of inference and testing data for temporal signal, among-lineage rate variation, and phylo-temporal clustering.


Sign in / Sign up

Export Citation Format

Share Document