scholarly journals The presence and impact of reference bias on population genomic studies of prehistoric human populations

2018 ◽  
Author(s):  
Torsten Günther ◽  
Carl Nettelblad

AbstractHigh quality reference genomes are an important resource in genomic research projects. A consequence is that DNA fragments carrying the reference allele will be more likely to map suc-cessfully, or receive higher quality scores. This reference bias can have effects on downstream population genomic analysis when heterozygous sites are falsely considered homozygous for the reference allele.In palaeogenomic studies of human populations, mapping against the human reference genome is used to identify endogenous human sequences. Ancient DNA studies usually operate with low sequencing coverages and fragmentation of DNA molecules causes a large proportion of the sequenced fragments to be shorter than 50 bp – reducing the amount of accepted mismatches, and increasing the probability of multiple matching sites in the genome. These ancient DNA specific properties are potentially exacerbating the impact of reference bias on downstream analyses, especially since most studies of ancient human populations use pseudohaploid data, i.e. they randomly sample only one sequencing read per site.We show that reference bias is pervasive in published ancient DNA sequence data of pre-historic humans with some differences between individual genomic regions. We illustrate that the strength of reference bias is negatively correlated with fragment length. Reference bias can cause differences in the results of downstream analyses such as population affinities, heterozygosity estimates and estimates of archaic ancestry. These spurious results highlight how important it is to be aware of these technical artifacts and that we need strategies to mitigate the effect. Therefore, we suggest some post-mapping filtering strategies to resolve reference bias which help to reduce its impact substantially.

Author(s):  
Adrien Oliva ◽  
Raymond Tobler ◽  
Alan Cooper ◽  
Bastien Llamas ◽  
Yassine Souilmi

Abstract The current standard practice for assembling individual genomes involves mapping millions of short DNA sequences (also known as DNA ‘reads’) against a pre-constructed reference genome. Mapping vast amounts of short reads in a timely manner is a computationally challenging task that inevitably produces artefacts, including biases against alleles not found in the reference genome. This reference bias and other mapping artefacts are expected to be exacerbated in ancient DNA (aDNA) studies, which rely on the analysis of low quantities of damaged and very short DNA fragments (~30–80 bp). Nevertheless, the current gold-standard mapping strategies for aDNA studies have effectively remained unchanged for nearly a decade, during which time new software has emerged. In this study, we used simulated aDNA reads from three different human populations to benchmark the performance of 30 distinct mapping strategies implemented across four different read mapping software—BWA-aln, BWA-mem, NovoAlign and Bowtie2—and quantified the impact of reference bias in downstream population genetic analyses. We show that specific NovoAlign, BWA-aln and BWA-mem parameterizations achieve high mapping precision with low levels of reference bias, particularly after filtering out reads with low mapping qualities. However, unbiased NovoAlign results required the use of an IUPAC reference genome. While relevant only to aDNA projects where reference population data are available, the benefit of using an IUPAC reference demonstrates the value of incorporating population genetic information into the aDNA mapping process, echoing recent results based on graph genome representations.


2019 ◽  
Author(s):  
Rui Martiniano ◽  
Erik Garrison ◽  
Eppie R. Jones ◽  
Andrea Manica ◽  
Richard Durbin

AbstractBackgroundDuring the last decade, the analysis of ancient DNA (aDNA) sequence has become a powerful tool for the study of past human populations. However, the degraded nature of aDNA means that aDNA molecules are short and frequently mutated by post-mortem chemical modifications. These features decrease read mapping accuracy and increase reference bias, in which reads containing non-reference alleles are less likely to be mapped than those containing reference alleles. Recently, alternative approaches for read mapping and genetic variation analysis have been developed that replace the linear reference by a variation graph which includes known alternative variants at each genetic locus. Here, we evaluate the use of variation graph software vg to avoid reference bias for ancient DNA and compare our approach to existing methods.ResultsWe used vg to align simulated and real aDNA samples to a variation graph containing 1000 Genome Project variants, and compared these with the same data aligned with bwa to the human linear reference genome. We show that use of vg leads to a balanced allelic representation at polymorphic sites, effectively removing reference bias, and more sensitive variant detection in comparison with bwa, especially for insertions and deletions (indels). Alternative approaches that use relaxed bwa parameter settings or filter bwa alignments can also reduce bias, but can have lower sensitivity than vg, particularly for indels.ConclusionsOur findings demonstrate that aligning aDNA sequences to variation graphs effectively mitigates the impact of reference bias when analysing aDNA, while retaining mapping sensitivity and allowing detection of variation, in particular indel variation, that was previously missed.


2020 ◽  
Vol 21 (1) ◽  
Author(s):  
Rui Martiniano ◽  
Erik Garrison ◽  
Eppie R. Jones ◽  
Andrea Manica ◽  
Richard Durbin

Abstract Background During the last decade, the analysis of ancient DNA (aDNA) sequence has become a powerful tool for the study of past human populations. However, the degraded nature of aDNA means that aDNA molecules are short and frequently mutated by post-mortem chemical modifications. These features decrease read mapping accuracy and increase reference bias, in which reads containing non-reference alleles are less likely to be mapped than those containing reference alleles. Alternative approaches have been developed to replace the linear reference with a variation graph which includes known alternative variants at each genetic locus. Here, we evaluate the use of variation graph software to avoid reference bias for aDNA and compare with existing methods. Results We use to align simulated and real aDNA samples to a variation graph containing 1000 Genome Project variants and compare with the same data aligned with to the human linear reference genome. Using leads to a balanced allelic representation at polymorphic sites, effectively removing reference bias, and more sensitive variant detection in comparison with , especially for insertions and deletions (indels). Alternative approaches that use relaxed parameter settings or filter alignments can also reduce bias but can have lower sensitivity than , particularly for indels. Conclusions Our findings demonstrate that aligning aDNA sequences to variation graphs effectively mitigates the impact of reference bias when analyzing aDNA, while retaining mapping sensitivity and allowing detection of variation, in particular indel variation, that was previously missed.


Author(s):  
S. Rubinacci ◽  
D.M. Ribeiro ◽  
R. Hofmeister ◽  
O. Delaneau

AbstractLow-coverage whole genome sequencing followed by imputation has been proposed as a cost-effective genotyping approach for disease and population genetics studies. However, its competitiveness against SNP arrays is undermined as current imputation methods are computationally expensive and unable to leverage large reference panels.Here, we describe a method, GLIMPSE, for phasing and imputation of low-coverage sequencing datasets from modern reference panels. We demonstrate its remarkable performance across different coverages and human populations. It achieves imputation of a full genome for less than $1, outperforming existing methods by orders of magnitude, with an increased accuracy of more than 20% at rare variants. We also show that 1x coverage enables effective association studies and is better suited than dense SNP arrays to access the impact of rare variations. Overall, this study demonstrates the promising potential of low-coverage imputation and suggests a paradigm shift in the design of future genomic studies.


2020 ◽  
Author(s):  
Brendan N. Reid ◽  
Rachel L. Moran ◽  
Christopher J. Kopack ◽  
Sarah W. Fitzpatrick

AbstractResearchers studying non-model organisms have an increasing number of methods available for generating genomic data. However, the applicability of different methods across species, as well as the effect of reference genome choice on population genomic inference, are still difficult to predict in many cases. We evaluated the impact of data type (whole-genome vs. reduced representation) and reference genome choice on data quality and on population genomic and phylogenomic inference across several species of darters (subfamily Etheostomatinae), a highly diverse radiation of freshwater fish. We generated a high-quality reference genome and developed a hybrid RADseq/sequence capture (Rapture) protocol for the Arkansas darter (Etheostoma cragini). Rapture data from 1900 individuals spanning four darter species showed recovery of most loci across darter species at high depth and consistent estimates of heterozygosity regardless of reference genome choice. Loci with baits spanning both sides of the restriction enzyme cut site performed especially well across species. For low-coverage whole-genome data, choice of reference genome affected read depth and inferred heterozygosity. For similar amounts of sequence data, Rapture performed better at identifying fine-scale genetic structure compared to whole-genome sequencing. Rapture loci also recovered an accurate phylogeny for the study species and demonstrated high phylogenetic informativeness across the evolutionary history of the genus Etheostoma. Low cost and high cross-species effectiveness regardless of reference genome suggest that Rapture and similar sequence capture methods may be worthwhile choices for studies of diverse species radiations.


2019 ◽  
Author(s):  
Polina Yu. Novikova ◽  
Ian G. Brennan ◽  
William Booker ◽  
Michael Mahony ◽  
Paul Doughty ◽  
...  

Polyploidy has played an important role in evolution across the tree of life but it is still unclear how polyploid lineages may persist after their initial formation. While both common and well-studied in plants, polyploidy is rare in animals and generally less well-understood. The Australian burrowing frog genus Neobatrachus is comprised of six diploid and three polyploid species and offers a powerful animal polyploid model system. We generated exome-capture sequence data from 87 individuals representing all nine species of Neobatrachus to investigate species-level relationships, the origin and inheritance mode of polyploid species, and the population genomic effects of polyploidy on genus-wide demography. We resolve the phylogenetic relationships among Neobatrachus species and provide further support that the three polyploid species have independent autotetraploid origins. We document higher genetic diversity in tetraploids, resulting from widespread gene flow specifically between the tetraploids, asymmetric inter-ploidy gene flow directed from sympatric diploids to tetraploids, and current isolation of diploid species from each other. We also constructed models of ecologically suitable areas for each species to investigate the impact of climate variation on frogs with differing ploidy levels. These models suggest substantial change in suitable areas compared to past climate, which in turn corresponds to population genomic estimates of demographic histories. We propose that Neobatrachus diploids may be suffering the early genomic impacts of climate-induced habitat loss, while tetraploids appear to be avoiding this fate, possibly due to widespread gene flow into tetraploid lineages specifically. Finally, we demonstrate that Neobatrachus is an attractive model to study the effects of ploidy on the evolution of adaptation in animals.


2017 ◽  
Author(s):  
Tristan Seecharran ◽  
Laura Kalin-Mänttäri ◽  
Katja A. Koskela ◽  
Simo Nikkari ◽  
Benjamin Dickins ◽  
...  

AbstractYersinia pseudotuberculosis is a Gram negative intestinal pathogen of humans and has been responsible for several nation-wide gastro-intestinal outbreaks. Large-scale population genomic studies have been performed on the other human pathogenic Yersinia, Y. pestis and Y. enterocolitica allowing a high-resolution understanding of the ecology, evolution and dissemination of these pathogens. However, to date no large-scale global population genomic analysis of Y. pseudotuberculosis has been performed. Here we present analyses of the genomes of 134 strains of Y. pseudotuberculosis isolated from around the world, from multiple ecosystems since 1960’s. Our data display a phylogeographic split within the population, with an Asian ancestry and subsequent dispersal of successful clonal lineages into Europe and the rest of the world. These lineages can be differentiated by CRISPR cluster arrays, and we show that the lineages are limited with respect to inter-lineage genetic exchange. This restriction of genetic exchange maintains the discrete lineage structure in the population despite co-existence of lineages for thousands of years in multiple countries. Our data highlights how CRISPR can be informative of the evolutionary trajectory of bacterial lineages, and merits further study across bacteria.


2018 ◽  
Author(s):  
Roger Ros-Freixedes ◽  
Battagin Mara ◽  
Martin Johnsson ◽  
Gregor Gorjanc ◽  
Alan J Mileham ◽  
...  

AbstractBackgroundInherent sources of error and bias that affect the quality of the sequence data include index hopping and bias towards the reference allele. The impact of these artefacts is likely greater for low-coverage data than for high-coverage data because low-coverage data has scant information and standard tools for processing sequence data were designed for high-coverage data. With the proliferation of cost-effective low-coverage sequencing there is a need to understand the impact of these errors and bias on resulting genotype calls.ResultsWe used a dataset of 26 pigs sequenced both at 2x with multiplexing and at 30x without multiplexing to show that index hopping and bias towards the reference allele due to alignment had little impact on genotype calls. However, pruning of alternative haplotypes supported by a number of reads below a predefined threshold, a default and desired step for removing potential sequencing errors in high-coverage data, introduced an unexpected bias towards the reference allele when applied to low-coverage data. This bias reduced best-guess genotype concordance of low-coverage sequence data by 19.0 absolute percentage points.ConclusionsWe propose a simple pipeline to correct this bias and we recommend that users of low-coverage sequencing be wary of unexpected biases produced by tools designed for high-coverage sequencing.


2018 ◽  
Author(s):  
Cesare de Filippo ◽  
Matthias Meyer ◽  
Kay Prüfer

AbstractThe study of ancient DNA is hampered by degradation, resulting in short DNA fragments. Advances in laboratory methods have made it possible to retrieve short DNA fragments, thereby improving access to DNA preserved in highly degraded, ancient material. However, such material contains large amounts of microbial contamination in addition to DNA fragments from the ancient organism. The resulting mixture of sequences constitute a challenge for computational analysis, since microbial sequences are hard to distinguish from the ancient sequences of interest, especially when they are short. Here, we develop a method to quantify spurious alignments based on the presence or absence of rare variants. We find that spurious alignments are enriched for mismatches and insertion/deletion differences and lack substitution patterns typical of ancient DNA. The impact of spurious alignments can be reduced by filtering on these features and by imposing a sample-specific minimum length cutoff. We apply this approach to sequences from the ~430,000 year-old Sima de los Huesos hominin remains, which contain particularly short DNA fragments, and increase the amount of usable sequence data by 17-150%. This allows us to place a third specimen from the site on the Neandertal lineage. Our method maximizes the sequence data amenable to genetic analysis from highly degraded ancient material and avoids pitfalls that are associated with the analysis of ultra-short DNA sequences.


Sign in / Sign up

Export Citation Format

Share Document