scholarly journals Distinguishing multiple-merger from Kingman coalescence using two-site frequency spectra

2018 ◽  
Author(s):  
Daniel P. Rice ◽  
John Novembre ◽  
Michael M. Desai

AbstractDemographic inference methods in population genetics typically assume that the ancestry of a sample can be modeled by the Kingman coalescent. A defining feature of this stochastic process is that it generates genealogies that are binary trees: no more than two ancestral lineages may coalesce at the same time. However, this assumption breaks down under several scenarios. For example, pervasive natural selection and extreme variation in offspring number can both generate genealogies with “multiple-merger” events in which more than two lineages coalesce instantaneously. Therefore, detecting multiple mergers is important both for understanding which forces have shaped the diversity of a population and for avoiding fitting misspecified models to data. Current methods to detect multiple mergers in genomic data rely on the site frequency spectrum (SFS). However, the signatures of multiple mergers in the SFS are also consistent with a Kingman coalescent with a time-varying population size. Here, we present a new method for detecting multiple mergers based on the pointwise mutual information of the two-site frequency spectrum for pairs of linked sites. Unlike the SFS, the pointwise mutual information depends mostly on the topologies of genealogies rather than on their branch lengths and is therefore largely insensitive to population size change. This statistic is global in the sense that it can detect when the genome-wide genetic diversity is inconsistent with the Kingman coalescent, rather than detecting outlier regions, as in selection scan methods. Finally, we demonstrate a graphical model-checking procedure based on the point-wise mutual information using genomic diversity data from Drosophila melanogaster.

2017 ◽  
Author(s):  
Berit Lindum Waltoft ◽  
Asger Hobolth

AbstractThe variability in population size is a key quantity for understanding the evolutionary history of a species. We present a new method, CubSFS, for estimating the changes in population size of a panmictic population from the site frequency spectrum. First, we provide a straightforward proof for the expression of the expected site frequency spectrum depending only on the population size. Our derivation is based on an eigenvalue decomposition of the instantaneous coalescent rate matrix. Second, we solve the inverse problem of determining the variability in population size from an observed SFS. Our solution is based on a cubic spline for the population size. The cubic spline is determined by minimizing the weighted average of two terms, namely (i) the goodness of fit to the SFS, and (ii) a penalty term based on the smoothness of the changes. The weight is determined by cross-validation. The new method is validated on simulated demographic histories and applied on data from nine different human populations.


2019 ◽  
Vol 36 (12) ◽  
pp. 2906-2921 ◽  
Author(s):  
Austin H Patton ◽  
Mark J Margres ◽  
Amanda R Stahlke ◽  
Sarah Hendricks ◽  
Kevin Lewallen ◽  
...  

Abstract Reconstructing species’ demographic histories is a central focus of molecular ecology and evolution. Recently, an expanding suite of methods leveraging either the sequentially Markovian coalescent (SMC) or the site-frequency spectrum has been developed to reconstruct population size histories from genomic sequence data. However, few studies have investigated the robustness of these methods to genome assemblies of varying quality. In this study, we first present an improved genome assembly for the Tasmanian devil using the Chicago library method. Compared with the original reference genome, our new assembly reduces the number of scaffolds (from 35,975 to 10,010) and increases the scaffold N90 (from 0.101 to 2.164 Mb). Second, we assess the performance of four contemporary genomic methods for inferring population size history (PSMC, MSMC, SMC++, Stairway Plot), using the two devil genome assemblies as well as simulated, artificially fragmented genomes that approximate the hypothesized demographic history of Tasmanian devils. We demonstrate that each method is robust to assembly quality, producing similar estimates of Ne when simulated genomes were fragmented into up to 5,000 scaffolds. Overall, methods reliant on the SMC are most reliable between ∼300 generations before present (gbp) and 100 kgbp, whereas methods exclusively reliant on the site-frequency spectrum are most reliable between the present and 30 gbp. Our results suggest that when used in concert, genomic methods for reconstructing species’ effective population size histories 1) can be applied to nonmodel organisms without highly contiguous reference genomes, and 2) are capable of detecting independently documented effects of historical geological events.


Author(s):  
Berit Lindum Waltoft ◽  
Asger Hobolth

Abstract Changes in population size is a useful quantity for understanding the evolutionary history of a species. Genetic variation within a species can be summarized by the site frequency spectrum (SFS). For a sample of size n, the SFS is a vector of length n − 1 where entry i is the number of sites where the mutant base appears i times and the ancestral base appears n − i times. We present a new method, CubSFS, for estimating the changes in population size of a panmictic population from an observed SFS. First, we provide a straightforward proof for the expression of the expected site frequency spectrum depending only on the population size. Our derivation is based on an eigenvalue decomposition of the instantaneous coalescent rate matrix. Second, we solve the inverse problem of determining the changes in population size from an observed SFS. Our solution is based on a cubic spline for the population size. The cubic spline is determined by minimizing the weighted average of two terms, namely (i) the goodness of fit to the observed SFS, and (ii) a penalty term based on the smoothness of the changes. The weight is determined by cross-validation. The new method is validated on simulated demographic histories and applied on unfolded and folded SFS from 26 different human populations from the 1000 Genomes Project.


2017 ◽  
Author(s):  
Ivana Cvijović ◽  
Benjamin H. Good ◽  
Michael M. Desai

Purifying selection reduces genetic diversity, both at sites under direct selection and at linked neutral sites. This process, known as background selection, is thought to play an important role in shaping genomic diversity in natural populations. Yet despite its importance, the effects of background selection are not fully understood. Previous theoretical analyses of this process have taken a backwards-time approach based on the structured coalescent. While they provide some insight, these methods are either limited to very small samples or are computationally prohibitive. Here, we present a new forward-time analysis of the trajectories of both neutral and deleterious mutations at a nonrecombining locus. We find that strong purifying selection leads to remarkably rich dynamics: neutral mutations can exhibit sweep-like behavior, and deleterious mutations can reach substantial frequencies even when they are guaranteed to eventually go extinct. Our analysis of these dynamics allows us to calculate analytical expressions for the full site frequency spectrum. We find that whenever background selection is strong enough to lead to a reduction in genetic diversity, it also results in substantial distortions to the site frequency spectrum, which can mimic the effects of population expansions or positive selection. Because these distortions are most pronounced in the low and high frequency ends of the spectrum, they become particularly important in larger samples, but may have small effects in smaller samples. We also apply our forward-time framework to calculate other quantities, such as the ultimate fates of polymorphisms or the fitnesses of their ancestral backgrounds.


2015 ◽  
Author(s):  
Jochen Blath ◽  
Mathias C Cronjager ◽  
Bjarki Eldon ◽  
Matthias Hammer

We give recursions for the expected site-frequency spectrum associated with so-calledXi-coalescents, that is exchangeable coalescents which admitsimultaneous multiple mergersof ancestral lineages. Xi-coalescents arise, for example, in association with population models of skewed offspring distributions with diploidy, recurrent advantageous mutations, or strong bottlenecks. In contrast, the simplerLambda-coalescentsadmit multiple mergers of lineages, but at most one such merger each time. Xi-coalescents, as well as Lambda-coalescents, can predict an excess of singletons, compared to the Kingman coalescent. We compare estimates of coalescent parameters when Xi-coalescents are applied to data generated by Lambda-coalescents, and vice versa. In general, Xi-coalescents predict fewer singletons than corresponding Lambda-coalescents, but a higher count of mutations of size larger than singletons. We fit examples of Xi-coalescents to unfolded site-frequency spectra obtained for autosomal loci of the diploid Atlantic cod, and obtain different coalescent parameter estimates than obtained with corresponding Lambda-coalescents. Our results provide new inference tools, and suggest that for autosomal population genetic data from diploid or polyploid highly fecund populations who may have skewed offspring distributions, one should not apply Lambda-coalescents, but Xi-coalescents.


2018 ◽  
Author(s):  
Andrew Melfi ◽  
Divakar Viswanath

AbstractThe diversity in genomes is due to the accumulation of mutations and the site frequency spectrum (SFS) is a popular statistic for summarizing genomic data. The current coalescent algorithm for calculating the SFS for a given demography assumes the μ → 0 limit, where μ is the mutation probability (or rate) per base pair per generation. The algorithm is applicable when μN, N being the haploid population size, is negligible. We derive a coalescent based algorithm for calculating the SFS that allows the mutation rate μ(t) as well as the population size N(t) to vary arbitrarily as a function of time. That algorithm shows that the probability of two mutations in the genealogy becomes noticeable already for μ = 10-8 for samples of n = 105 haploid human genomes and increases rapidly with μ. Our algorithm calculates the SFS under the assumption of a single mutation in the genealogy, and the part of the SFS due to a single mutation depends only mildly on the finiteness of μ. However, the dependence of the SFS on variation in μ can be substantial for even n = 100 samples. In addition, increasing and decreasing mutation rates alter the SFS in different ways and to different extents.


Genetics ◽  
1995 ◽  
Vol 140 (2) ◽  
pp. 783-796 ◽  
Author(s):  
J M Braverman ◽  
R R Hudson ◽  
N L Kaplan ◽  
C H Langley ◽  
W Stephan

Abstract The level of DNA sequence variation is reduced in regions of the Drosophila melanogaster genome where the rate of crossing over per physical distance is also reduced. This observation has been interpreted as support for the simple model of genetic hitchhiking, in which directional selection on rare variants, e.g., newly arising advantageous mutants, sweeps linked neutral alleles to fixation, thus eliminating polymorphisms near the selected site. However, the frequency spectra of segregating sites of several loci from some populations exhibiting reduced levels of nucleotide diversity and reduced numbers of segregating sites did not appear different from what would be expected under a neutral equilibrium model. Specifically, a skew toward an excess of rare sites was not observed in these samples, as measured by Tajima's D. Because this skew was predicted by a simple hitchhiking model, yet it had never been expressed quantitatively and compared directly to DNA polymorphism data, this paper investigates the hitchhiking effect on the site frequency spectrum, as measured by Tajima's D and several other statistics, using a computer simulation model based on the coalescent process and recurrent hitchhiking events. The results presented here demonstrate that under the simple hitchhiking model (1) the expected value of Tajima's D is large and negative (indicating a skew toward rare variants), (2) that Tajima's test has reasonable power to detect a skew in the frequency spectrum for parameters comparable to those from actual data sets, and (3) that the Tajima's Ds observed in several data sets are very unlikely to have been the result of simple hitchhiking. Consequently, the simple hitchhiking model is not a sufficient explanation for the DNA polymorphism at those loci exhibiting a decreased number of segregating sites yet not exhibiting a skew in the frequency spectrum.


2019 ◽  
Vol 286 (1896) ◽  
pp. 20182541 ◽  
Author(s):  
Shigeki Nakagome ◽  
Richard R. Hudson ◽  
Anna Di Rienzo

A fundamental question about adaptation in a population is the time of onset of the selective pressure acting on beneficial alleles. Inferring this time, in turn, depends on the selection model. We develop a framework of approximate Bayesian computation (ABC) that enables the use of the full site frequency spectrum and haplotype structure to test the goodness-of-fit of selection models and estimate the timing of selection under varying population size scenarios. We show that our method has sufficient power to distinguish natural selection from neutrality even if relatively old selection increased the frequency of a pre-existing allele from 20% to 50% or from 40% to 80%. Our ABC can accurately estimate the time of onset of selection on a new mutation. However, estimates are prone to bias under the standing variation model, possibly due to the uncertainty in the allele frequency at the onset of selection. We further extend our approach to take advantage of ancient DNA data that provides information on the allele frequency path of the beneficial allele. Applying our ABC, including both modern and ancient human DNA data, to four pigmentation alleles in Europeans, we detected selection on standing variants that occurred after the dispersal from Africa even though models of selection on a new mutation were initially supported for two of these alleles without the ancient data.


2018 ◽  
Author(s):  
Andrew Melfi ◽  
Divakar Viswanath

AbstractThe first terms of the Wright-Fisher (WF) site frequency spectrum that follow the coalescent approximation are determined precisely, with a view to understanding the accuracy of the coalescent approximation for large samples. The perturbing terms show that the probability of a single mutant in the sample (singleton probability) is elevated in WF but the rest of the frequency spectrum is lowered. A part of the perturbation can be attributed to a mismatch in rates of merger between WF and the coalescent. The rest of it can be attributed to the difference in the way WF and the coalescent partition children between parents. In particular, the number of children of a parent is approximately Poisson under WF and approximately geometric under the coalescent. Whereas the mismatch in rates raises the probability of singletons under WF, its offspring distribution being approximately Poisson lowers it. The two effects are of opposite sense everywhere except at the tail of the frequency spectrum. The WF frequency spectrum begins to depart from that of the coalescent only for sample sizes that are comparable to the population size. These conclusions are confirmed by a separate analysis that assumes the sample size n to be equal to the population size N. Partly thanks to the canceling effects, the total variation distance of WF minus coalescent is 0.12/ log N for a population sized sample with n = N, which is only 1% for N = 2 × 104.


Sign in / Sign up

Export Citation Format

Share Document