The Wright-Fisher Site Frequency Spectrum as a Perturbation of the Coalescent’s

Mapping Intimacies ◽

10.1101/332817 ◽

2018 ◽

Cited By ~ 1

Author(s):

Andrew Melfi ◽

Divakar Viswanath

Keyword(s):

Sample Size ◽

Population Size ◽

Total Variation ◽

Frequency Spectrum ◽

Separate Analysis ◽

Number Of Children ◽

Large Samples ◽

Variation Distance ◽

Site Frequency Spectrum ◽

The Difference

AbstractThe first terms of the Wright-Fisher (WF) site frequency spectrum that follow the coalescent approximation are determined precisely, with a view to understanding the accuracy of the coalescent approximation for large samples. The perturbing terms show that the probability of a single mutant in the sample (singleton probability) is elevated in WF but the rest of the frequency spectrum is lowered. A part of the perturbation can be attributed to a mismatch in rates of merger between WF and the coalescent. The rest of it can be attributed to the difference in the way WF and the coalescent partition children between parents. In particular, the number of children of a parent is approximately Poisson under WF and approximately geometric under the coalescent. Whereas the mismatch in rates raises the probability of singletons under WF, its offspring distribution being approximately Poisson lowers it. The two effects are of opposite sense everywhere except at the tail of the frequency spectrum. The WF frequency spectrum begins to depart from that of the coalescent only for sample sizes that are comparable to the population size. These conclusions are confirmed by a separate analysis that assumes the sample size n to be equal to the population size N. Partly thanks to the canceling effects, the total variation distance of WF minus coalescent is 0.12/ log N for a population sized sample with n = N, which is only 1% for N = 2 × 104.

Download Full-text

Non-parametric estimation of population size changes from the site frequency spectrum

10.1101/125351 ◽

2017 ◽

Author(s):

Berit Lindum Waltoft ◽

Asger Hobolth

Keyword(s):

Population Size ◽

Frequency Spectrum ◽

Goodness Of Fit ◽

Weighted Average ◽

Cubic Spline ◽

Parametric Estimation ◽

New Method ◽

Eigenvalue Decomposition ◽

Human Populations ◽

Site Frequency Spectrum

AbstractThe variability in population size is a key quantity for understanding the evolutionary history of a species. We present a new method, CubSFS, for estimating the changes in population size of a panmictic population from the site frequency spectrum. First, we provide a straightforward proof for the expression of the expected site frequency spectrum depending only on the population size. Our derivation is based on an eigenvalue decomposition of the instantaneous coalescent rate matrix. Second, we solve the inverse problem of determining the variability in population size from an observed SFS. Our solution is based on a cubic spline for the population size. The cubic spline is determined by minimizing the weighted average of two terms, namely (i) the goodness of fit to the SFS, and (ii) a penalty term based on the smoothness of the changes. The weight is determined by cross-validation. The new method is validated on simulated demographic histories and applied on data from nine different human populations.

Download Full-text

Contemporary Demographic Reconstruction Methods Are Robust to Genome Assembly Quality: A Case Study in Tasmanian Devils

Molecular Biology and Evolution ◽

10.1093/molbev/msz191 ◽

2019 ◽

Vol 36 (12) ◽

pp. 2906-2921 ◽

Cited By ~ 20

Author(s):

Austin H Patton ◽

Mark J Margres ◽

Amanda R Stahlke ◽

Sarah Hendricks ◽

Kevin Lewallen ◽

...

Keyword(s):

Population Size ◽

Frequency Spectrum ◽

Genome Assembly ◽

Genomic Sequence ◽

Demographic History ◽

Tasmanian Devil ◽

Assembly Quality ◽

Reconstruction Methods ◽

Site Frequency Spectrum ◽

Genome Assemblies

Abstract Reconstructing species’ demographic histories is a central focus of molecular ecology and evolution. Recently, an expanding suite of methods leveraging either the sequentially Markovian coalescent (SMC) or the site-frequency spectrum has been developed to reconstruct population size histories from genomic sequence data. However, few studies have investigated the robustness of these methods to genome assemblies of varying quality. In this study, we first present an improved genome assembly for the Tasmanian devil using the Chicago library method. Compared with the original reference genome, our new assembly reduces the number of scaffolds (from 35,975 to 10,010) and increases the scaffold N90 (from 0.101 to 2.164 Mb). Second, we assess the performance of four contemporary genomic methods for inferring population size history (PSMC, MSMC, SMC++, Stairway Plot), using the two devil genome assemblies as well as simulated, artificially fragmented genomes that approximate the hypothesized demographic history of Tasmanian devils. We demonstrate that each method is robust to assembly quality, producing similar estimates of Ne when simulated genomes were fragmented into up to 5,000 scaffolds. Overall, methods reliant on the SMC are most reliable between ∼300 generations before present (gbp) and 100 kgbp, whereas methods exclusively reliant on the site-frequency spectrum are most reliable between the present and 30 gbp. Our results suggest that when used in concert, genomic methods for reconstructing species’ effective population size histories 1) can be applied to nonmodel organisms without highly contiguous reference genomes, and 2) are capable of detecting independently documented effects of historical geological events.

Download Full-text

The Validity of the Coalescent Approximation for Large Samples

10.1101/170928 ◽

2017 ◽

Author(s):

Andrew Melfi ◽

Divakar Viswanath

Keyword(s):

Sample Size ◽

Population Size ◽

Exponential Growth ◽

Asymptotic Theory ◽

Triple Collision ◽

Demographic Models ◽

Large Samples ◽

Population Sizes ◽

Sample Frequency ◽

Haploid Population

AbstractThe Kingman coalescent, widely used in genetics, is known to be a good approximation when the sample size is small relative to the population size. In this article, we investigate how large the sample size can get without violating the coalescent approximation. If the haploid population size is 2N, we prove that for samples of size N1/3−ϵ, ϵ > 0, coalescence under the Wright-Fisher (WF) model converges in probability to the Kingman coalescent in the limit of large N. For samples of size N2/5−ϵ or smaller, the WF coalescent converges to a mixture of the Kingman coalescent and what we call the mod-2 coalescent. For samples of size N1/2 or larger, triple collisions in the WF genealogy of the sample become important. The sample size for which the probability of conformance with the Kingman coalescent is 95% is found to be 1.47 × N0.31 for N ∈ [103, 105], showing the pertinence of the asymptotic theory. The probability of no triple collisions is found to be 95% for sample sizes equal to 0.92 × N0.49, which too is in accord with the asymptotic theory.Varying population sizes are handled using algorithms that calculate the probability of WF coalescence agreeing with the Kingman model or taking place without triple collisions. For a sample of size 100, the probabilities of coalescence according to the Kingman model are 2%, 0%, 1%, and 0% in four models of human population with constant N, constant N except for two bottlenecks, recent exponential growth, and increasing recent exponential growth, respectively. For the same four demographic models and the same sample size, the probabilities of coalescence with no triple collision are 92%, 73%, 88%, and 87%, respectively. Visualizations of the algorithm show that even distant bottlenecks can impede agreement between the coalescent and the WF model.Finally, we prove that the WF sample frequency spectrum for samples of size N1/3−ϵ or smaller converges to the classical answer for the coalescent.

Download Full-text

Accuracy of Approximation for Discrete Distributions

Journal of Probability and Statistics ◽

10.1155/2016/6212567 ◽

2016 ◽

Vol 2016 ◽

pp. 1-6

Author(s):

Tamás F. Móri

Keyword(s):

Total Variation ◽

Generating Functions ◽

Probability Distributions ◽

Discrete Distributions ◽

Discrete Probability ◽

Order Of Magnitude ◽

Variation Distance ◽

The Difference ◽

Discrete Probability Distributions ◽

Accuracy Of Approximation

The paper is a contribution to the problem of estimating the deviation of two discrete probability distributions in terms of the supremum distance between their generating functions over the interval [0,1]. Deviation can be measured by the difference of the kth terms or by total variation distance. Our new bounds have better order of magnitude than those proved previously, and they are even sharp in certain cases.

Download Full-text

Distinguishing multiple-merger from Kingman coalescence using two-site frequency spectra

10.1101/461517 ◽

2018 ◽

Cited By ~ 1

Author(s):

Daniel P. Rice ◽

John Novembre ◽

Michael M. Desai

Keyword(s):

Mutual Information ◽

Population Size ◽

Frequency Spectrum ◽

Graphical Model ◽

Genomic Diversity ◽

Size Change ◽

Frequency Spectra ◽

Offspring Number ◽

Site Frequency Spectrum ◽

Pointwise Mutual Information

AbstractDemographic inference methods in population genetics typically assume that the ancestry of a sample can be modeled by the Kingman coalescent. A defining feature of this stochastic process is that it generates genealogies that are binary trees: no more than two ancestral lineages may coalesce at the same time. However, this assumption breaks down under several scenarios. For example, pervasive natural selection and extreme variation in offspring number can both generate genealogies with “multiple-merger” events in which more than two lineages coalesce instantaneously. Therefore, detecting multiple mergers is important both for understanding which forces have shaped the diversity of a population and for avoiding fitting misspecified models to data. Current methods to detect multiple mergers in genomic data rely on the site frequency spectrum (SFS). However, the signatures of multiple mergers in the SFS are also consistent with a Kingman coalescent with a time-varying population size. Here, we present a new method for detecting multiple mergers based on the pointwise mutual information of the two-site frequency spectrum for pairs of linked sites. Unlike the SFS, the pointwise mutual information depends mostly on the topologies of genealogies rather than on their branch lengths and is therefore largely insensitive to population size change. This statistic is global in the sense that it can detect when the genome-wide genetic diversity is inconsistent with the Kingman coalescent, rather than detecting outlier regions, as in selection scan methods. Finally, we demonstrate a graphical model-checking procedure based on the point-wise mutual information using genomic diversity data from Drosophila melanogaster.

Download Full-text

Non-parametric estimation of population size changes from the site frequency spectrum

Statistical Applications in Genetics and Molecular Biology ◽

10.1515/sagmb-2017-0061 ◽

2018 ◽

Vol 17 (3) ◽

Cited By ~ 6

Author(s):

Berit Lindum Waltoft ◽

Asger Hobolth

Keyword(s):

Population Size ◽

Frequency Spectrum ◽

Goodness Of Fit ◽

Weighted Average ◽

Cubic Spline ◽

Parametric Estimation ◽

New Method ◽

Eigenvalue Decomposition ◽

Human Populations ◽

Site Frequency Spectrum

Abstract Changes in population size is a useful quantity for understanding the evolutionary history of a species. Genetic variation within a species can be summarized by the site frequency spectrum (SFS). For a sample of size n, the SFS is a vector of length n − 1 where entry i is the number of sites where the mutant base appears i times and the ancestral base appears n − i times. We present a new method, CubSFS, for estimating the changes in population size of a panmictic population from an observed SFS. First, we provide a straightforward proof for the expression of the expected site frequency spectrum depending only on the population size. Our derivation is based on an eigenvalue decomposition of the instantaneous coalescent rate matrix. Second, we solve the inverse problem of determining the changes in population size from an observed SFS. Our solution is based on a cubic spline for the population size. The cubic spline is determined by minimizing the weighted average of two terms, namely (i) the goodness of fit to the observed SFS, and (ii) a penalty term based on the smoothness of the changes. The weight is determined by cross-validation. The new method is validated on simulated demographic histories and applied on unfolded and folded SFS from 26 different human populations from the 1000 Genomes Project.

Download Full-text

The Site Frequency Spectrum under Finite and Time-Varying Mutation Rates

10.1101/375907 ◽

2018 ◽

Author(s):

Andrew Melfi ◽

Divakar Viswanath

Keyword(s):

Population Size ◽

Frequency Spectrum ◽

Genomic Data ◽

Mutation Rates ◽

Single Mutation ◽

Time Varying ◽

Mutation Probability ◽

Human Genomes ◽

Haploid Population ◽

Site Frequency Spectrum

AbstractThe diversity in genomes is due to the accumulation of mutations and the site frequency spectrum (SFS) is a popular statistic for summarizing genomic data. The current coalescent algorithm for calculating the SFS for a given demography assumes the μ → 0 limit, where μ is the mutation probability (or rate) per base pair per generation. The algorithm is applicable when μN, N being the haploid population size, is negligible. We derive a coalescent based algorithm for calculating the SFS that allows the mutation rate μ(t) as well as the population size N(t) to vary arbitrarily as a function of time. That algorithm shows that the probability of two mutations in the genealogy becomes noticeable already for μ = 10-8 for samples of n = 105 haploid human genomes and increases rapidly with μ. Our algorithm calculates the SFS under the assumption of a single mutation in the genealogy, and the part of the SFS due to a single mutation depends only mildly on the finiteness of μ. However, the dependence of the SFS on variation in μ can be substantial for even n = 100 samples. In addition, increasing and decreasing mutation rates alter the SFS in different ways and to different extents.

Download Full-text

Inferring the model and onset of natural selection under varying population size from the site frequency spectrum and haplotype structure

Proceedings of The Royal Society B Biological Sciences ◽

10.1098/rspb.2018.2541 ◽

2019 ◽

Vol 286 (1896) ◽

pp. 20182541 ◽

Cited By ~ 3

Author(s):

Shigeki Nakagome ◽

Richard R. Hudson ◽

Anna Di Rienzo

Keyword(s):

Natural Selection ◽

Population Size ◽

Allele Frequency ◽

Frequency Spectrum ◽

Goodness Of Fit ◽

Haplotype Structure ◽

New Mutation ◽

Standing Variation ◽

Site Frequency Spectrum ◽

Varying Population

A fundamental question about adaptation in a population is the time of onset of the selective pressure acting on beneficial alleles. Inferring this time, in turn, depends on the selection model. We develop a framework of approximate Bayesian computation (ABC) that enables the use of the full site frequency spectrum and haplotype structure to test the goodness-of-fit of selection models and estimate the timing of selection under varying population size scenarios. We show that our method has sufficient power to distinguish natural selection from neutrality even if relatively old selection increased the frequency of a pre-existing allele from 20% to 50% or from 40% to 80%. Our ABC can accurately estimate the time of onset of selection on a new mutation. However, estimates are prone to bias under the standing variation model, possibly due to the uncertainty in the allele frequency at the onset of selection. We further extend our approach to take advantage of ancient DNA data that provides information on the allele frequency path of the beneficial allele. Applying our ABC, including both modern and ancient human DNA data, to four pigmentation alleles in Europeans, we detected selection on standing variants that occurred after the dispersal from Africa even though models of selection on a new mutation were initially supported for two of these alleles without the ancient data.

Download Full-text

Fast and accurate approximation of the joint site frequency spectrum of multiple populations

10.1101/2020.05.01.073213 ◽

2020 ◽

Author(s):

Ethan M. Jewett

Keyword(s):

Genetic Variation ◽

Sample Size ◽

Frequency Spectrum ◽

Dna Sequences ◽

Population Genetic ◽

Accurate Approximation ◽

Computationally Efficient ◽

Multiple Populations ◽

Site Frequency Spectrum ◽

Approximate Formulas

AbstractThe site frequency spectrum (SFS) is a statistic that summarizes the distribution of derived allele frequencies in a sample of DNA sequences. The SFS provides useful information about genetic variation within and among populations and it can used to make population genetic inferences. Methods for computing the SFS based on the diffusion approximation are computationally efficient when computing all terms of the SFS simultaneously and they can handle complicated demographic scenarios. However, in practice it is sometimes only necessary to compute a subset of terms of the SFS, in which case coalescent-based methods can achieve greater computational efficiency. Here, we present simple and accurate approximate formulas for the expected joint SFS for multiple populations connected by migration. Compared with existing exact approaches, our approximate formulas greatly reduce the complexity of computing each entry of the SFS and have simple forms. The computational complexity of our method depends on the index of the entry to be computed, rather than on the sample size, and the accuracy of our approximation improves as the sample size increases.

Download Full-text

Smart Estimation Ver-H.1.0

10.31227/osf.io/2tyqk ◽

2018 ◽

Author(s):

Sigit Haryadi

Keyword(s):

Sample Size ◽

Confidence Level ◽

Time Span ◽

Regression Equation ◽

Weight Factor ◽

The Past ◽

Level Of Confidence ◽

The Future ◽

Future Value ◽

The Difference

We cannot be sure exactly what will happen, we can only estimate by using a particular method, where each method must have the formula to create a regression equation and a formula to calculate the confidence level of the estimated value. This paper conveys a method of estimating the future values, in which the formula for creating a regression equation is based on the assumption that the future value will depend on the difference of the past values divided by a weight factor which corresponding to the time span to the present, and the formula for calculating the level of confidence is to use "the Haryadi Index". The advantage of this method is to remain accurate regardless of the sample size and may ignore the past value that is considered irrelevant.

Download Full-text