Non-parametric estimation of population size changes from the site frequency spectrum

Mapping Intimacies ◽

10.1101/125351 ◽

2017 ◽

Author(s):

Berit Lindum Waltoft ◽

Asger Hobolth

Keyword(s):

Population Size ◽

Frequency Spectrum ◽

Goodness Of Fit ◽

Weighted Average ◽

Cubic Spline ◽

Parametric Estimation ◽

New Method ◽

Eigenvalue Decomposition ◽

Human Populations ◽

Site Frequency Spectrum

AbstractThe variability in population size is a key quantity for understanding the evolutionary history of a species. We present a new method, CubSFS, for estimating the changes in population size of a panmictic population from the site frequency spectrum. First, we provide a straightforward proof for the expression of the expected site frequency spectrum depending only on the population size. Our derivation is based on an eigenvalue decomposition of the instantaneous coalescent rate matrix. Second, we solve the inverse problem of determining the variability in population size from an observed SFS. Our solution is based on a cubic spline for the population size. The cubic spline is determined by minimizing the weighted average of two terms, namely (i) the goodness of fit to the SFS, and (ii) a penalty term based on the smoothness of the changes. The weight is determined by cross-validation. The new method is validated on simulated demographic histories and applied on data from nine different human populations.

Download Full-text

Non-parametric estimation of population size changes from the site frequency spectrum

Statistical Applications in Genetics and Molecular Biology ◽

10.1515/sagmb-2017-0061 ◽

2018 ◽

Vol 17 (3) ◽

Cited By ~ 6

Author(s):

Berit Lindum Waltoft ◽

Asger Hobolth

Keyword(s):

Population Size ◽

Frequency Spectrum ◽

Goodness Of Fit ◽

Weighted Average ◽

Cubic Spline ◽

Parametric Estimation ◽

New Method ◽

Eigenvalue Decomposition ◽

Human Populations ◽

Site Frequency Spectrum

Abstract Changes in population size is a useful quantity for understanding the evolutionary history of a species. Genetic variation within a species can be summarized by the site frequency spectrum (SFS). For a sample of size n, the SFS is a vector of length n − 1 where entry i is the number of sites where the mutant base appears i times and the ancestral base appears n − i times. We present a new method, CubSFS, for estimating the changes in population size of a panmictic population from an observed SFS. First, we provide a straightforward proof for the expression of the expected site frequency spectrum depending only on the population size. Our derivation is based on an eigenvalue decomposition of the instantaneous coalescent rate matrix. Second, we solve the inverse problem of determining the changes in population size from an observed SFS. Our solution is based on a cubic spline for the population size. The cubic spline is determined by minimizing the weighted average of two terms, namely (i) the goodness of fit to the observed SFS, and (ii) a penalty term based on the smoothness of the changes. The weight is determined by cross-validation. The new method is validated on simulated demographic histories and applied on unfolded and folded SFS from 26 different human populations from the 1000 Genomes Project.

Download Full-text

Inferring the model and onset of natural selection under varying population size from the site frequency spectrum and haplotype structure

Proceedings of The Royal Society B Biological Sciences ◽

10.1098/rspb.2018.2541 ◽

2019 ◽

Vol 286 (1896) ◽

pp. 20182541 ◽

Cited By ~ 3

Author(s):

Shigeki Nakagome ◽

Richard R. Hudson ◽

Anna Di Rienzo

Keyword(s):

Natural Selection ◽

Population Size ◽

Allele Frequency ◽

Frequency Spectrum ◽

Goodness Of Fit ◽

Haplotype Structure ◽

New Mutation ◽

Standing Variation ◽

Site Frequency Spectrum ◽

Varying Population

A fundamental question about adaptation in a population is the time of onset of the selective pressure acting on beneficial alleles. Inferring this time, in turn, depends on the selection model. We develop a framework of approximate Bayesian computation (ABC) that enables the use of the full site frequency spectrum and haplotype structure to test the goodness-of-fit of selection models and estimate the timing of selection under varying population size scenarios. We show that our method has sufficient power to distinguish natural selection from neutrality even if relatively old selection increased the frequency of a pre-existing allele from 20% to 50% or from 40% to 80%. Our ABC can accurately estimate the time of onset of selection on a new mutation. However, estimates are prone to bias under the standing variation model, possibly due to the uncertainty in the allele frequency at the onset of selection. We further extend our approach to take advantage of ancient DNA data that provides information on the allele frequency path of the beneficial allele. Applying our ABC, including both modern and ancient human DNA data, to four pigmentation alleles in Europeans, we detected selection on standing variants that occurred after the dispersal from Africa even though models of selection on a new mutation were initially supported for two of these alleles without the ancient data.

Download Full-text

Contemporary Demographic Reconstruction Methods Are Robust to Genome Assembly Quality: A Case Study in Tasmanian Devils

Molecular Biology and Evolution ◽

10.1093/molbev/msz191 ◽

2019 ◽

Vol 36 (12) ◽

pp. 2906-2921 ◽

Cited By ~ 20

Author(s):

Austin H Patton ◽

Mark J Margres ◽

Amanda R Stahlke ◽

Sarah Hendricks ◽

Kevin Lewallen ◽

...

Keyword(s):

Population Size ◽

Frequency Spectrum ◽

Genome Assembly ◽

Genomic Sequence ◽

Demographic History ◽

Tasmanian Devil ◽

Assembly Quality ◽

Reconstruction Methods ◽

Site Frequency Spectrum ◽

Genome Assemblies

Abstract Reconstructing species’ demographic histories is a central focus of molecular ecology and evolution. Recently, an expanding suite of methods leveraging either the sequentially Markovian coalescent (SMC) or the site-frequency spectrum has been developed to reconstruct population size histories from genomic sequence data. However, few studies have investigated the robustness of these methods to genome assemblies of varying quality. In this study, we first present an improved genome assembly for the Tasmanian devil using the Chicago library method. Compared with the original reference genome, our new assembly reduces the number of scaffolds (from 35,975 to 10,010) and increases the scaffold N90 (from 0.101 to 2.164 Mb). Second, we assess the performance of four contemporary genomic methods for inferring population size history (PSMC, MSMC, SMC++, Stairway Plot), using the two devil genome assemblies as well as simulated, artificially fragmented genomes that approximate the hypothesized demographic history of Tasmanian devils. We demonstrate that each method is robust to assembly quality, producing similar estimates of Ne when simulated genomes were fragmented into up to 5,000 scaffolds. Overall, methods reliant on the SMC are most reliable between ∼300 generations before present (gbp) and 100 kgbp, whereas methods exclusively reliant on the site-frequency spectrum are most reliable between the present and 30 gbp. Our results suggest that when used in concert, genomic methods for reconstructing species’ effective population size histories 1) can be applied to nonmodel organisms without highly contiguous reference genomes, and 2) are capable of detecting independently documented effects of historical geological events.

Download Full-text

Inference of super-exponential human population growth via efficient computation of the site frequency spectrum for generalized models

10.1101/022574 ◽

2015 ◽

Author(s):

Feng Gao ◽

Alon Keinan

Keyword(s):

Exome Sequencing ◽

Frequency Spectrum ◽

Equivalent Model ◽

P Value ◽

Growth Speed ◽

Human Populations ◽

Summary Statistics ◽

Efficient Computation ◽

Effective Population ◽

Site Frequency Spectrum

The site frequency spectrum (SFS) and other genetic summary statistics are at the heart of many population genetics studies. Previous studies have shown that human populations had undergone a recent epoch of fast growth in effective population size. These studies assumed that growth is exponential, and the ensuing models leave unexplained excess amount of extremely rare variants. This suggests that human populations might have experienced a recent growth with speed faster than exponential. Recent studies have introduced a generalized growth model where the growth speed can be faster or slower than exponential. However, only simulation approaches were available for obtaining summary statistics under such models. In this study, we provide expressions to accurately and efficiently evaluate the SFS and other summary statistics under generalized models, which we further implement in a publicly available software. Investigating the power to infer deviation of growth from being exponential, we observed that decent sample sizes facilitate accurate inference, e.g. a sample of 3000 individuals with the amount of data expected from exome sequencing allows observing and accurately estimating growth with speed deviating by 10% or more from that of exponential. Applying our inference framework to data from the NHLBI Exome Sequencing Project, we found that a model with a generalized growth epoch fits the observed SFS significantly better than the equivalent model with exponential growth (p-value = 3.85 × 10-6). The estimated growth speed significantly deviates from exponential (p-value << 10-12), with the best-fit estimate being of growth speed 12% faster than exponential.

Download Full-text

Molecular Population Genetics

A Primer of Population Genetics and Genomics ◽

10.1093/oso/9780198862291.003.0007 ◽

2020 ◽

pp. 179-224

Author(s):

Daniel L. Hartl

Keyword(s):

Population Genetics ◽

Population Dynamics ◽

Amino Acid ◽

Frequency Spectrum ◽

Demographic History ◽

Human Populations ◽

Noncoding Dna ◽

Molecular Population Genetics ◽

Amino Acid Divergence ◽

Site Frequency Spectrum

Chapter 7 is an introduction to molecular population genetics that includes the principal concepts of nucleotide polymorphism and divergence, the site frequency spectrum, and tests of selection and their limitations. Highlighted are rates of nucleotide substitution in coding and noncoding DNA, nucleotide and amino acid divergence between species, corrections for multiple substitutions, and the molecular clock. Discussion of the folded and unfolded site frequency spectrum includes the strengths and limitations of Tajima’s D, Fay and Wu’s H, and other measures. The chapter also discusses an emerging consensus to resolve the celebrated selection–neutrality controversy. It also includes examination of demographic history through the use of ancient DNA with special emphasis on the surprising findings in regard to the ancestral makeup of contemporary human populations. Also discussed are the population dynamics of transposable elements in prokaryotes and eukaryotes.

Download Full-text

Distinguishing multiple-merger from Kingman coalescence using two-site frequency spectra

10.1101/461517 ◽

2018 ◽

Cited By ~ 1

Author(s):

Daniel P. Rice ◽

John Novembre ◽

Michael M. Desai

Keyword(s):

Mutual Information ◽

Population Size ◽

Frequency Spectrum ◽

Graphical Model ◽

Genomic Diversity ◽

Size Change ◽

Frequency Spectra ◽

Offspring Number ◽

Site Frequency Spectrum ◽

Pointwise Mutual Information

AbstractDemographic inference methods in population genetics typically assume that the ancestry of a sample can be modeled by the Kingman coalescent. A defining feature of this stochastic process is that it generates genealogies that are binary trees: no more than two ancestral lineages may coalesce at the same time. However, this assumption breaks down under several scenarios. For example, pervasive natural selection and extreme variation in offspring number can both generate genealogies with “multiple-merger” events in which more than two lineages coalesce instantaneously. Therefore, detecting multiple mergers is important both for understanding which forces have shaped the diversity of a population and for avoiding fitting misspecified models to data. Current methods to detect multiple mergers in genomic data rely on the site frequency spectrum (SFS). However, the signatures of multiple mergers in the SFS are also consistent with a Kingman coalescent with a time-varying population size. Here, we present a new method for detecting multiple mergers based on the pointwise mutual information of the two-site frequency spectrum for pairs of linked sites. Unlike the SFS, the pointwise mutual information depends mostly on the topologies of genealogies rather than on their branch lengths and is therefore largely insensitive to population size change. This statistic is global in the sense that it can detect when the genome-wide genetic diversity is inconsistent with the Kingman coalescent, rather than detecting outlier regions, as in selection scan methods. Finally, we demonstrate a graphical model-checking procedure based on the point-wise mutual information using genomic diversity data from Drosophila melanogaster.

Download Full-text

The Site Frequency Spectrum under Finite and Time-Varying Mutation Rates

10.1101/375907 ◽

2018 ◽

Author(s):

Andrew Melfi ◽

Divakar Viswanath

Keyword(s):

Population Size ◽

Frequency Spectrum ◽

Genomic Data ◽

Mutation Rates ◽

Single Mutation ◽

Time Varying ◽

Mutation Probability ◽

Human Genomes ◽

Haploid Population ◽

Site Frequency Spectrum

AbstractThe diversity in genomes is due to the accumulation of mutations and the site frequency spectrum (SFS) is a popular statistic for summarizing genomic data. The current coalescent algorithm for calculating the SFS for a given demography assumes the μ → 0 limit, where μ is the mutation probability (or rate) per base pair per generation. The algorithm is applicable when μN, N being the haploid population size, is negligible. We derive a coalescent based algorithm for calculating the SFS that allows the mutation rate μ(t) as well as the population size N(t) to vary arbitrarily as a function of time. That algorithm shows that the probability of two mutations in the genealogy becomes noticeable already for μ = 10-8 for samples of n = 105 haploid human genomes and increases rapidly with μ. Our algorithm calculates the SFS under the assumption of a single mutation in the genealogy, and the part of the SFS due to a single mutation depends only mildly on the finiteness of μ. However, the dependence of the SFS on variation in μ can be substantial for even n = 100 samples. In addition, increasing and decreasing mutation rates alter the SFS in different ways and to different extents.

Download Full-text

The Wright-Fisher Site Frequency Spectrum as a Perturbation of the Coalescent’s

10.1101/332817 ◽

2018 ◽

Cited By ~ 1

Author(s):

Andrew Melfi ◽

Divakar Viswanath

Keyword(s):

Sample Size ◽

Population Size ◽

Total Variation ◽

Frequency Spectrum ◽

Separate Analysis ◽

Number Of Children ◽

Large Samples ◽

Variation Distance ◽

Site Frequency Spectrum ◽

The Difference

AbstractThe first terms of the Wright-Fisher (WF) site frequency spectrum that follow the coalescent approximation are determined precisely, with a view to understanding the accuracy of the coalescent approximation for large samples. The perturbing terms show that the probability of a single mutant in the sample (singleton probability) is elevated in WF but the rest of the frequency spectrum is lowered. A part of the perturbation can be attributed to a mismatch in rates of merger between WF and the coalescent. The rest of it can be attributed to the difference in the way WF and the coalescent partition children between parents. In particular, the number of children of a parent is approximately Poisson under WF and approximately geometric under the coalescent. Whereas the mismatch in rates raises the probability of singletons under WF, its offspring distribution being approximately Poisson lowers it. The two effects are of opposite sense everywhere except at the tail of the frequency spectrum. The WF frequency spectrum begins to depart from that of the coalescent only for sample sizes that are comparable to the population size. These conclusions are confirmed by a separate analysis that assumes the sample size n to be equal to the population size N. Partly thanks to the canceling effects, the total variation distance of WF minus coalescent is 0.12/ log N for a population sized sample with n = N, which is only 1% for N = 2 × 104.

Download Full-text