scholarly journals Exact Likelihood Calculation under the Infinite Sites Model

Computation ◽  
2015 ◽  
Vol 3 (4) ◽  
pp. 701-713 ◽  
Author(s):  
Muhammad Faisal ◽  
Andreas Futschik ◽  
Claus Vogl
Keyword(s):  
Genetics ◽  
1996 ◽  
Vol 144 (4) ◽  
pp. 1941-1950 ◽  
Author(s):  
Ziheng Yang

Statistical properties of a DNA sample from a random-mating population of constant size are studied under the finite-sites model. It is assumed that there is no migration and no recombination occurs within the locus. A Markov process model is used for nucleotide substitution, allowing for multiple substitutions at a single site. The evolutionary rates among sites are treated as either constant or variable. The general likelihood calculation using numerical integration involves intensive computation and is feasible for three or four sequences only; it may be used for validating approximate algorithms. Methods are developed to approximate the probability distribution of the number of segregating sites in a random sample of n sequences, with either constant or variable substitution rates across sites. Calculations using parameter estimates obtained for human D-loop mitochondrial DNAs show that among-site rate variation has a major effect on the distribution of the number of segregating sites; the distribution under the finite-sites model with variable rates among sites is quite different from that under the infinite-sites model.


Genetics ◽  
2000 ◽  
Vol 155 (4) ◽  
pp. 2011-2014 ◽  
Author(s):  
Richard R Hudson

Abstract A new statistic for detecting genetic differentiation of subpopulations is described. The statistic can be calculated when genetic data are collected on individuals sampled from two or more localities. It is assumed that haplotypic data are obtained, either in the form of DNA sequences or data on many tightly linked markers. Using a symmetric island model, and assuming an infinite-sites model of mutation, it is found that the new statistic is as powerful or more powerful than previously proposed statistics for a wide range of parameter values.


2001 ◽  
Vol 77 (2) ◽  
pp. 153-166 ◽  
Author(s):  
BRIAN CHARLESWORTH

Formulae for the effective population sizes of autosomal, X-linked, Y-linked and maternally transmitted loci in age-structured populations are developed. The approximations used here predict both asymptotic rates of increase in probabilities of identity, and equilibrium levels of neutral nucleotide site diversity under the infinite-sites model. The applications of the results to the interpretation of data on DNA sequence variation in Drosophila, plant, and human populations are discussed. It is concluded that sex differences in demographic parameters such as adult mortality rates generally have small effects on the relative effective population sizes of loci with different modes of inheritance, whereas differences between the sexes in variance in reproductive success can have major effects, either increasing or reducing the effective population size for X-linked loci relative to autosomal or Y-linked loci. These effects need to be accounted for when trying to understand data on patterns of sequence variation for genes with different transmission modes.


2014 ◽  
Author(s):  
Susanta Tewari ◽  
John L Spouge

Importance sampling is widely used in coalescent theory to compute data likelihood. Efficient importance sampling requires a trial distribution close to the target distribution of the genealogies conditioned on the data. Moreover, an efficient proposal requires intuition about how the data influence the target distribution. Different proposals might work under similar conditions, and sometimes the corresponding concepts overlap extensively. Currently, there is no framework available for coalescent theory that evaluates proposals in an integrated manner. Typically, problems are not modeled, optimization is performed vigorously on limited datasets, user interaction requires thorough knowledge, and programs are not aligned with the current demands of open science. We have designed a general framework (http://coalescent.sourceforge.net) for importance sampling, to compute data likelihood under the infinite sites model of mutation. The framework models the necessary core concepts, comes integrated with several data sets of varying size, implements the standard competing proposals, and integrates tightly with our previous framework for calculating exact probabilities. The framework computes the data likelihood and provides maximum likelihood estimates of the mutation parameter. Well-known benchmarks in the coalescent literature validate the framework’s accuracy. We evaluate several proposals in the coalescent literature, to discover that the order of efficiency among three standard proposals changes when running time is considered along with the effective sample size. The framework provides an intuitive user interface with minimal clutter. For speed, the framework switches automatically to modern multicore hardware, if available. It runs on three major platforms (Windows, Mac and Linux). Extensive tests and coverage make the framework accessible to a large community.


2019 ◽  
Author(s):  
Peter Ralph ◽  
Kevin Thornton ◽  
Jerome Kelleher

AbstractAs a genetic mutation is passed down across generations, it distinguishes those genomes that have inherited it from those that have not, providing a glimpse of the genealogical tree relating the genomes to each other at that site. Statistical summaries of genetic variation therefore also describe the underlying genealogies. We use this correspondence to define a general framework that efficiently computes single-site population genetic statistics using the succinct tree sequence encoding of genealogies and genome sequence. The general approach accumulates “sample weights” within the genealogical tree at each position on the genome, which are then combined using a “summary function”; different statistics result from different choices of weight and function. Results can be reported in three ways: by site, which corresponds to statistics calculated as usual from genome sequence; by branch, which gives the expected value of the dual site statistic under the infinite-sites model of mutation, and by node, which summarizes the contribution of each ancestor to these statistics. We use the framework to implement many currently-defined statistics of genome sequence (making the statistics’ relationship to the underlying genealogical trees concrete and explicit), as well as the corresponding “branch” statistics of tree shape. We evaluate computational performance using simulated data, and show that calculating statistics from tree sequences using this general framework is several orders of magnitude more efficient than optimized matrix-based methods in terms of both run time and memory requirements. We also explore how well the duality between site and branch statistics holds in practice on trees inferred from the 1000 Genomes Project dataset, and discuss ways in which deviations may encode interesting biological signals.


Genetics ◽  
2020 ◽  
Vol 215 (3) ◽  
pp. 779-797 ◽  
Author(s):  
Peter Ralph ◽  
Kevin Thornton ◽  
Jerome Kelleher

As a genetic mutation is passed down across generations, it distinguishes those genomes that have inherited it from those that have not, providing a glimpse of the genealogical tree relating the genomes to each other at that site. Statistical summaries of genetic variation therefore also describe the underlying genealogies. We use this correspondence to define a general framework that efficiently computes single-site population genetic statistics using the succinct tree sequence encoding of genealogies and genome sequence. The general approach accumulates sample weights within the genealogical tree at each position on the genome, which are then combined using a summary function; different statistics result from different choices of weight and function. Results can be reported in three ways: by site, which corresponds to statistics calculated as usual from genome sequence; by branch, which gives the expected value of the dual site statistic under the infinite sites model of mutation, and by node, which summarizes the contribution of each ancestor to these statistics. We use the framework to implement many currently defined statistics of genome sequence (making the statistics’ relationship to the underlying genealogical trees concrete and explicit), as well as the corresponding branch statistics of tree shape. We evaluate computational performance using simulated data, and show that calculating statistics from tree sequences using this general framework is several orders of magnitude more efficient than optimized matrix-based methods in terms of both run time and memory requirements. We also explore how well the duality between site and branch statistics holds in practice on trees inferred from the 1000 Genomes Project data set, and discuss ways in which deviations may encode interesting biological signals.


1981 ◽  
Vol 18 (01) ◽  
pp. 42-51 ◽  
Author(s):  
R. C. Griffiths

The transient distribution of the number of segregating sites in a sample from a large population of 2N genes is found. Segregating sites are split into those in common with the sites segregating in the initial population and those segregating due to new mutations. The waiting time distribution for a population to lose all of its initial sites is also studied. A neutral infinite-sites model with no recombination is used. This paper extends the work of Watterson (1975), from the stationary case; and of Li (1977) from the transient distribution in a sample of 2.


2019 ◽  
Author(s):  
Yufeng Wu

Abstract Motivation Cells in an organism share a common evolutionary history, called cell lineage tree. Cell lineage tree can be inferred from single cell genotypes at genomic variation sites. Cell lineage tree inference from noisy single cell data is a challenging computational problem. Most existing methods for cell lineage tree inference assume uniform uncertainty in genotypes. A key missing aspect is that real single cell data usually has non-uniform uncertainty in individual genotypes. Moreover, existing methods are often sampling based and can be very slow for large data. Results In this article, we propose a new method called ScisTree, which infers cell lineage tree and calls genotypes from noisy single cell genotype data. Different from most existing approaches, ScisTree works with genotype probabilities of individual genotypes (which can be computed by existing single cell genotype callers). ScisTree assumes the infinite sites model. Given uncertain genotypes with individualized probabilities, ScisTree implements a fast heuristic for inferring cell lineage tree and calling the genotypes that allow the so-called perfect phylogeny and maximize the likelihood of the genotypes. Through simulation, we show that ScisTree performs well on the accuracy of inferred trees, and is much more efficient than existing methods. The efficiency of ScisTree enables new applications including imputation of the so-called doublets. Availability and implementation The program ScisTree is available for download at: https://github.com/yufengwudcs/ScisTree. Supplementary information Supplementary data are available at Bioinformatics online.


2008 ◽  
Vol 105 (38) ◽  
pp. 14254-14261 ◽  
Author(s):  
J. Ma ◽  
A. Ratan ◽  
B. J. Raney ◽  
B. B. Suh ◽  
W. Miller ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document