Exact Likelihood Calculation under the Infinite Sites Model

Muhammad Faisal; Andreas Futschik; Claus Vogl

doi:10.3390/computation3040701

Statistical Properties of a DNA Sample Under the Finite-Sites Model

Genetics ◽

10.1093/genetics/144.4.1941 ◽

1996 ◽

Vol 144 (4) ◽

pp. 1941-1950 ◽

Cited By ~ 1

Author(s):

Ziheng Yang

Keyword(s):

Process Model ◽

Random Mating ◽

Major Effect ◽

Statistical Properties ◽

Parameter Estimates ◽

Rate Variation ◽

Markov Process Model ◽

D Loop ◽

Infinite Sites Model ◽

Segregating Sites

Statistical properties of a DNA sample from a random-mating population of constant size are studied under the finite-sites model. It is assumed that there is no migration and no recombination occurs within the locus. A Markov process model is used for nucleotide substitution, allowing for multiple substitutions at a single site. The evolutionary rates among sites are treated as either constant or variable. The general likelihood calculation using numerical integration involves intensive computation and is feasible for three or four sequences only; it may be used for validating approximate algorithms. Methods are developed to approximate the probability distribution of the number of segregating sites in a random sample of n sequences, with either constant or variable substitution rates across sites. Calculations using parameter estimates obtained for human D-loop mitochondrial DNAs show that among-site rate variation has a major effect on the distribution of the number of segregating sites; the distribution under the finite-sites model with variable rates among sites is quite different from that under the infinite-sites model.

Download Full-text

A New Statistic for Detecting Genetic Differentiation

Genetics ◽

10.1093/genetics/155.4.2011 ◽

2000 ◽

Vol 155 (4) ◽

pp. 2011-2014 ◽

Cited By ~ 10

Author(s):

Richard R Hudson

Keyword(s):

Genetic Differentiation ◽

Dna Sequences ◽

Genetic Data ◽

Island Model ◽

Wide Range ◽

Infinite Sites Model ◽

Parameter Values

Abstract A new statistic for detecting genetic differentiation of subpopulations is described. The statistic can be calculated when genetic data are collected on individuals sampled from two or more localities. It is assumed that haplotypic data are obtained, either in the form of DNA sequences or data on many tightly linked markers. Using a symmetric island model, and assuming an infinite-sites model of mutation, it is found that the new statistic is as powerful or more powerful than previously proposed statistics for a wide range of parameter values.

Download Full-text

The effect of life-history and mode of inheritance on neutral genetic variability

Genetics Research ◽

10.1017/s0016672301004979 ◽

2001 ◽

Vol 77 (2) ◽

pp. 153-166 ◽

Cited By ~ 114

Author(s):

BRIAN CHARLESWORTH

Keyword(s):

Sequence Variation ◽

Adult Mortality ◽

Structured Populations ◽

Human Populations ◽

Effective Population ◽

Transmission Modes ◽

Population Sizes ◽

Age Structured ◽

Infinite Sites Model ◽

Site Diversity

Formulae for the effective population sizes of autosomal, X-linked, Y-linked and maternally transmitted loci in age-structured populations are developed. The approximations used here predict both asymptotic rates of increase in probabilities of identity, and equilibrium levels of neutral nucleotide site diversity under the infinite-sites model. The applications of the results to the interpretation of data on DNA sequence variation in Drosophila, plant, and human populations are discussed. It is concluded that sex differences in demographic parameters such as adult mortality rates generally have small effects on the relative effective population sizes of loci with different modes of inheritance, whereas differences between the sexes in variance in reproductive success can have major effects, either increasing or reducing the effective population size for X-linked loci relative to autosomal or Y-linked loci. These effects need to be accounted for when trying to understand data on patterns of sequence variation for genes with different transmission modes.

Download Full-text

Maximum Likelihood Estimation of Population Divergence Times and Population Phylogenies under the Infinite Sites Model

Theoretical Population Biology ◽

10.1006/tpbi.1997.1348 ◽

1998 ◽

Vol 53 (2) ◽

pp. 143-151 ◽

Cited By ~ 37

Author(s):

Rasmus Nielsen

Keyword(s):

Maximum Likelihood ◽

Maximum Likelihood Estimation ◽

Likelihood Estimation ◽

Divergence Times ◽

Population Divergence ◽

Infinite Sites Model

Download Full-text

Coalescent: an Open-Science framework for Importance Sampling in Coalescent theory

10.7287/peerj.preprints.395v1 ◽

2014 ◽

Author(s):

Susanta Tewari ◽

John L Spouge

Keyword(s):

Importance Sampling ◽

User Interaction ◽

Open Science ◽

Maximum Likelihood Estimates ◽

Effective Sample Size ◽

Data Sets ◽

Coalescent Theory ◽

Target Distribution ◽

Compute Data ◽

Infinite Sites Model

Importance sampling is widely used in coalescent theory to compute data likelihood. Efficient importance sampling requires a trial distribution close to the target distribution of the genealogies conditioned on the data. Moreover, an efficient proposal requires intuition about how the data influence the target distribution. Different proposals might work under similar conditions, and sometimes the corresponding concepts overlap extensively. Currently, there is no framework available for coalescent theory that evaluates proposals in an integrated manner. Typically, problems are not modeled, optimization is performed vigorously on limited datasets, user interaction requires thorough knowledge, and programs are not aligned with the current demands of open science. We have designed a general framework (http://coalescent.sourceforge.net) for importance sampling, to compute data likelihood under the infinite sites model of mutation. The framework models the necessary core concepts, comes integrated with several data sets of varying size, implements the standard competing proposals, and integrates tightly with our previous framework for calculating exact probabilities. The framework computes the data likelihood and provides maximum likelihood estimates of the mutation parameter. Well-known benchmarks in the coalescent literature validate the framework’s accuracy. We evaluate several proposals in the coalescent literature, to discover that the order of efficiency among three standard proposals changes when running time is considered along with the effective sample size. The framework provides an intuitive user interface with minimal clutter. For speed, the framework switches automatically to modern multicore hardware, if available. It runs on three major platforms (Windows, Mac and Linux). Extensive tests and coverage make the framework accessible to a large community.

Download Full-text

Efficiently summarizing relationships in large samples: a general duality between statistics of genealogies and genomes

10.1101/779132 ◽

2019 ◽

Cited By ~ 3

Author(s):

Peter Ralph ◽

Kevin Thornton ◽

Jerome Kelleher

Keyword(s):

Genome Sequence ◽

General Framework ◽

Simulated Data ◽

Genetic Mutation ◽

Genealogical Tree ◽

Computational Performance ◽

Infinite Sites Model ◽

And Function ◽

General Duality ◽

Dual Site

AbstractAs a genetic mutation is passed down across generations, it distinguishes those genomes that have inherited it from those that have not, providing a glimpse of the genealogical tree relating the genomes to each other at that site. Statistical summaries of genetic variation therefore also describe the underlying genealogies. We use this correspondence to define a general framework that efficiently computes single-site population genetic statistics using the succinct tree sequence encoding of genealogies and genome sequence. The general approach accumulates “sample weights” within the genealogical tree at each position on the genome, which are then combined using a “summary function”; different statistics result from different choices of weight and function. Results can be reported in three ways: by site, which corresponds to statistics calculated as usual from genome sequence; by branch, which gives the expected value of the dual site statistic under the infinite-sites model of mutation, and by node, which summarizes the contribution of each ancestor to these statistics. We use the framework to implement many currently-defined statistics of genome sequence (making the statistics’ relationship to the underlying genealogical trees concrete and explicit), as well as the corresponding “branch” statistics of tree shape. We evaluate computational performance using simulated data, and show that calculating statistics from tree sequences using this general framework is several orders of magnitude more efficient than optimized matrix-based methods in terms of both run time and memory requirements. We also explore how well the duality between site and branch statistics holds in practice on trees inferred from the 1000 Genomes Project dataset, and discuss ways in which deviations may encode interesting biological signals.

Download Full-text

Efficiently Summarizing Relationships in Large Samples: A General Duality Between Statistics of Genealogies and Genomes

Genetics ◽

10.1534/genetics.120.303253 ◽

2020 ◽

Vol 215 (3) ◽

pp. 779-797 ◽

Cited By ~ 3

Author(s):

Peter Ralph ◽

Kevin Thornton ◽

Jerome Kelleher

Keyword(s):

Genome Sequence ◽

General Framework ◽

Simulated Data ◽

Genetic Mutation ◽

Data Set ◽

Genealogical Tree ◽

Computational Performance ◽

Infinite Sites Model ◽

Project Data ◽

And Function

As a genetic mutation is passed down across generations, it distinguishes those genomes that have inherited it from those that have not, providing a glimpse of the genealogical tree relating the genomes to each other at that site. Statistical summaries of genetic variation therefore also describe the underlying genealogies. We use this correspondence to define a general framework that efficiently computes single-site population genetic statistics using the succinct tree sequence encoding of genealogies and genome sequence. The general approach accumulates sample weights within the genealogical tree at each position on the genome, which are then combined using a summary function; different statistics result from different choices of weight and function. Results can be reported in three ways: by site, which corresponds to statistics calculated as usual from genome sequence; by branch, which gives the expected value of the dual site statistic under the infinite sites model of mutation, and by node, which summarizes the contribution of each ancestor to these statistics. We use the framework to implement many currently defined statistics of genome sequence (making the statistics’ relationship to the underlying genealogical trees concrete and explicit), as well as the corresponding branch statistics of tree shape. We evaluate computational performance using simulated data, and show that calculating statistics from tree sequences using this general framework is several orders of magnitude more efficient than optimized matrix-based methods in terms of both run time and memory requirements. We also explore how well the duality between site and branch statistics holds in practice on trees inferred from the 1000 Genomes Project data set, and discuss ways in which deviations may encode interesting biological signals.

Download Full-text

Transient distribution of the number of segregating sites in a neutral infinite-sites model with no recombination

Journal of Applied Probability ◽

10.1017/s002190020009759x ◽

1981 ◽

Vol 18 (01) ◽

pp. 42-51 ◽

Cited By ~ 3

Author(s):

R. C. Griffiths

Keyword(s):

Waiting Time ◽

Time Distribution ◽

Large Population ◽

Initial Population ◽

Stationary Case ◽

Waiting Time Distribution ◽

Transient Distribution ◽

Infinite Sites Model ◽

Segregating Sites ◽

New Mutations

The transient distribution of the number of segregating sites in a sample from a large population of 2N genes is found. Segregating sites are split into those in common with the sites segregating in the initial population and those segregating due to new mutations. The waiting time distribution for a population to lose all of its initial sites is also studied. A neutral infinite-sites model with no recombination is used. This paper extends the work of Watterson (1975), from the stationary case; and of Li (1977) from the transient distribution in a sample of 2.

Download Full-text

Accurate and efficient cell lineage tree inference from noisy single cell data: the maximum likelihood perfect phylogeny approach

Bioinformatics ◽

10.1093/bioinformatics/btz676 ◽

2019 ◽

Author(s):

Yufeng Wu

Keyword(s):

Single Cell ◽

Cell Lineage ◽

Large Data ◽

Genomic Variation ◽

Supplementary Information ◽

Perfect Phylogeny ◽

Tree Inference ◽

Lineage Tree ◽

Infinite Sites Model ◽

Cell Data

Abstract Motivation Cells in an organism share a common evolutionary history, called cell lineage tree. Cell lineage tree can be inferred from single cell genotypes at genomic variation sites. Cell lineage tree inference from noisy single cell data is a challenging computational problem. Most existing methods for cell lineage tree inference assume uniform uncertainty in genotypes. A key missing aspect is that real single cell data usually has non-uniform uncertainty in individual genotypes. Moreover, existing methods are often sampling based and can be very slow for large data. Results In this article, we propose a new method called ScisTree, which infers cell lineage tree and calls genotypes from noisy single cell genotype data. Different from most existing approaches, ScisTree works with genotype probabilities of individual genotypes (which can be computed by existing single cell genotype callers). ScisTree assumes the infinite sites model. Given uncertain genotypes with individualized probabilities, ScisTree implements a fast heuristic for inferring cell lineage tree and calling the genotypes that allow the so-called perfect phylogeny and maximize the likelihood of the genotypes. Through simulation, we show that ScisTree performs well on the accuracy of inferred trees, and is much more efficient than existing methods. The efficiency of ScisTree enables new applications including imputation of the so-called doublets. Availability and implementation The program ScisTree is available for download at: https://github.com/yufengwudcs/ScisTree. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text