infinite sites model
Recently Published Documents


TOTAL DOCUMENTS

38
(FIVE YEARS 1)

H-INDEX

12
(FIVE YEARS 0)

2021 ◽  
Author(s):  
Jonas Demeulemeester ◽  
Stefan C Dentro ◽  
Moritz Gerstung ◽  
Peter Van Loo

The infinite sites model of molecular evolution requires that every position in the genome is mutated at most once. It is a cornerstone of tumour phylogenetic analysis, and is often implied when calling, phasing and interpreting variants or studying the mutational landscape as a whole. Here we identify 20,555 biallelic mutations, where the same base is mutated independently on both parental copies, in 722 (26.0%) bulk sequencing samples from the Pan-Cancer Analysis of Whole Genomes study (PCAWG). Biallelic mutations reveal UV damage hotspots at ETS and NFAT binding sites, and hypermutable motifs in POLE-mutant and other cancers. We formulate recommendations for variant calling and provide frameworks to model and detect biallelic mutations. These results highlight the need for accurate models of mutation rates and tumour evolution, as well as their inference from sequencing data.


2020 ◽  
Author(s):  
Erin K. Molloy ◽  
John Gatesy ◽  
Mark S. Springer

AbstractA major shortcoming of concatenation methods for species tree estimation is their failure to account for incomplete lineage sorting (ILS). Coalescence methods explicitly address this problem, but make various assumptions that, if violated, can result in worse performance than concatenation. Given the challenges of analyzing DNA sequences with both concatenation and coalescence methods, retroelement insertions have emerged as powerful phylogenomic markers for species tree estimation. We show that two recently proposed methods, SDPquartets and ASTRAL_BP, are statistically consistent estimators of the species tree under the multispecies coalescent model, with retroelement insertions following a neutral infinite sites model of mutation. The accuracy of these and other methods for inferring species trees with retroelements has not been assessed in simulation studies. We simulate retroelements for four different species trees, including three with short branch lengths in the anomaly zone, and assess the performance of eight different methods for recovering the correct species tree. We also examine whether ASTRAL_BP recovers accurate internal branch lengths for internodes of various lengths (in coalescent units). Our results indicate that two recently proposed ILS-aware methods, ASTRAL_BP and SDPquartets, as well as the newly proposed ASTRID_BP, always recover the correct species tree on data sets with large numbers of retroelements even when there are extremely short species-tree branches in the anomaly zone. Dollo parsimony performed almost as well as these ILS-aware methods. By contrast, unordered parsimony, polymorphism parsimony, and MDC recovered the correct species tree in the case of a pectinate tree with four ingroup taxa in the anomaly zone, but failed to recover the correct tree in more complex anomaly-zone situations with additional lineages impacted by extensive incomplete lineage sorting. Camin-Sokal parsimony always reconstructed an incorrect tree in the anomaly zone. ASTRAL_BP accurately estimated branch lengths when internal branches were very short as in anomaly zone situations, but branch lengths were upwardly biased by more than 35% when species tree branches were longer. We derive a mathematical correction for these distortions, assuming the expected number of new retroelement insertions per generation is constant across the species tree. We also show that short branches do not need to be corrected even when this assumption does not hold; therefore, the branch lengths estimates produced by ASTRAL_BP may provide insight into whether an estimated species tree is in the anomaly zone.


2020 ◽  
Vol 501 ◽  
pp. 110335
Author(s):  
Chris D. Greenman ◽  
Luca Penso-Dolfin ◽  
Taoyang Wu

Genetics ◽  
2020 ◽  
Vol 215 (3) ◽  
pp. 779-797 ◽  
Author(s):  
Peter Ralph ◽  
Kevin Thornton ◽  
Jerome Kelleher

As a genetic mutation is passed down across generations, it distinguishes those genomes that have inherited it from those that have not, providing a glimpse of the genealogical tree relating the genomes to each other at that site. Statistical summaries of genetic variation therefore also describe the underlying genealogies. We use this correspondence to define a general framework that efficiently computes single-site population genetic statistics using the succinct tree sequence encoding of genealogies and genome sequence. The general approach accumulates sample weights within the genealogical tree at each position on the genome, which are then combined using a summary function; different statistics result from different choices of weight and function. Results can be reported in three ways: by site, which corresponds to statistics calculated as usual from genome sequence; by branch, which gives the expected value of the dual site statistic under the infinite sites model of mutation, and by node, which summarizes the contribution of each ancestor to these statistics. We use the framework to implement many currently defined statistics of genome sequence (making the statistics’ relationship to the underlying genealogical trees concrete and explicit), as well as the corresponding branch statistics of tree shape. We evaluate computational performance using simulated data, and show that calculating statistics from tree sequences using this general framework is several orders of magnitude more efficient than optimized matrix-based methods in terms of both run time and memory requirements. We also explore how well the duality between site and branch statistics holds in practice on trees inferred from the 1000 Genomes Project data set, and discuss ways in which deviations may encode interesting biological signals.


2019 ◽  
Author(s):  
Peter Ralph ◽  
Kevin Thornton ◽  
Jerome Kelleher

AbstractAs a genetic mutation is passed down across generations, it distinguishes those genomes that have inherited it from those that have not, providing a glimpse of the genealogical tree relating the genomes to each other at that site. Statistical summaries of genetic variation therefore also describe the underlying genealogies. We use this correspondence to define a general framework that efficiently computes single-site population genetic statistics using the succinct tree sequence encoding of genealogies and genome sequence. The general approach accumulates “sample weights” within the genealogical tree at each position on the genome, which are then combined using a “summary function”; different statistics result from different choices of weight and function. Results can be reported in three ways: by site, which corresponds to statistics calculated as usual from genome sequence; by branch, which gives the expected value of the dual site statistic under the infinite-sites model of mutation, and by node, which summarizes the contribution of each ancestor to these statistics. We use the framework to implement many currently-defined statistics of genome sequence (making the statistics’ relationship to the underlying genealogical trees concrete and explicit), as well as the corresponding “branch” statistics of tree shape. We evaluate computational performance using simulated data, and show that calculating statistics from tree sequences using this general framework is several orders of magnitude more efficient than optimized matrix-based methods in terms of both run time and memory requirements. We also explore how well the duality between site and branch statistics holds in practice on trees inferred from the 1000 Genomes Project dataset, and discuss ways in which deviations may encode interesting biological signals.


2019 ◽  
Author(s):  
Yufeng Wu

Abstract Motivation Cells in an organism share a common evolutionary history, called cell lineage tree. Cell lineage tree can be inferred from single cell genotypes at genomic variation sites. Cell lineage tree inference from noisy single cell data is a challenging computational problem. Most existing methods for cell lineage tree inference assume uniform uncertainty in genotypes. A key missing aspect is that real single cell data usually has non-uniform uncertainty in individual genotypes. Moreover, existing methods are often sampling based and can be very slow for large data. Results In this article, we propose a new method called ScisTree, which infers cell lineage tree and calls genotypes from noisy single cell genotype data. Different from most existing approaches, ScisTree works with genotype probabilities of individual genotypes (which can be computed by existing single cell genotype callers). ScisTree assumes the infinite sites model. Given uncertain genotypes with individualized probabilities, ScisTree implements a fast heuristic for inferring cell lineage tree and calling the genotypes that allow the so-called perfect phylogeny and maximize the likelihood of the genotypes. Through simulation, we show that ScisTree performs well on the accuracy of inferred trees, and is much more efficient than existing methods. The efficiency of ScisTree enables new applications including imputation of the so-called doublets. Availability and implementation The program ScisTree is available for download at: https://github.com/yufengwudcs/ScisTree. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Author(s):  
Yufeng Wu

AbstractCells in an organism share a common evolutionary history, called cell lineage tree. Cell lineage tree can be inferred from single cell genotypes at genomic variation sites. Cell lineage tree inference from noisy single cell data is a challenging computational problem. Most existing methods for cell lineage tree inference assume uniform uncertainty in genotypes. A key missing aspect is that real single cell data usually has non-uniform uncertainty in individual genotypes. Moreover, existing methods are often sampling-based and can be very slow for large data.In this paper, we propose a new method called ScisTree, which infers cell lineage tree and calls genotypes from noisy single cell genotype data. Different from most existing approaches, ScisTree works with genotype probabilities of individual genotypes (which can be computed by existing single cell genotype callers). ScisTree assumes the infinite sites model. Given uncertain genotypes with individualized probabilities, ScisTree implements a fast heuristic for inferring cell lineage tree and calling the genotypes that allow the so-called perfect phylogeny and maximize the likelihood of the genotypes. Through simulation, we show that ScisTree performs well on the accuracy of inferred trees, and is much more efficient than existing methods. The efficiency of ScisTree enables new applications including imputation of the so-called doublets.AvailabilityThe program ScisTree is available for download at: https://github.com/yufengwudcs/[email protected]


Computation ◽  
2015 ◽  
Vol 3 (4) ◽  
pp. 701-713 ◽  
Author(s):  
Muhammad Faisal ◽  
Andreas Futschik ◽  
Claus Vogl
Keyword(s):  

2015 ◽  
Author(s):  
Sha Zhu ◽  
James H Degnan ◽  
Sharyn J Goldstien ◽  
Bjarki Eldon

Background: There has been increasing interest in coalescent models which admit multiple mergers of ancestral lineages; and to model hybridization and coalescence simultaneously. Results: Hybrid-Lambda is a software package that simulates gene genealogies under multiple merger and Kingman's coalescent processes within species networks or species trees. Hybrid-Lambda allows different coalescent processes to be specified for different populations, and allows for time to be converted between generations and coalescent units, by specifying a population size for each population. In addition, Hybrid-Lambda can generate simulated datasets, assuming the infinitely many sites mutation model, and compute the Fst statistic. As an illustration, we apply Hybrid-Lambda to infer the time of subdivision of certain marine invertebrates under different coalescent processes. Conclusions: Hybrid-Lambda makes it possible to investigate biogeographic concordance among high fecundity species exhibiting skewed offspring distribution. Keywords: hybridization; multiple merger; gene tree; coalescent; FST ; infinite sites model; hybrid-lambda; skewed offspring distribution


2014 ◽  
Author(s):  
Susanta Tewari ◽  
John L Spouge

Importance sampling is widely used in coalescent theory to compute data likelihood. Efficient importance sampling requires a trial distribution close to the target distribution of the genealogies conditioned on the data. Moreover, an efficient proposal requires intuition about how the data influence the target distribution. Different proposals might work under similar conditions, and sometimes the corresponding concepts overlap extensively. Currently, there is no framework available for coalescent theory that evaluates proposals in an integrated manner. Typically, problems are not modeled, optimization is performed vigorously on limited datasets, user interaction requires thorough knowledge, and programs are not aligned with the current demands of open science. We have designed a general framework (http://coalescent.sourceforge.net) for importance sampling, to compute data likelihood under the infinite sites model of mutation. The framework models the necessary core concepts, comes integrated with several data sets of varying size, implements the standard competing proposals, and integrates tightly with our previous framework for calculating exact probabilities. The framework computes the data likelihood and provides maximum likelihood estimates of the mutation parameter. Well-known benchmarks in the coalescent literature validate the framework’s accuracy. We evaluate several proposals in the coalescent literature, to discover that the order of efficiency among three standard proposals changes when running time is considered along with the effective sample size. The framework provides an intuitive user interface with minimal clutter. For speed, the framework switches automatically to modern multicore hardware, if available. It runs on three major platforms (Windows, Mac and Linux). Extensive tests and coverage make the framework accessible to a large community.


Sign in / Sign up

Export Citation Format

Share Document