infinite sites model Latest Research Papers

Biallelic mutations in cancer genomes reveal local mutational determinants

10.1101/2021.03.29.437407 ◽

2021 ◽

Author(s):

Jonas Demeulemeester ◽

Stefan C Dentro ◽

Moritz Gerstung ◽

Peter Van Loo

Keyword(s):

Binding Sites ◽

Variant Calling ◽

Mutation Rates ◽

Uv Damage ◽

Sequencing Data ◽

Whole Genomes ◽

Cancer Genomes ◽

Infinite Sites Model ◽

Biallelic Mutations ◽

Pan Cancer

The infinite sites model of molecular evolution requires that every position in the genome is mutated at most once. It is a cornerstone of tumour phylogenetic analysis, and is often implied when calling, phasing and interpreting variants or studying the mutational landscape as a whole. Here we identify 20,555 biallelic mutations, where the same base is mutated independently on both parental copies, in 722 (26.0%) bulk sequencing samples from the Pan-Cancer Analysis of Whole Genomes study (PCAWG). Biallelic mutations reveal UV damage hotspots at ETS and NFAT binding sites, and hypermutable motifs in POLE-mutant and other cancers. We formulate recommendations for variant calling and provide frameworks to model and detect biallelic mutations. These results highlight the need for accurate models of mutation rates and tumour evolution, as well as their inference from sequencing data.

Download Full-text

ILS-Aware Analyses of Retroelement Insertions in the Anomaly Zone

10.1101/2020.09.29.319038 ◽

2020 ◽

Author(s):

Erin K. Molloy ◽

John Gatesy ◽

Mark S. Springer

Keyword(s):

Dna Sequences ◽

Incomplete Lineage Sorting ◽

Species Tree ◽

Species Trees ◽

Lineage Sorting ◽

Short Branch ◽

Multispecies Coalescent ◽

Branch Lengths ◽

Tree Estimation ◽

Infinite Sites Model

AbstractA major shortcoming of concatenation methods for species tree estimation is their failure to account for incomplete lineage sorting (ILS). Coalescence methods explicitly address this problem, but make various assumptions that, if violated, can result in worse performance than concatenation. Given the challenges of analyzing DNA sequences with both concatenation and coalescence methods, retroelement insertions have emerged as powerful phylogenomic markers for species tree estimation. We show that two recently proposed methods, SDPquartets and ASTRAL_BP, are statistically consistent estimators of the species tree under the multispecies coalescent model, with retroelement insertions following a neutral infinite sites model of mutation. The accuracy of these and other methods for inferring species trees with retroelements has not been assessed in simulation studies. We simulate retroelements for four different species trees, including three with short branch lengths in the anomaly zone, and assess the performance of eight different methods for recovering the correct species tree. We also examine whether ASTRAL_BP recovers accurate internal branch lengths for internodes of various lengths (in coalescent units). Our results indicate that two recently proposed ILS-aware methods, ASTRAL_BP and SDPquartets, as well as the newly proposed ASTRID_BP, always recover the correct species tree on data sets with large numbers of retroelements even when there are extremely short species-tree branches in the anomaly zone. Dollo parsimony performed almost as well as these ILS-aware methods. By contrast, unordered parsimony, polymorphism parsimony, and MDC recovered the correct species tree in the case of a pectinate tree with four ingroup taxa in the anomaly zone, but failed to recover the correct tree in more complex anomaly-zone situations with additional lineages impacted by extensive incomplete lineage sorting. Camin-Sokal parsimony always reconstructed an incorrect tree in the anomaly zone. ASTRAL_BP accurately estimated branch lengths when internal branches were very short as in anomaly zone situations, but branch lengths were upwardly biased by more than 35% when species tree branches were longer. We derive a mathematical correction for these distortions, assuming the expected number of new retroelement insertions per generation is constant across the species tree. We also show that short branches do not need to be corrected even when this assumption does not hold; therefore, the branch lengths estimates produced by ASTRAL_BP may provide insight into whether an estimated species tree is in the anomaly zone.

Download Full-text

The complexity of genome rearrangement combinatorics under the infinite sites model

Journal of Theoretical Biology ◽

10.1016/j.jtbi.2020.110335 ◽

2020 ◽

Vol 501 ◽

pp. 110335

Author(s):

Chris D. Greenman ◽

Luca Penso-Dolfin ◽

Taoyang Wu

Keyword(s):

Genome Rearrangement ◽

Infinite Sites Model

Download Full-text

Efficiently Summarizing Relationships in Large Samples: A General Duality Between Statistics of Genealogies and Genomes

Genetics ◽

10.1534/genetics.120.303253 ◽

2020 ◽

Vol 215 (3) ◽

pp. 779-797 ◽

Cited By ~ 3

Author(s):

Peter Ralph ◽

Kevin Thornton ◽

Jerome Kelleher

Keyword(s):

Genome Sequence ◽

General Framework ◽

Simulated Data ◽

Genetic Mutation ◽

Data Set ◽

Genealogical Tree ◽

Computational Performance ◽

Infinite Sites Model ◽

Project Data ◽

And Function

As a genetic mutation is passed down across generations, it distinguishes those genomes that have inherited it from those that have not, providing a glimpse of the genealogical tree relating the genomes to each other at that site. Statistical summaries of genetic variation therefore also describe the underlying genealogies. We use this correspondence to define a general framework that efficiently computes single-site population genetic statistics using the succinct tree sequence encoding of genealogies and genome sequence. The general approach accumulates sample weights within the genealogical tree at each position on the genome, which are then combined using a summary function; different statistics result from different choices of weight and function. Results can be reported in three ways: by site, which corresponds to statistics calculated as usual from genome sequence; by branch, which gives the expected value of the dual site statistic under the infinite sites model of mutation, and by node, which summarizes the contribution of each ancestor to these statistics. We use the framework to implement many currently defined statistics of genome sequence (making the statistics’ relationship to the underlying genealogical trees concrete and explicit), as well as the corresponding branch statistics of tree shape. We evaluate computational performance using simulated data, and show that calculating statistics from tree sequences using this general framework is several orders of magnitude more efficient than optimized matrix-based methods in terms of both run time and memory requirements. We also explore how well the duality between site and branch statistics holds in practice on trees inferred from the 1000 Genomes Project data set, and discuss ways in which deviations may encode interesting biological signals.

Download Full-text

Efficiently summarizing relationships in large samples: a general duality between statistics of genealogies and genomes

10.1101/779132 ◽

2019 ◽

Cited By ~ 3

Author(s):

Peter Ralph ◽

Kevin Thornton ◽

Jerome Kelleher

Keyword(s):

Genome Sequence ◽

General Framework ◽

Simulated Data ◽

Genetic Mutation ◽

Genealogical Tree ◽

Computational Performance ◽

Infinite Sites Model ◽

And Function ◽

General Duality ◽

Dual Site

AbstractAs a genetic mutation is passed down across generations, it distinguishes those genomes that have inherited it from those that have not, providing a glimpse of the genealogical tree relating the genomes to each other at that site. Statistical summaries of genetic variation therefore also describe the underlying genealogies. We use this correspondence to define a general framework that efficiently computes single-site population genetic statistics using the succinct tree sequence encoding of genealogies and genome sequence. The general approach accumulates “sample weights” within the genealogical tree at each position on the genome, which are then combined using a “summary function”; different statistics result from different choices of weight and function. Results can be reported in three ways: by site, which corresponds to statistics calculated as usual from genome sequence; by branch, which gives the expected value of the dual site statistic under the infinite-sites model of mutation, and by node, which summarizes the contribution of each ancestor to these statistics. We use the framework to implement many currently-defined statistics of genome sequence (making the statistics’ relationship to the underlying genealogical trees concrete and explicit), as well as the corresponding “branch” statistics of tree shape. We evaluate computational performance using simulated data, and show that calculating statistics from tree sequences using this general framework is several orders of magnitude more efficient than optimized matrix-based methods in terms of both run time and memory requirements. We also explore how well the duality between site and branch statistics holds in practice on trees inferred from the 1000 Genomes Project dataset, and discuss ways in which deviations may encode interesting biological signals.

Download Full-text

Accurate and efficient cell lineage tree inference from noisy single cell data: the maximum likelihood perfect phylogeny approach

Bioinformatics ◽

10.1093/bioinformatics/btz676 ◽

2019 ◽

Author(s):

Yufeng Wu

Keyword(s):

Single Cell ◽

Cell Lineage ◽

Large Data ◽

Genomic Variation ◽

Supplementary Information ◽

Perfect Phylogeny ◽

Tree Inference ◽

Lineage Tree ◽

Infinite Sites Model ◽

Cell Data

Abstract Motivation Cells in an organism share a common evolutionary history, called cell lineage tree. Cell lineage tree can be inferred from single cell genotypes at genomic variation sites. Cell lineage tree inference from noisy single cell data is a challenging computational problem. Most existing methods for cell lineage tree inference assume uniform uncertainty in genotypes. A key missing aspect is that real single cell data usually has non-uniform uncertainty in individual genotypes. Moreover, existing methods are often sampling based and can be very slow for large data. Results In this article, we propose a new method called ScisTree, which infers cell lineage tree and calls genotypes from noisy single cell genotype data. Different from most existing approaches, ScisTree works with genotype probabilities of individual genotypes (which can be computed by existing single cell genotype callers). ScisTree assumes the infinite sites model. Given uncertain genotypes with individualized probabilities, ScisTree implements a fast heuristic for inferring cell lineage tree and calling the genotypes that allow the so-called perfect phylogeny and maximize the likelihood of the genotypes. Through simulation, we show that ScisTree performs well on the accuracy of inferred trees, and is much more efficient than existing methods. The efficiency of ScisTree enables new applications including imputation of the so-called doublets. Availability and implementation The program ScisTree is available for download at: https://github.com/yufengwudcs/ScisTree. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Accurate and Efficient Cell Lineage Tree Inference from Noisy Single Cell Data: the Maximum Likelihood Perfect Phylogeny Approach

10.1101/742395 ◽

2019 ◽

Author(s):

Yufeng Wu

Keyword(s):

Single Cell ◽

Cell Lineage ◽

Large Data ◽

Genomic Variation ◽

Perfect Phylogeny ◽

Tree Inference ◽

Lineage Tree ◽

Infinite Sites Model ◽

New Applications ◽

Cell Data

AbstractCells in an organism share a common evolutionary history, called cell lineage tree. Cell lineage tree can be inferred from single cell genotypes at genomic variation sites. Cell lineage tree inference from noisy single cell data is a challenging computational problem. Most existing methods for cell lineage tree inference assume uniform uncertainty in genotypes. A key missing aspect is that real single cell data usually has non-uniform uncertainty in individual genotypes. Moreover, existing methods are often sampling-based and can be very slow for large data.In this paper, we propose a new method called ScisTree, which infers cell lineage tree and calls genotypes from noisy single cell genotype data. Different from most existing approaches, ScisTree works with genotype probabilities of individual genotypes (which can be computed by existing single cell genotype callers). ScisTree assumes the infinite sites model. Given uncertain genotypes with individualized probabilities, ScisTree implements a fast heuristic for inferring cell lineage tree and calling the genotypes that allow the so-called perfect phylogeny and maximize the likelihood of the genotypes. Through simulation, we show that ScisTree performs well on the accuracy of inferred trees, and is much more efficient than existing methods. The efficiency of ScisTree enables new applications including imputation of the so-called doublets.AvailabilityThe program ScisTree is available for download at: https://github.com/yufengwudcs/[email protected]

Download Full-text

Exact Likelihood Calculation under the Infinite Sites Model

Computation ◽

10.3390/computation3040701 ◽

2015 ◽

Vol 3 (4) ◽

pp. 701-713 ◽

Cited By ~ 1

Author(s):

Muhammad Faisal ◽

Andreas Futschik ◽

Claus Vogl

Keyword(s):

Infinite Sites Model

Download Full-text

Hybrid-Lambda: simulation of multiple merger and Kingman gene genealogies in species networks and species trees

10.1101/023465 ◽

2015 ◽

Author(s):

Sha Zhu ◽

James H Degnan ◽

Sharyn J Goldstien ◽

Bjarki Eldon

Keyword(s):

Software Package ◽

Marine Invertebrates ◽

Gene Tree ◽

Species Trees ◽

High Fecundity ◽

Offspring Distribution ◽

Mutation Model ◽

Different Populations ◽

Infinite Sites Model ◽

Kingman’S Coalescent

Background: There has been increasing interest in coalescent models which admit multiple mergers of ancestral lineages; and to model hybridization and coalescence simultaneously. Results: Hybrid-Lambda is a software package that simulates gene genealogies under multiple merger and Kingman's coalescent processes within species networks or species trees. Hybrid-Lambda allows different coalescent processes to be specified for different populations, and allows for time to be converted between generations and coalescent units, by specifying a population size for each population. In addition, Hybrid-Lambda can generate simulated datasets, assuming the infinitely many sites mutation model, and compute the Fst statistic. As an illustration, we apply Hybrid-Lambda to infer the time of subdivision of certain marine invertebrates under different coalescent processes. Conclusions: Hybrid-Lambda makes it possible to investigate biogeographic concordance among high fecundity species exhibiting skewed offspring distribution. Keywords: hybridization; multiple merger; gene tree; coalescent; FST ; infinite sites model; hybrid-lambda; skewed offspring distribution

Download Full-text

Coalescent: an Open-Science framework for Importance Sampling in Coalescent theory

10.7287/peerj.preprints.395v1 ◽

2014 ◽

Author(s):

Susanta Tewari ◽

John L Spouge

Keyword(s):

Importance Sampling ◽

User Interaction ◽

Open Science ◽

Maximum Likelihood Estimates ◽

Effective Sample Size ◽

Data Sets ◽

Coalescent Theory ◽

Target Distribution ◽

Compute Data ◽

Infinite Sites Model

Importance sampling is widely used in coalescent theory to compute data likelihood. Efficient importance sampling requires a trial distribution close to the target distribution of the genealogies conditioned on the data. Moreover, an efficient proposal requires intuition about how the data influence the target distribution. Different proposals might work under similar conditions, and sometimes the corresponding concepts overlap extensively. Currently, there is no framework available for coalescent theory that evaluates proposals in an integrated manner. Typically, problems are not modeled, optimization is performed vigorously on limited datasets, user interaction requires thorough knowledge, and programs are not aligned with the current demands of open science. We have designed a general framework (http://coalescent.sourceforge.net) for importance sampling, to compute data likelihood under the infinite sites model of mutation. The framework models the necessary core concepts, comes integrated with several data sets of varying size, implements the standard competing proposals, and integrates tightly with our previous framework for calculating exact probabilities. The framework computes the data likelihood and provides maximum likelihood estimates of the mutation parameter. Well-known benchmarks in the coalescent literature validate the framework’s accuracy. We evaluate several proposals in the coalescent literature, to discover that the order of efficiency among three standard proposals changes when running time is considered along with the effective sample size. The framework provides an intuitive user interface with minimal clutter. For speed, the framework switches automatically to modern multicore hardware, if available. It runs on three major platforms (Windows, Mac and Linux). Extensive tests and coverage make the framework accessible to a large community.

Download Full-text

infinite sites model
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Biallelic mutations in cancer genomes reveal local mutational determinants

ILS-Aware Analyses of Retroelement Insertions in the Anomaly Zone

The complexity of genome rearrangement combinatorics under the infinite sites model

Efficiently Summarizing Relationships in Large Samples: A General Duality Between Statistics of Genealogies and Genomes

Efficiently summarizing relationships in large samples: a general duality between statistics of genealogies and genomes

Accurate and efficient cell lineage tree inference from noisy single cell data: the maximum likelihood perfect phylogeny approach

Accurate and Efficient Cell Lineage Tree Inference from Noisy Single Cell Data: the Maximum Likelihood Perfect Phylogeny Approach

Exact Likelihood Calculation under the Infinite Sites Model

Hybrid-Lambda: simulation of multiple merger and Kingman gene genealogies in species networks and species trees

Coalescent: an Open-Science framework for Importance Sampling in Coalescent theory

Export Citation Format

infinite sites modelRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Biallelic mutations in cancer genomes reveal local mutational determinants

ILS-Aware Analyses of Retroelement Insertions in the Anomaly Zone

The complexity of genome rearrangement combinatorics under the infinite sites model

Efficiently Summarizing Relationships in Large Samples: A General Duality Between Statistics of Genealogies and Genomes

Efficiently summarizing relationships in large samples: a general duality between statistics of genealogies and genomes

Accurate and efficient cell lineage tree inference from noisy single cell data: the maximum likelihood perfect phylogeny approach

Accurate and Efficient Cell Lineage Tree Inference from Noisy Single Cell Data: the Maximum Likelihood Perfect Phylogeny Approach

Exact Likelihood Calculation under the Infinite Sites Model

Hybrid-Lambda: simulation of multiple merger and Kingman gene genealogies in species networks and species trees

Coalescent: an Open-Science framework for Importance Sampling in Coalescent theory

infinite sites model
Recently Published Documents