Data, methods and assumptions in phylogenetic inference

JG West; DP Faith

doi:10.1071/sb9900009

Data, methods and assumptions in phylogenetic inference

Australian Systematic Botany ◽

10.1071/sb9900009 ◽

1990 ◽

Vol 3 (1) ◽

pp. 9 ◽

Cited By ~ 10

Author(s):

JG West ◽

DP Faith

Keyword(s):

Quantitative Data ◽

Sequence Data ◽

Phylogenetic Inference ◽

Consistency Index ◽

Significance Testing ◽

Biological Knowledge ◽

Recent Developments ◽

Distance Data ◽

Evaluation Of Data ◽

Process Based Models

We consider data, methods and assumptions in relation to phylogenetic inference under two main themes, that of assumptions and models relating to methodology (independent of the data), and assumptions and models relating to kinds of data (independent of the methods). Some aspects of methodological assumptions are well known, e.g. those contrasting cladistics and phenetics, but more detailed study of methodological assumptions is needed in relation to analysis of quantitative data and to analysis of distance data. Debate over assumptions relating purely to methodology has perhaps overshadowed considerations of assumptions relating to data. Pattern based evaluation of data includes congruence/consensus measures, iterative weighting schemes and calculation of statistics such as consistency index. These strategies are complementary to recent attempts to build more process based models, for example in the rationale for weighting of transversions over transitions in analysis of sequence data. Recent developments in significance testing bridge the gap between pattern based and process based models. In both of these contexts biological knowledge and interpretation will play an important role.

Download Full-text

Evolution of the C-Type Lectin-Like Receptor Genes of the DECTIN-1 Cluster in the NK Gene Complex

The Scientific World JOURNAL ◽

10.1100/2012/931386 ◽

2012 ◽

Vol 2012 ◽

pp. 1-11 ◽

Cited By ~ 16

Author(s):

Susanne Sattler ◽

Hormas Ghadially ◽

Erhard Hofer

Keyword(s):

Pattern Recognition ◽

Sequence Data ◽

Gene Duplications ◽

Gene Complex ◽

Adaptive Immune ◽

Recognition Function ◽

Recent Developments ◽

Evolutionary Emergence ◽

Related Proteins ◽

Nk Gene Complex

Pattern recognition receptors are crucial in initiating and shaping innate and adaptive immune responses and often belong to families of structurally and evolutionarily related proteins. The human C-type lectin-like receptors encoded in the DECTIN-1 cluster within the NK gene complex contain prominent receptors with pattern recognition function, such as DECTIN-1 and LOX-1. All members of this cluster share significant homology and are considered to have arisen from subsequent gene duplications. Recent developments in sequencing and the availability of comprehensive sequence data comprising many species showed that the receptors of the DECTIN-1 cluster are not only homologous to each other but also highly conserved between species. Even inCaenorhabditis elegans, genes displaying homology to the mammalian C-type lectin-like receptors have been detected. In this paper, we conduct a comprehensive phylogenetic survey and give an up-to-date overview of the currently available data on the evolutionary emergence of the DECTIN-1 cluster genes.

Download Full-text

[34] Analysis of DNA sequence data: Phylogenetic inference

Methods in Enzymology - Molecular Evolution: Producing the Biochemical Data ◽

10.1016/0076-6879(93)24035-s ◽

1993 ◽

pp. 456-487 ◽

Cited By ~ 72

Author(s):

David M. Hillis ◽

Marc W. Allard ◽

Michael M. Miyamoto

Keyword(s):

Dna Sequence ◽

Sequence Data ◽

Phylogenetic Inference ◽

Dna Sequence Data

Download Full-text

Genes and Other Samples of DNA Sequence Data for Phylogenetic Inference

Biological Bulletin ◽

10.2307/1542967 ◽

1999 ◽

Vol 196 (3) ◽

pp. 345-350 ◽

Cited By ~ 10

Author(s):

M. P. Cummings ◽

S. P. Otto ◽

J. Wakeley

Keyword(s):

Dna Sequence ◽

Sequence Data ◽

Phylogenetic Inference ◽

Dna Sequence Data

Download Full-text

Synonymization of the male-based ant genus Phaulomyrma (Hymenoptera, Formicidae) with Leptanilla based upon Bayesian total-evidence phylogenetic inference

10.1101/2020.08.28.272799 ◽

2020 ◽

Author(s):

Zachary H. Griebenow

Keyword(s):

Sequence Data ◽

Phylogenetic Inference ◽

Morphological Characters ◽

Molecular Data ◽

Total Evidence ◽

Morphological Data ◽

Systematic Revision ◽

Dna Sequence Data ◽

Evidence Framework ◽

Nuclear Loci

Abstract.Although molecular data have proven indispensable in confidently resolving the phylogeny of many clades across the tree of life, these data may be inaccessible for certain taxa. The resolution of taxonomy in the ant subfamily Leptanillinae is made problematic by the absence of DNA sequence data for leptanilline taxa that are known only from male specimens, including the monotypic genus Phaulomyrma Wheeler & Wheeler. Focusing upon the considerable diversity of undescribed male leptanilline morphospecies, the phylogeny of 35 putative morphospecies sampled from across the Leptanillinae, plus an outgroup, is inferred from 11 nuclear loci and 41 discrete male morphological characters using a Bayesian total-evidence framework, with Phaulomyrma represented by morphological data only. Based upon the results of this analysis Phaulomyrma is synonymized with Leptanilla Emery, and male-based diagnoses for Leptanilla that are grounded in phylogeny are provided, under both broad and narrow circumscriptions of that genus. This demonstrates the potential utility of a total-evidence approach in inferring the phylogeny of rare extant taxa for which molecular data are unavailable and begins a long-overdue systematic revision of the Leptanillinae that is focused on male material.

Download Full-text

Machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: COVID-19 case study

10.1101/2020.02.03.932350 ◽

2020 ◽

Cited By ~ 10

Author(s):

Gurjit S. Randhawa ◽

Maximillian P.M. Soltysiak ◽

Hadi El Roz ◽

Camila P.E. de Souza ◽

Kathleen A. Hill ◽

...

Keyword(s):

Machine Learning ◽

Death Rate ◽

Genomic Sequence ◽

Sequence Data ◽

Rank Correlation ◽

Taxonomic Classification ◽

Supervised Machine Learning ◽

Biological Knowledge ◽

Alignment Free

AbstractAs of February 20, 2020, the 2019 novel coronavirus (renamed to COVID-19) spread to 30 countries with 2130 deaths and more than 75500 confirmed cases. COVID-19 is being compared to the infamous SARS coronavirus, which resulted, between November 2002 and July 2003, in 8098 confirmed cases worldwide with a 9.6% death rate and 774 deaths. Though COVID-19 has a death rate of 2.8% as of 20 February, the 75752 confirmed cases in a few weeks (December 8, 2019 to February 20, 2020) are alarming, with cases likely being under-reported given the comparatively longer incubation period. Such outbreaks demand elucidation of taxonomic classification and origin of the virus genomic sequence, for strategic planning, containment, and treatment. This paper identifies an intrinsic COVID-19 genomic signature and uses it together with a machine learning-based alignment-free approach for an ultra-fast, scalable, and highly accurate classification of whole COVID-19 genomes. The proposed method combines supervised machine learning with digital signal processing for genome analyses, augmented by a decision tree approach to the machine learning component, and a Spearman’s rank correlation coefficient analysis for result validation. These tools are used to analyze a large dataset of over 5000 unique viral genomic sequences, totalling 61.8 million bp. Our results support a hypothesis of a bat origin and classify COVID-19 as Sarbecovirus, within Betacoronavirus. Our method achieves high levels of classification accuracy and discovers the most relevant relationships among over 5,000 viral genomes within a few minutes, ab initio, using raw DNA sequence data alone, and without any specialized biological knowledge, training, gene or genome annotations. This suggests that, for novel viral and pathogen genome sequences, this alignment-free whole-genome machine-learning approach can provide a reliable real-time option for taxonomic classification.

Download Full-text

Significance testing and genomic inflation factor using high‐density genotypes or whole‐genome sequence data

Journal of Animal Breeding and Genetics ◽

10.1111/jbg.12419 ◽

2019 ◽

Vol 136 (6) ◽

pp. 418-429 ◽

Cited By ~ 1

Author(s):

Sanne den Berg ◽

Jérémie Vandenplas ◽

Fred A. Eeuwijk ◽

Marcos S. Lopes ◽

Roel F. Veerkamp

Keyword(s):

Genome Sequence ◽

Sequence Data ◽

High Density ◽

Whole Genome Sequence ◽

Significance Testing ◽

Whole Genome ◽

Genome Sequence Data ◽

Inflation Factor

Download Full-text

Large-Scale Phylogenetic Analysis on Current HPC Architectures

Scientific Programming ◽

10.1155/2008/395908 ◽

2008 ◽

Vol 16 (2-3) ◽

pp. 255-270 ◽

Cited By ~ 8

Author(s):

Michael Ott ◽

Jaroslaw Zola ◽

Srinivas Aluru ◽

Andrew D. Johnson ◽

Daniel Janies ◽

...

Keyword(s):

Large Scale ◽

Sequence Data ◽

Phylogenetic Inference ◽

Speed Ratio ◽

Grand Challenge ◽

Nucleotide Polymorphisms ◽

Base Pairs ◽

Fine Grained ◽

Biological Studies ◽

Fine Grained Parallelism

Phylogenetic inference is considered a grand challenge in Bioinformatics due to its immense computational requirements. The increasing popularity and availability of large multi-gene alignments as well as comprehensive datasets of single nucleotide polymorphisms (SNPs) in current biological studies, coupled with rapid accumulation of sequence data in general, pose new challenges for high performance computing. By example of RAxML, which is currently among the fastest and most accurate programs for phylogenetic inference under the Maximum Likelihood (ML) criterion, we demonstrate how the phylogenetic ML function can be efficiently scaled to current supercomputer architectures like the IBM BlueGene/L (BG/L) and SGI Altix. This is achieved by simultaneous exploitation of coarse- and fine-grained parallelism which is inherent to every ML-based biological analysis. Performance is assessed using datasets consisting of 270 sequences and 566,470 base pairs (haplotype map dataset), and 2,182 sequences and 51,089 base pairs, respectively. To the best of our knowledge, these are the largest datasets analyzed under ML to date. Experimental results indicate that the fine-grained parallelization scales well up to 1,024 processors. Moreover, a larger number of processors can be efficiently exploited by a combination of coarse- and fine-grained parallelism. We also demonstrate that our parallelization scales equally well on an AMD Opteron cluster with a less favorable network latency to processor speed ratio. Finally, we underline the practical relevance of our approach by including a biological discussion of the results from the haplotype map dataset analysis, which revealed novel biological insights via phylogenetic inference.

Download Full-text

Construction and annotation of large phylogenetic trees

Australian Systematic Botany ◽

10.1071/sb07006 ◽

2007 ◽

Vol 20 (4) ◽

pp. 287 ◽

Cited By ~ 16

Author(s):

Michael J. Sanderson

Keyword(s):

Phylogenetic Analysis ◽

Phylogenetic Trees ◽

De Novo ◽

Sequence Data ◽

Divergence Time ◽

Phylogenetic Inference ◽

Analysis Data ◽

Molecular Sequence Data ◽

Divergence Time Estimates ◽

Large Trees

Broad availability of molecular sequence data allows construction of phylogenetic trees with 1000s or even 10 000s of taxa. This paper reviews methodological, technological and empirical issues raised in phylogenetic inference at this scale. Numerous algorithmic and computational challenges have been identified surrounding the core problem of reconstructing large trees accurately from sequence data, but many other obstacles, both upstream and downstream of this step, are less well understood. Before phylogenetic analysis, data must be generated de novo or extracted from existing databases, compiled into blocks of homologous data with controlled properties, aligned, examined for the presence of gene duplications or other kinds of complicating factors, and finally, combined with other evidence via supermatrix or supertree approaches. After phylogenetic analysis, confidence assessments are usually reported, along with other kinds of annotations, such as clade names, or annotations requiring additional inference procedures, such as trait evolution or divergence time estimates. Prospects for partial automation of large-tree construction are also discussed, as well as risks associated with ‘outsourcing’ phylogenetic inference beyond the systematics community.

Download Full-text

The effect of model choice on phylogenetic inference using mitochondrial sequence data: Lessons from the scorpions

Molecular Phylogenetics and Evolution ◽

10.1016/j.ympev.2006.11.017 ◽

2007 ◽

Vol 43 (2) ◽

pp. 583-595 ◽

Cited By ~ 24

Author(s):

Martin Jones ◽

Benjamin Gantenbein ◽

Victor Fet ◽

Mark Blaxter

Keyword(s):

Sequence Data ◽

Phylogenetic Inference ◽

Model Choice ◽

Mitochondrial Sequence

Download Full-text

An Evaluation of Phylogenetic Workflows in Viral Molecular Epidemiology

10.1101/2020.11.24.396820 ◽

2020 ◽

Author(s):

Colin Young ◽

Sarah Meng ◽

Niema Moshiri

Keyword(s):

Sequence Alignment ◽

Multiple Sequence Alignment ◽

Sequence Data ◽

Phylogenetic Inference ◽

The Other ◽

Computational Techniques ◽

Viral Sequence ◽

Sequence Alignments ◽

Multiple Sequence ◽

Branch Lengths

AbstractThe use of computational techniques to analyze viral sequence data and ultimately inform public health intervention has become increasingly common in the realm of epidemiology. These methods typically attempt to make epidemiological inferences based on multiple sequence alignments and phylogenies estimated from the raw sequence data. Like all estimation techniques, multiple sequence alignment and phylogenetic inference tools are error-prone, and the impacts of such imperfections on downstream epidemiological inferences are poorly understood. To address this, we executed multiple commonly-used workflows for conducting viral phylogenetic analyses on simulated viral sequence data modeling HIV, HCV, and Ebola, and we computed multiple methods of accuracy motivated by transmission clustering techniques. For multiple sequence alignment, MAFFT consistently outperformed MUSCLE and Clustal Omega in both accuracy and runtime. For phylogenetic inference, FastTree 2, IQ-TREE, RAxML-NG, and PhyML had similar topological accuracies, but branch lengths and pairwise distances were consistently most accurate in phylogenies inferred by RAxML-NG. However, FastTree 2 was orders of magnitude faster than the other tools, and when the other tools were used to optimize branch lengths along a fixed topology provided by FastTree 2 (i.e., no tree search), the resulting phylogenies had accuracies that were indistinguishable from their original counterparts, but with a fraction of the runtime. Our results indicate that an ideal workflow for viral phylogenetic inference is to (1) use MAFFT to perform MSA, (2) use FastTree 2 under the GTR model with discrete gamma-distributed site-rate heterogeneity to quickly obtain a reasonable tree topology, and (3) use RAxML-NG to optimize branch lengths along the fixed FastTree 2 topology.

Download Full-text