Phylogenetic Signal, Congruence, and Uncertainty across Bacteria and Archaea

Molecular Biology and Evolution ◽

10.1093/molbev/msab254 ◽

2021 ◽

Author(s):

Carolina A Martinez-Gutierrez ◽

Frank O Aylward

Keyword(s):

Phylogenetic Trees ◽

Phylogenetic Signal ◽

Phylogenetic Reconstruction ◽

Sister Group ◽

Tree Of Life ◽

Marker Genes ◽

Sequence Composition ◽

Tree Construction ◽

Taxonomic Groups ◽

The Impact

Abstract Reconstruction of the Tree of Life is a central goal in biology. Although numerous novel phyla of bacteria and archaea have recently been discovered, inconsistent phylogenetic relationships are routinely reported, and many inter-phylum and inter-domain evolutionary relationships remain unclear. Here, we benchmark different marker genes often used in constructing multidomain phylogenetic trees of bacteria and archaea and present a set of marker genes that perform best for multidomain trees constructed from concatenated alignments. We use recently-developed Tree Certainty metrics to assess the confidence of our results and to obviate the complications of traditional bootstrap-based metrics. Given the vastly disparate number of genomes available for different phyla of bacteria and archaea, we also assessed the impact of taxon sampling on multidomain tree construction. Our results demonstrate that biases between the representation of different taxonomic groups can dramatically impact the topology of resulting trees. Inspection of our highest-quality tree supports the division of most bacteria into Terrabacteria and Gracilicutes, with Thermatogota and Synergistota branching earlier from these superphyla. This tree also supports the inclusion of the Patescibacteria within the Terrabacteria as a sister group to the Chloroflexota instead of as a basal-branching lineage. For the Archaea, our tree supports three monophyletic lineages (DPANN, Euryarchaeota, and TACK/Asgard), although we note the basal placement of the DPANN may still represent an artifact caused by biased sequence composition. Our findings provide a robust and standardized framework for multidomain phylogenetic reconstruction that can be used to evaluate inter-phylum relationships and assess uncertainty in conflicting topologies of the Tree of Life.

Download Full-text

A Tree of Human Gut Bacterial Species and its Applications to Metagenomics and Metaproteomics Data Analysis

10.1101/2020.09.24.311720 ◽

2020 ◽

Author(s):

Moses Stamboulian ◽

Thomas G. Doak ◽

Yuzhen Ye

Keyword(s):

Gut Microbiome ◽

Phylogenetic Trees ◽

Bacterial Species ◽

Taxonomic Composition ◽

Marker Genes ◽

Missing Information ◽

Human Gut ◽

Taxonomic Profiling ◽

Tree Building ◽

The Impact

Abstract1BackgroundRecent advances in genome and metagenome sequencing have dramatically enriched the collection of genomes of bacterial species related to human health and diseases. In metagenomic studies phylogenetic trees are commonly used to depict, describe, and compare the bacterial members of the community under study. The most accurate tree-building algorithms now use large sets of marker genes taken from across genomes. However, many of the current bacterial genomes were assembled from metagenomic datasets (i.e., metagenome assembled genomes, MAGs), and often contain missing information. It is therefore important to study how well the phylogeny approach performs on such genomes. Further, phylogeny methods are not perfect and it is important to know how reliable an inferred tree is.ResultsHere we examined the impact of incompleteness of the genomes on the tree reconstruction, and we showed that phylogeny approaches including RAxML (which handles missing data explicitly) and FastTree generally performed well on simulated collection of 400 genomes with missing information. As RAxML is computationally prohibitive for the much larger collections of gut genomes, we chose FastTree to build a unified tree of human-gut associated bacterial species (referred to as gut tree), including more than 3000 genomes, most of which are incomplete. We developed two downstream applications of the gut tree: peptide-centric analysis of metaproteomics datasets; and taxonomic characterization of metagenomic sequences. In both applications, the gut tree provided the basis for quantification of species composition at various taxonomic resolutions.ConclusionsThe gut tree presented in this study provides a useful framework for taxonomic profiling of human gut microbiome. Including MAGs in the tree provides more comprehensive representation of microbial species diversity associated with human gut, important for studying the taxonomic composition of gut microbiome.Availability and ImplementationThe tree construction pipeline and downstream applications of the gut tree are freely available at https://github.com/mgtools/guttree.

Download Full-text

Fast-evolving alignment sites are highly informative for reconstructions of deep Tree of Life phylogenies

10.1101/835504 ◽

2019 ◽

Author(s):

L. Thibério Rangel ◽

Gregory P. Fournier

Keyword(s):

Signal To Noise Ratio ◽

Phylogenetic Signal ◽

Phylogenetic Analyses ◽

Phylogenetic Reconstruction ◽

Tree Of Life ◽

Signal To Noise ◽

Fast Analysis ◽

Substantial Impact ◽

Substitution Saturation ◽

Noise Ratio

AbstractThe trimming of fast-evolving sites, often known as “slow-fast” analysis, is broadly used in microbial phylogenetic reconstruction under assumption that fast-evolving sites do not retain accurate phylogenetic signal due to substitution saturation. Therefore, removing sites that have experienced multiple substitutions would improve the signal-to-noise ratio in phylogenetic analyses, with the remaining slower-evolving sites preserving a more reliable record of evolutionary relationships. Here we show that, contrary to this assumption, even the fastest evolving sites, present in conserved proteins often used in Tree of Life studies, contain reliable and valuable phylogenetic information, and that the trimming of such sites can negatively impact the accuracy of phylogenetic reconstruction. Simulated alignments modeled after ribosomal protein datasets used in Tree of Life studies consistently show that slow-evolving sites are less likely to recover true bipartitions than even the fastest-evolving sites. Furthermore, site specific substitution-rates are positively correlated with the frequency of accurately recovered short-branched bipartitions, as slowly evolving sites are less likely to have experienced substitutions along these intervals. Using published Tree of Life sequence alignment datasets, we additionally show that both slow-and fast-evolving sites contain similarly inconsistent phylogenetic signals, and that, for fast-evolving sites, this inconsistency can be attributed to poor alignment quality. Furthermore, trimming fast sites, slow sites, or both is shown to have substantial impact on phylogenetic reconstruction across multiple evolutionary models. This is perhaps most evident in the resulting placements of Eukarya and Asgardarchaeota groups, which are especially sensitive to the implementation of different trimming schemes.Significance StatementIt is common practice among comprehensive microbial phylogenetic studies to trim fast-evolving sites from the source alignment in the expectation to increase the signal to noise ratio. Here we show that despite fast-evolving sites being more sensitive to parameter misspecifications than mid-rate evolving sites, such sensitivity is comparable, if not smaller, than what we observe among slow-evolving sites. Through the use of both empirical and simulated datasets we also show that, besides the lack of evidences regarding the noisy nature of fast-evolving sites, such sites are of core importance for the reliable the reconstruction of short-branched bipartitions. Such points are exemplified by the variations in the Eukarya+Archaea Tree of Life when subjective alignment trimming strategies are employed.

Download Full-text

Taxon sampling unequally affects individual nodes in a phylogenetic tree: consequences for model gene tree construction in SwissTree

10.1101/181966 ◽

2017 ◽

Cited By ~ 3

Author(s):

Brigitte Boeckmann ◽

David Dylus ◽

Sebastien Moretti ◽

Adrian Altenhoff ◽

Clément-Marie Train ◽

...

Keyword(s):

Phylogenetic Trees ◽

Phylogenetic Signal ◽

Gene Tree ◽

Evolutionary Relationship ◽

Taxon Sampling ◽

Gene Trees ◽

Data Types ◽

Large Gene ◽

Tree Construction ◽

Taxonomic Range

AbstractMedium to large phylogenetic gene trees constructed from datasets of different species density and taxonomic range are rarely topologically consistent because of missing phylogenetic signal, non-phylogenetic signal and error. In this study, we first use simulations to show that taxon sampling unequally affects nodes in a gene tree, which likely contributes to controversial conclusions from taxon sampling experiments and contradicting species phylogenies such as for the boreoeutherians. Hence, because it is unlikely that a large gene tree can be reconstructed correctly based on a single optimized dataset, we take a two-step approach for the construction of model gene trees. First, stable and unstable clades are identified by comparing phylogenetic trees inferred from multiple datasets and data types (nucleotide, amino acid, codon) from the same gene family. Subsequently, data subsets are optimized for the analysis of individual uncertain clades. Results are summarized in form of a model tree that illustrates the evolutionary relationship of gene loci. A case study shows how a seemingly complex gene phylogeny becomes increasingly consistent with the reference species tree by attentive taxon sampling and subtree analysis. The procedure is progressively introduced to SwissTree (http://swisstree.vital-it.ch), a resource of high confidence model gene (locus) trees. Finally we demonstrate the usefulness of SwissTree for orthology benchmarking.

Download Full-text

Towards a unified classification for human respiratory syncytial virus genotypes

Virus Evolution ◽

10.1093/ve/veaa052 ◽

2020 ◽

Vol 6 (2) ◽

Cited By ~ 1

Author(s):

Kaat Ramaekers ◽

Annabel Rector ◽

Lize Cuypers ◽

Philippe Lemey ◽

Els Keyaerts ◽

...

Keyword(s):

Respiratory Syncytial Virus ◽

Phylogenetic Trees ◽

Phylogenetic Signal ◽

Phylogenetic Reconstruction ◽

Hypervariable Region ◽

Human Respiratory Syncytial Virus ◽

Bootstrap Support ◽

Whole Genome ◽

Patristic Distance ◽

Syncytial Virus

Abstract Since the first human respiratory syncytial virus (HRSV) genotype classification in 1998, inconsistent conclusions have been drawn regarding the criteria that define HRSV genotypes and their nomenclature, challenging data comparisons between research groups. In this study, we aim to unify the field of HRSV genotype classification by reviewing the different methods that have been used in the past to define HRSV genotypes and by proposing a new classification procedure, based on well-established phylogenetic methods. All available complete HRSV genomes (>12,000 bp) were downloaded from GenBank and divided into the two subgroups: HRSV-A and HRSV-B. From whole-genome alignments, the regions that correspond to the open reading frame of the glycoprotein G and the second hypervariable region (HVR2) of the ectodomain were extracted. In the resulting partial alignments, the phylogenetic signal within each fragment was assessed. Maximum likelihood phylogenetic trees were reconstructed using the complete genome alignments. Patristic distances were calculated between all pairs of tips in the phylogenetic tree and summarized as a density plot in order to determine a cutoff value at the lowest point following the major distance peak. Our data show that neither the HVR2 fragment nor the G gene contains sufficient phylogenetic signal to perform reliable phylogenetic reconstruction. Therefore, whole-genome alignments were used to determine HRSV genotypes. We define a genotype using the following criteria: a bootstrap support of ≥70 per cent for the respective clade and a maximum patristic distance between all members of the clade of ≤0.018 substitutions per site for HRSV-A or ≤0.026 substitutions per site for HRSV-B. By applying this definition, we distinguish twenty-three genotypes within subtype HRSV-A and six genotypes within subtype HRSV-B. Applying the genotype criteria on subsampled data sets confirmed the robustness of the method.

Download Full-text

Phylogenetic Analyses of Sites in Different Protein Structural Environments Result in Distinct Placements of the Metazoan Root

Biology ◽

10.3390/biology9040064 ◽

2020 ◽

Vol 9 (4) ◽

pp. 64 ◽

Cited By ~ 6

Author(s):

Akanksha Pandey ◽

Edward L. Braun

Keyword(s):

Amino Acids ◽

Amino Acid ◽

Solvent Accessibility ◽

Phylogenetic Signal ◽

Phylogenetic Analyses ◽

Sister Group ◽

Striking Difference ◽

Relative Solvent Accessibility ◽

Protein Datasets ◽

The Impact

Phylogenomics, the use of large datasets to examine phylogeny, has revolutionized the study of evolutionary relationships. However, genome-scale data have not been able to resolve all relationships in the tree of life; this could reflect, at least in part, the poor-fit of the models used to analyze heterogeneous datasets. Some of the heterogeneity may reflect the different patterns of selection on proteins based on their structures. To test that hypothesis, we developed a pipeline to divide phylogenomic protein datasets into subsets based on secondary structure and relative solvent accessibility. We then tested whether amino acids in different structural environments had distinct signals for the topology of the deepest branches in the metazoan tree. We focused on a dataset that appeared to have a mixture of signals and we found that the most striking difference in phylogenetic signal reflected relative solvent accessibility. Analyses of exposed sites (residues located on the surface of proteins) yielded a tree that placed ctenophores sister to all other animals whereas sites buried inside proteins yielded a tree with a sponge+ctenophore clade. These differences in phylogenetic signal were not ameliorated when we conducted analyses using a set of maximum-likelihood profile mixture models. These models are very similar to the Bayesian CAT model, which has been used in many analyses of deep metazoan phylogeny. In contrast, analyses conducted after recoding amino acids to limit the impact of deviations from compositional stationarity increased the congruence in the estimates of phylogeny for exposed and buried sites; after recoding amino acid trees estimated using the exposed and buried site both supported placement of ctenophores sister to all other animals. Although the central conclusion of our analyses is that sites in different structural environments yield distinct trees when analyzed using models of protein evolution, our amino acid recoding analyses also have implications for metazoan evolution. Specifically, our results add to the evidence that ctenophores are the sister group of all other animals and they further suggest that the placozoa+cnidaria clade found in some other studies deserves more attention. Taken as a whole, these results provide striking evidence that it is necessary to achieve a better understanding of the constraints due to protein structure to improve phylogenetic estimation.

Download Full-text

Synthesis of phylogeny and taxonomy into a comprehensive tree of life

10.1101/012260 ◽

2014 ◽

Cited By ~ 2

Author(s):

Cody Hinchliff ◽

Stephen A Smith ◽

James F Allman ◽

J Gordon Burleigh ◽

Ruchi Chaudhary ◽

...

Keyword(s):

Phylogenetic Trees ◽

Biological Diversity ◽

Phylogenetic Reconstruction ◽

Tree Of Life ◽

Grand Challenge ◽

Community Resources ◽

Starting Point ◽

Community Contribution ◽

Fundamental Research ◽

Digital Objects

Reconstructing the phylogenetic relationships that unite all lineages (the tree of life) is a grand challenge. The paucity of homologous character data across disparately related lineages currently renders direct phylogenetic inference untenable. To reconstruct a comprehensive tree of life we therefore synthesized published phylogenies, together with taxonomic classifications for taxa never incorporated into a phylogeny. We present a draft tree containing 2.3 million tips -- the Open Tree of Life. Realization of this tree required the assembly of two additional community resources: 1) a novel comprehensive global reference taxonomy; and 2) a database of published phylogenetic trees mapped to this taxonomy. Our open source framework facilitates community comment and contribution, enabling the tree to be continuously updated when new phylogenetic and taxonomic data become digitally available. While data coverage and phylogenetic conflict across the Open Tree of Life illuminate gaps in both the underlying data available for phylogenetic reconstruction and the publication of trees as digital objects, the tree provides a compelling starting point for community contribution. This comprehensive tree will fuel fundamental research on the nature of biological diversity, ultimately providing up-to-date phylogenies for downstream applications in comparative biology, ecology, conservation biology, climate change, agriculture, and genomics.

Download Full-text

For common community phylogenetic analyses, go ahead and use synthesis phylogenies

10.1101/370353 ◽

2018 ◽

Cited By ~ 3

Author(s):

Daijiang Li ◽

Lauren Trotta ◽

Hannah E. Marx ◽

Julie M. Allen ◽

Miao Sun ◽

...

Keyword(s):

Phylogenetic Trees ◽

Phylogenetic Diversity ◽

Gene Sequence ◽

Sequence Data ◽

Phylogenetic Signal ◽

Phylogenetic Analyses ◽

Tree Of Life ◽

Pairwise Distance ◽

Highly Correlated ◽

Gene Sequence Data

AbstractShould we build our own phylogenetic trees based on gene sequence data, or can we simply use available synthesis phylogenies? This is a fundamental question that any study involving a phylogenetic framework must face at the beginning of the project. Building a phylogeny from gene sequence data (purpose-built phylogeny) requires more effort, expertise, and cost than subsetting an already available phylogeny (synthesis-based phylogeny). However, we still lack a comparison of how these two approaches to building phylogenetic trees influence common community phylogenetic analyses such as comparing community phylogenetic diversity and estimating trait phylogenetic signal. Here, we generated three purpose-built phylogenies and their corresponding synthesis-based trees (two from Phylomatic and one from the Open Tree of Life [OTL]). We simulated 1,000 communities and 12,000 continuous traits along each purpose-built phylogeny. We then compared the effects of different trees on estimates of phylogenetic diversity (alpha and beta) and phylogenetic signal (Pagel’s λ and Blomberg’s K). Synthesis-based phylogenies generally yielded higher estimates of phylogenetic diversity when compared to purpose-built phylogenies. However, resulting measures of phylogenetic diversity from both types of phylogenies were highly correlated (Spearman’s ρ > 0.8 in most cases). Mean pairwise distance (both alpha and beta) is the index that is most robust to the differences in tree construction that we tested. Measures of phylogenetic diversity based on the OTL showed the highest correlation with measures based on the purpose-built phylogenies. Trait phylogenetic signal estimated with synthesis-based phylogenies, especially from the OTL, were also highly correlated with estimates of Blomberg’s K or close to Pagel’s λ from purpose-built phylogenies when traits were simulated under Brownian Motion. For commonly employed community phylogenetic analyses, our results justify taking advantage of recently developed and continuously improving synthesis trees, especially the Open Tree of Life.

Download Full-text

Physcraper: a Python package for continually updated phylogenetic trees using the Open Tree of Life

BMC Bioinformatics ◽

10.1186/s12859-021-04274-6 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Luna L. Sánchez-Reyes ◽

Martha Kandziora ◽

Emily Jane McTavish

Keyword(s):

Dna Sequences ◽

Phylogenetic Trees ◽

Phylogenetic Reconstruction ◽

Open Science ◽

Tree Of Life ◽

Matrix Assembly ◽

Life Project ◽

Character Matrix ◽

Molecular Dataset ◽

Phylogenetic Hypotheses

Abstract Background Phylogenies are a key part of research in many areas of biology. Tools that automate some parts of the process of phylogenetic reconstruction, mainly molecular character matrix assembly, have been developed for the advantage of both specialists in the field of phylogenetics and non-specialists. However, interpretation of results, comparison with previously available phylogenetic hypotheses, and selection of one phylogeny for downstream analyses and discussion still impose difficulties to one that is not a specialist either on phylogenetic methods or on a particular group of study. Results Physcraper is a command-line Python program that automates the update of published phylogenies by adding public DNA sequences to underlying alignments of previously published phylogenies. It also provides a framework for straightforward comparison of published phylogenies with their updated versions, by leveraging upon tools from the Open Tree of Life project to link taxonomic information across databases. The program can be used by the nonspecialist, as a tool to generate phylogenetic hypotheses based on publicly available expert phylogenetic knowledge. Phylogeneticists and taxonomic group specialists will find it useful as a tool to facilitate molecular dataset gathering and comparison of alternative phylogenetic hypotheses (topologies). Conclusion The Physcraper workflow showcases the benefits of doing open science for phylogenetics, encouraging researchers to strive for better scientific sharing practices. Physcraper can be used with any OS and is released under an open-source license. Detailed instructions for installation and usage are available at https://physcraper.readthedocs.io.

Download Full-text

Using mitochondrial genomes to infer phylogenetic relationships among the oldest extant winged insects (Palaeoptera)

10.1101/164459 ◽

2017 ◽

Cited By ~ 2

Author(s):

Sereina Rutschmann ◽

Ping Chen ◽

Changfa Zhou ◽

Michael T. Monaghan

Keyword(s):

Phylogenetic Relationships ◽

Phylogenetic Trees ◽

Phylogenetic Reconstruction ◽

Sister Group ◽

Data Matrix ◽

Mitochondrial Genomes ◽

Sampling Effort ◽

Reconstruction Method ◽

Marker Selection ◽

Phylogenetic Resolution

AbstractPhylogenetic relationships among the basal orders of winged insects remain unclear, in particular the relationship of the Ephemeroptera (mayflies) and the Odonata (dragonflies and damselflies) with the Neoptera. Insect evolution is thought to have followed rapid divergence in the distant past and phylogenetic reconstruction may therefore be susceptible to problems of taxon sampling, choice of outgroup, marker selection, and tree reconstruction method. Here we newly sequenced three mitochondrial genomes representing the two most diverse families of the Ephemeroptera, one of which is a basal lineage of the order. We then used an additional 90 insect mitochondrial genomes to reconstruct their phylogeny using Bayesian and maximum likelihood approaches. Bayesian analysis supported a basal Odonata hypothesis, with Ephemeroptera as sister group to the remaining insects. This was only supported when using an optimized data matrix from which rogue taxa and terminals affected by long-branch attraction were removed. None of our analyses supported a basal Ephemeroptera hypothesis or Ephemeroptera + Odonata as monophyletic clade sister to other insects (i.e., the Palaeoptera hypothesis). Our newly sequenced mitochondrial genomes of Baetis rutilocylindratus, Cloeon dipterum, and Habrophlebiodes zijinensis had a complete set of protein coding genes and a conserved orientation except for two inverted tRNAs in H. zijinensis. Increased mayfly sampling, removal of problematic taxa, and a Bayesian phylogenetic framework were needed to infer phylogenetic relationships within the three ancient insect lineages of Odonata, Ephemeroptera, and Neoptera. Pruning of rogue taxa improved the number of supported nodes in all phylogenetic trees. Our results add to previous evidence for the Odonata hypothesis and indicate that the phylogenetic resolution of the basal insects can be resolved with more data and sampling effort.

Download Full-text

DNA Analyses Have Revolutionized Studies on the Taxonomy and Evolution in Birds

10.5772/intechopen.97013 ◽

2021 ◽

Author(s):

Michael Wink

Keyword(s):

Dna Sequences ◽

Biochemical Markers ◽

Phylogenetic Trees ◽

Morphological Characters ◽

Tree Of Life ◽

Marker Genes ◽

Species Complexes ◽

Systematic Relationships ◽

Next Generation Sequencing Ngs ◽

Generation Sequencing

Whereas Linné aimed to classify all species of our planet by a unique binomial Latin name, later generations of taxonomists and systematicists intended to place the taxa in a natural system according to their phylogeny. This also happened in ornithology and still scientists are on the way to find the ultimate “Avian Tree of Life”. Formerly, systematic relationships were studied by comparing morphological characters. Since adaptive character evolution occurred frequently, convergences could lead to misleading conclusions. An alternative to morphological characters are biochemical markers, especially nucleotide sequences of marker genes or of complete genomes. They are less prone to convergent evolution. The use of DNA sequences of marker genes for bird systematics started around 1990. The introduction of Next Generation Sequencing (NGS) facilitated the sequence analysis of large parts of bird genomes and to reconstruct the Avian Tree of Life. The genetic analyses allowed the reconstruction of phylogenetic trees and the detection of monophyletic clades, which should be the base for a phylogenetic classification. In consequence, several orders, families and genera of birds had to be rearranged. In addition, a number of species was split into several new species because DNA data could point out hidden lineages in cryptic species or in species complexes.

Download Full-text