scholarly journals Genomic characterization and curation of UCEs improves species tree reconstruction

2019 ◽  
Author(s):  
Matthew H. Van Dam ◽  
James B. Henderson ◽  
Lauren Esposito ◽  
Michelle Trautwein

ABSTRACTUltraconserved genomic elements (UCEs), are generally treated as independent loci in phylogenetic analyses. The identification pipeline for UCE probes is agnostic to genetic identity, only selecting loci that are highly conserved, single copy, without repeats, and of a particular length. Here we characterized UCEs from 12 phylogenomic studies across the animal tree of life, from birds to marine invertebrates. We found that within vertebrate lineages, UCEs are mostly intronic and intergenic, while in invertebrates, the majority are in exons. We then curated 4 different sets of UCE markers by genomic category from 5 different studies including; birds, mammals, fish, Hymenoptera (ants, wasps and bees) and Coleoptera (beetles). Of genes captured by UCEs, we find that many are represented by 2 or more UCEs, corresponding to non-overlapping segments of a single gene. We considered these UCEs to be non-independent, merged all UCEs that belonged to a particular gene, constructed gene and species trees, and then evaluated the subsequent effect of merging co-genic UCEs on gene and species tree reconstruction. Average bootstrap support for merged UCE gene trees were significantly improved across all datasets. Increased loci length appears to drive this increase in bootstrap support. Additionally, we found that gene trees generated from merged UCEs were more accurate than those generated by unmerged and randomly merged UCEs, based on our simulation study. This modest degree of UCE characterization and curation impacts downstream analyses and demonstrates the advantages of incorporating basic genomic characterizations into phylogenomic analyses.

2020 ◽  
Author(s):  
Matthew H Van Dam ◽  
James B Henderson ◽  
Lauren Esposito ◽  
Michelle Trautwein

Abstract Ultraconserved genomic elements (UCEs) are generally treated as independent loci in phylogenetic analyses. The identification pipeline for UCE probes does not require prior knowledge of genetic identity, only selecting loci that are highly conserved, single copy, without repeats, and of a particular length. Here, we characterized UCEs from 11 phylogenomic studies across the animal tree of life, from birds to marine invertebrates. We found that within vertebrate lineages, UCEs are mostly intronic and intergenic, while in invertebrates, the majority are in exons. We then curated four different sets of UCE markers by genomic category from five different studies including: birds, mammals, fish, Hymenoptera (ants, wasps, and bees), and Coleoptera (beetles). Of genes captured by UCEs, we find that many are represented by two or more UCEs, corresponding to nonoverlapping segments of a single gene. We considered these UCEs to be nonindependent, merged all UCEs that belonged to a particular gene, constructed gene and species trees, and then evaluated the subsequent effect of merging cogenic UCEs on gene and species tree reconstruction. Average bootstrap support for merged UCE gene trees was significantly improved across all data sets apparently driven by the increase in loci length. Additionally, we conducted simulations and found that gene trees generated from merged UCEs were more accurate than those generated by unmerged UCEs. As loci length improves gene tree accuracy, this modest degree of UCE characterization and curation impacts downstream analyses and demonstrates the advantages of incorporating basic genomic characterizations into phylogenomic analyses. [Anchored hybrid enrichment; ants; ASTRAL; bait capture; carangimorph; Coleoptera; conserved nonexonic elements; exon capture; gene tree; Hymenoptera; mammal; phylogenomic markers; songbird; species tree; ultraconserved elements; weevils.]


2019 ◽  
Vol 69 (2) ◽  
pp. 384-391 ◽  
Author(s):  
Maryam Rabiee ◽  
Siavash Mirarab

Abstract Phylogenomic analyses have increasingly adopted species tree reconstruction using methods that account for gene tree discordance using pipelines that require both human effort and computational resources. As the number of available genomes continues to increase, a new problem is facing researchers. Once more species become available, they have to repeat the whole process from the beginning because updating species trees is currently not possible. However, the de novo inference can be prohibitively costly in human effort or machine time. In this article, we introduce INSTRAL, a method that extends ASTRAL to enable phylogenetic placement. INSTRAL is designed to place a new species on an existing species tree after sequences from the new species have already been added to gene trees; thus, INSTRAL is complementary to existing placement methods that update gene trees. [ASTRAL; ILS; phylogenetic placement; species tree reconstruction.]


1990 ◽  
Vol 3 (1) ◽  
pp. 111 ◽  
Author(s):  
RH Crozier

Mitochondrial DNA (mtDNA) is clonally and maternally inherited in all animals and in most plants. Mitochondrial gene content is similar although not identical in all eukaryotes. Because of these characteristics, mtDNA has a number of features useful to systematists for all levels of evolutionary divergence. Clonal inheritance leads to unusual confidence in constructing gene trees which are useful in population-level studies, such as in the detection of population subdivision. Maternal inheritance presents the opportunity to distinguish paternal from maternal gene flow. The clonal, or single-gene, nature of mtDNA inheritance leads to consideration of the expected convergence between gene- and species-trees. For closely related populations or species, it is desirable to use several genes to be sure that the correct species-tree is discovered; this means that, although mtDNA will be the most precise guide to the species tree because of its lower effective population size, nuclear genes should also be used in such studies. Although restriction fragment length polymorphisms dominated the field until recently, sequencing following DNA amplification using the polymerase chain reaction is now easier and opens up the use of preserved specimens to molecular systematists. Because mitochondria1 genes evolve at different rates, one of appropriate rate can be selected for almost any phylogenetic problem.


2021 ◽  
Author(s):  
Caesar Al Jewari ◽  
Sandra L Baldauf

Phylogenomics uses multiple genetic loci to reconstruct evolutionary trees, under the stipulation that all combined loci share a common phylogenetic history, i.e., they are congruent. Congruence is primarily evaluated via single-gene trees, but these trees invariably lack sufficient signal to resolve deep nodes making it difficult to assess congruence at these levels. Two methods were developed to systematically assess congruence in multi-locus data. Protocol 1 uses gene jackknifing to measure deviation from a central mean to identify taxon-specific incongruencies in the form of persistent outliers. Protocol_2 assesses congruence at the sub-gene level using a sliding window. Both protocols were tested on a controversial data set of 76 mitochondrial proteins previously used in various combinations to assess the eukaryote root. Protocol_1 showed a concentration of outliers in under-sampled taxa, including the pivotal taxon Discoba. Further analysis of Discoba using Protocol_2 detected a surprising number of apparently exogenous gene fragments, some of which overlap with Protocol_1 outliers and others that do not. Phylogenetic analyses of the full data using the static LG-gamma evolutionary model support a neozoan-excavate root for eukaryotes (Discoba sister), which rises to 99-100% bootstrap support with data masked according to either Protocol_1 or Protocol_2. In contrast, site-heterogeneous (mixture) models perform inconsistently with these data, yielding all three possible roots depending on presence/absence/type of masking and/or extent of missing data. The neozoan-excavate root places Amorphea (including animals and fungi) and Diaphoretickes (including plants) as more closely related to each other than either is to Discoba (Jakobida, Heterolobosea, and Euglenozoa), regardless of the presence/absence of additional taxa.


2017 ◽  
Author(s):  
Damien M. de Vienne ◽  
Fran Supek ◽  
Toni Gabaldon

AbstractBackgroundOvertraining occurs when an optimization process is applied for too many steps, leading to a model describing noise in addition to the signal present in the data. This effect may affect typical approaches for species tree reconstruction that use maximum likelihood optimization procedures on a small sample of concatenated genes. In this context, overtraining may result in trees better describing the specific evolutionary history of the sampled genes rather than the sought evolutionary relationships among the species.ResultsUsing a cross-validation-like approach on real and simulated datasets we showed that overtraining occurs in a significant fraction of cases, leading to species trees that are more distant from a gold-standard reference tree than a previously considered (and rejected) solution in the optimization process. However, we show that the shape of the likelihood curve is informative of the optimal stopping point. As expected, overtraining is aggravated in smaller gene samples and in datasets with increased levels of topological variation among gene trees, but occurs also in controlled, simulated scenarios where a common underlying topology is enforced.ConclusionsOvertraining is frequent in species tree reconstruction and leads to a final tree that is worse in describing the evolutionary relationships of the species under study than an earlier (and rejected) solution encountered during the likelihood optimization process. This result should help develop specific methods for species tree reconstruction in the future, and may improve our understanding of the complexity of tree likelihood landscapes.


Phytotaxa ◽  
2014 ◽  
Vol 166 (1) ◽  
pp. 1 ◽  
Author(s):  
KENNETH BAUTERS ◽  
ISABEL LARRIDON ◽  
MARC REYNDERS ◽  
PIETER ASSELMAN ◽  
ALEXANDER VRIJDAGHS ◽  
...  

Recent molecular phylogenetic analyses showed that Lipocarpha and Volkiella are nested in a paraphyletic Cyperus s.s. and therefore should be viewed as part of a broadly circumscribed genus Cyperus (Cyperaceae). In this paper, molecular phylogenetic analyses of Lipocarpha and Volkiella based on nuclear ribosomal ETS1f and plastid rpl32-trnL and trnH-psbA markers are presented. Separate gene trees as well as a species tree were constructed. Results indicate a polyphyletic Lipocarpha s.l. consisting of a paraphyletic core Lipocarpha s.s in which the monotypic Volkiella is included, and a small non-related clade with species formerly placed in the genus Rikliella. Core Lipocarpha s.s. encompasses six clades, which can be distinguished based on morphological characters. Floral developmental data for Lipocarpha rehmannii (the type of Rikliella) confirms that this species is not a true Lipocarpha s.s. Based on our findings, Lipocarpha s.l. and Volkiella are here included in Cyperus subg. Cyperus. New names and combinations for Lipocarpha s.l. and Volkiella species and a new sectional classification for these species are proposed. 


2020 ◽  
Vol 15 (1) ◽  
Author(s):  
Sarah Christensen ◽  
Erin K. Molloy ◽  
Pranjal Vachaspati ◽  
Ananya Yammanuru ◽  
Tandy Warnow

Abstract Motivation Estimated gene trees are often inaccurate, due to insufficient phylogenetic signal in the single gene alignment, among other causes. Gene tree correction aims to improve the accuracy of an estimated gene tree by using computational techniques along with auxiliary information, such as a reference species tree or sequencing data. However, gene trees and species trees can differ as a result of gene duplication and loss (GDL), incomplete lineage sorting (ILS), and other biological processes. Thus gene tree correction methods need to take estimation error as well as gene tree heterogeneity into account. Many prior gene tree correction methods have been developed for the case where GDL is present. Results Here, we study the problem of gene tree correction where gene tree heterogeneity is instead due to ILS and/or HGT. We introduce TRACTION, a simple polynomial time method that provably finds an optimal solution to the RF-optimal tree refinement and completion (RF-OTRC) Problem, which seeks a refinement and completion of a singly-labeled gene tree with respect to a given singly-labeled species tree so as to minimize the Robinson−Foulds (RF) distance. Our extensive simulation study on 68,000 estimated gene trees shows that TRACTION matches or improves on the accuracy of well-established methods from the GDL literature when HGT and ILS are both present, and ties for best under the ILS-only conditions. Furthermore, TRACTION ties for fastest on these datasets. We also show that a naive generalization of the RF-OTRC problem to multi-labeled trees is possible, but can produce misleading results where gene tree heterogeneity is due to GDL.


2020 ◽  
Vol 37 (11) ◽  
pp. 3292-3307
Author(s):  
Chao Zhang ◽  
Celine Scornavacca ◽  
Erin K Molloy ◽  
Siavash Mirarab

Abstract Phylogenetic inference from genome-wide data (phylogenomics) has revolutionized the study of evolution because it enables accounting for discordance among evolutionary histories across the genome. To this end, summary methods have been developed to allow accurate and scalable inference of species trees from gene trees. However, most of these methods, including the widely used ASTRAL, can only handle single-copy gene trees and do not attempt to model gene duplication and gene loss. As a result, most phylogenomic studies have focused on single-copy genes and have discarded large parts of the data. Here, we first propose a measure of quartet similarity between single-copy and multicopy trees that accounts for orthology and paralogy. We then introduce a method called ASTRAL-Pro (ASTRAL for PaRalogs and Orthologs) to find the species tree that optimizes our quartet similarity measure using dynamic programing. By studying its performance on an extensive collection of simulated data sets and on real data sets, we show that ASTRAL-Pro is more accurate than alternative methods.


2018 ◽  
Author(s):  
Maryam Rabiee ◽  
Siavash Mirarab

ABSTRACTPhylogenomic analyses have increasingly adopted species tree reconstruction using methods that account for gene tree discordance using pipelines that require both human effort and computational resources. As the number of available genomes continues to increase, a new problem is facing researchers. Once more species become available, they have to repeat the whole process from the beginning because updating species trees is currently not possible. However, thede novoinference can be prohibitively costly in human effort or machine time. In this paper, we introduce INSTRAL, a method that extends ASTRAL to enable phylogenetic placement. INSTRAL is designed to place a new species on an existing species tree after sequences from the new species have already been added to gene trees; thus, INSTRAL is complementary to existing placement methods that update gene trees.


Author(s):  
Chao Zhang ◽  
Celine Scornavacca ◽  
Erin K. Molloy ◽  
Siavash Mirarab

AbstractSpecies tree inference via summary methods that combine gene trees has become an increasingly common analysis in recent phylogenomic studies. This broad adoption has been partly due to the greater availability of genome-wide data and ample recognition that gene trees and species trees can differ due to biological processes such as gene duplication and gene loss. This increase has also been encouraged by the recent development of accurate and scalable summary methods, such as ASTRAL. However, most of these methods, including ASTRAL, can only handle single-copy gene trees and do not attempt to model gene duplication and gene loss. In this paper, we introduce a measure of quartet similarity between single-copy and multi-copy trees (accounting for orthology and paralogy relationships) that can be optimized via a scalable dynamic programming similar to the one used by ASTRAL. We then present a new quartet-based species tree inference method: ASTRAL-Pro (ASTRAL for PaRalogs and Orthologs). By studying its performance on an extensive collection of simulated datasets and on a real plant dataset, we show that ASTRAL-Pro is more accurate than alternative methods when gene trees differ from the species tree due to the simultaneous presence of gene duplication, gene loss, incomplete lineage sorting, and estimation errors.


Sign in / Sign up

Export Citation Format

Share Document