scholarly journals Pitfalls in supermatrix phylogenomics

Author(s):  
Hervé Philippe ◽  
Damien M. de Vienne ◽  
Vincent Ranwez ◽  
Béatrice Roure ◽  
Denis Baurain ◽  
...  

In the mid-2000s, molecular phylogenetics turned into phylogenomics, a development that improved the resolution of phylogenetic trees through a dramatic reduction in stochastic error. While some then predicted “the end of incongruence”, it soon appeared that analysing large amounts of sequence data without an adequate model of sequence evolution amplifies systematic error and leads to phylogenetic artefacts. With the increasing flood of (sometimes low-quality) genomic data resulting from the rise of high-throughput sequencing, a new type of error has emerged. Termed here “data errors”, it lumps together several kinds of issues affecting the construction of phylogenomic supermatrices (e.g., sequencing and annotation errors, contaminant sequences). While easy to deal with at a single-gene scale, such errors become very difficult to avoid at the genomic scale, both because hand curating thousands of sequences is prohibitively time-consuming and because the suitable automated bioinformatics tools are still in their infancy. In this paper, we first review the pitfalls affecting the construction of supermatrices and the strategies to limit their adverse effects on phylogenomic inference. Then, after discussing the relative non-issue of missing data in supermatrices, we briefly present the approaches commonly used to reduce systematic error.

2019 ◽  
Author(s):  
◽  
Sarah Unruh

[ACCESS RESTRICTED TO THE UNIVERSITY OF MISSOURI AT REQUEST OF AUTHOR.] Phylogenetic trees show us how organisms are related and provide frameworks for studying and testing evolutionary hypotheses. To better understand the evolution of orchids and their mycorrhizal fungi, I used high-throughput sequencing data and bioinformatic analyses, to build phylogenetic hypotheses. In Chapter 2, I used transcriptome sequences to both build a phylogeny of the slipper orchid genera and to confirm the placement of a polyploidy event at the base of the orchid family. Polyploidy is hypothesized to be a strong driver of evolution and a source of unique traits so confirming this event leads us closer to explaining extant orchid diversity. The list of orthologous genes generated from this study will provide a less expensive and more powerful method for researchers examining the evolutionary relationships in Orchidaceae. In Chapter 3, I generated genomic sequence data for 32 fungal isolates that were collected from orchids across North America. I inferred the first multi-locus nuclear phylogenetic tree for these fungal clades. The phylogenetic structure of these fungi will improve the taxonomy of these clades by providing evidence for new species and for revising problematic species designations. A robust taxonomy is necessary for studying the role of fungi in the orchid mycorrhizal symbiosis. In chapter 4 I summarize my work and outline the future directions of my lab at Illinois College including addressing the remaining aims of my Community Sequencing Proposal with the Joint Genome Institute by analyzing the 15 fungal reference genomes I generated during my PhD. Together these chapters are the start of a life-long research project into the evolution and function of the orchid/fungal symbiosis.


2021 ◽  
Vol 17 (1) ◽  
pp. e1008678
Author(s):  
Carlos Valiente-Mullor ◽  
Beatriz Beamud ◽  
Iván Ansari ◽  
Carlos Francés-Cuesta ◽  
Neris García-González ◽  
...  

Mapping of high-throughput sequencing (HTS) reads to a single arbitrary reference genome is a frequently used approach in microbial genomics. However, the choice of a reference may represent a source of errors that may affect subsequent analyses such as the detection of single nucleotide polymorphisms (SNPs) and phylogenetic inference. In this work, we evaluated the effect of reference choice on short-read sequence data from five clinically and epidemiologically relevant bacteria (Klebsiella pneumoniae, Legionella pneumophila, Neisseria gonorrhoeae, Pseudomonas aeruginosa and Serratia marcescens). Publicly available whole-genome assemblies encompassing the genomic diversity of these species were selected as reference sequences, and read alignment statistics, SNP calling, recombination rates, dN/dS ratios, and phylogenetic trees were evaluated depending on the mapping reference. The choice of different reference genomes proved to have an impact on almost all the parameters considered in the five species. In addition, these biases had potential epidemiological implications such as including/excluding isolates of particular clades and the estimation of genetic distances. These findings suggest that the single reference approach might introduce systematic errors during mapping that affect subsequent analyses, particularly for data sets with isolates from genetically diverse backgrounds. In any case, exploring the effects of different references on the final conclusions is highly recommended.


2021 ◽  
Author(s):  
Hamid Reza Ghanavi ◽  
Victoria Twort ◽  
Tobias Joannes Hartman ◽  
Reza Zahiri ◽  
Niklas Wahlberg

The use of molecular data to study evolutionary history of different organisms, revolutionized the field of systematics. Now with the appearance of high throughput sequencing (HTS) technologies more and more genetic sequence data is available. One of the important sources of genetic data for phylogenetic analyses has been mitochondrial DNA. The limitations of mitochondrial DNA for the study of phylogenetic relationships have been thoroughly explored in the age of single locus phylogenies. Now with the appearance of genomic scale data, more and more mitochondrial genomes are available. Here we assemble 47 mitochondrial genomes using whole genome Illumina short reads of representatives of the family Erebidae (Lepidoptera), in order to evaluate the accuracy of mitochondrial genome application in resolving deep phylogenetic relationships. We find that mitogenomes are inadequate for resolving subfamily level relationships in Erebidae, but given good taxon sampling, we see its potential in resolving lower level phylogenetic relationships.


2021 ◽  
Vol 22 (9) ◽  
Author(s):  
Irvan Fadli Wanda ◽  
Nina Ratna Djuita ◽  
TATIK CHIKMAWATI

Abstract. Wanda IF, Djuita NR, Chikmawati T. 2021. Molecular phylogenetics of Malesian Diospyros (Ebenaceae) based trnL-F spacer sequences. Biodiversitas 22: 4106-4114. Diospyros is a genus composed of potential species as an economic commodity with high diversity. However, there is limited information on the phylogenetic relationship of this genus in the Malesian region. This study aimed to provide information on the species diversity through a DNA barcoding approach and revealing the phylogenetic information of Diospyros spp. in the Malesian region. This study used 20 Diospyros accessions from Bogor Botanical Garden collections, 40 Diospyros accessions, and four outgroup accessions obtained from the NCBI database. The DNA barcoding primer utilized comes from plastids, trnL-F intergenic spacer. The phylogenetic trees were constructed using the Maximum-Parsimony method. A total of 20 accessions of Diospyros were validated with sequence data on the genebank. The result showed that all accessions had relationships with 44 other Diospyros species globally. Here, we reported 10 new trnL-F intergenic spacer sequences of Malesian Diospyros species. A phylogenetic tree grouped 64 monophyletic Diospyros species into seven clades. The phylogenetic results supports the biogeographic hypothesis: the Malesian region, the Malesian-Caledonian Region, and Cosmopolite species in almost all bioregions.


2021 ◽  
Author(s):  
Yueyu Jiang ◽  
Metin Balaban ◽  
Qiyun Zhu ◽  
Siavash Mirarab

AbstractIdentifying samples in an evolutionary context is a fundamental step in the study of microbiome, and more broadly, biodiversity. Extending a reference phylogeny by placing new query sequences onto it has been increasingly used for sample identification and other applications. Existing phylogenetic placement methods have assumed that the query sequence is homologous to the data used to infer the reference phylogeny. Thus, they are designed to place data from a single gene onto a gene tree (e.g., they can place 16S sequences onto a 16S gene tree). While this assumption is reasonable, ultimately, sample identification is a question of identifying the species not individual genes. The placement of single gene data on a gene tree is therefore used as a proxy for a more ambitious goal: extending a species tree given sequence data from one or more gene. This goal poses difficult algorithmic questions. Nevertheless, a sufficiently accurate solution would not only improve sample identification using marker genes, it would also help achieving the long-standing goal of combining 16S and metagenomic data. We approach this problem using deep neural networks (DNN) and introduce a method called DEPP. Given a reference species tree and sequence data from one (or a handful of) genes, DEPP learns how to extend the species tree to include new species. DEPP does not rely on pre-specified models of sequence evolution or gene tree discordance; instead, it uses highly parameterized DNNs to learn both aspects from the data. We test DEPP both in simulations and on real microbial data and show high accuracy.


2000 ◽  
Vol 57 (8) ◽  
pp. 1701-1717 ◽  
Author(s):  
Carol A Stepien ◽  
Alison K Dillon ◽  
Amy K Patterson

Population genetic, phylogeographic, and systematic relationships are elucidated among the three species comprising the thornyhead rockfish genus Sebastolobus (Teleostei: Scorpaenidae). Genetic variation among sampling sites representing their extensive ranges along the deep continental slopes of the northern Pacific Ocean is compared using sequence data from the left domain of the mtDNA control region. Comparisons are made among the shortspine thornyhead (S. alascanus) (from seven locations), the longspine thornyhead (S. altivelis) (from five sites), which are sympatric in the northeast, and the broadbanded thornyhead (S. macrochir) (a single site) from the northwest. Phylogenetic trees rooted to Sebastes show that S. macrochir is the sister taxon of S. alascanus and S. altivelis. Intraspecific genetic variability is appreciable, with most individuals having unique haplotypes. Gene flow is substantial among some locations and others diverged significantly. Genetic divergences among sampling sites for S. alascanus indicate an isolation by geographic distance pattern. Genetic divergences for S. altivelis are unrelated to the hypothesis of isolation by geographic distance and appear to be more consistent with the hypothesis of larval retention in currents and gyres. Differences in geographic genetic patterns between the species are attributed to life history differences in their relative mobilities as juveniles and adults.


2014 ◽  
Vol 95 (11) ◽  
pp. 2372-2376 ◽  
Author(s):  
Andi Krumbholz ◽  
Jeannette Lange ◽  
Andreas Sauerbrei ◽  
Marco Groth ◽  
Matthias Platzer ◽  
...  

The avian-like swine influenza viruses emerged in 1979 in Belgium and Germany. Thereafter, they spread through many European swine-producing countries, replaced the circulating classical swine H1N1 influenza viruses, and became endemic. Serological and subsequent molecular data indicated an avian source, but details remained obscure due to a lack of relevant avian influenza virus sequence data. Here, the origin of the European avian-like swine influenza viruses was analysed using a collection of 16 European swine H1N1 influenza viruses sampled in 1979–1981 in Germany, the Netherlands, Belgium, Italy and France, as well as several contemporaneous avian influenza viruses of various serotypes. The phylogenetic trees suggested a triple reassortant with a unique genotype constellation. Time-resolved maximum clade credibility trees indicated times to the most recent common ancestors of 34–46 years (before 2008) depending on the RNA segment and the method of tree inference.


1980 ◽  
Vol 187 (1) ◽  
pp. 65-74 ◽  
Author(s):  
D Penny ◽  
M D Hendy ◽  
L R Foulds

We have recently reported a method to identify the shortest possible phylogenetic tree for a set of protein sequences [Foulds Hendy & Penny (1979) J. Mol. Evol. 13. 127–150; Foulds, Penny & Hendy (1979) J. Mol. Evol. 13, 151–166]. The present paper discusses issues that arise during the construction of minimal phylogenetic trees from protein-sequence data. The conversion of the data from amino acid sequences into nucleotide sequences is shown to be advantageous. A new variation of a method for constructing a minimal tree is presented. Our previous methods have involved first constructing a tree and then either proving that it is minimal or transforming it into a minimal tree. The approach presented in the present paper progressively builds up a tree, taxon by taxon. We illustrate this approach by using it to construct a minimal tree for ten mammalian haemoglobin alpha-chain sequences. Finally we define a measure of the complexity of the data and illustrate a method to derive a directed phylogenetic tree from the minimal tree.


Sign in / Sign up

Export Citation Format

Share Document