Phylogenomics of orchids and their mycorrhizal fungi : trees, diversity, and the pursuit of symbiosis

Mapping Intimacies ◽

10.32469/10355/72205 ◽

2019 ◽

Author(s):

◽

Sarah Unruh

Keyword(s):

Mycorrhizal Fungi ◽

Phylogenetic Trees ◽

High Throughput Sequencing ◽

Genomic Sequence ◽

Sequence Data ◽

Mycorrhizal Symbiosis ◽

Sequencing Data ◽

Phylogenetic Structure ◽

University Of Missouri ◽

Fungal Symbiosis

[ACCESS RESTRICTED TO THE UNIVERSITY OF MISSOURI AT REQUEST OF AUTHOR.] Phylogenetic trees show us how organisms are related and provide frameworks for studying and testing evolutionary hypotheses. To better understand the evolution of orchids and their mycorrhizal fungi, I used high-throughput sequencing data and bioinformatic analyses, to build phylogenetic hypotheses. In Chapter 2, I used transcriptome sequences to both build a phylogeny of the slipper orchid genera and to confirm the placement of a polyploidy event at the base of the orchid family. Polyploidy is hypothesized to be a strong driver of evolution and a source of unique traits so confirming this event leads us closer to explaining extant orchid diversity. The list of orthologous genes generated from this study will provide a less expensive and more powerful method for researchers examining the evolutionary relationships in Orchidaceae. In Chapter 3, I generated genomic sequence data for 32 fungal isolates that were collected from orchids across North America. I inferred the first multi-locus nuclear phylogenetic tree for these fungal clades. The phylogenetic structure of these fungi will improve the taxonomy of these clades by providing evidence for new species and for revising problematic species designations. A robust taxonomy is necessary for studying the role of fungi in the orchid mycorrhizal symbiosis. In chapter 4 I summarize my work and outline the future directions of my lab at Illinois College including addressing the remaining aims of my Community Sequencing Proposal with the Joint Genome Institute by analyzing the 15 fungal reference genomes I generated during my PhD. Together these chapters are the start of a life-long research project into the evolution and function of the orchid/fungal symbiosis.

Download Full-text

Application of genotyping-by-sequencing data on inferring the phylogeny of Curcuma (Zingiberaceae) from China

10.21203/rs.2.15210/v1 ◽

2019 ◽

Author(s):

Heng Liang ◽

Yan Zhang ◽

Jiabing Deng ◽

Gang Gao ◽

Chunbang Ding ◽

...

Keyword(s):

Phylogenetic Relationships ◽

Phylogenetic Trees ◽

Large Scale ◽

Reference Genome ◽

Genomic Sequence ◽

Sequence Data ◽

Morphological Differentiation ◽

Genotyping By Sequencing ◽

Tibet Plateau ◽

Sequencing Data

Abstract Background: Genotyping-by-sequencing (GBS), as one of the next generation sequences, has been applied to large scale genotyping in plants, which is poor in morphological differentiation and low in genetic divergence among different species. Curcuma is a significantly medicinal and edible genus. Improvement efforts of phylogenetic relationships and disentangling species are still a challenge due to poor morphology and lack in a reference genome. Result: A high-throughput genomic sequence data which was obtained through GBS protocols was used to investigate the relationships among 8 species with 60 total samples of Curcuma. Through the use of the ipyrad software, 437,061 loci and 997,988 filtered SNPs without reliance upon a reference genome were produced. After quality control (QC) of the filtered SNPs, 1,295 high-quality SNPs were used to clarify the phylogenetic relationships among Curcuma species. Based on these data, a supermatrix approach was used to speculate the phylogeny, and the phylogenetic trees and the relationships were inferred . Conclusions: Varying degrees of support can be explained, as well as the diversification events for Chinese Curcuma. The diversification events showed that the third intense uplift of Qinghai–Tibet Plateau (QTP) and formation of the Hengduan Mountains may speed up Curcuma interspecific divergence in China. The PCA suggested the same topology of the phylogenetic tree. The genetic structure analysis revealed that extensive hybridization may exist in Chinese Curcuma. Additionally, the GBS will be a promising approach for the phylogenetic and systematic study in the future.

Download Full-text

Implication of the Identification of an Earlier Pseudorabies Virus (PRV) Strain HLJ-2013 to the Evolution of Chinese PRVs

Frontiers in Microbiology ◽

10.3389/fmicb.2020.612474 ◽

2020 ◽

Vol 11 ◽

Author(s):

Huimin Liu ◽

Zhibin Shi ◽

Chunguo Liu ◽

Pengfei Wang ◽

Ming Wang ◽

...

Keyword(s):

Phylogenetic Trees ◽

Pseudorabies Virus ◽

High Throughput Sequencing ◽

Genomic Sequence ◽

Full Genome Sequence ◽

Genome Sequences ◽

Protein Coding ◽

One Step ◽

Human Infections ◽

Full Length Genome

Pseudorabies viruses (PRVs) pose a great threat to the pig industry of many countries around the world. Human infections with PRV have also been reported occasionally in China. Therefore, understanding the epidemiology and evolution of PRVs is of great importance for disease control in the pig populations and humans as well. In this study, we isolated a PRV designated HLJ-2013 from PRV-positive samples that had been collected in Heilongjiang, China, in 2013. The full genome sequence of the virus was determined to be ∼143 kbp in length using high-throughput sequencing. The genomic sequence identities between this isolate and 21 other previous PRV isolates ranged from 92.4% (with Bartha) to 97.3% (with SC). Phylogenetic analysis based on the full-length genome sequences revealed that PRV HLJ-2013 clustered together with all the Chinese strains in one group belonging to Genotype II, but this virus occurred phylogenetically earlier than all the other Chinese PRV strains. Phylogenetic trees based on both protein-coding genes and non-coding regions revealed that HLJ-2013 probably obtained its genome sequences from three origins: a yet unknown parent virus, the European viruses, and the same ancestor of all Chinese PRVs. Recombination analysis showed that HLJ-2013-like virus possibly donated the main framework of the genome of the Chinese PRVs. HLJ-2013 exhibited cytopathic and growth characteristics similar to that of the Chinese PRV strains SC and HeN1, but its pathogenicity in mice was higher than that of SC and lower than that of HeN1. The identification of HLJ-2013 takes us one step closer to understanding the origin of PRVs in China and provides new knowledge about the evolution of PRVs worldwide.

Download Full-text

Inferring species compositions of complex fungal communities from long- and short-read sequence data

10.1101/2021.05.02.442318 ◽

2021 ◽

Author(s):

Yiheng Hu ◽

Laszlo Irinyi ◽

Minh Thuy Vi Hoang ◽

Tavish Eenjes ◽

Abigail Graetz ◽

...

Keyword(s):

Community Composition ◽

Pathogen Detection ◽

High Throughput Sequencing ◽

Sequence Data ◽

Whole Genome Sequence ◽

Composition Analysis ◽

Sequencing Data ◽

Species Classification ◽

Shotgun Metagenomics ◽

Query Coverage

Background: The kingdom fungi is crucial for life on earth and is highly diverse. Yet fungi are challenging to characterize. They can be difficult to culture and may be morphologically indistinct in culture. They can have complex genomes of over 1 Gb in size and are still underrepresented in whole genome sequence databases. Overall their description and analysis lags far behind other microbes such as bacteria. At the same time, classification of species via high throughput sequencing without prior purification is increasingly becoming the norm for pathogen detection, microbiome studies, and environmental monitoring. However, standardized procedures for characterizing unknown fungi from complex sequencing data have not yet been established. Results: We compared different metagenomics sequencing and analysis strategies for the identification of fungal species. Using two fungal mock communities of 44 phylogenetically diverse species, we compared species classification and community composition analysis pipelines using shotgun metagenomics and amplicon sequencing data generated from both short and long read sequencing technologies. We show that regardless of the sequencing methodology used, the highest accuracy of species identification was achieved by sequence alignment against a fungi-specific database. During the assessment of classification algorithms, we found that applying cut-offs to the query coverage of each read or contig significantly improved the classification accuracy and community composition analysis without significant data loss. Conclusion: Overall, our study expands the toolkit for identifying fungi by improving sequence-based fungal classification, and provides a practical guide for the design of metagenomics analyses.

Download Full-text

‘There and back again’: revisiting the pathophysiological roles of human endogenous retroviruses in the post-genomic era

Philosophical Transactions of the Royal Society B Biological Sciences ◽

10.1098/rstb.2012.0504 ◽

2013 ◽

Vol 368 (1626) ◽

pp. 20120504 ◽

Cited By ~ 45

Author(s):

Gkikas Magiorkinis ◽

Robert Belshaw ◽

Aris Katzourakis

Keyword(s):

Human Genome ◽

High Throughput Sequencing ◽

Immune Escape ◽

Genomic Sequence ◽

Sequence Data ◽

Endogenous Retroviruses ◽

Human Endogenous Retroviruses ◽

The Past ◽

Sequencing Technologies ◽

Research Questions

Almost 8% of the human genome comprises endogenous retroviruses (ERVs). While they have been shown to cause specific pathologies in animals, such as cancer, their association with disease in humans remains controversial. The limited evidence is partly due to the physical and bioethical restrictions surrounding the study of transposons in humans, coupled with the major experimental and bioinformatics challenges surrounding the association of ERVs with disease in general. Two biotechnological landmarks of the past decade provide us with unprecedented research artillery: (i) the ultra-fine sequencing of the human genome and (ii) the emergence of high-throughput sequencing technologies. Here, we critically assemble research about potential pathologies of ERVs in humans. We argue that the time is right to revisit the long-standing questions of human ERV pathogenesis within a robust and carefully structured framework that makes full use of genomic sequence data. We also pose two thought-provoking research questions on potential pathophysiological roles of ERVs with respect to immune escape and regulation.

Download Full-text

Metannot: A succinct data structure for compression of colors in dynamic de Bruijn graphs

10.1101/236711 ◽

2017 ◽

Cited By ~ 1

Author(s):

Harun Mustafa ◽

André Kahles ◽

Mikhail Karasikov ◽

Gunnar Rätsch

Keyword(s):

Data Structure ◽

High Throughput Sequencing ◽

Sequence Data ◽

Data Sets ◽

Sequencing Data ◽

Dynamic Data ◽

De Bruijn Graphs ◽

Dna And Rna ◽

Succinct Data Structure ◽

Dynamic Data Structure

AbstractMuch of the DNA and RNA sequencing data available is in the form of high-throughput sequencing (HTS) reads and is currently unindexed by established sequence search databases. Recent succinct data structures for indexing both reference sequences and HTS data, along with associated metadata, have been based on either hashing or graph models, but many of these structures are static in nature, and thus, not well-suited as backends for dynamic databases.We propose a parallel construction method for and novel application of the wavelet trie as a dynamic data structure for compressing and indexing graph metadata. By developing an algorithm for merging wavelet tries, we are able to construct large tries in parallel by merging smaller tries constructed concurrently from batches of data.When compared against general compression algorithms and those developed specifically for graph colors (VARI and Rainbowfish), our method achieves compression ratios superior to gzip and VARI, converging to compression ratios of 6.5% to 2% on data sets constructed from over 600 virus genomes.While marginally worse than compression by bzip2 or Rainbowfish, this structure allows for both fast extension and query. We also found that additionally encoding graph topology metadata improved compression ratios, particularly on data sets consisting of several mutually-exclusive reference genomes.It was also observed that the compression ratio of wavelet tries grew sublinearly with the density of the annotation matrices.This work is a significant step towards implementing a dynamic data structure for indexing large annotated sequence data sets that supports fast query and update operations. At the time of writing, no established standard tool has filled this niche.

Download Full-text

A benchmarking of human mitochondrial DNA haplogroup classifiers from whole-genome and whole-exome sequence data

10.1101/2021.02.11.430775 ◽

2021 ◽

Author(s):

Víctor García-Olivares ◽

Adrián Muñoz-Barrera ◽

José Miguel Lorenzo-Salazar ◽

Carlos Zaragoza-Trello ◽

Luis A. Rubio-Rodríguez ◽

...

Keyword(s):

High Throughput Sequencing ◽

De Novo ◽

Sequence Data ◽

Qualitative Assessment ◽

Whole Genome ◽

Third Generation ◽

Sequencing Data ◽

Short Read ◽

Bioinformatic Tools ◽

Whole Exome

AbstractThe mitochondrial genome (mtDNA) is of interest for a range of fields including evolutionary, forensic, and medical genetics. Human mitogenomes can be classified into evolutionary related haplogroups that provide ancestral information and pedigree relationships. Because of this and the advent of high-throughput sequencing (HTS) technology, there is a diversity of bioinformatic tools for haplogroup classification. We present a benchmarking of the 11 most salient tools for human mtDNA classification using empirical whole-genome (WGS) and whole-exome (WES) short-read sequencing data from 36 unrelated donors. Besides, because of its relevance, we also assess the best performing tool in third-generation long noisy read WGS data obtained with nanopore technology for a subset of the donors. We found that, for short-read WGS, most of the tools exhibit high accuracy for haplogroup classification irrespective of the input file used for the analysis. However, for short-read WES, Haplocheck and MixEmt were the most accurate tools. Based on the performance shown for WGS and WES, and the accompanying qualitative assessment, Haplocheck stands out as the most complete tool. For third-generation HTS data, we also showed that Haplocheck was able to accurately retrieve mtDNA haplogroups for all samples assessed, although only after following assembly-based approaches (either based on a referenced-based assembly or a hybrid de novo assembly). Taken together, our results provide guidance for researchers to select the most suitable tool to conduct the mtDNA analyses from HTS data.

Download Full-text

Quality Assessment of Domesticated Animal Genome Assemblies

Bioinformatics and Biology Insights ◽

10.4137/bbi.s29333 ◽

2015 ◽

Vol 9S4 ◽

pp. BBI.S29333 ◽

Cited By ~ 3

Author(s):

Stefan E. Seemann ◽

Christian Anthon ◽

Oana Palasca ◽

Jan Gorodkin

Keyword(s):

High Throughput Sequencing ◽

Genomic Sequence ◽

Rna Seq ◽

Sequencing Data ◽

Assembly Quality ◽

High Quality ◽

Rnaseq Data ◽

Genome Assemblies ◽

Animal Genomes

The era of high-throughput sequencing has made it relatively simple to sequence genomes and transcriptomes of individuals from many species. In order to analyze the resulting sequencing data, high-quality reference genome assemblies are required. However, this is still a major challenge, and many domesticated animal genomes still need to be sequenced deeper in order to produce high-quality assemblies. In the meanwhile, ironically, the extent to which RNA seq and other next-generation data is produced frequently far exceeds that of the genomic sequence. Furthermore, basic comparative analysis is often affected by the lack of genomic sequence. Herein, we quantify the quality of the genome assemblies of 20 domesticated animals and related species by assessing a range of measurable parameters, and we show that there is a positive correlation between the fraction of mappable reads from RNAseq data and genome assembly quality. We rank the genomes by their assembly quality and discuss the implications for genotype analyses.

Download Full-text

Systematic processing of ribosomal RNA gene amplicon sequencing data

GigaScience ◽

10.1093/gigascience/giz146 ◽

2019 ◽

Vol 8 (12) ◽

Cited By ~ 10

Author(s):

Julien Tremblay ◽

Etienne Yergeau

Keyword(s):

Ribosomal Rna ◽

High Performance ◽

High Throughput Sequencing ◽

Sequence Data ◽

Low Cost ◽

Marker Gene ◽

Fine Tuning ◽

Rrna Gene ◽

Sequencing Data ◽

Data Types

Abstract Background With the advent of high-throughput sequencing, microbiology is becoming increasingly data-intensive. Because of its low cost, robust databases, and established bioinformatic workflows, sequencing of 16S/18S/ITS ribosomal RNA (rRNA) gene amplicons, which provides a marker of choice for phylogenetic studies, has become ubiquitous. Many established end-to-end bioinformatic pipelines are available to perform short amplicon sequence data analysis. These pipelines suit a general audience, but few options exist for more specialized users who are experienced in code scripting, Linux-based systems, and high-performance computing (HPC) environments. For such an audience, existing pipelines can be limiting to fully leverage modern HPC capabilities and perform tweaking and optimization operations. Moreover, a wealth of stand-alone software packages that perform specific targeted bioinformatic tasks are increasingly accessible, and finding a way to easily integrate these applications in a pipeline is critical to the evolution of bioinformatic methodologies. Results Here we describe AmpliconTagger, a short rRNA marker gene amplicon pipeline coded in a Python framework that enables fine tuning and integration of virtually any potential rRNA gene amplicon bioinformatic procedure. It is designed to work within an HPC environment, supporting a complex network of job dependencies with a smart-restart mechanism in case of job failure or parameter modifications. As proof of concept, we present end results obtained with AmpliconTagger using 16S, 18S, ITS rRNA short gene amplicons and Pacific Biosciences long-read amplicon data types as input. Conclusions Using a selection of published algorithms for generating operational taxonomic units and amplicon sequence variants and for computing downstream taxonomic summaries and diversity metrics, we demonstrate the performance and versatility of our pipeline for systematic analyses of amplicon sequence data.

Download Full-text

Pitfalls in supermatrix phylogenomics

European Journal of Taxonomy ◽

10.5852/ejt.2017.283 ◽

2017 ◽

Cited By ~ 13

Author(s):

Hervé Philippe ◽

Damien M. de Vienne ◽

Vincent Ranwez ◽

Béatrice Roure ◽

Denis Baurain ◽

...

Keyword(s):

Systematic Error ◽

Phylogenetic Trees ◽

Molecular Phylogenetics ◽

High Throughput Sequencing ◽

Sequence Data ◽

Single Gene ◽

Sequence Evolution ◽

Adequate Model ◽

Stochastic Error ◽

Genomic Scale

In the mid-2000s, molecular phylogenetics turned into phylogenomics, a development that improved the resolution of phylogenetic trees through a dramatic reduction in stochastic error. While some then predicted “the end of incongruence”, it soon appeared that analysing large amounts of sequence data without an adequate model of sequence evolution amplifies systematic error and leads to phylogenetic artefacts. With the increasing flood of (sometimes low-quality) genomic data resulting from the rise of high-throughput sequencing, a new type of error has emerged. Termed here “data errors”, it lumps together several kinds of issues affecting the construction of phylogenomic supermatrices (e.g., sequencing and annotation errors, contaminant sequences). While easy to deal with at a single-gene scale, such errors become very difficult to avoid at the genomic scale, both because hand curating thousands of sequences is prohibitively time-consuming and because the suitable automated bioinformatics tools are still in their infancy. In this paper, we first review the pitfalls affecting the construction of supermatrices and the strategies to limit their adverse effects on phylogenomic inference. Then, after discussing the relative non-issue of missing data in supermatrices, we briefly present the approaches commonly used to reduce systematic error.

Download Full-text

One is not enough: On the effects of reference genome for the mapping and subsequent analyses of short-reads

PLoS Computational Biology ◽

10.1371/journal.pcbi.1008678 ◽

2021 ◽

Vol 17 (1) ◽

pp. e1008678

Author(s):

Carlos Valiente-Mullor ◽

Beatriz Beamud ◽

Iván Ansari ◽

Carlos Francés-Cuesta ◽

Neris García-González ◽

...

Keyword(s):

Legionella Pneumophila ◽

Phylogenetic Trees ◽

High Throughput Sequencing ◽

Reference Genome ◽

Sequence Data ◽

Genetic Distances ◽

Genomic Diversity ◽

Nucleotide Polymorphisms ◽

Recombination Rates ◽

Almost All

Mapping of high-throughput sequencing (HTS) reads to a single arbitrary reference genome is a frequently used approach in microbial genomics. However, the choice of a reference may represent a source of errors that may affect subsequent analyses such as the detection of single nucleotide polymorphisms (SNPs) and phylogenetic inference. In this work, we evaluated the effect of reference choice on short-read sequence data from five clinically and epidemiologically relevant bacteria (Klebsiella pneumoniae, Legionella pneumophila, Neisseria gonorrhoeae, Pseudomonas aeruginosa and Serratia marcescens). Publicly available whole-genome assemblies encompassing the genomic diversity of these species were selected as reference sequences, and read alignment statistics, SNP calling, recombination rates, dN/dS ratios, and phylogenetic trees were evaluated depending on the mapping reference. The choice of different reference genomes proved to have an impact on almost all the parameters considered in the five species. In addition, these biases had potential epidemiological implications such as including/excluding isolates of particular clades and the estimation of genetic distances. These findings suggest that the single reference approach might introduce systematic errors during mapping that affect subsequent analyses, particularly for data sets with isolates from genetically diverse backgrounds. In any case, exploring the effects of different references on the final conclusions is highly recommended.

Download Full-text