Consensus Phylogenetic trees of Fifteen Prokaryotic Aminoacyl-tRNA Synthetase Polypeptides based on Euclidean Geometry of All-Pairs Distances and Concatenation

Mapping Intimacies ◽

10.1101/051623 ◽

2016 ◽

Author(s):

Rhishikesh Bargaje ◽

M.Milner Kumar ◽

Sohan Prabhakar Modak

Keyword(s):

Phylogenetic Tree ◽

Phylogenetic Trees ◽

Molecular Phylogenetics ◽

Euclidean Geometry ◽

Trna Synthetase ◽

Aminoacyl Trna Synthetase ◽

Sequence Alignments ◽

Multiple Sequence ◽

Relative Closeness ◽

Tree Topologies

AbstractBackgroundMost molecular phylogenetic trees depict the relative closeness or the extent of similarity among a set of taxa based on comparison of sequences of homologous genes or proteins. Since the tree topology for individual monogenic traits varies among the same set of organisms and does not overlap taxonomic hierarchy, hence there is a need to generate multidimensional phylogenetic trees.ResultsPhylogenetic trees were constructed for 119 prokaryotes representing 2 phyla under Archaea and 11 phyla under Bacteria after comparing multiple sequence alignments for 15 different aminoacyl-tRNA synthetase polypeptides. The topology of Neighbor Joining (NJ) trees for individual tRNA synthetase polypeptides varied substantially. We use Euclidean geometry to estimate all-pairs distances in order to construct phylogenetic trees. Further, we used a novel “Taxonomic fidelity” algorithm to estimate clade by clade similarity between the phylogenetic tree and the taxonomic tree. We find that, as compared to trees for individual tRNA synthetase polypeptides and rDNA sequences, the topology of our Euclidean tree and that for aligned and concatenated sequences of 15 proteins are closer to the taxonomic trees and offer the best consensus. We have also aligned sequences after concatenation, and find that by changing the order of sequence joining prior to alignment, the tree topologies vary. In contrast, changing the types of polypeptides in the grouping for Euclidean trees does not affect the tree topologies.ConclusionsWe show that a consensus phylogenetic tree of 15 polypeptides from 14 aminoacyl-tRNA synthetases for 119 prokaryotes using Euclidean geometry exhibits better taxonomic fidelity than trees for individual tRNA synthetase polypeptides as well as 16S rDNA. We have also examined Euclidean N-dimensional trees for 15 tRNA synthetase polypeptides which give the same topology as that constructed after amalgamating 3-dimensional Euclidean trees for groups of 3 polypeptides. Euclidean N-dimensional trees offer a reliable future to multi-genic molecular phylogenetics.

Download Full-text

High-Resolution, Multidimensional Phylogenetic Metrics Identify Class I Aminoacyl-tRNA Synthetase Evolutionary Mosaicity and Inter-modular ‘Coupling

10.1101/2020.04.09.033712 ◽

2020 ◽

Cited By ~ 1

Author(s):

Charles W. Carter ◽

Alex Popinga ◽

Remco Bouckaert ◽

Peter R. Wills

Keyword(s):

Amino Acid ◽

Trna Synthetase ◽

Class I ◽

Evolutionary Divergence ◽

Aminoacyl Trna Synthetase ◽

Insertion Element ◽

Sequence Alignments ◽

Multiple Sequence ◽

Residue Conservation ◽

Genetic Coding

AbstractThe provenance of the aminoacyl-tRNA synthetases (aaRS) poses unusually challenging questions because of their role in the emergence and evolution of genetic coding. We investigate evidence about their ancestry from highly curated structure-based multiple sequence alignments of a small “scaffold” that is structurally invariant in all 10 canonical Class I aaRS. Statistically different values of two uncorrelated phylogenetic metrics—residue by residue conservation derived from Clustal and row-by-row cladistic congruence derived from BEAST2—suggest that the Class I scaffold is a mosaic assembled from distinct, successive genetic sources. These data are especially significant in light of: (i) experimental fragmentations of the Class I scaffold into three partitions that retain catalytic activities in proportion to their length; and (ii) multiple sources of evidence that two of these partitions arose from an ancestral Class I aaRS gene encoding a Class II ancestor in frame on the opposite strand. Two additional metrics output by BEAST2 vary in accordance with the presumed functionality endowed by the various modules. The new evidence supplements previous aaRS phylogenies. It identifies a previously characterized 46-residue Class I “protozyme” as preceding the adaptive radiation of the superfamily containing variations of the Rossmann dinucleotide binding fold related to amino acid discrimination, and thus as root of that molecular tree. Such a rooting is consistent with near simultaneous emergence of genetic coding and the origin of the proteome, resolving a conundrum posed by previous inferences that Class I aaRS evolved long after the genetic code had been implemented in an RNA world. Further, it establishes a timeline for the growth of coding from a binary amino acid alphabet by pinpointing discontinuous enhancements of aaRS fidelity.Author SummaryPhylogenetic analysis uncovers evolutionary connections between different protein superfamily members. We describe complementary, uncorrelated, phylogenetic metrics that support multiple evolutionary histories for different segments within members of the Class I aminoacyl-tRNA synthetase superfamily. Using a carefully curated 3D crystal structure superposition as the primary source of the multiple sequence alignment substantially reduced dependence of these metrics on empirical amino acid substitution matrices. Two metrics are derived from the amino acid distribution observed in each successive position. A third depends on how individual sequences distribute into phylogenetic tree branches for each of the ten amino acids activated by the superfamily. All metrics confirm that a segment previously identified as an inserted element is, indeed, a more recent acquisition, despite its structural conservation. The residue-by-residue conservation metrics reveal significant co-variation of mutational frequencies between a core segment that forms the amino acid binding site and a neighboring segment derived from the more recent insertion element. We attribute that covariation to the differentiation of superfamily members as evolutionary divergence enhanced amino acid specificity. Finally, evidence that the insertion element is a recent acquisition implies a new branching order for much of the proteome.

Download Full-text

Is cytochrome c oxidase subunit I (COI) the right DNA barcoding marker for the Chaetopteryx villosa group?

ARPHA Conference Abstracts ◽

10.3897/aca.4.e64707 ◽

2021 ◽

Vol 4 ◽

Author(s):

Dalila Destanović ◽

Lejla Ušanović ◽

Lejla Lasić ◽

Jasna Hanjalić ◽

Belma Kalamujić Stroil

Keyword(s):

Phylogenetic Trees ◽

Pairwise Distance ◽

Zoological Museum ◽

Species Determination ◽

Sequence Alignments ◽

Multiple Sequence ◽

Software Analysis ◽

Group Data ◽

Species Specific ◽

The Right

Chaetopteryx villosa (Fabricius, 1798) is a caddisfly species distributed throughout Europe, except in the Balkan and Apennine Peninsula. However, phylogenetically close species belonging to the C. villosa group are widespread throughout entire Europe. Species of this group (C. villosa, C. gessneri, C. fusca, C. sahlbergi, C. atlantica, C. bosniaca, C. vulture, and C. trinacriae) have distinct distributions with some overlaps. Adult forms of these species are morphologically similar, whereas larval morphology is only known for some species. There are also indications of species hybridization (e.g., C. villosa x fusca). Presumably, the molecular approach for the species determination of this group would be highly beneficial. In the BOLD database, there are 154 specimens with COI-5P barcodes of C. villosa species. Out of the remaining species, C. sahlbergi has 27 specimens with a barcode, C. fusca 20, C. gessneri 5, C. bosniaca 5, and C. atlantica 1, whereas sequences from the species C. vulture and C. trinacriae are missing. Therefore, we tested the power of discrimination of the COI-5P marker in the C. villosa group, as the most common barcoding markers for species identification in animals. Only sequences from public records originating from experienced research groups or taxonomists and containing a specimen photograph were taken as input. A total of 75 sequences from the BOLD database were obtained. Out of these sequences, 11 belonged to C. fusca, 5 to C. gessneri, 52 to C. villosa, 5 to C. bosniaca, and 2 to C. sahlbergi. For the generation of overview trees, COI-5P barcodes of Rhyacophila fasciata and Rh. nubila were used as outgroups. All sequences were trimmed at 5’ and 3’ ends, resulting in a final alignment length of 516 base pairs. Multiple sequence alignments and editing were done in the MEGA-X software. Analysis of nucleotide polymorphism was done in DNASP6 software. MEGA-X was used to calculate the pairwise distance and overall mean p-distance, and to construct the overview trees. Analysis of DNA polymorphism revealed 14 haplotypes of C. villosa, 3 haplotypes of C. fusca, 2 haplotypes of C. gessneri, and one for species C. bosniaca and C. sahlbergi. There were no significant interspecific and intraspecific differences among haplotypes based on pairwise distances. The p-distance between one of the haplotypes of C. fusca and C. villosa was 0.000, whereas the p-distance among haplotypes of C. villosa varied from 0.001 to about 0.055. The mean overall p-distance among haplotypes of all species equaled 0.03. No species-specific clusters were observed when phylogenetic trees were constructed except for C. gessneri, regardless of the method used (i.e., NJ, UPGMA, ML, ME, or MP). To minimize the possibility of species misidentification, we used only records submitted by NTNU-Norwegian University of Science and Technology (Norway), SNSB-Zoologische Staatssammlung Muenchen (Germany), Zoologisches Forschungsmuseum Alexander Koenig (Germany), University of Oulu, Zoological Museum (Finland), prof Hans Malicky and prof Mladen Kučinić. No records identified as hybrids were included in the analyses. With the exception of C. gessneri, COI-5P marker failed to separate the species of the C. villosa group. However, it is highly unlikely that poor species determination was the basis for such a result. To enable the comprehensive and unbiased evaluation of the relationships within this group, data coverage in BOLD database for most of the studied species should be enhanced, encompassing different geographical distribution of samples. Further studies are needed to detect the array of molecular markers suitable for the species delineation in a complex group such as C. villosa.

Download Full-text

Accurate inference of tree topologies from multiple sequence alignments using deep learning

10.1101/559054 ◽

2019 ◽

Cited By ~ 2

Author(s):

Anton Suvorov ◽

Joshua Hochuli ◽

Daniel R. Schrider

Keyword(s):

Deep Learning ◽

Parameter Space ◽

Phylogenetic Trees ◽

Strong Support ◽

Biological Research ◽

Learning Approaches ◽

Sequence Alignments ◽

Traditional Methods ◽

Multiple Sequence ◽

Multiple Sequence Alignments

AbstractReconstructing the phylogenetic relationships between species is one of the most formidable tasks in evolutionary biology. Multiple methods exist to reconstruct phylogenetic trees, each with their own strengths and weaknesses. Both simulation and empirical studies have identified several “zones” of parameter space where accuracy of some methods can plummet, even for four-taxon trees. Further, some methods can have undesirable statistical properties such as statistical inconsistency and/or the tendency to be positively misleading (i.e. assert strong support for the incorrect tree topology). Recently, deep learning techniques have made inroads on a number of both new and longstanding problems in biological research. Here we designed a deep convolutional neural network (CNN) to infer quartet topologies from multiple sequence alignments. This CNN can readily be trained to make inferences using both gapped and ungapped data. We show that our approach is highly accurate, often outperforming traditional methods, and is remarkably robust to bias-inducing regions of parameter space such as the Felsenstein zone and the Farris zone. We also demonstrate that the confidence scores produced by our CNN can more accurately assess support for the chosen topology than bootstrap and posterior probability scores from traditional methods. While numerous practical challenges remain, these findings suggest that deep learning approaches such as ours have the potential to produce more accurate phylogenetic inferences.

Download Full-text

Plastid phylogenomics of the Gynoxoid group (Senecioneae, Asteraceae) highlights the importance of motif-based sequence alignment amid low genetic distances

10.1101/2021.04.23.441144 ◽

2021 ◽

Author(s):

Belen Escobari ◽

Thomas Borsch ◽

Taylor S. Quedensley ◽

Michael Gruenstaeudl

Keyword(s):

Dna Sequence ◽

Phylogenetic Trees ◽

Plastid Genome ◽

Intergenic Spacer ◽

Genetic Distances ◽

Sequence Alignments ◽

Multiple Sequence ◽

Plastid Genomes ◽

Tree Inference ◽

The Impact

ABSTRACTPREMISEThe genus Gynoxys and relatives form a species-rich lineage of Andean shrubs and trees with low genetic distances within the sunflower subtribe Tussilaginineae. Previous molecular phylogenetic investigations of the Tussilaginineae have included few, if any, representatives of this Gynoxoid group or reconstructed ambiguous patterns of relationships for it.METHODSWe sequenced complete plastid genomes of 21 species of the Gynoxoid group and related Tussilaginineae and conducted detailed comparisons of the phylogenetic relationships supported by the gene, intron, and intergenic spacer partitions of these genomes. We also evaluated the impact of manual, motif-based adjustments of automatic DNA sequence alignments on phylogenetic tree inference.RESULTSOur results indicate that the inclusion of all plastid genome partitions is needed to infer fully resolved phylogenetic trees of the Gynoxoid group. Whole plastome-based tree inference suggests that the genera Gynoxys and Nordenstamia are polyphyletic and form the core clade of the Gynoxoid group. This clade is sister to a clade of Aequatorium and Paragynoxys and also includes some but not all representatives of Paracalia.CONCLUSIONSThe concatenation and combined analysis of all plastid genome partitions and the construction of manually curated, motif-based DNA sequence alignments are found to be instrumental in the recovery of strongly supported relationships of the Gynoxoid group. We demonstrate that the correct assessment of homology in genome-level plastid sequence datasets is crucial for subsequent phylogeny reconstruction and that the manual post-processing of multiple sequence alignments improves the reliability of such reconstructions amid low genetic distances between taxa.

Download Full-text

Whole genome sequencing of a novel, dichloromethane-fermenting Peptococcaceae from an enrichment culture

10.7287/peerj.preprints.27718 ◽

2019 ◽

Author(s):

Sophie I Holland ◽

Richard J Edwards ◽

Haluk Ertan ◽

Yie Kuan Wong ◽

Tonia L Russell ◽

...

Keyword(s):

Phylogenetic Trees ◽

De Novo ◽

Enrichment Culture ◽

Rrna Gene ◽

Strictly Anaerobic ◽

Whole Genome ◽

Sequence Alignments ◽

Multiple Sequence ◽

Protein Coding ◽

Fermenting Bacteria

Bacteria capable of dechlorinating the toxic environmental contaminant dichloromethane (DCM, CH2Cl2) are of great interest for potential bioremediation applications. A novel, strictly anaerobic, DCM-fermenting bacterium, "DCMF", was enriched from organochlorine-contaminated groundwater near Botany Bay, Australia. The enrichment culture was maintained in minimal, mineral salt medium amended with dichloromethane as the sole energy source. PacBio whole genome SMRTTM sequencing of DCMF allowed de novo, gap-free assembly despite the presence of cohabiting organisms in the culture. Illumina sequencing reads were utilised to correct minor indels. The single, circularised 6.44 Mb chromosome was annotated with the IMG pipeline and contains 5,773 predicted protein-coding genes. Based on 16S rRNA gene and predicted proteome phylogeny, the organism appears to be a novel member of the Peptococcaceae family. The DCMF genome is large in comparison to known DCM-fermenting bacteria and includes 96 predicted methylamine methyltransferases, which may provide clues to the basis of its DCM metabolism. Full annotation has been provided in a custom genome browser and search tool, in addition to multiple sequence alignments and phylogenetic trees for every predicted protein, available at http://www.slimsuite.unsw.edu.au/research/dcmf/.

Download Full-text

Interim Report on Multiple Sequence Alignments and TaqMan Signature Mapping to Phylogenetic Trees

10.2172/1047247 ◽

2012 ◽

Author(s):

S Gardner ◽

C Jaing

Keyword(s):

Phylogenetic Trees ◽

Interim Report ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments

Download Full-text

Evolutionary Relationships and Sequence-Structure Determinants in Human SARS Coronavirus-2 Spike Proteins for Host Receptor Recognition

10.26434/chemrxiv.12190449 ◽

2020 ◽

Author(s):

Lalitha Guruprasad

Keyword(s):

Phylogenetic Trees ◽

Disulfide Bridge ◽

Sequence Motifs ◽

Structural Determinants ◽

Sequence Alignments ◽

Multiple Sequence ◽

Angiotensin Converting Enzyme 2 ◽

Multiple Sequence Alignments ◽

Loop 2 ◽

Spike Proteins

<div>Coronavirus disease 2019 (COVID-19) is a pandemic infectious disease caused by novel Severe Acute Respiratory Syndrome coronavirus-2 (SARS CoV-2). The SARS CoV-2 is transmitted more rapidly and readily than SARS CoV. Both, SARS CoV and SARS CoV-2 via their glycosylated spike proteins recognize the human angiotensin converting enzyme-2 (ACE-2) receptor. We generated multiple sequence alignments and phylogenetic trees for representative spike proteins of CoV and CoV-2 from various host sources in order to analyze the specificity in SARS CoV-2 spike proteins required for causing infection in humans. Our results show that two sequence motifs in the N-terminal domain; "MESEFR" and "SYLTPG" are specific to human SARS CoV-2 and pangolin SARS CoV. In the receptor binding domain (RBD), three sequence loops; VGGNY (loop 1), YQAGSTPC (loop 2), EGFNCY (loop 3) and a tethered disulfide bridge Cys480-Cys488 connecting loops 2 and 3 are structural determinants for the recognition of human ACE-2 receptor. The complete genome analysis of representative SARS CoVs from bat, civet, pangolin, human host sources and human SARS CoV-2 identified the bat genome (GenBank code: MN996532.1) and the pangolin SARS CoV genomes as closest to the recent novel human SARS CoV-2 genomes. The bat CoV genomes (GenBank codes: MG772933 and MG772934) are evolutionary intermediates in the mutagenesis progression towards becoming human SARS CoV-2. </div>

Download Full-text

PhyKIT: a broadly applicable UNIX shell toolkit for processing and analyzing phylogenomic data

Bioinformatics ◽

10.1093/bioinformatics/btab096 ◽

2021 ◽

Author(s):

Jacob L Steenwyk ◽

Thomas J Buida ◽

Abigail L Labella ◽

Yuanning Li ◽

Xing-Xing Shen ◽

...

Keyword(s):

Information Content ◽

Phylogenetic Trees ◽

Supplementary Information ◽

Sequence Alignments ◽

Multiple Sequence ◽

Sequence Composition ◽

Multiple Sequence Alignments ◽

Functional Relationships ◽

Biology Process ◽

Rate Evaluation

Abstract Motivation Diverse disciplines in biology process and analyze multiple sequence alignments (MSAs) and phylogenetic trees to evaluate their information content, infer evolutionary events and processes, and predict gene function. However, automated processing of MSAs and trees remains a challenge due to the lack of a unified toolkit. To fill this gap, we introduce PhyKIT, a toolkit for the UNIX shell environment with 30 functions that process MSAs and trees, including but not limited to estimation of mutation rate, evaluation of sequence composition biases, calculation of the degree of violation of a molecular clock, and collapsing bipartitions (internal branches) with low support. Results To demonstrate the utility of PhyKIT, we detail three use cases: (1) summarizing information content in MSAs and phylogenetic trees for diagnosing potential biases in sequence or tree data; (2) evaluating gene-gene covariation of evolutionary rates to identify functional relationships, including novel ones, among genes; and (3) identify lack of resolution events or polytomies in phylogenetic trees, which are suggestive of rapid radiation events or lack of data. We anticipate PhyKIT will be useful for processing, examining, and deriving biological meaning from increasingly large phylogenomic datasets. Availability PhyKIT is freely available on GitHub (https://github.com/JLSteenwyk/PhyKIT), PyPi (https://pypi.org/project/phykit/), and the Anaconda Cloud (https://anaconda.org/JLSteenwyk/phykit) under the MIT license with extensive documentation and user tutorials (https://jlsteenwyk.com/PhyKIT). Supplementary information Supplementary data are available on figshare (doi: 10.6084/m9.figshare.13118600) and are available at Bioinformatics online.

Download Full-text

Whole genome sequencing of a novel, dichloromethane-fermenting Peptococcaceae from an enrichment culture

PeerJ ◽

10.7717/peerj.7775 ◽

2019 ◽

Vol 7 ◽

pp. e7775 ◽

Cited By ~ 1

Author(s):

Sophie I. Holland ◽

Richard J. Edwards ◽

Haluk Ertan ◽

Yie Kuan Wong ◽

Tonia L. Russell ◽

...

Keyword(s):

Phylogenetic Trees ◽

De Novo ◽

Enrichment Culture ◽

Rrna Gene ◽

Strictly Anaerobic ◽

Whole Genome ◽

Sequence Alignments ◽

Multiple Sequence ◽

Protein Coding ◽

Fermenting Bacteria

Bacteria capable of dechlorinating the toxic environmental contaminant dichloromethane (DCM, CH2Cl2) are of great interest for potential bioremediation applications. A novel, strictly anaerobic, DCM-fermenting bacterium, “DCMF”, was enriched from organochlorine-contaminated groundwater near Botany Bay, Australia. The enrichment culture was maintained in minimal, mineral salt medium amended with dichloromethane as the sole energy source. PacBio whole genome SMRTTM sequencing of DCMF allowed de novo, gap-free assembly despite the presence of cohabiting organisms in the culture. Illumina sequencing reads were utilised to correct minor indels. The single, circularised 6.44 Mb chromosome was annotated with the IMG pipeline and contains 5,773 predicted protein-coding genes. Based on 16S rRNA gene and predicted proteome phylogeny, the organism appears to be a novel member of the Peptococcaceae family. The DCMF genome is large in comparison to known DCM-fermenting bacteria. It includes an abundance of methyltransferases, which may provide clues to the basis of its DCM metabolism, as well as potential to metabolise additional methylated substrates such as quaternary amines. Full annotation has been provided in a custom genome browser and search tool, in addition to multiple sequence alignments and phylogenetic trees for every predicted protein, http://www.slimsuite.unsw.edu.au/research/dcmf/.

Download Full-text

Beyond Simple Homology Searches: Multiple Sequence Alignments and Phylogenetic Trees

Current Protocols Essential Laboratory Techniques ◽

10.1002/9780470089941.et1103s01 ◽

2009 ◽

Vol 1 (1) ◽

Cited By ~ 1

Author(s):

Rebecca A. Zufall

Keyword(s):

Phylogenetic Trees ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments

Download Full-text