scholarly journals CAM: an alignment-free method to recover phylogenies using codon aversion motifs

PeerJ ◽  
2019 ◽  
Vol 7 ◽  
pp. e6984 ◽  
Author(s):  
Justin B. Miller ◽  
Lauren M. McKinnon ◽  
Michael F. Whiting ◽  
Perry G. Ridge

Background Common phylogenomic approaches for recovering phylogenies are often time-consuming and require annotations for orthologous gene relationships that are not always available. In contrast, alignment-free phylogenomic approaches typically use structure and oligomer frequencies to calculate pairwise distances between species. We have developed an approach to quickly calculate distances between species based on codon aversion. Methods Utilizing a novel alignment-free character state, we present CAM, an alignment-free approach to recover phylogenies by comparing differences in codon aversion motifs (i.e., the set of unused codons within each gene) across all genes within a species. Synonymous codon usage is non-random and differs between organisms, between genes, and even within a single gene, and many genes do not use all possible codons. We report a comprehensive analysis of codon aversion within 229,742,339 genes from 23,428 species across all kingdoms of life, and we provide an alignment-free framework for its use in a phylogenetic construct. For each species, we first construct a set of codon aversion motifs spanning all genes within that species. We define the pairwise distance between two species, A and B, as one minus the number of shared codon aversion motifs divided by the total codon aversion motifs of the species, A or B, containing the fewest motifs. This approach allows us to calculate pairwise distances even when substantial differences in the number of genes or a high rate of divergence between species exists. Finally, we use neighbor-joining to recover phylogenies. Results Using the Open Tree of Life and NCBI Taxonomy Database as expected phylogenies, our approach compares well, recovering phylogenies that largely match expected trees and are comparable to trees recovered using maximum likelihood and other alignment-free approaches. Our technique is much faster than maximum likelihood and similar in accuracy to other alignment-free approaches. Therefore, we propose that codon aversion be considered a phylogenetically conserved character that may be used in future phylogenomic studies. Availability CAM, documentation, and test files are freely available on GitHub at https://github.com/ridgelab/cam.

2019 ◽  
Author(s):  
Justin B Miller ◽  
Lauren M McKinnon ◽  
Michael F Whiting ◽  
Perry G Ridge

Background. Common phylogenomic approaches for recovering phylogenies are often time-consuming and require annotations for orthologous gene relationships that are not always available. In contrast, alignment-free phylogenomic approaches typically use structure and oligomer frequencies to calculate pairwise distances between species. We have developed an algorithm to quickly calculate distances between species based on codon aversion. Methods. Utilizing a novel alignment-free character state, we present CAM, an alignment-free approach to recover phylogenies by comparing differences in codon aversion motifs (i.e., the set of unused codons within each gene) across all genes within a species. Synonymous codon usage is non-random and differs between organisms, between genes, and even within a single gene, where many genes do not use all possible codons. We report a comprehensive analysis of codon aversion within 229 742 339 genes from 23 428 species across all kingdoms of life, and we provide an alignment-free framework for its use in a phylogenetic construct. For each species, we first construct a set of codon aversion motifs spanning all genes within that species. We define the pairwise distance between two species, A and B, as one minus the number of shared codon aversion motifs divided by the total codon aversion motifs of the species, A or B, containing the fewest motifs. This approach allows us to calculate pairwise distances even when substantial differences in the number of genes or a high rate of divergence between species exists. Finally, we use neighbor-joining to recover phylogenies. Results. Using the Open Tree of Life and NCBI Taxonomy Database as expected phylogenies, our approach compares well, recovering phylogenies that largely match expected trees and are comparable to trees recovered using maximum likelihood and other alignment-free approaches. Our technique is much faster than maximum likelihood and similar in accuracy to other alignment-free approaches. Therefore, we propose that codon aversion be considered a phylogenetically conserved character that may be used in future phylogenomic studies. Availability. CAM, documentation, and test files are freely available on GitHub at https://github.com/ridgelab/cam


2019 ◽  
Author(s):  
Justin B Miller ◽  
Lauren M McKinnon ◽  
Michael F Whiting ◽  
Perry G Ridge

Background. Common phylogenomic approaches for recovering phylogenies are often time-consuming and require annotations for orthologous gene relationships that are not always available. In contrast, alignment-free phylogenomic approaches typically use structure and oligomer frequencies to calculate pairwise distances between species. We have developed an algorithm to quickly calculate distances between species based on codon aversion. Methods. Utilizing a novel alignment-free character state, we present CAM, an alignment-free approach to recover phylogenies by comparing differences in codon aversion motifs (i.e., the set of unused codons within each gene) across all genes within a species. Synonymous codon usage is non-random and differs between organisms, between genes, and even within a single gene, where many genes do not use all possible codons. We report a comprehensive analysis of codon aversion within 229 742 339 genes from 23 428 species across all kingdoms of life, and we provide an alignment-free framework for its use in a phylogenetic construct. For each species, we first construct a set of codon aversion motifs spanning all genes within that species. We define the pairwise distance between two species, A and B, as one minus the number of shared codon aversion motifs divided by the total codon aversion motifs of the species, A or B, containing the fewest motifs. This approach allows us to calculate pairwise distances even when substantial differences in the number of genes or a high rate of divergence between species exists. Finally, we use neighbor-joining to recover phylogenies. Results. Using the Open Tree of Life and NCBI Taxonomy Database as expected phylogenies, our approach compares well, recovering phylogenies that largely match expected trees and are comparable to trees recovered using maximum likelihood and other alignment-free approaches. Our technique is much faster than maximum likelihood and similar in accuracy to other alignment-free approaches. Therefore, we propose that codon aversion be considered a phylogenetically conserved character that may be used in future phylogenomic studies. Availability. CAM, documentation, and test files are freely available on GitHub at https://github.com/ridgelab/cam


2019 ◽  
Vol 2 (1) ◽  
Author(s):  
Thomas Dencker ◽  
Chris-André Leimeister ◽  
Michael Gerth ◽  
Christoph Bleidorn ◽  
Sagi Snir ◽  
...  

Abstract Word-based or ‘alignment-free’ methods for phylogeny inference have become popular in recent years. These methods are much faster than traditional, alignment-based approaches, but they are generally less accurate. Most alignment-free methods calculate ‘pairwise’ distances between nucleic-acid or protein sequences; these distance values can then be used as input for tree-reconstruction programs such as neighbor-joining. In this paper, we propose the first word-based phylogeny approach that is based on ‘multiple’ sequence comparison and ‘maximum likelihood’. Our algorithm first samples small, gap-free alignments involving four taxa each. For each of these alignments, it then calculates a quartet tree and, finally, the program ‘Quartet MaxCut’ is used to infer a super tree for the full set of input taxa from the calculated quartet trees. Experimental results show that trees produced with our approach are of high quality.


2020 ◽  
Vol 105 (3) ◽  
pp. 323-376
Author(s):  
Li-E Yang ◽  
Lu Lu ◽  
Kevin S. Burgess ◽  
Hong Wang ◽  
De-Zhu Li

Lamiids, a clade composed of approximately 15% of all flowering plants, contains more than 50,000 species dispersed across 49 families and eight orders (APG IV, 2016). This paper is the eighth in a series that analyzes pollen characters across angiosperms. We reconstructed a maximum likelihood tree based on the most recent phylogenetic studies for the Lamiids, comprising 150 terminal genera (including six outgroups) and covering all eight orders and 49 families within the clade. To illustrate pollen diversity across the Lamiids, pollen grains from 22 species (22 genera in 14 families) were imaged under light, scanning, and transmission electron microscopy. Eighteen pollen characters that were documented from previous publications, websites, and our new observations were coded and optimized onto the reconstructed phylogenetic tree using Fitch parsimony, maximum likelihood, and hierarchical Bayesian analysis. Pollen morphology of the Lamiids is highly diverse, particularly in shape class, pollen size, aperture number, endoaperture shape, supratectal element shape, and tectum sculpture. In addition, some genera show relatively high infrageneric pollen variation within the Lamiids: i.e., Coffea L., Jacquemontia Choisy, Justicia L., Pedicularis L., Psychotria L. nom. cons., Sesamum L., Stachytarpheta Vahl, and Veronica L. The plesiomorphic states for 16 pollen characters were inferred unambiguously, and 10 of them displayed consistent plesiomorphic states under all optimization methods. Seventy-one lineages at or above the family level are characterized by pollen character state transitions. We identified diagnostic character states for monophyletic clades and explored palynological evidence to shed light on unresolved relationships. For example, palynological evidence supports the monophyly of Garryales and Metteniusaceae, and sister relationships between Icacinaceae and Oncothecaceae, as well as between Vahliales and Solanales. The evolutionary patterns of pollen morphology found in this study reconfirm several previously postulated evolutionary trends, which include an increase in aperture number, a transition from equatorially arranged apertures to globally distributed ones, and an increase in exine ornamentation complexity. Furthermore, there is a significant correlation between pollen characters and a number of ecological factors, e.g., pollen size and pollination type, pollen ornamentation and pollination type, and shape class and plant growth form. Our results provide insight into the ecological, environmental, and evolutionary mechanisms driving pollen character state changes in the Lamiids.


Genetics ◽  
2001 ◽  
Vol 159 (3) ◽  
pp. 1191-1199
Author(s):  
Araxi O Urrutia ◽  
Laurence D Hurst

Abstract In numerous species, from bacteria to Drosophila, evidence suggests that selection acts even on synonymous codon usage: codon bias is greater in more abundantly expressed genes, the rate of synonymous evolution is lower in genes with greater codon bias, and there is consistency between genes in the same species in which codons are preferred. In contrast, in mammals, while nonequal use of alternative codons is observed, the bias is attributed to the background variance in nucleotide concentrations, reflected in the similar nucleotide composition of flanking noncoding and exonic third sites. However, a systematic examination of the covariants of codon usage controlling for background nucleotide content has yet to be performed. Here we present a new method to measure codon bias that corrects for background nucleotide content and apply this to 2396 human genes. Nearly all (99%) exhibit a higher amount of codon bias than expected by chance. The patterns associated with selectively driven codon bias are weakly recovered: Broadly expressed genes have a higher level of bias than do tissue-specific genes, the bias is higher for genes with lower rates of synonymous substitutions, and certain codons are repeatedly preferred. However, while these patterns are suggestive, the first two patterns appear to be methodological artifacts. The last pattern reflects in part biases in usage of nucleotide pairs. We conclude that we find no evidence for selection on codon usage in humans.


BMC Genomics ◽  
2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Giovanni Franzo ◽  
Claudia Maria Tucciarone ◽  
Matteo Legnardi ◽  
Mattia Cecchinato

Abstract Background Infectious bronchitis virus (IBV) is one of the most relevant viruses affecting the poultry industry, and several studies have investigated the factors involved in its biological cycle and evolution. However, very few of those studies focused on the effect of genome composition and the codon bias of different IBV proteins, despite the remarkable increase in available complete genomes. In the present study, all IBV complete genomes were downloaded (n = 383), and several statistics representative of genome composition and codon bias were calculated for each protein-coding sequence, including but not limited to, the nucleotide odds ratio, relative synonymous codon usage and effective number of codons. Additionally, viral codon usage was compared to host codon usage based on a collection of highly expressed genes in IBV target and nontarget tissues. Results The results obtained demonstrated a significant difference among structural, non-structural and accessory proteins, especially regarding dinucleotide composition, which appears under strong selective forces. In particular, some dinucleotide pairs, such as CpG, a probable target of the host innate immune response, are underrepresented in genes coding for pp1a, pp1ab, S and N. Although genome composition and dinucleotide bias appear to affect codon usage, additional selective forces may act directly on codon bias. Variability in relative synonymous codon usage and effective number of codons was found for different proteins, with structural proteins and polyproteins being more adapted to the codon bias of host target tissues. In contrast, accessory proteins had a more biased codon usage (i.e., lower number of preferred codons), which might contribute to the regulation of their expression level and timing throughout the cell cycle. Conclusions The present study confirms the existence of selective forces acting directly on the genome and not only indirectly through phenotype selection. This evidence might help understanding IBV biology and in developing attenuated strains without affecting the protein phenotype and therefore immunogenicity.


Sign in / Sign up

Export Citation Format

Share Document