Scalable empirical mixture models that account for across-site compositional heterogeneity

Mapping Intimacies ◽

10.1101/794263 ◽

2019 ◽

Cited By ~ 1

Author(s):

Dominik Schrempf ◽

Nicolas Lartillot ◽

Gergely Szöllősi

Keyword(s):

Cluster Analysis ◽

Amino Acid ◽

Mixture Models ◽

Empirical Distribution ◽

Amino Acid Replacement ◽

Compositional Heterogeneity ◽

Long Branch Attraction ◽

Distribution Mixture ◽

Improved Performance ◽

Cat Model

AbstractBiochemical demands constrain the range of amino acids acceptable at specific sites resulting in across-site compositional heterogeneity of the amino acid replacement process. Phylogenetic models that disregard this heterogeneity are prone to systematic errors, which can lead to severe long branch attraction artifacts. State-of-the-art models accounting for across-site compositional heterogeneity include the CAT model, which is computationally expensive, and empirical distribution mixture models estimated via maximum likelihood (C10 to C60 models). Here, we present a new, scalable method EDCluster for finding empirical distribution mixture models involving a simple cluster analysis. The cluster analysis utilizes specific coordinate transformations which allow the detection of specialized amino acid distributions either from curated databases, or from the alignment at hand. We apply EDCluster to the HOGENOM and HSSP databases in order to provide universal distribution mixture (UDM) models comprising up to 4096 components. Detailed analyses of the UDM models demonstrate the removal of various long branch attraction artifacts and improved performance compared to the C10 to C60 models. Ready-to-use implementations of the UDM models are provided for three established software packages (IQ-TREE, Phylobayes, and RevBayes).

Download Full-text

Scalable Empirical Mixture Models That Account for Across-Site Compositional Heterogeneity

Molecular Biology and Evolution ◽

10.1093/molbev/msaa145 ◽

2020 ◽

Vol 37 (12) ◽

pp. 3616-3631 ◽

Cited By ~ 1

Author(s):

Dominik Schrempf ◽

Nicolas Lartillot ◽

Gergely Szöllősi

Keyword(s):

Cluster Analysis ◽

Amino Acid ◽

Mixture Models ◽

Empirical Distribution ◽

Amino Acid Replacement ◽

Compositional Heterogeneity ◽

Long Branch Attraction ◽

Distribution Mixture ◽

Improved Performance ◽

Cat Model

Abstract Biochemical demands constrain the range of amino acids acceptable at specific sites resulting in across-site compositional heterogeneity of the amino acid replacement process. Phylogenetic models that disregard this heterogeneity are prone to systematic errors, which can lead to severe long-branch attraction artifacts. State-of-the-art models accounting for across-site compositional heterogeneity include the CAT model, which is computationally expensive, and empirical distribution mixture models estimated via maximum likelihood (C10–C60 models). Here, we present a new, scalable method EDCluster for finding empirical distribution mixture models involving a simple cluster analysis. The cluster analysis utilizes specific coordinate transformations which allow the detection of specialized amino acid distributions either from curated databases or from the alignment at hand. We apply EDCluster to the HOGENOM and HSSP databases in order to provide universal distribution mixture (UDM) models comprising up to 4,096 components. Detailed analyses of the UDM models demonstrate the removal of various long-branch attraction artifacts and improved performance compared with the C10–C60 models. Ready-to-use implementations of the UDM models are provided for three established software packages (IQ-TREE, Phylobayes, and RevBayes).

Download Full-text

Phylogenetic mixture models for proteins

Philosophical Transactions of the Royal Society B Biological Sciences ◽

10.1098/rstb.2008.0180 ◽

2008 ◽

Vol 363 (1512) ◽

pp. 3965-3976 ◽

Cited By ~ 119

Author(s):

Si Quang Le ◽

Nicolas Lartillot ◽

Olivier Gascuel

Keyword(s):

Amino Acid ◽

Mixture Models ◽

Model Comparison ◽

Tertiary Structure ◽

Amino Acid Replacement ◽

Single Amino Acid ◽

Learning Approaches ◽

Solvent Exposure ◽

Substitution Pattern ◽

Better Than

Standard protein substitution models use a single amino acid replacement rate matrix that summarizes the biological, chemical and physical properties of amino acids. However, site evolution is highly heterogeneous and depends on many factors: genetic code; solvent exposure; secondary and tertiary structure; protein function; etc. These impact the substitution pattern and, in most cases, a single replacement matrix is not enough to represent all the complexity of the evolutionary processes. This paper explores in maximum-likelihood framework phylogenetic mixture models that combine several amino acid replacement matrices to better fit protein evolution. We learn these mixture models from a large alignment database extracted from HSSP, and test the performance using independent alignments from TreeBase . We compare unsupervised learning approaches, where the site categories are unknown, to supervised ones, where in estimations we use the known category of each site, based on its exposure or its secondary structure. All our models are combined with gamma-distributed rates across sites. Results show that highly significant likelihood gains are obtained when using mixture models compared with the best available single replacement matrices. Mixtures of matrices also improve over mixtures of profiles in the manner of the CAT model. The unsupervised approach tends to be better than the supervised one, but it appears difficult to implement and highly sensitive to the starting values of the parameters, meaning that the supervised approach is still of interest for initialization and model comparison. Using an unsupervised model involving three matrices, the average AIC gain per site with TreeBase test alignments is 0.31, 0.49 and 0.61 compared with LG (named after Le & Gascuel 2008 Mol. Biol. Evol. 25 , 1307–1320), WAG and JTT, respectively. This three-matrix model is significantly better than LG for 34 alignments (among 57), and significantly worse for 1 alignment only. Moreover, tree topologies inferred with our mixture models frequently differ from those obtained with single matrices, indicating that using these mixtures impacts not only the likelihood value but also the output tree. All our models and a PhyML implementation are available from http://atgc.lirmm.fr/mixtures .

Download Full-text

A Novel Polymorphism in Exon 10 of the Factor V Gene Inducing an Amino Acid Replacement of Arginine to Lysine at Codon 485

Thrombosis and Haemostasis ◽

10.1055/s-0038-1665424 ◽

1997 ◽

Vol 78 (05) ◽

pp. 1419-1420 ◽

Cited By ~ 1

Author(s):

Tetsuo Ozawa ◽

Kenji Niiya ◽

Naoko Ejiri ◽

Nobuo Sakuragawa

Keyword(s):

Amino Acid ◽

Amino Acid Replacement ◽

Factor V ◽

Factor V Gene ◽

V Gene ◽

Exon 10

Download Full-text

Assessing the Impact of Secondary Structure and Solvent Accessibility on Protein Evolution

Genetics ◽

10.1093/genetics/149.1.445 ◽

1998 ◽

Vol 149 (1) ◽

pp. 445-458 ◽

Cited By ~ 21

Author(s):

Nick Goldman ◽

Jeffrey L Thorne ◽

David T Jones

Keyword(s):

Amino Acid ◽

Secondary Structure ◽

Protein Evolution ◽

Solvent Accessibility ◽

Strong Association ◽

Length Distribution ◽

Parametric Bootstrap ◽

Amino Acid Replacement ◽

Physical Constraints ◽

The Impact

Abstract Empirically derived models of amino acid replacement are employed to study the association between various physical features of proteins and evolution. The strengths of these associations are statistically evaluated by applying the models of protein evolution to 11 diverse sets of protein sequences. Parametric bootstrap tests indicate that the solvent accessibility status of a site has a particularly strong association with the process of amino acid replacement that it experiences. Significant association between secondary structure environment and the amino acid replacement process is also observed. Careful description of the length distribution of secondary structure elements and of the organization of secondary structure and solvent accessibility along a protein did not always significantly improve the fit of the evolutionary models to the data sets that were analyzed. As indicated by the strength of the association of both solvent accessibility and secondary structure with amino acid replacement, the process of protein evolution—both above and below the species level—will not be well understood until the physical constraints that affect protein evolution are identified and characterized.

Download Full-text

Evidence for Selection at the fused1 Locus of Drosophila americana

Genetics ◽

10.1093/genetics/158.1.279 ◽

2001 ◽

Vol 158 (1) ◽

pp. 279-290 ◽

Cited By ~ 4

Author(s):

Jorge Vieira ◽

Bryant F McAllister ◽

Brian Charlesworth

Keyword(s):

Amino Acid ◽

Sequence Variation ◽

Amino Acid Replacement ◽

Population Subdivision ◽

Amino Acid Substitutions ◽

Clinal Variation ◽

Haplotype Structure ◽

Common Amino Acid ◽

Chromosome Arrangement ◽

Dna Sequence Variation

Abstract We analyze genetic variation at fused1, a locus that is close to the centromere of the X chromosome-autosome (X/4) fusion in Drosophila americana. In contrast to other X-linked and autosomal genes, for which a lack of population subdivision in D. americana has been observed at the DNA level, we find strong haplotype structure associated with the alternative chromosomal arrangements. There are several derived fixed differences at fused1 (including one amino acid replacement) between two haplotype classes of this locus. From these results, we obtain an estimate of an age of ∼0.61 million years for the origin of the two haplotypes of the fused1 gene. Haplotypes associated with the X/4 fusion have less DNA sequence variation at fused1 than haplotypes associated with the ancestral chromosome arrangement. The X/4 haplotypes also exhibit clinal variation for the allele frequencies of the three most common amino acid replacement polymorphisms, but not for adjacent silent polymorphisms. These patterns of variation are best explained as a result of selection acting on amino acid substitutions, with geographic variation in selection pressures.

Download Full-text

ReplacementMatrix: a web server for maximum-likelihood estimation of amino acid replacement rate matrices

Bioinformatics ◽

10.1093/bioinformatics/btr435 ◽

2011 ◽

Vol 27 (19) ◽

pp. 2758-2760 ◽

Cited By ~ 11

Author(s):

C. C. Dang ◽

V. Lefort ◽

V. S. Le ◽

Q. S. Le ◽

O. Gascuel

Keyword(s):

Amino Acid ◽

Maximum Likelihood ◽

Maximum Likelihood Estimation ◽

Web Server ◽

Likelihood Estimation ◽

Amino Acid Replacement ◽

Replacement Rate

Download Full-text

An alternative model of amino acid replacement

Bioinformatics ◽

10.1093/bioinformatics/bti109 ◽

2004 ◽

Vol 21 (7) ◽

pp. 975-980 ◽

Cited By ~ 13

Author(s):

G. E. Crooks ◽

S. E. Brenner

Keyword(s):

Amino Acid ◽

Alternative Model ◽

Amino Acid Replacement

Download Full-text

Nucleotide Variation and Conservation at the dpp Locus, a Gene Controlling Early Development in Drosophila

Genetics ◽

10.1093/genetics/145.2.311 ◽

1997 ◽

Vol 145 (2) ◽

pp. 311-323 ◽

Cited By ~ 2

Author(s):

Brent Richter ◽

Manyuan Long ◽

R C Lewontin ◽

Eiji Nitasaka

Keyword(s):

Linkage Disequilibrium ◽

Amino Acid ◽

Gene Conversion ◽

Haplotype Diversity ◽

Amino Acid Replacement ◽

Nucleotide Variation ◽

Nucleotide Level ◽

Untranslated Sequence ◽

Gene Coding ◽

Large Intron

A study of polymorphism and species divergence of the dpp gene of Drosophila has been made. Eighteen lines from a population of D. melanogaster were sequenced for 5200 bp of the Hin region of the gene, coding for the dpp polypeptide. A comparison was made with sequence from D. simulans. Ninety-six silent polymorphisms and three amino acid replacement polymorphisms were found. The overall silent polymorphism (0.0247) is low, but haplotype diversity (0.0066 for effectively silent sites and 0.0054 for all sites) is in the range found for enzyme loci. Amino acid variation is absent in the N-terminal signal peptide, the C-terminal TGF-β peptide and in the N-terminal half of the pro-protein region. At the nucleotide level there is strong conservation in the middle half of the large intron and in the 3′ untranslated sequence of the last exon. The 3′ untranslated conservation, which is perfect for 110 bp among all the divergent species, is unexplained. There is strong positive linkage disequilibrium among polymorphic sites, with stretches of apparent gene conversion among originally divergent sequences. The population apparently is a migration mixture of divergent clades.

Download Full-text

Evolutionary divergence and salinity-mediated selection in halophilic archaea

Microbiology and Molecular Biology Reviews ◽

10.1128/mmbr.61.1.90-104.1997 ◽

1997 ◽

Vol 61 (1) ◽

pp. 90-104

Author(s):

P P Dennis ◽

L C Shimmin

Keyword(s):

Amino Acid ◽

Tertiary Structure ◽

Halophilic Archaea ◽

Amino Acid Replacement ◽

Evolutionary Divergence ◽

Ionic Balance ◽

Amino Acid Residues ◽

Nucleotide Substitutions ◽

Nonsynonymous Substitutions ◽

Environmental Salinity

Halophilic (literally salt-loving) archaea are a highly evolved group of organisms that are uniquely able to survive in and exploit hypersaline environments. In this review, we examine the potential interplay between fluctuations in environmental salinity and the primary sequence and tertiary structure of halophilic proteins. The proteins of halophilic archaea are highly adapted and magnificently engineered to function in an intracellular milieu that is in ionic balance with an external environment containing between 2 and 5 M inorganic salt. To understand the nature of halophilic adaptation and to visualize this interplay, the sequences of genes encoding the L11, L1, L10, and L12 proteins of the large ribosome subunit and Mn/Fe superoxide dismutase proteins from three genera of halophilic archaea have been aligned and analyzed for the presence of synonymous and nonsynonymous nucleotide substitutions. Compared to homologous eubacterial genes, these halophilic genes exhibit an inordinately high proportion of nonsynonymous nucleotide substitutions that result in amino acid replacement in the encoded proteins. More than one-third of the replacements involve acidic amino acid residues. We suggest that fluctuations in environmental salinity provide the driving force for fixation of the excessive number of nonsynonymous substitutions. Tinkering with the number, location, and arrangement of acidic and other amino acid residues influences the fitness (i.e., hydrophobicity, surface hydration, and structural stability) of the halophilic protein. Tinkering is also evident at halophilic protein positions monomorphic or polymorphic for serine; more than one-third of these positions use both the TCN and the AGY serine codons, indicating that there have been multiple nonsynonymous substitutions at these positions. Our model suggests that fluctuating environmental salinity prevents optimization of fitness for many halophilic proteins and helps to explain the unusual evolutionary divergence of their encoding genes.

Download Full-text

Comparison of Finite and Infinite Mixture Models for Capturing Compositional Heterogeneity Across Sites

10.22215/etd/2018-13324 ◽

2018 ◽

Author(s):

Thomas Bujaki

Keyword(s):

Mixture Models ◽

Compositional Heterogeneity

Download Full-text