scholarly journals Scalable empirical mixture models that account for across-site compositional heterogeneity

2019 ◽  
Author(s):  
Dominik Schrempf ◽  
Nicolas Lartillot ◽  
Gergely Szöllősi

AbstractBiochemical demands constrain the range of amino acids acceptable at specific sites resulting in across-site compositional heterogeneity of the amino acid replacement process. Phylogenetic models that disregard this heterogeneity are prone to systematic errors, which can lead to severe long branch attraction artifacts. State-of-the-art models accounting for across-site compositional heterogeneity include the CAT model, which is computationally expensive, and empirical distribution mixture models estimated via maximum likelihood (C10 to C60 models). Here, we present a new, scalable method EDCluster for finding empirical distribution mixture models involving a simple cluster analysis. The cluster analysis utilizes specific coordinate transformations which allow the detection of specialized amino acid distributions either from curated databases, or from the alignment at hand. We apply EDCluster to the HOGENOM and HSSP databases in order to provide universal distribution mixture (UDM) models comprising up to 4096 components. Detailed analyses of the UDM models demonstrate the removal of various long branch attraction artifacts and improved performance compared to the C10 to C60 models. Ready-to-use implementations of the UDM models are provided for three established software packages (IQ-TREE, Phylobayes, and RevBayes).

2020 ◽  
Vol 37 (12) ◽  
pp. 3616-3631 ◽  
Author(s):  
Dominik Schrempf ◽  
Nicolas Lartillot ◽  
Gergely Szöllősi

Abstract Biochemical demands constrain the range of amino acids acceptable at specific sites resulting in across-site compositional heterogeneity of the amino acid replacement process. Phylogenetic models that disregard this heterogeneity are prone to systematic errors, which can lead to severe long-branch attraction artifacts. State-of-the-art models accounting for across-site compositional heterogeneity include the CAT model, which is computationally expensive, and empirical distribution mixture models estimated via maximum likelihood (C10–C60 models). Here, we present a new, scalable method EDCluster for finding empirical distribution mixture models involving a simple cluster analysis. The cluster analysis utilizes specific coordinate transformations which allow the detection of specialized amino acid distributions either from curated databases or from the alignment at hand. We apply EDCluster to the HOGENOM and HSSP databases in order to provide universal distribution mixture (UDM) models comprising up to 4,096 components. Detailed analyses of the UDM models demonstrate the removal of various long-branch attraction artifacts and improved performance compared with the C10–C60 models. Ready-to-use implementations of the UDM models are provided for three established software packages (IQ-TREE, Phylobayes, and RevBayes).


2008 ◽  
Vol 363 (1512) ◽  
pp. 3965-3976 ◽  
Author(s):  
Si Quang Le ◽  
Nicolas Lartillot ◽  
Olivier Gascuel

Standard protein substitution models use a single amino acid replacement rate matrix that summarizes the biological, chemical and physical properties of amino acids. However, site evolution is highly heterogeneous and depends on many factors: genetic code; solvent exposure; secondary and tertiary structure; protein function; etc. These impact the substitution pattern and, in most cases, a single replacement matrix is not enough to represent all the complexity of the evolutionary processes. This paper explores in maximum-likelihood framework phylogenetic mixture models that combine several amino acid replacement matrices to better fit protein evolution. We learn these mixture models from a large alignment database extracted from HSSP, and test the performance using independent alignments from TreeBase . We compare unsupervised learning approaches, where the site categories are unknown, to supervised ones, where in estimations we use the known category of each site, based on its exposure or its secondary structure. All our models are combined with gamma-distributed rates across sites. Results show that highly significant likelihood gains are obtained when using mixture models compared with the best available single replacement matrices. Mixtures of matrices also improve over mixtures of profiles in the manner of the CAT model. The unsupervised approach tends to be better than the supervised one, but it appears difficult to implement and highly sensitive to the starting values of the parameters, meaning that the supervised approach is still of interest for initialization and model comparison. Using an unsupervised model involving three matrices, the average AIC gain per site with TreeBase test alignments is 0.31, 0.49 and 0.61 compared with LG (named after Le & Gascuel 2008 Mol. Biol. Evol. 25 , 1307–1320), WAG and JTT, respectively. This three-matrix model is significantly better than LG for 34 alignments (among 57), and significantly worse for 1 alignment only. Moreover, tree topologies inferred with our mixture models frequently differ from those obtained with single matrices, indicating that using these mixtures impacts not only the likelihood value but also the output tree. All our models and a PhyML implementation are available from http://atgc.lirmm.fr/mixtures .


1997 ◽  
Vol 78 (05) ◽  
pp. 1419-1420 ◽  
Author(s):  
Tetsuo Ozawa ◽  
Kenji Niiya ◽  
Naoko Ejiri ◽  
Nobuo Sakuragawa

Genetics ◽  
1998 ◽  
Vol 149 (1) ◽  
pp. 445-458 ◽  
Author(s):  
Nick Goldman ◽  
Jeffrey L Thorne ◽  
David T Jones

Abstract Empirically derived models of amino acid replacement are employed to study the association between various physical features of proteins and evolution. The strengths of these associations are statistically evaluated by applying the models of protein evolution to 11 diverse sets of protein sequences. Parametric bootstrap tests indicate that the solvent accessibility status of a site has a particularly strong association with the process of amino acid replacement that it experiences. Significant association between secondary structure environment and the amino acid replacement process is also observed. Careful description of the length distribution of secondary structure elements and of the organization of secondary structure and solvent accessibility along a protein did not always significantly improve the fit of the evolutionary models to the data sets that were analyzed. As indicated by the strength of the association of both solvent accessibility and secondary structure with amino acid replacement, the process of protein evolution—both above and below the species level—will not be well understood until the physical constraints that affect protein evolution are identified and characterized.


Genetics ◽  
2001 ◽  
Vol 158 (1) ◽  
pp. 279-290 ◽  
Author(s):  
Jorge Vieira ◽  
Bryant F McAllister ◽  
Brian Charlesworth

Abstract We analyze genetic variation at fused1, a locus that is close to the centromere of the X chromosome-autosome (X/4) fusion in Drosophila americana. In contrast to other X-linked and autosomal genes, for which a lack of population subdivision in D. americana has been observed at the DNA level, we find strong haplotype structure associated with the alternative chromosomal arrangements. There are several derived fixed differences at fused1 (including one amino acid replacement) between two haplotype classes of this locus. From these results, we obtain an estimate of an age of ∼0.61 million years for the origin of the two haplotypes of the fused1 gene. Haplotypes associated with the X/4 fusion have less DNA sequence variation at fused1 than haplotypes associated with the ancestral chromosome arrangement. The X/4 haplotypes also exhibit clinal variation for the allele frequencies of the three most common amino acid replacement polymorphisms, but not for adjacent silent polymorphisms. These patterns of variation are best explained as a result of selection acting on amino acid substitutions, with geographic variation in selection pressures.


2004 ◽  
Vol 21 (7) ◽  
pp. 975-980 ◽  
Author(s):  
G. E. Crooks ◽  
S. E. Brenner

Genetics ◽  
1997 ◽  
Vol 145 (2) ◽  
pp. 311-323 ◽  
Author(s):  
Brent Richter ◽  
Manyuan Long ◽  
R C Lewontin ◽  
Eiji Nitasaka

A study of polymorphism and species divergence of the dpp gene of Drosophila has been made. Eighteen lines from a population of D. melanogaster were sequenced for 5200 bp of the Hin region of the gene, coding for the dpp polypeptide. A comparison was made with sequence from D. simulans. Ninety-six silent polymorphisms and three amino acid replacement polymorphisms were found. The overall silent polymorphism (0.0247) is low, but haplotype diversity (0.0066 for effectively silent sites and 0.0054 for all sites) is in the range found for enzyme loci. Amino acid variation is absent in the N-terminal signal peptide, the C-terminal TGF-β peptide and in the N-terminal half of the pro-protein region. At the nucleotide level there is strong conservation in the middle half of the large intron and in the 3′ untranslated sequence of the last exon. The 3′ untranslated conservation, which is perfect for 110 bp among all the divergent species, is unexplained. There is strong positive linkage disequilibrium among polymorphic sites, with stretches of apparent gene conversion among originally divergent sequences. The population apparently is a migration mixture of divergent clades.


1997 ◽  
Vol 61 (1) ◽  
pp. 90-104
Author(s):  
P P Dennis ◽  
L C Shimmin

Halophilic (literally salt-loving) archaea are a highly evolved group of organisms that are uniquely able to survive in and exploit hypersaline environments. In this review, we examine the potential interplay between fluctuations in environmental salinity and the primary sequence and tertiary structure of halophilic proteins. The proteins of halophilic archaea are highly adapted and magnificently engineered to function in an intracellular milieu that is in ionic balance with an external environment containing between 2 and 5 M inorganic salt. To understand the nature of halophilic adaptation and to visualize this interplay, the sequences of genes encoding the L11, L1, L10, and L12 proteins of the large ribosome subunit and Mn/Fe superoxide dismutase proteins from three genera of halophilic archaea have been aligned and analyzed for the presence of synonymous and nonsynonymous nucleotide substitutions. Compared to homologous eubacterial genes, these halophilic genes exhibit an inordinately high proportion of nonsynonymous nucleotide substitutions that result in amino acid replacement in the encoded proteins. More than one-third of the replacements involve acidic amino acid residues. We suggest that fluctuations in environmental salinity provide the driving force for fixation of the excessive number of nonsynonymous substitutions. Tinkering with the number, location, and arrangement of acidic and other amino acid residues influences the fitness (i.e., hydrophobicity, surface hydration, and structural stability) of the halophilic protein. Tinkering is also evident at halophilic protein positions monomorphic or polymorphic for serine; more than one-third of these positions use both the TCN and the AGY serine codons, indicating that there have been multiple nonsynonymous substitutions at these positions. Our model suggests that fluctuating environmental salinity prevents optimization of fitness for many halophilic proteins and helps to explain the unusual evolutionary divergence of their encoding genes.


Sign in / Sign up

Export Citation Format

Share Document