scholarly journals ModEnzA: Accurate Identification of Metabolic Enzymes Using Function Specific Profile HMMs with Optimised Discrimination Threshold and Modified Emission Probabilities

2011 ◽  
Vol 2011 ◽  
pp. 1-12 ◽  
Author(s):  
Dhwani K. Desai ◽  
Soumyadeep Nandi ◽  
Prashant K. Srivastava ◽  
Andrew M. Lynn

Various enzyme identification protocols involving homology transfer by sequence-sequence or profile-sequence comparisons have been devised which utilise Swiss-Prot sequences associated with EC numbers as the training set. A profile HMM constructed for a particular EC number might select sequences which perform a different enzymatic function due to the presence of certain fold-specific residues which are conserved in enzymes sharing a common fold. We describe a protocol, ModEnzA (HMM-ModE Enzyme Annotation), which generates profile HMMs highly specific at a functional level as defined by the EC numbers by incorporating information from negative training sequences. We enrich the training dataset by mining sequences from the NCBI Non-Redundant database for increased sensitivity. We compare our method with other enzyme identification methods, both for assigning EC numbers to a genome as well as identifying protein sequences associated with an enzymatic activity. We report a sensitivity of 88% and specificity of 95% in identifying EC numbers and annotating enzymatic sequences from the E. coli genome which is higher than any other method. With the next-generation sequencing methods producing a huge amount of sequence data, the development and use of fully automated yet accurate protocols such as ModEnzA is warranted for rapid annotation of newly sequenced genomes and metagenomic sequences.

2021 ◽  
Vol 9 (8) ◽  
pp. 1570
Author(s):  
Chien-Hsun Huang ◽  
Chih-Chieh Chen ◽  
Yu-Chun Lin ◽  
Chia-Hsuan Chen ◽  
Ai-Yun Lee ◽  
...  

The current taxonomy of the Lactiplantibacillus plantarum group comprises of 17 closely related species that are indistinguishable from each other by using commonly used 16S rRNA gene sequencing. In this study, a whole-genome-based analysis was carried out for exploring the highly distinguished target genes whose interspecific sequence identity is significantly less than those of 16S rRNA or conventional housekeeping genes. In silico analyses of 774 core genes by the cano-wgMLST_BacCompare analytics platform indicated that csbB, morA, murI, mutL, ntpJ, rutB, trmK, ydaF, and yhhX genes were the most promising candidates. Subsequently, the mutL gene was selected, and the discrimination power was further evaluated using Sanger sequencing. Among the type strains, mutL exhibited a clearly superior sequence identity (61.6–85.6%; average: 66.6%) to the 16S rRNA gene (96.7–100%; average: 98.4%) and the conventional phylogenetic marker genes (e.g., dnaJ, dnaK, pheS, recA, and rpoA), respectively, which could be used to separat tested strains into various species clusters. Consequently, species-specific primers were developed for fast and accurate identification of L. pentosus, L. argentoratensis, L. plantarum, and L. paraplantarum. During this study, one strain (BCRC 06B0048, L. pentosus) exhibited not only relatively low mutL sequence identities (97.0%) but also a low digital DNA–DNA hybridization value (78.1%) with the type strain DSM 20314T, signifying that it exhibits potential for reclassification as a novel subspecies. Our data demonstrate that mutL can be a genome-wide target for identifying and classifying the L. plantarum group species and for differentiating novel taxa from known species.


Author(s):  
Daniella F Lato ◽  
G Brian Golding

Abstract Increasing evidence supports the notion that different regions of a genome have unique rates of molecular change. This variation is particularly evident in bacterial genomes where previous studies have reported gene expression and essentiality tend to decrease, while substitution rates usually increases with increasing distance from the origin of replication. Genomic reorganization such as rearrangements occur frequently in bacteria and allow for the introduction and restructuring of genetic content, creating gradients of molecular traits along genomes. Here, we explore the interplay of these phenomena by mapping substitutions to the genomes of Escherichia coli, Bacillus subtilis, Streptomyces, and Sinorhizobium meliloti, quantifying how many substitutions have occurred at each position in the genome. Preceding work indicates that substitution rate significantly increases with distance from the origin. Using a larger sample size and accounting for genome rearrangements through ancestral reconstruction, our analysis demonstrates that the correlation between the number of substitutions and distance from the origin of replication is often significant but small and inconsistent in direction. Some replicons had a significantly decreasing trend (E. coli and the chromosome of S. meliloti), while others showed the opposite significant trend (B. subtilis, Streptomyces, pSymA and pSymB in S. meliloti). dN, dS and ω were examined across all genes and there was no significant correlation between those values and distance from the origin. This study highlights the impact that genomic rearrangements and location have on molecular trends in some bacteria, illustrating the importance of considering spatial trends in molecular evolutionary analysis. Assuming that molecular trends are exclusively in one direction can be problematic.


2021 ◽  
Vol 16 (1) ◽  
Author(s):  
Kingshuk Mukherjee ◽  
Massimiliano Rossi ◽  
Leena Salmela ◽  
Christina Boucher

AbstractGenome wide optical maps are high resolution restriction maps that give a unique numeric representation to a genome. They are produced by assembling hundreds of thousands of single molecule optical maps, which are called Rmaps. Unfortunately, there are very few choices for assembling Rmap data. There exists only one publicly-available non-proprietary method for assembly and one proprietary software that is available via an executable. Furthermore, the publicly-available method, by Valouev et al. (Proc Natl Acad Sci USA 103(43):15770–15775, 2006), follows the overlap-layout-consensus (OLC) paradigm, and therefore, is unable to scale for relatively large genomes. The algorithm behind the proprietary method, Bionano Genomics’ Solve, is largely unknown. In this paper, we extend the definition of bi-labels in the paired de Bruijn graph to the context of optical mapping data, and present the first de Bruijn graph based method for Rmap assembly. We implement our approach, which we refer to as rmapper, and compare its performance against the assembler of Valouev et al. (Proc Natl Acad Sci USA 103(43):15770–15775, 2006) and Solve by Bionano Genomics on data from three genomes: E. coli, human, and climbing perch fish (Anabas Testudineus). Our method was able to successfully run on all three genomes. The method of Valouev et al. (Proc Natl Acad Sci USA 103(43):15770–15775, 2006) only successfully ran on E. coli. Moreover, on the human genome rmapper was at least 130 times faster than Bionano Solve, used five times less memory and produced the highest genome fraction with zero mis-assemblies. Our software, rmapper is written in C++ and is publicly available under GNU General Public License at https://github.com/kingufl/Rmapper.


2006 ◽  
Vol 400 (1) ◽  
pp. 105-114 ◽  
Author(s):  
Eric Di Luccio ◽  
Robert A. Elling ◽  
David K. Wilson

The AKRs (aldo-keto reductases) are a superfamily of enzymes which mainly rely on NADPH to reversibly reduce various carbonyl-containing compounds to the corresponding alcohols. A small number have been found with dual NADPH/NADH specificity, usually preferring NADPH, but none are exclusive for NADH. Crystal structures of the dual-specificity enzyme xylose reductase (AKR2B5) indicate that NAD+ is bound via a key interaction with a glutamate that is able to change conformations to accommodate the 2′-phosphate of NADP+. Sequence comparisons suggest that analogous glutamate or aspartate residues may function in other AKRs to allow NADH utilization. Based on this, nine putative enzymes with potential NADH specificity were identified and seven genes were successfully expressed and purified from Drosophila melanogaster, Escherichia coli, Schizosaccharomyces pombe, Sulfolobus solfataricus, Sinorhizobium meliloti and Thermotoga maritima. Each was assayed for co-substrate dependence with conventional AKR substrates. Three were exclusive for NADPH (AKR2E3, AKR3F2 and AKR3F3), two were dual-specific (AKR3C2 and AKR3F1) and one was specific for NADH (AKR11B2), the first such activity in an AKR. Fluorescence measurements of the seventh protein indicated that it bound both NADPH and NADH but had no activity. Mutation of the aspartate into an alanine residue or a more mobile glutamate in the NADH-specific E. coli protein converted it into an enzyme with dual specificity. These results show that the presence of this carboxylate is an indication of NADH dependence. This should allow improved prediction of co-substrate specificity and provide a basis for engineering enzymes with altered co-substrate utilization for this class of enzymes.


Plant Disease ◽  
2021 ◽  
Author(s):  
Terry Torres-Cruz ◽  
Briana Whitaker ◽  
Robert Proctor ◽  
Kirk Broders ◽  
Imane Laraba ◽  
...  

Species within Fusarium are of global agricultural, medical, and food/feed safety concern and have been extensively characterized. However, accurate identification of species is challenging and usually requires DNA sequence data. FUSARIUM-ID (http://isolate.fusariumdb.org/) is a publicly available database designed to support the identification of Fusarium species using sequences of multiple phylogenetically informative loci, especially the highly informative ~680 bp 5' portion of the translation elongation factor 1-alpha (TEF1) gene that has been adopted as the primary barcoding locus in the genus. However, FUSARIUM-ID v.1.0 and 2.0 had several limitations, including inconsistent metadata annotation for the archived sequences and poor representation of some species complexes and marker loci. Here, we present FUSARIUM-ID v.3.0, which provides the following improvements: (i) additional and updated annotation of metadata for isolates associated with each sequence, (ii) expanded taxon representation in the TEF1 sequence database, (iii) availability of the sequence database as a downloadable file to enable local BLAST queries, and (iv) a tutorial file for users to perform local BLAST searches using either freely-available software, such as SequenceServer, BLAST+ executable in the command line, and Galaxy, or the proprietary Geneious software. FUSARIUM-ID will be updated on a regular basis by archiving sequences of TEF1 and other loci from newly identified species and greater in-depth sampling of currently recognized species.


1982 ◽  
Vol 39 (1) ◽  
pp. 1-30 ◽  
Author(s):  
George L. Gabor Miklos ◽  
Amanda Clare Gill

SummaryThe nucleotide sequence data from highly repeated DNAs of inverte-brates and mammals are summarized and briefly discussed. Very similar conclusions can be drawn from the two data bases. Sequence complexities can vary from 2 bp to at least 359 bp in invertebrates and from 3 bp to at least 2350 bp in mammals. The larger sequences may or may not exhibit a substructure. Significant sequence variation occurs for any given repeated array within a species, but the sources of this heterogeneity have not been systematically partitioned. The types of alterations in a basic repeating unit can involve base changes as well as deletions or additions which can vary from 1 bp to at least 98 bp in length. These changes indicate that sequence per se is unlikely to be under significant biological constraints and may sensibly be examined by analogy to Kimura's neutral theory for allelic variation. It is not possible with the present evidence to discriminate between the roles of neutral and selective mechanisms in the evolution of highly repeated DNA.Tandemly repeated arrays are constantly subjected to cycles of amplification and deletion by mechanisms for which the available data stem largely from ribosomal genes. It is a matter of conjecture whether the solutions to the mechanistic puzzles involved in amplification or rapid redeployment of satellite sequences throughout a genome will necessarily give any insight into biological functions.The lack of significant somatic effects when the satellite DNA content of a genome is significantly perturbed indicates that the hunt for specific functions at the cellular level is unlikely to prove profitable.The presence or in some cases the amount of satellite DNA on a chromosome, however, can have significant effects in the germ line. There the data show that localized condensed chromatin, rich in satellite DNA, can have the effect of rendering adjacent euchromatic regions rec−, or of altering levels of recombination on different chromosomes. No data stemming from natural populations however are yet available to tell us if these effects are of adaptive or evolutionary significance.


2011 ◽  
Vol 13 (2) ◽  
pp. 214-224 ◽  
Author(s):  
Nicholas R. Sandoval ◽  
Tirzah Y. Mills ◽  
Min Zhang ◽  
Ryan T. Gill
Keyword(s):  
E Coli ◽  

2007 ◽  
Vol 32 (5) ◽  
pp. 373-380 ◽  
Author(s):  
Jorge F. Pereira ◽  
Mariana D.C. Ignacchiti ◽  
Elza F. Araújo ◽  
Sérgio H. Brommonschenkel ◽  
Júlio C.M. Cascardo ◽  
...  

Reverse transcriptase (RT) sequence analysis is an important technique used to detect the presence of transposable elements in a genome. Putative RT sequences were analyzed in the genome of the pathogenic fungus C. perniciosa, the causal agent of witches' broom disease of cocoa. A 394 bp fragment was amplified from genomic DNA of different isolates of C. perniciosa belonging to C-, L-, and S-biotypes and collected from various geographical areas. The cleavage of PCR products with restriction enzymes and the sequencing of various RT fragments indicated the presence of several sequences showing transition events (G:C to A:T). Southern blot analysis revealed high copy numbers of RT signals, forming different patterns among C-, S-, and L-biotype isolates. Sequence comparisons of the predicted RT peptide indicate a close relationship with the RT protein from thegypsy family of LTR-retrotransposons. The possible role of these retrotransposons in generating genetic variability in the homothallic C. perniciosa is discussed.


Parasitology ◽  
2009 ◽  
Vol 136 (5) ◽  
pp. 469-485 ◽  
Author(s):  
A. S. TAFT ◽  
J. J. VERMEIRE ◽  
J. BERNIER ◽  
S. R. BIRKELAND ◽  
M. J. CIPRIANO ◽  
...  

SUMMARYInfection of the snail,Biomphalaria glabrata, by the free-swimming miracidial stage of the human blood fluke,Schistosoma mansoni, and its subsequent development to the parasitic sporocyst stage is critical to establishment of viable infections and continued human transmission. We performed a genome-wide expression analysis of theS. mansonimiracidia and developing sporocyst using Long Serial Analysis of Gene Expression (LongSAGE). Five cDNA libraries were constructed from miracidia andin vitrocultured 6- and 20-day-old sporocysts maintained in sporocyst medium (SM) or in SM conditioned by previous cultivation with cells of theB. glabrataembryonic (Bge) cell line. We generated 21 440 SAGE tags and mapped 13 381 to theS. mansonigene predictions (v4.0e) either by estimating theoretical 3′ UTR lengths or using existing 3′ EST sequence data. Overall, 432 transcripts were found to be differentially expressed amongst all 5 libraries. In total, 172 tags were differentially expressed between miracidia and 6-day conditioned sporocysts and 152 were differentially expressed between miracidia and 6-day unconditioned sporocysts. In addition, 53 and 45 tags, respectively, were differentially expressed in 6-day and 20-day cultured sporocysts, due to the effects of exposure to Bge cell-conditioned medium.


2020 ◽  
Author(s):  
Brett Whitty ◽  
John F. Thompson

AbstractBackgroundLow levels of sample contamination can have disastrous effects on the accurate identification of somatic variation in tumor samples. Detection of sample contamination in DNA is generally based on observation of low frequency variants that suggest more than a single source of DNA is present. This strategy works with standard DNA samples but is especially problematic in solid tumor FFPE samples because there can be huge variations in allele frequency (AF) due to massive copy number changes arising from large gains and losses across the genome. The tremendously variable allele frequencies make detection of contamination challenging. A method not based on individual AF is needed for accurate determination of whether a sample is contaminated and to what degree.MethodsWe used microhaplotypes to determine whether sample contamination is present. Microhaplotypes are sets of variants on the same sequencing read that can be unambiguously phased. Instead of measuring AF, the number and frequency of microhaplotypes is determined. Contamination detection becomes based on fundamental genomic properties, linkage disequilibrium (LD) and the diploid nature of human DNA, rather than variant frequencies. We optimized microhaplotype content based on 164 single nucleotide variant sets located in genes already sequenced within a cancer panel. Thus, contamination detection uses existing sequence data and does not require sequencing of any extraneous regions. The content is chosen based on LD data from the 1000 Genomes Project to be ancestry agnostic, providing the same sensitivity for contamination detection with samples from individuals of African, East Asian, and European ancestry.ResultsDetection of contamination at 1% and below is possible using this design. The methods described here can also be extended to other DNA mixtures such as forensic and non-invasive prenatal testing samples where DNA mixes of 1% or less can be similarly detected.ConclusionsThe microhaplotype method allows sensitive detection of DNA contamination in FFPE tumor samples. These methods provide a foundation for examining DNA mixtures in a variety of contexts. With the appropriate panels and high sequencing depth, low levels of secondary DNA can be detected and this can be valuable in a variety of applications.


Sign in / Sign up

Export Citation Format

Share Document