ModEnzA: Accurate Identification of Metabolic Enzymes Using Function Specific Profile HMMs with Optimised Discrimination Threshold and Modified Emission Probabilities

Advances in Bioinformatics ◽

10.1155/2011/743782 ◽

2011 ◽

Vol 2011 ◽

pp. 1-12 ◽

Cited By ~ 8

Author(s):

Dhwani K. Desai ◽

Soumyadeep Nandi ◽

Prashant K. Srivastava ◽

Andrew M. Lynn

Keyword(s):

Sequence Data ◽

Training Dataset ◽

Accurate Identification ◽

Sequence Comparisons ◽

Profile Hmms ◽

E Coli ◽

Enzymatic Function ◽

Training Sequences ◽

A Genome ◽

Increased Sensitivity

Various enzyme identification protocols involving homology transfer by sequence-sequence or profile-sequence comparisons have been devised which utilise Swiss-Prot sequences associated with EC numbers as the training set. A profile HMM constructed for a particular EC number might select sequences which perform a different enzymatic function due to the presence of certain fold-specific residues which are conserved in enzymes sharing a common fold. We describe a protocol, ModEnzA (HMM-ModE Enzyme Annotation), which generates profile HMMs highly specific at a functional level as defined by the EC numbers by incorporating information from negative training sequences. We enrich the training dataset by mining sequences from the NCBI Non-Redundant database for increased sensitivity. We compare our method with other enzyme identification methods, both for assigning EC numbers to a genome as well as identifying protein sequences associated with an enzymatic activity. We report a sensitivity of 88% and specificity of 95% in identifying EC numbers and annotating enzymatic sequences from the E. coli genome which is higher than any other method. With the next-generation sequencing methods producing a huge amount of sequence data, the development and use of fully automated yet accurate protocols such as ModEnzA is warranted for rapid annotation of newly sequenced genomes and metagenomic sequences.

Download Full-text

The mutL Gene as a Genome-Wide Taxonomic Marker for High Resolution Discrimination of Lactiplantibacillus plantarum and Its Closely Related Taxa

Microorganisms ◽

10.3390/microorganisms9081570 ◽

2021 ◽

Vol 9 (8) ◽

pp. 1570

Author(s):

Chien-Hsun Huang ◽

Chih-Chieh Chen ◽

Yu-Chun Lin ◽

Chia-Hsuan Chen ◽

Ai-Yun Lee ◽

...

Keyword(s):

16S Rrna ◽

16S Rrna Gene ◽

Target Genes ◽

Marker Genes ◽

Rrna Gene ◽

Accurate Identification ◽

Discrimination Power ◽

Sequence Identity ◽

Genome Wide ◽

A Genome

The current taxonomy of the Lactiplantibacillus plantarum group comprises of 17 closely related species that are indistinguishable from each other by using commonly used 16S rRNA gene sequencing. In this study, a whole-genome-based analysis was carried out for exploring the highly distinguished target genes whose interspecific sequence identity is significantly less than those of 16S rRNA or conventional housekeeping genes. In silico analyses of 774 core genes by the cano-wgMLST_BacCompare analytics platform indicated that csbB, morA, murI, mutL, ntpJ, rutB, trmK, ydaF, and yhhX genes were the most promising candidates. Subsequently, the mutL gene was selected, and the discrimination power was further evaluated using Sanger sequencing. Among the type strains, mutL exhibited a clearly superior sequence identity (61.6–85.6%; average: 66.6%) to the 16S rRNA gene (96.7–100%; average: 98.4%) and the conventional phylogenetic marker genes (e.g., dnaJ, dnaK, pheS, recA, and rpoA), respectively, which could be used to separat tested strains into various species clusters. Consequently, species-specific primers were developed for fast and accurate identification of L. pentosus, L. argentoratensis, L. plantarum, and L. paraplantarum. During this study, one strain (BCRC 06B0048, L. pentosus) exhibited not only relatively low mutL sequence identities (97.0%) but also a low digital DNA–DNA hybridization value (78.1%) with the type strain DSM 20314T, signifying that it exhibits potential for reclassification as a novel subspecies. Our data demonstrate that mutL can be a genome-wide target for identifying and classifying the L. plantarum group species and for differentiating novel taxa from known species.

Download Full-text

The Location of Substitutions and Bacterial Genome Arrangements

Genome Biology and Evolution ◽

10.1093/gbe/evaa260 ◽

2020 ◽

Author(s):

Daniella F Lato ◽

G Brian Golding

Keyword(s):

Sinorhizobium Meliloti ◽

Bacterial Genome ◽

Evolutionary Analysis ◽

Origin Of Replication ◽

Ancestral Reconstruction ◽

Molecular Change ◽

E Coli ◽

A Genome ◽

The Impact ◽

Molecular Evolutionary Analysis

Abstract Increasing evidence supports the notion that different regions of a genome have unique rates of molecular change. This variation is particularly evident in bacterial genomes where previous studies have reported gene expression and essentiality tend to decrease, while substitution rates usually increases with increasing distance from the origin of replication. Genomic reorganization such as rearrangements occur frequently in bacteria and allow for the introduction and restructuring of genetic content, creating gradients of molecular traits along genomes. Here, we explore the interplay of these phenomena by mapping substitutions to the genomes of Escherichia coli, Bacillus subtilis, Streptomyces, and Sinorhizobium meliloti, quantifying how many substitutions have occurred at each position in the genome. Preceding work indicates that substitution rate significantly increases with distance from the origin. Using a larger sample size and accounting for genome rearrangements through ancestral reconstruction, our analysis demonstrates that the correlation between the number of substitutions and distance from the origin of replication is often significant but small and inconsistent in direction. Some replicons had a significantly decreasing trend (E. coli and the chromosome of S. meliloti), while others showed the opposite significant trend (B. subtilis, Streptomyces, pSymA and pSymB in S. meliloti). dN, dS and ω were examined across all genes and there was no significant correlation between those values and distance from the origin. This study highlights the impact that genomic rearrangements and location have on molecular trends in some bacteria, illustrating the importance of considering spatial trends in molecular evolutionary analysis. Assuming that molecular trends are exclusively in one direction can be problematic.

Download Full-text

Fast and efficient Rmap assembly using the Bi-labelled de Bruijn graph

Algorithms for Molecular Biology ◽

10.1186/s13015-021-00182-9 ◽

2021 ◽

Vol 16 (1) ◽

Author(s):

Kingshuk Mukherjee ◽

Massimiliano Rossi ◽

Leena Salmela ◽

Christina Boucher

Keyword(s):

Single Molecule ◽

De Bruijn Graph ◽

Anabas Testudineus ◽

E Coli ◽

Genome Wide ◽

A Genome ◽

De Bruijn ◽

Optical Maps ◽

Definition Of ◽

Numeric Representation

AbstractGenome wide optical maps are high resolution restriction maps that give a unique numeric representation to a genome. They are produced by assembling hundreds of thousands of single molecule optical maps, which are called Rmaps. Unfortunately, there are very few choices for assembling Rmap data. There exists only one publicly-available non-proprietary method for assembly and one proprietary software that is available via an executable. Furthermore, the publicly-available method, by Valouev et al. (Proc Natl Acad Sci USA 103(43):15770–15775, 2006), follows the overlap-layout-consensus (OLC) paradigm, and therefore, is unable to scale for relatively large genomes. The algorithm behind the proprietary method, Bionano Genomics’ Solve, is largely unknown. In this paper, we extend the definition of bi-labels in the paired de Bruijn graph to the context of optical mapping data, and present the first de Bruijn graph based method for Rmap assembly. We implement our approach, which we refer to as rmapper, and compare its performance against the assembler of Valouev et al. (Proc Natl Acad Sci USA 103(43):15770–15775, 2006) and Solve by Bionano Genomics on data from three genomes: E. coli, human, and climbing perch fish (Anabas Testudineus). Our method was able to successfully run on all three genomes. The method of Valouev et al. (Proc Natl Acad Sci USA 103(43):15770–15775, 2006) only successfully ran on E. coli. Moreover, on the human genome rmapper was at least 130 times faster than Bionano Solve, used five times less memory and produced the highest genome fraction with zero mis-assemblies. Our software, rmapper is written in C++ and is publicly available under GNU General Public License at https://github.com/kingufl/Rmapper.

Download Full-text

Identification of a novel NADH-specific aldo-keto reductase using sequence and structural homologies

Biochemical Journal ◽

10.1042/bj20060660 ◽

2006 ◽

Vol 400 (1) ◽

pp. 105-114 ◽

Cited By ~ 27

Author(s):

Eric Di Luccio ◽

Robert A. Elling ◽

David K. Wilson

Keyword(s):

Escherichia Coli ◽

Sinorhizobium Meliloti ◽

Thermotoga Maritima ◽

Xylose Reductase ◽

Sulfolobus Solfataricus ◽

Sequence Comparisons ◽

E Coli ◽

Fluorescence Measurements ◽

Dual Specificity ◽

Substrate Dependence

The AKRs (aldo-keto reductases) are a superfamily of enzymes which mainly rely on NADPH to reversibly reduce various carbonyl-containing compounds to the corresponding alcohols. A small number have been found with dual NADPH/NADH specificity, usually preferring NADPH, but none are exclusive for NADH. Crystal structures of the dual-specificity enzyme xylose reductase (AKR2B5) indicate that NAD+ is bound via a key interaction with a glutamate that is able to change conformations to accommodate the 2′-phosphate of NADP+. Sequence comparisons suggest that analogous glutamate or aspartate residues may function in other AKRs to allow NADH utilization. Based on this, nine putative enzymes with potential NADH specificity were identified and seven genes were successfully expressed and purified from Drosophila melanogaster, Escherichia coli, Schizosaccharomyces pombe, Sulfolobus solfataricus, Sinorhizobium meliloti and Thermotoga maritima. Each was assayed for co-substrate dependence with conventional AKR substrates. Three were exclusive for NADPH (AKR2E3, AKR3F2 and AKR3F3), two were dual-specific (AKR3C2 and AKR3F1) and one was specific for NADH (AKR11B2), the first such activity in an AKR. Fluorescence measurements of the seventh protein indicated that it bound both NADPH and NADH but had no activity. Mutation of the aspartate into an alanine residue or a more mobile glutamate in the NADH-specific E. coli protein converted it into an enzyme with dual specificity. These results show that the presence of this carboxylate is an indication of NADH dependence. This should allow improved prediction of co-substrate specificity and provide a basis for engineering enzymes with altered co-substrate utilization for this class of enzymes.

Download Full-text

FUSARIUM-ID v.3.0: An updated, downloadable resource for Fusarium species identification

Plant Disease ◽

10.1094/pdis-09-21-2105-sr ◽

2021 ◽

Author(s):

Terry Torres-Cruz ◽

Briana Whitaker ◽

Robert Proctor ◽

Kirk Broders ◽

Imane Laraba ◽

...

Keyword(s):

Sequence Data ◽

Safety Concern ◽

Fusarium Species ◽

Elongation Factor ◽

Sequence Database ◽

Accurate Identification ◽

Dna Sequence Data ◽

Species Complexes ◽

Marker Loci ◽

Feed Safety

Species within Fusarium are of global agricultural, medical, and food/feed safety concern and have been extensively characterized. However, accurate identification of species is challenging and usually requires DNA sequence data. FUSARIUM-ID (http://isolate.fusariumdb.org/) is a publicly available database designed to support the identification of Fusarium species using sequences of multiple phylogenetically informative loci, especially the highly informative ~680 bp 5' portion of the translation elongation factor 1-alpha (TEF1) gene that has been adopted as the primary barcoding locus in the genus. However, FUSARIUM-ID v.1.0 and 2.0 had several limitations, including inconsistent metadata annotation for the archived sequences and poor representation of some species complexes and marker loci. Here, we present FUSARIUM-ID v.3.0, which provides the following improvements: (i) additional and updated annotation of metadata for isolates associated with each sequence, (ii) expanded taxon representation in the TEF1 sequence database, (iii) availability of the sequence database as a downloadable file to enable local BLAST queries, and (iv) a tutorial file for users to perform local BLAST searches using either freely-available software, such as SequenceServer, BLAST+ executable in the command line, and Galaxy, or the proprietary Geneious software. FUSARIUM-ID will be updated on a regular basis by archiving sequences of TEF1 and other loci from newly identified species and greater in-depth sampling of currently recognized species.

Download Full-text

Nucleotide sequences of highly repeated DNAs; compilation and comments

Genetics Research ◽

10.1017/s0016672300020711 ◽

1982 ◽

Vol 39 (1) ◽

pp. 1-30 ◽

Cited By ~ 33

Author(s):

George L. Gabor Miklos ◽

Amanda Clare Gill

Keyword(s):

Satellite Dna ◽

Germ Line ◽

Sequence Data ◽

Allelic Variation ◽

Neutral Theory ◽

Natural Populations ◽

Evolutionary Significance ◽

Nucleotide Sequence Data ◽

A Genome ◽

Repeated Dnas

SummaryThe nucleotide sequence data from highly repeated DNAs of inverte-brates and mammals are summarized and briefly discussed. Very similar conclusions can be drawn from the two data bases. Sequence complexities can vary from 2 bp to at least 359 bp in invertebrates and from 3 bp to at least 2350 bp in mammals. The larger sequences may or may not exhibit a substructure. Significant sequence variation occurs for any given repeated array within a species, but the sources of this heterogeneity have not been systematically partitioned. The types of alterations in a basic repeating unit can involve base changes as well as deletions or additions which can vary from 1 bp to at least 98 bp in length. These changes indicate that sequence per se is unlikely to be under significant biological constraints and may sensibly be examined by analogy to Kimura's neutral theory for allelic variation. It is not possible with the present evidence to discriminate between the roles of neutral and selective mechanisms in the evolution of highly repeated DNA.Tandemly repeated arrays are constantly subjected to cycles of amplification and deletion by mechanisms for which the available data stem largely from ribosomal genes. It is a matter of conjecture whether the solutions to the mechanistic puzzles involved in amplification or rapid redeployment of satellite sequences throughout a genome will necessarily give any insight into biological functions.The lack of significant somatic effects when the satellite DNA content of a genome is significantly perturbed indicates that the hunt for specific functions at the cellular level is unlikely to prove profitable.The presence or in some cases the amount of satellite DNA on a chromosome, however, can have significant effects in the germ line. There the data show that localized condensed chromatin, rich in satellite DNA, can have the effect of rendering adjacent euchromatic regions rec−, or of altering levels of recombination on different chromosomes. No data stemming from natural populations however are yet available to tell us if these effects are of adaptive or evolutionary significance.

Download Full-text

Elucidating acetate tolerance in E. coli using a genome-wide approach

Metabolic Engineering ◽

10.1016/j.ymben.2010.12.001 ◽

2011 ◽

Vol 13 (2) ◽

pp. 214-224 ◽

Cited By ~ 49

Author(s):

Nicholas R. Sandoval ◽

Tirzah Y. Mills ◽

Min Zhang ◽

Ryan T. Gill

Keyword(s):

E Coli ◽

Genome Wide ◽

A Genome

Download Full-text

PCR amplification and sequence analyses of reverse transcriptase-like genes in Crinipellis perniciosa isolates

Fitopatologia Brasileira ◽

10.1590/s0100-41582007000500001 ◽

2007 ◽

Vol 32 (5) ◽

pp. 373-380 ◽

Cited By ~ 3

Author(s):

Jorge F. Pereira ◽

Mariana D.C. Ignacchiti ◽

Elza F. Araújo ◽

Sérgio H. Brommonschenkel ◽

Júlio C.M. Cascardo ◽

...

Keyword(s):

Reverse Transcriptase ◽

Pathogenic Fungus ◽

Pcr Amplification ◽

Restriction Enzymes ◽

Sequence Comparisons ◽

Copy Numbers ◽

A Genome ◽

Close Relationship ◽

Pcr Products ◽

Broom Disease

Reverse transcriptase (RT) sequence analysis is an important technique used to detect the presence of transposable elements in a genome. Putative RT sequences were analyzed in the genome of the pathogenic fungus C. perniciosa, the causal agent of witches' broom disease of cocoa. A 394 bp fragment was amplified from genomic DNA of different isolates of C. perniciosa belonging to C-, L-, and S-biotypes and collected from various geographical areas. The cleavage of PCR products with restriction enzymes and the sequencing of various RT fragments indicated the presence of several sequences showing transition events (G:C to A:T). Southern blot analysis revealed high copy numbers of RT signals, forming different patterns among C-, S-, and L-biotype isolates. Sequence comparisons of the predicted RT peptide indicate a close relationship with the RT protein from thegypsy family of LTR-retrotransposons. The possible role of these retrotransposons in generating genetic variability in the homothallic C. perniciosa is discussed.

Download Full-text

Transcriptome analysis ofSchistosoma mansonilarval development using serial analysis of gene expression (SAGE)

Parasitology ◽

10.1017/s0031182009005733 ◽

2009 ◽

Vol 136 (5) ◽

pp. 469-485 ◽

Cited By ~ 22

Author(s):

A. S. TAFT ◽

J. J. VERMEIRE ◽

J. BERNIER ◽

S. R. BIRKELAND ◽

M. J. CIPRIANO ◽

...

Keyword(s):

Gene Expression ◽

Sequence Data ◽

Subsequent Development ◽

Differentially Expressed ◽

Cdna Libraries ◽

Genome Wide ◽

A Genome ◽

Genome Wide Expression ◽

Cell Conditioned Medium

SUMMARYInfection of the snail,Biomphalaria glabrata, by the free-swimming miracidial stage of the human blood fluke,Schistosoma mansoni, and its subsequent development to the parasitic sporocyst stage is critical to establishment of viable infections and continued human transmission. We performed a genome-wide expression analysis of theS. mansonimiracidia and developing sporocyst using Long Serial Analysis of Gene Expression (LongSAGE). Five cDNA libraries were constructed from miracidia andin vitrocultured 6- and 20-day-old sporocysts maintained in sporocyst medium (SM) or in SM conditioned by previous cultivation with cells of theB. glabrataembryonic (Bge) cell line. We generated 21 440 SAGE tags and mapped 13 381 to theS. mansonigene predictions (v4.0e) either by estimating theoretical 3′ UTR lengths or using existing 3′ EST sequence data. Overall, 432 transcripts were found to be differentially expressed amongst all 5 libraries. In total, 172 tags were differentially expressed between miracidia and 6-day conditioned sporocysts and 152 were differentially expressed between miracidia and 6-day unconditioned sporocysts. In addition, 53 and 45 tags, respectively, were differentially expressed in 6-day and 20-day cultured sporocysts, due to the effects of exposure to Bge cell-conditioned medium.

Download Full-text

Sensitive detection of DNA contamination in tumor samples via microhaplotypes

10.1101/2020.12.18.423488 ◽

2020 ◽

Author(s):

Brett Whitty ◽

John F. Thompson

Keyword(s):

Sequence Data ◽

Accurate Determination ◽

Sensitive Detection ◽

European Ancestry ◽

Somatic Variation ◽

Dna Mixtures ◽

Accurate Identification ◽

Sample Contamination ◽

Dna Contamination ◽

Low Levels

AbstractBackgroundLow levels of sample contamination can have disastrous effects on the accurate identification of somatic variation in tumor samples. Detection of sample contamination in DNA is generally based on observation of low frequency variants that suggest more than a single source of DNA is present. This strategy works with standard DNA samples but is especially problematic in solid tumor FFPE samples because there can be huge variations in allele frequency (AF) due to massive copy number changes arising from large gains and losses across the genome. The tremendously variable allele frequencies make detection of contamination challenging. A method not based on individual AF is needed for accurate determination of whether a sample is contaminated and to what degree.MethodsWe used microhaplotypes to determine whether sample contamination is present. Microhaplotypes are sets of variants on the same sequencing read that can be unambiguously phased. Instead of measuring AF, the number and frequency of microhaplotypes is determined. Contamination detection becomes based on fundamental genomic properties, linkage disequilibrium (LD) and the diploid nature of human DNA, rather than variant frequencies. We optimized microhaplotype content based on 164 single nucleotide variant sets located in genes already sequenced within a cancer panel. Thus, contamination detection uses existing sequence data and does not require sequencing of any extraneous regions. The content is chosen based on LD data from the 1000 Genomes Project to be ancestry agnostic, providing the same sensitivity for contamination detection with samples from individuals of African, East Asian, and European ancestry.ResultsDetection of contamination at 1% and below is possible using this design. The methods described here can also be extended to other DNA mixtures such as forensic and non-invasive prenatal testing samples where DNA mixes of 1% or less can be similarly detected.ConclusionsThe microhaplotype method allows sensitive detection of DNA contamination in FFPE tumor samples. These methods provide a foundation for examining DNA mixtures in a variety of contexts. With the appropriate panels and high sequencing depth, low levels of secondary DNA can be detected and this can be valuable in a variety of applications.

Download Full-text