scholarly journals DIMEdb: an integrated database and web service for metabolite identification in direct infusion mass spectrometery

2018 ◽  
Author(s):  
Keiron O’Shea ◽  
Divya Kattupalli ◽  
Luis AJ Mur ◽  
Nigel W Hardy ◽  
Biswapriya B Misra ◽  
...  

AbstractMotivationMetabolomics involves the characterisation, identification, and quantification of small molecules (metabolites) that act as the reaction intermediates of biological processes. Over the past few years, we have seen wide scale improvements in data processing, database, and statistical analysis tools. Direct infusion mass spectrometery (DIMS) is a widely used platform that is able to produce a global fingerprint of the metabolome, without the requirement of a prior chromatographic step - making it ideal for wide scale high-throughput metabolomics analysis. In spite of these developments, metabolite identification still remains a key bottleneck in untargeted mass spectrometry-based metabolomics studies. The first step of the metabolite identification task is to query masses against a metaboite database to get putative metabolite annotations. Each existing metabolite database differs in a number of aspects including coverage, format, and accessibility - often limiting the user to a rudimentary web interface. Manually combining multiple search results for a single experiment where there may be potentially hundreds of masses to investigate becomes an incredibly arduous task.ResultsTo facilitate unified access to metabolite information we have created the Direct Infusion MEtabolite database (DIMEdb), a comprehensive web-based metabolite database that contains over 80,000 metabolites sourced from a number of renowned metabolite databases of which can be utilised in the analysis and annotation of DIMS data. To demostrate the efficacy of DIMEdb, a simple use case for metabolic identification is presented. DIMEdb aims to provide a single point of access to metabolite information, and hopefully facilitate the development of much needed bioinformatic tools.AvailabilityDIMEdb is freely available at https://[email protected] informationSupplementary data are available at Bioinformatics online.

Author(s):  
Darawan Rinchai ◽  
Jessica Roelands ◽  
Mohammed Toufiq ◽  
Wouter Hendrickx ◽  
Matthew C Altman ◽  
...  

Abstract Motivation We previously described the construction and characterization of generic and reusable blood transcriptional module repertoires. More recently we released a third iteration (“BloodGen3” module repertoire) that comprises 382 functionally annotated gene sets (modules) and encompasses 14,168 transcripts. Custom bioinformatic tools are needed to support downstream analysis, visualization and interpretation relying on such fixed module repertoires. Results We have developed and describe here a R package, BloodGen3Module. The functions of our package permit group comparison analyses to be performed at the module-level, and to display the results as annotated fingerprint grid plots. A parallel workflow for computing module repertoire changes for individual samples rather than groups of samples is also available; these results are displayed as fingerprint heatmaps. An illustrative case is used to demonstrate the steps involved in generating blood transcriptome repertoire fingerprints of septic patients. Taken together, this resource could facilitate the analysis and interpretation of changes in blood transcript abundance observed across a wide range of pathological and physiological states. Availability The BloodGen3Module package and documentation are freely available from Github: https://github.com/Drinchai/BloodGen3Module Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Tomasz Zok

Abstract Motivation Biomolecular structures come in multiple representations and diverse data formats. Their incompatibility with the requirements of data analysis programs significantly hinders the analytics and the creation of new structure-oriented bioinformatic tools. Therefore, the need for robust libraries of data processing functions is still growing. Results BioCommons is an open-source, Java library for structural bioinformatics. It contains many functions working with the 2D and 3D structures of biomolecules, with a particular emphasis on RNA. Availability and implementation The library is available in Maven Central Repository and its source code is hosted on GitHub: https://github.com/tzok/BioCommons Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 35 (17) ◽  
pp. 3196-3198 ◽  
Author(s):  
Tobias Depke ◽  
Raimo Franke ◽  
Mark Brönstrup

Abstract Summary Compound identification is one of the most eminent challenges in the untargeted analysis of complex mixtures of small molecules by mass spectrometry. Similarity of tandem mass spectra can provide valuable information on putative structural similarities between known and unknown analytes and hence aids feature identification in the bioanalytical sciences. We have developed CluMSID (Clustering of MS2 spectra for metabolite identification), an R package that enables researchers to make use of tandem mass spectra and neutral loss pattern similarities as a part of their metabolite annotation workflow. CluMSID offers functions for all analysis steps from import of raw data to data mining by unsupervised multivariate methods along with respective (interactive) visualizations. A detailed tutorial with example data is provided as supplementary information. Availability and implementation CluMSID is available as R package from https://github.com/tdepke/CluMSID/and from https://bioconductor.org/packages/CluMSID/. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 35 (22) ◽  
pp. 4748-4753 ◽  
Author(s):  
Ahmad Borzou ◽  
Razie Yousefi ◽  
Rovshan G Sadygov

Abstract Motivation High throughput technologies are widely employed in modern biomedical research. They yield measurements of a large number of biomolecules in a single experiment. The number of experiments usually is much smaller than the number of measurements in each experiment. The simultaneous measurements of biomolecules provide a basis for a comprehensive, systems view for describing relevant biological processes. Often it is necessary to determine correlations between the data matrices under different conditions or pathways. However, the techniques for analyzing the data with a low number of samples for possible correlations within or between conditions are still in development. Earlier developed correlative measures, such as the RV coefficient, use the trace of the product of data matrices as the most relevant characteristic. However, a recent study has shown that the RV coefficient consistently overestimates the correlations in the case of low sample numbers. To correct for this bias, it was suggested to discard the diagonal elements of the outer products of each data matrix. In this work, a principled approach based on the matrix decomposition generates three trace-independent parts for every matrix. These components are unique, and they are used to determine different aspects of correlations between the original datasets. Results Simulations show that the decomposition results in the removal of high correlation bias and the dependence on the sample number intrinsic to the RV coefficient. We then use the correlations to analyze a real proteomics dataset. Availability and implementation The python code can be downloaded from http://dynamic-proteome.utmb.edu/MatrixCorrelations.aspx. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Author(s):  
Laura Avino Esteban ◽  
Lyubov R Lonishin ◽  
Daniil Bobrovskiy ◽  
Gregory Leleytner ◽  
Natalya S Bogatyreva ◽  
...  

Abstract Motivation Epistasis, the context-dependence of the contribution of an amino acid substitution to fitness, is common in evolution. To detect epistasis, fitness must be measured for at least four genotypes: the reference genotype, two different single mutants and a double mutant with both of the single mutations. For higher-order epistasis of the order n, fitness has to be measured for all 2n genotypes of an n-dimensional hypercube in genotype space forming a “combinatorially complete dataset”. So far, only a handful of such datasets have been produced by manual curation. Concurrently, random mutagenesis experiments have produced measurements of fitness and other phenotypes in a high-throughput manner, potentially containing a number of combinatorially complete datasets. Results We present an effective recursive algorithm for finding all hypercube structures in random mutagenesis experimental data. To test the algorithm, we applied it to the data from a recent HIS3 protein dataset and found all 199,847,053 unique combinatorially complete genotype combinations of dimensionality ranging from two to twelve. The algorithm may be useful for researchers looking for higher-order epistasis in their high-throughput experimental data. Availability https://github.com/ivankovlab/HypercubeME.git Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 116 (48) ◽  
pp. 24206-24213 ◽  
Author(s):  
Paul R. Jaschke ◽  
Gabrielle A. Dotson ◽  
Kay S. Hung ◽  
Diane Liu ◽  
Drew Endy

We develop a method for completing the genetics of natural living systems by which the absence of expected future discoveries can be established. We demonstrate the method using bacteriophage øX174, the first DNA genome to be sequenced. Like many well-studied natural organisms, closely related genome sequences are available—23 Bullavirinae genomes related to øX174. Using bioinformatic tools, we first identified 315 potential open reading frames (ORFs) within the genome, including the 11 established essential genes and 82 highly conserved ORFs that have no known gene products or assigned functions. Using genome-scale design and synthesis, we made a mutant genome in which all 11 essential genes are simultaneously disrupted, leaving intact only the 82 conserved but cryptic ORFs. The resulting genome is not viable. Cell-free gene expression followed by mass spectrometry revealed only a single peptide expressed from both the cryptic ORF and wild-type genomes, suggesting a potential new gene. A second synthetic genome in which 71 conserved cryptic ORFs were simultaneously disrupted is viable but with ∼50% reduced fitness relative to the wild type. However, rather than finding any new genes, repeated evolutionary adaptation revealed a single point mutation that modulates expression of gene H, a known essential gene, and fully suppresses the fitness defect. Taken together, we conclude that the annotation of currently functional ORFs for the øX174 genome is formally complete. More broadly, we show that sequencing and bioinformatics followed by synthesis-enabled reverse genomics, proteomics, and evolutionary adaptation can definitely establish the sufficiency and completeness of natural genome annotations.


Author(s):  
Lijun Cai ◽  
Xuanbai Ren ◽  
Xiangzheng Fu ◽  
Li Peng ◽  
Mingyu Gao ◽  
...  

Abstract Motivation Enhancers are non-coding DNA fragments with high position variability and free scattering. They play an important role in controlling gene expression. As machine learning has become more widely used in identifying enhancers, a number of bioinformatic tools have been developed. Although several models for identifying enhancers and their strengths have been proposed, their accuracy and efficiency have yet to be improved. Results We propose a two-layer predictor called ‘iEnhancer-XG.’ It comprises a one-layer predictor (for identifying enhancers) and a second classifier (for their strength) and uses ‘XGBoost’ as a base classifier and five feature extraction methods, namely, k-Spectrum Profile, Mismatch k-tuple, Subsequence Profile, Position-specific scoring matrix (PSSM) and Pseudo dinucleotide composition (PseDNC). Each method has an independent output. We place the feature vector matrix into the ensemble learning for fusion. This experiment involves the method of ‘SHapley Additive explanations’ to provide interpretability for the previous black box machine learning methods and improve their credibility. The accuracies of the ensemble learning method are 0.811 (first layer) and 0.657 (second layer). The rigorous 10-fold cross-validation confirms that the proposed method is significantly better than existing technologies. Availability and implementation The source code and dataset for the enhancer predictions have been uploaded to https://github.com/jimmyrate/ienhancer-xg. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Author(s):  
Jun Wang ◽  
Xi Xiang ◽  
Lixin Cheng ◽  
Xiuqing Zhang ◽  
Yonglun Luo

ABSTRACTMotivationThe CRISPR/Cas9 system has been broadly used in genetic engineering. However, risks of potential off-targets and the variability of on-target activity among different targets are two limiting factors. Several bioinformatic tools have been developed for CRISPR on-target activity and off-target prediction. However, the general application of the current prediction models is hampered by the great variation among different algorithms.ResultsIn this study, we thoroughly re-analyzed 13 published datasets with eight regression models. We proved that the current model gave very low cross-dataset and cross-species prediction outcome. To overcome these limitations, we have developed an improved model (a generalization score, GNL) based on normalized gene editing activity from 8,101 gRNAs and 2,488 features using Bayesian Ridge Regression model. Our results demonstrated that the GNL model is a better general algorithm for CRISPR on-target activity predictionAvailability and implementationThe prediction scorer is available on GitHub (https://github.com/TerminatorJ/GNL_Scorer).ContactJ.W. ([email protected]) or Y.L. ([email protected])Supplementary InformationSupplementary data are available at Bioinformatics online.


Author(s):  
Michael G Leeming ◽  
Sean O’Callaghan ◽  
Luana Licata ◽  
Marta Iannuccelli ◽  
Prisca Lo Surdo ◽  
...  

Abstract Motivation Mass spectrometry-based phosphoproteomics can routinely identify and quantify thousands of phosphorylated peptides from a single experiment. However interrogating possible upstream kinases and identifying key literature for phosphorylation sites is laborious and time-consuming. Results Here, we present Phosphomatics—a publicly available web resource for interrogating phosphoproteomics data. Phosphomatics allows researchers to upload phosphoproteomics data and interrogate possible relationships from a substrate-, kinase- or pathway-centric viewpoint. Availability and implementation Phosphomatics is freely available via the internet at: https://phosphomatics.com. Supplementary information Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document