scholarly journals A comprehensive and scalable database search system for metaproteomics

2016 ◽  
Author(s):  
Sandip Chatterjee ◽  
Gregory S. Stupp ◽  
Sung Kyu (Robin) Park ◽  
Jean-Christophe Ducom ◽  
John R. Yates ◽  
...  

AbstractBackgroundMass spectrometry-based shotgun proteomics experiments rely on accurate matching of experimental spectra against a database of protein sequences. Existing computational analysis methods are limited in the size of their sequence databases, which severely restricts the proteomic sequencing depth and functional analysis of highly complex samples. The growing amount of public high-throughput sequencing data will only exacerbate this problem. We designed a broadly applicable metaproteomic analysis method (ComPIL) that addresses protein database size limitations.ResultsOur approach to overcome this significant limitation in metaproteomics was to design a scalable set of sequence databases assembled for optimal library querying speeds. ComPIL was integrated with a modified version of the search engine ProLuCID (termed “Blazmass”) to permit rapid matching of experimental spectra. Proof-of-principle analysis of human HEK293 lysate with a ComPIL database derived from high-quality genomic libraries was able to detect nearly all of the same peptides as a search with a human database (~500x fewer peptides in the database), with a small reduction in sensitivity. We were also able to detect proteins from the adenovirus used to immortalize these cells. We applied our method to a set of healthy human gut microbiome proteomic samples and showed a substantial increase in the number of identified peptides and proteins compared to previous metaproteomic analyses, while retaining a high degree of protein identification accuracy, and allowing for a more in-depth characterization of the functional landscape of the samples.ConclusionsThe combination of ComPIL with Blazmass allows proteomic searches to be performed with database sizes much larger than previously possible. These large database searches can be applied to complex meta-samples with unknown composition or proteomic samples where unexpected proteins may be identified. The protein database, proteomics search engine, and the proteomic data files for the 5 microbiome samples characterized and discussed herein are open source and available for use and additional analysis.

Antibiotics ◽  
2020 ◽  
Vol 9 (11) ◽  
pp. 757 ◽  
Author(s):  
Daniela Almeida ◽  
Dany Domínguez-Pérez ◽  
Ana Matos ◽  
Guillermin Agüero-Chapin ◽  
Hugo Osório ◽  
...  

Cephalopods, successful predators, can use a mixture of substances to subdue their prey, becoming interesting sources of bioactive compounds. In addition to neurotoxins and enzymes, the presence of antimicrobial compounds has been reported. Recently, the transcriptome and the whole proteome of the Octopus vulgaris salivary apparatus were released, but the role of some compounds—e.g., histones, antimicrobial peptides (AMPs), and toxins—remains unclear. Herein, we profiled the proteome of the posterior salivary glands (PSGs) of O. vulgaris using two sample preparation protocols combined with a shotgun-proteomics approach. Protein identification was performed against a composite database comprising data from the UniProtKB, all transcriptomes available from the cephalopods’ PSGs, and a comprehensive non-redundant AMPs database. Out of the 10,075 proteins clustered in 1868 protein groups, 90 clusters corresponded to venom protein toxin families. Additionally, we detected putative AMPs clustered with histones previously found as abundant proteins in the saliva of O. vulgaris. Some of these histones, such as H2A and H2B, are involved in systemic inflammatory responses and their antimicrobial effects have been demonstrated. These results not only confirm the production of enzymes and toxins by the O. vulgaris PSGs but also suggest their involvement in the first line of defense against microbes.


2016 ◽  
Vol 2016 ◽  
pp. 1-10 ◽  
Author(s):  
Qiang Yu ◽  
Hongwei Huo ◽  
Dazheng Feng

Identifying conserved patterns in DNA sequences, namely, motif discovery, is an important and challenging computational task. With hundreds or more sequences contained, the high-throughput sequencing data set is helpful to improve the identification accuracy of motif discovery but requires an even higher computing performance. To efficiently identify motifs in large DNA data sets, a new algorithm called PairMotifChIP is proposed by extracting and combining pairs of l-mers in the input with relatively small Hamming distance. In particular, a method for rapidly extracting pairs of l-mers is designed, which can be used not only for PairMotifChIP, but also for other DNA data mining tasks with the same demand. Experimental results on the simulated data show that the proposed algorithm can find motifs successfully and runs faster than the state-of-the-art motif discovery algorithms. Furthermore, the validity of the proposed algorithm has been verified on real data.


Author(s):  
Y. S. Golenko ◽  
A. A. Ismailova

Today, shotgun proteomics is a powerful approach to characterize proteomes in biological samples. Unlike the top-down proteomics strategy, shotgun proteomics is characterized by high separation efficiency and mass spectral sensitivity. At the same time, it places higher demands on the computational and statistical methods required for peptide identification, protein identification, and label-free quantification. The main purpose of shotgun proteomics is to identify the shape and amount of each protein by combining liquid chromatography with tandem mass spectrometry. The analysis and interpretation of experimental data is the final and most important stage in proteomics; they also generate a large number of problems that require complex computational solutions. One of the most important tasks, of course, is the identification of proteins present in the experimental sample. As a rule, this task is divided into two main components: the stage of assigning experimental tandem mass spectra to peptides obtained from the protein database, and the stage of comparing peptides with proteins and quantitative assessment of the reliability of the identified proteins. It is also worth considering that the assessment of the reliability of the data obtained can be a separate, no less important and complex task. In this article, we propose to consider protein identification only as a problem of statistical inference, and also describe a number of methods that can be used to solve it. We classify the existing approaches into (1) rule-based methods, (2) combinatorial optimization methods, and (3) probabilistic inference methods. Integer programming and Bayesian inference frameworks are used to represent methods. We also discuss the main problems of protein identification and suggest possible solutions to these problems.


2018 ◽  
Vol 17 (7) ◽  
pp. 2249-2255 ◽  
Author(s):  
Lev I. Levitsky ◽  
Mark V. Ivanov ◽  
Anna A. Lobas ◽  
Julia A. Bubis ◽  
Irina A. Tarasova ◽  
...  

2021 ◽  
Vol 99 (2) ◽  
Author(s):  
Yuhua Fu ◽  
Pengyu Fan ◽  
Lu Wang ◽  
Ziqiang Shu ◽  
Shilin Zhu ◽  
...  

Abstract Despite the broad variety of available microRNA (miRNA) research tools and methods, their application to the identification, annotation, and target prediction of miRNAs in nonmodel organisms is still limited. In this study, we collected nearly all public sRNA-seq data to improve the annotation for known miRNAs and identify novel miRNAs that have not been annotated in pigs (Sus scrofa). We newly annotated 210 mature sequences in known miRNAs and found that 43 of the known miRNA precursors were problematic due to redundant/missing annotations or incorrect sequences. We also predicted 811 novel miRNAs with high confidence, which was twice the current number of known miRNAs for pigs in miRBase. In addition, we proposed a correlation-based strategy to predict target genes for miRNAs by using a large amount of sRNA-seq and RNA-seq data. We found that the correlation-based strategy provided additional evidence of expression compared with traditional target prediction methods. The correlation-based strategy also identified the regulatory pairs that were controlled by nonbinding sites with a particular pattern, which provided abundant complementarity for studying the mechanism of miRNAs that regulate gene expression. In summary, our study improved the annotation of known miRNAs, identified a large number of novel miRNAs, and predicted target genes for all pig miRNAs by using massive public data. This large data-based strategy is also applicable for other nonmodel organisms with incomplete annotation information.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Sung Yong Park ◽  
Gina Faraci ◽  
Pamela M. Ward ◽  
Jane F. Emerson ◽  
Ha Youn Lee

AbstractCOVID-19 global cases have climbed to more than 33 million, with over a million total deaths, as of September, 2020. Real-time massive SARS-CoV-2 whole genome sequencing is key to tracking chains of transmission and estimating the origin of disease outbreaks. Yet no methods have simultaneously achieved high precision, simple workflow, and low cost. We developed a high-precision, cost-efficient SARS-CoV-2 whole genome sequencing platform for COVID-19 genomic surveillance, CorvGenSurv (Coronavirus Genomic Surveillance). CorvGenSurv directly amplified viral RNA from COVID-19 patients’ Nasopharyngeal/Oropharyngeal (NP/OP) swab specimens and sequenced the SARS-CoV-2 whole genome in three segments by long-read, high-throughput sequencing. Sequencing of the whole genome in three segments significantly reduced sequencing data waste, thereby preventing dropouts in genome coverage. We validated the precision of our pipeline by both control genomic RNA sequencing and Sanger sequencing. We produced near full-length whole genome sequences from individuals who were COVID-19 test positive during April to June 2020 in Los Angeles County, California, USA. These sequences were highly diverse in the G clade with nine novel amino acid mutations including NSP12-M755I and ORF8-V117F. With its readily adaptable design, CorvGenSurv grants wide access to genomic surveillance, permitting immediate public health response to sudden threats.


BMC Biology ◽  
2021 ◽  
Vol 19 (1) ◽  
Author(s):  
Timothy P. Jenkins ◽  
David I. Pritchard ◽  
Radu Tanasescu ◽  
Gary Telford ◽  
Marina Papaiakovou ◽  
...  

Abstract Background Helminth-associated changes in gut microbiota composition have been hypothesised to contribute to the immune-suppressive properties of parasitic worms. Multiple sclerosis is an immune-mediated autoimmune disease of the central nervous system whose pathophysiology has been linked to imbalances in gut microbial communities. Results In the present study, we investigated, for the first time, qualitative and quantitative changes in the faecal bacterial composition of human volunteers with remitting multiple sclerosis (RMS) prior to and following experimental infection with the human hookworm, Necator americanus (N+), and following anthelmintic treatment, and compared the findings with data obtained from a cohort of RMS patients subjected to placebo treatment (PBO). Bacterial 16S rRNA high-throughput sequencing data revealed significantly decreased alpha diversity in the faecal microbiota of PBO compared to N+ subjects over the course of the trial; additionally, we observed significant differences in the abundances of several bacterial taxa with putative immune-modulatory functions between study cohorts. Parabacteroides were significantly expanded in the faecal microbiota of N+ individuals for which no clinical and/or radiological relapses were recorded at the end of the trial. Conclusions Overall, our data lend support to the hypothesis of a contributory role of parasite-associated alterations in gut microbial composition to the immune-modulatory properties of hookworm parasites.


2020 ◽  
Vol 49 (D1) ◽  
pp. D877-D883
Author(s):  
Fangzhou Xie ◽  
Shurong Liu ◽  
Junhao Wang ◽  
Jiajia Xuan ◽  
Xiaoqin Zhang ◽  
...  

Abstract Eukaryotic genomes encode thousands of small and large non-coding RNAs (ncRNAs). However, the expression, functions and evolution of these ncRNAs are still largely unknown. In this study, we have updated deepBase to version 3.0 (deepBase v3.0, http://rna.sysu.edu.cn/deepbase3/index.html), an increasingly popular and openly licensed resource that facilitates integrative and interactive display and analysis of the expression, evolution, and functions of various ncRNAs by deeply mining thousands of high-throughput sequencing data from tissue, tumor and exosome samples. We updated deepBase v3.0 to provide the most comprehensive expression atlas of small RNAs and lncRNAs by integrating ∼67 620 data from 80 normal tissues and ∼50 cancer tissues. The extracellular patterns of various ncRNAs were profiled to explore their applications for discovery of noninvasive biomarkers. Moreover, we constructed survival maps of tRNA-derived RNA Fragments (tRFs), miRNAs, snoRNAs and lncRNAs by analyzing >45 000 cancer sample data and corresponding clinical information. We also developed interactive webs to analyze the differential expression and biological functions of various ncRNAs in ∼50 types of cancers. This update is expected to provide a variety of new modules and graphic visualizations to facilitate analyses and explorations of the functions and mechanisms of various types of ncRNAs.


Sign in / Sign up

Export Citation Format

Share Document