A comprehensive and scalable database search system for metaproteomics

Mapping Intimacies ◽

10.1101/053975 ◽

2016 ◽

Author(s):

Sandip Chatterjee ◽

Gregory S. Stupp ◽

Sung Kyu (Robin) Park ◽

Jean-Christophe Ducom ◽

John R. Yates ◽

...

Keyword(s):

Search Engine ◽

Protein Identification ◽

High Throughput Sequencing ◽

Shotgun Proteomics ◽

Identification Accuracy ◽

Sequencing Data ◽

Protein Database ◽

Healthy Human ◽

Genomic Libraries ◽

Sequence Databases

AbstractBackgroundMass spectrometry-based shotgun proteomics experiments rely on accurate matching of experimental spectra against a database of protein sequences. Existing computational analysis methods are limited in the size of their sequence databases, which severely restricts the proteomic sequencing depth and functional analysis of highly complex samples. The growing amount of public high-throughput sequencing data will only exacerbate this problem. We designed a broadly applicable metaproteomic analysis method (ComPIL) that addresses protein database size limitations.ResultsOur approach to overcome this significant limitation in metaproteomics was to design a scalable set of sequence databases assembled for optimal library querying speeds. ComPIL was integrated with a modified version of the search engine ProLuCID (termed “Blazmass”) to permit rapid matching of experimental spectra. Proof-of-principle analysis of human HEK293 lysate with a ComPIL database derived from high-quality genomic libraries was able to detect nearly all of the same peptides as a search with a human database (~500x fewer peptides in the database), with a small reduction in sensitivity. We were also able to detect proteins from the adenovirus used to immortalize these cells. We applied our method to a set of healthy human gut microbiome proteomic samples and showed a substantial increase in the number of identified peptides and proteins compared to previous metaproteomic analyses, while retaining a high degree of protein identification accuracy, and allowing for a more in-depth characterization of the functional landscape of the samples.ConclusionsThe combination of ComPIL with Blazmass allows proteomic searches to be performed with database sizes much larger than previously possible. These large database searches can be applied to complex meta-samples with unknown composition or proteomic samples where unexpected proteins may be identified. The protein database, proteomics search engine, and the proteomic data files for the 5 microbiome samples characterized and discussed herein are open source and available for use and additional analysis.

Download Full-text

Putative Antimicrobial Peptides of the Posterior Salivary Glands from the Cephalopod Octopus vulgaris Revealed by Exploring a Composite Protein Database

Antibiotics ◽

10.3390/antibiotics9110757 ◽

2020 ◽

Vol 9 (11) ◽

pp. 757 ◽

Cited By ~ 1

Author(s):

Daniela Almeida ◽

Dany Domínguez-Pérez ◽

Ana Matos ◽

Guillermin Agüero-Chapin ◽

Hugo Osório ◽

...

Keyword(s):

Antimicrobial Peptides ◽

Salivary Glands ◽

Protein Identification ◽

Inflammatory Responses ◽

Shotgun Proteomics ◽

Octopus Vulgaris ◽

Protein Database ◽

Venom Protein ◽

Protein Toxin ◽

Proteomics Approach

Cephalopods, successful predators, can use a mixture of substances to subdue their prey, becoming interesting sources of bioactive compounds. In addition to neurotoxins and enzymes, the presence of antimicrobial compounds has been reported. Recently, the transcriptome and the whole proteome of the Octopus vulgaris salivary apparatus were released, but the role of some compounds—e.g., histones, antimicrobial peptides (AMPs), and toxins—remains unclear. Herein, we profiled the proteome of the posterior salivary glands (PSGs) of O. vulgaris using two sample preparation protocols combined with a shotgun-proteomics approach. Protein identification was performed against a composite database comprising data from the UniProtKB, all transcriptomes available from the cephalopods’ PSGs, and a comprehensive non-redundant AMPs database. Out of the 10,075 proteins clustered in 1868 protein groups, 90 clusters corresponded to venom protein toxin families. Additionally, we detected putative AMPs clustered with histones previously found as abundant proteins in the saliva of O. vulgaris. Some of these histones, such as H2A and H2B, are involved in systemic inflammatory responses and their antimicrobial effects have been demonstrated. These results not only confirm the production of enzymes and toxins by the O. vulgaris PSGs but also suggest their involvement in the first line of defense against microbes.

Download Full-text

PairMotifChIP: A Fast Algorithm for Discovery of Patterns Conserved in Large ChIP-seq Data Sets

BioMed Research International ◽

10.1155/2016/4986707 ◽

2016 ◽

Vol 2016 ◽

pp. 1-10 ◽

Cited By ~ 3

Author(s):

Qiang Yu ◽

Hongwei Huo ◽

Dazheng Feng

Keyword(s):

Dna Sequences ◽

Motif Discovery ◽

High Throughput Sequencing ◽

Hamming Distance ◽

Simulated Data ◽

Real Data ◽

Identification Accuracy ◽

Data Sets ◽

Sequencing Data ◽

Data Set

Identifying conserved patterns in DNA sequences, namely, motif discovery, is an important and challenging computational task. With hundreds or more sequences contained, the high-throughput sequencing data set is helpful to improve the identification accuracy of motif discovery but requires an even higher computing performance. To efficiently identify motifs in large DNA data sets, a new algorithm called PairMotifChIP is proposed by extracting and combining pairs of l-mers in the input with relatively small Hamming distance. In particular, a method for rapidly extracting pairs of l-mers is designed, which can be used not only for PairMotifChIP, but also for other DNA data mining tasks with the same demand. Experimental results on the simulated data show that the proposed algorithm can find motifs successfully and runs faster than the state-of-the-art motif discovery algorithms. Furthermore, the validity of the proposed algorithm has been verified on real data.

Download Full-text

MODERN COMPUTATIONAL STRATEGIES FOR PROTEIN INFERENCE IN SHOTGUN PROTEOMIC

NEWS OF THE NATIONAL ACADEMY OF SCIENCES OF THE REPUBLIC OF KAZAKHSTAN ◽

10.32014/2021.2518-1726.21 ◽

2021 ◽

Vol 2 (336) ◽

pp. 56-65

Author(s):

Y. S. Golenko ◽

A. A. Ismailova

Keyword(s):

Protein Identification ◽

Separation Efficiency ◽

Peptide Identification ◽

Shotgun Proteomics ◽

Optimization Methods ◽

Tandem Mass ◽

Experimental Sample ◽

Label Free ◽

Protein Database ◽

Tandem Mass Spectra

Today, shotgun proteomics is a powerful approach to characterize proteomes in biological samples. Unlike the top-down proteomics strategy, shotgun proteomics is characterized by high separation efficiency and mass spectral sensitivity. At the same time, it places higher demands on the computational and statistical methods required for peptide identification, protein identification, and label-free quantification. The main purpose of shotgun proteomics is to identify the shape and amount of each protein by combining liquid chromatography with tandem mass spectrometry. The analysis and interpretation of experimental data is the final and most important stage in proteomics; they also generate a large number of problems that require complex computational solutions. One of the most important tasks, of course, is the identification of proteins present in the experimental sample. As a rule, this task is divided into two main components: the stage of assigning experimental tandem mass spectra to peptides obtained from the protein database, and the stage of comparing peptides with proteins and quantitative assessment of the reliability of the identified proteins. It is also worth considering that the assessment of the reliability of the data obtained can be a separate, no less important and complex task. In this article, we propose to consider protein identification only as a problem of statistical inference, and also describe a number of methods that can be used to solve it. We classify the existing approaches into (1) rule-based methods, (2) combinatorial optimization methods, and (3) probabilistic inference methods. Integer programming and Bayesian inference frameworks are used to represent methods. We also discuss the main problems of protein identification and suggest possible solutions to these problems.

Download Full-text

IdentiPy: An Extensible Search Engine for Protein Identification in Shotgun Proteomics

Journal of Proteome Research ◽

10.1021/acs.jproteome.7b00640 ◽

2018 ◽

Vol 17 (7) ◽

pp. 2249-2255 ◽

Cited By ~ 17

Author(s):

Lev I. Levitsky ◽

Mark V. Ivanov ◽

Anna A. Lobas ◽

Julia A. Bubis ◽

Irina A. Tarasova ◽

...

Keyword(s):

Search Engine ◽

Protein Identification ◽

Shotgun Proteomics

Download Full-text

Faculty Opinions recommendation of Coalescent Inference Using Serially Sampled, High-Throughput Sequencing Data from Intrahost HIV Infection.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.726132071.793531014 ◽

2017 ◽

Author(s):

Sarah Rowland-Jones ◽

Sophie Andrews

Keyword(s):

Hiv Infection ◽

High Throughput ◽

High Throughput Sequencing ◽

Sequencing Data ◽

High Throughput Sequencing Data

Download Full-text

BlindCall: ultra-fast base-calling of high-throughput sequencing data by blind deconvolution

Bioinformatics ◽

10.1093/bioinformatics/btu010 ◽

2014 ◽

Vol 30 (9) ◽

pp. 1214-1219 ◽

Cited By ~ 6

Author(s):

C. Ye ◽

C. Hsiao ◽

H. Corrada Bravo

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Blind Deconvolution ◽

Sequencing Data ◽

Base Calling ◽

High Throughput Sequencing Data

Download Full-text

Improvement, identification, and target prediction for miRNAs in the porcine genome by using massive, public high-throughput sequencing data

Journal of Animal Science ◽

10.1093/jas/skab018 ◽

2021 ◽

Vol 99 (2) ◽

Author(s):

Yuhua Fu ◽

Pengyu Fan ◽

Lu Wang ◽

Ziqiang Shu ◽

Shilin Zhu ◽

...

Keyword(s):

High Throughput Sequencing ◽

Target Genes ◽

Target Prediction ◽

Large Data ◽

Sequencing Data ◽

Regulate Gene Expression ◽

High Throughput Sequencing Data ◽

Annotation Information ◽

Public Data ◽

Broad Variety

Abstract Despite the broad variety of available microRNA (miRNA) research tools and methods, their application to the identification, annotation, and target prediction of miRNAs in nonmodel organisms is still limited. In this study, we collected nearly all public sRNA-seq data to improve the annotation for known miRNAs and identify novel miRNAs that have not been annotated in pigs (Sus scrofa). We newly annotated 210 mature sequences in known miRNAs and found that 43 of the known miRNA precursors were problematic due to redundant/missing annotations or incorrect sequences. We also predicted 811 novel miRNAs with high confidence, which was twice the current number of known miRNAs for pigs in miRBase. In addition, we proposed a correlation-based strategy to predict target genes for miRNAs by using a large amount of sRNA-seq and RNA-seq data. We found that the correlation-based strategy provided additional evidence of expression compared with traditional target prediction methods. The correlation-based strategy also identified the regulatory pairs that were controlled by nonbinding sites with a particular pattern, which provided abundant complementarity for studying the mechanism of miRNAs that regulate gene expression. In summary, our study improved the annotation of known miRNAs, identified a large number of novel miRNAs, and predicted target genes for all pig miRNAs by using massive public data. This large data-based strategy is also applicable for other nonmodel organisms with incomplete annotation information.

Download Full-text

High-precision and cost-efficient sequencing for real-time COVID-19 surveillance

Scientific Reports ◽

10.1038/s41598-021-93145-4 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Sung Yong Park ◽

Gina Faraci ◽

Pamela M. Ward ◽

Jane F. Emerson ◽

Ha Youn Lee

Keyword(s):

Los Angeles ◽

Whole Genome Sequencing ◽

Real Time ◽

Genome Sequencing ◽

High Precision ◽

High Throughput Sequencing ◽

Whole Genome ◽

Sequencing Data ◽

Public Health Response ◽

Cost Efficient

AbstractCOVID-19 global cases have climbed to more than 33 million, with over a million total deaths, as of September, 2020. Real-time massive SARS-CoV-2 whole genome sequencing is key to tracking chains of transmission and estimating the origin of disease outbreaks. Yet no methods have simultaneously achieved high precision, simple workflow, and low cost. We developed a high-precision, cost-efficient SARS-CoV-2 whole genome sequencing platform for COVID-19 genomic surveillance, CorvGenSurv (Coronavirus Genomic Surveillance). CorvGenSurv directly amplified viral RNA from COVID-19 patients’ Nasopharyngeal/Oropharyngeal (NP/OP) swab specimens and sequenced the SARS-CoV-2 whole genome in three segments by long-read, high-throughput sequencing. Sequencing of the whole genome in three segments significantly reduced sequencing data waste, thereby preventing dropouts in genome coverage. We validated the precision of our pipeline by both control genomic RNA sequencing and Sanger sequencing. We produced near full-length whole genome sequences from individuals who were COVID-19 test positive during April to June 2020 in Los Angeles County, California, USA. These sequences were highly diverse in the G clade with nine novel amino acid mutations including NSP12-M755I and ORF8-V117F. With its readily adaptable design, CorvGenSurv grants wide access to genomic surveillance, permitting immediate public health response to sudden threats.

Download Full-text

Experimental infection with the hookworm, Necator americanus, is associated with stable gut microbial diversity in human volunteers with relapsing multiple sclerosis

BMC Biology ◽

10.1186/s12915-021-01003-6 ◽

2021 ◽

Vol 19 (1) ◽

Author(s):

Timothy P. Jenkins ◽

David I. Pritchard ◽

Radu Tanasescu ◽

Gary Telford ◽

Marina Papaiakovou ◽

...

Keyword(s):

Multiple Sclerosis ◽

Experimental Infection ◽

High Throughput Sequencing ◽

Alpha Diversity ◽

Placebo Treatment ◽

Sequencing Data ◽

Faecal Microbiota ◽

Microbial Composition ◽

Necator Americanus ◽

Human Volunteers

Abstract Background Helminth-associated changes in gut microbiota composition have been hypothesised to contribute to the immune-suppressive properties of parasitic worms. Multiple sclerosis is an immune-mediated autoimmune disease of the central nervous system whose pathophysiology has been linked to imbalances in gut microbial communities. Results In the present study, we investigated, for the first time, qualitative and quantitative changes in the faecal bacterial composition of human volunteers with remitting multiple sclerosis (RMS) prior to and following experimental infection with the human hookworm, Necator americanus (N+), and following anthelmintic treatment, and compared the findings with data obtained from a cohort of RMS patients subjected to placebo treatment (PBO). Bacterial 16S rRNA high-throughput sequencing data revealed significantly decreased alpha diversity in the faecal microbiota of PBO compared to N+ subjects over the course of the trial; additionally, we observed significant differences in the abundances of several bacterial taxa with putative immune-modulatory functions between study cohorts. Parabacteroides were significantly expanded in the faecal microbiota of N+ individuals for which no clinical and/or radiological relapses were recorded at the end of the trial. Conclusions Overall, our data lend support to the hypothesis of a contributory role of parasite-associated alterations in gut microbial composition to the immune-modulatory properties of hookworm parasites.

Download Full-text

deepBase v3.0: expression atlas and interactive analysis of ncRNAs from thousands of deep-sequencing data

Nucleic Acids Research ◽

10.1093/nar/gkaa1039 ◽

2020 ◽

Vol 49 (D1) ◽

pp. D877-D883

Author(s):

Fangzhou Xie ◽

Shurong Liu ◽

Junhao Wang ◽

Jiajia Xuan ◽

Xiaoqin Zhang ◽

...

Keyword(s):

High Throughput Sequencing ◽

Clinical Information ◽

Sequencing Data ◽

Normal Tissues ◽

Interactive Analysis ◽

High Throughput Sequencing Data ◽

Expression Atlas ◽

Expression Evolution ◽

Noninvasive Biomarkers ◽

Cancer Tissues

Abstract Eukaryotic genomes encode thousands of small and large non-coding RNAs (ncRNAs). However, the expression, functions and evolution of these ncRNAs are still largely unknown. In this study, we have updated deepBase to version 3.0 (deepBase v3.0, http://rna.sysu.edu.cn/deepbase3/index.html), an increasingly popular and openly licensed resource that facilitates integrative and interactive display and analysis of the expression, evolution, and functions of various ncRNAs by deeply mining thousands of high-throughput sequencing data from tissue, tumor and exosome samples. We updated deepBase v3.0 to provide the most comprehensive expression atlas of small RNAs and lncRNAs by integrating ∼67 620 data from 80 normal tissues and ∼50 cancer tissues. The extracellular patterns of various ncRNAs were profiled to explore their applications for discovery of noninvasive biomarkers. Moreover, we constructed survival maps of tRNA-derived RNA Fragments (tRFs), miRNAs, snoRNAs and lncRNAs by analyzing >45 000 cancer sample data and corresponding clinical information. We also developed interactive webs to analyze the differential expression and biological functions of various ncRNAs in ∼50 types of cancers. This update is expected to provide a variety of new modules and graphic visualizations to facilitate analyses and explorations of the functions and mechanisms of various types of ncRNAs.

Download Full-text