scholarly journals ProteoClade: a taxonomic toolkit for multi-species and metaproteomic analysis

2019 ◽  
Author(s):  
Arshag D. Mooradian ◽  
Sjoerd van der Post ◽  
Kristen M. Naegle ◽  
Jason M. Held

AbstractWe present ProteoClade, a Python toolkit that performs taxa-specific peptide assignment, protein inference, and quantitation for multi-species proteomics experiments. ProteoClade scales to hundreds of millions of protein sequences, requires minimal computational resources, and is open source, multi-platform, and accessible to non-programmers. We demonstrate its utility for processing quantitative proteomic data derived from patient-derived xenografts and its speed and scalability enable a novel de novo proteomic workflow for complex microbiota samples.

Life ◽  
2019 ◽  
Vol 9 (1) ◽  
pp. 8 ◽  
Author(s):  
Michael S. Wang ◽  
Kenric J. Hoegler ◽  
Michael H. Hecht

Life as we know it would not exist without the ability of protein sequences to bind metal ions. Transition metals, in particular, play essential roles in a wide range of structural and catalytic functions. The ubiquitous occurrence of metalloproteins in all organisms leads one to ask whether metal binding is an evolved trait that occurred only rarely in ancestral sequences, or alternatively, whether it is an innate property of amino acid sequences, occurring frequently in unevolved sequence space. To address this question, we studied 52 proteins from a combinatorial library of novel sequences designed to fold into 4-helix bundles. Although these sequences were neither designed nor evolved to bind metals, the majority of them have innate tendencies to bind the transition metals copper, cobalt, and zinc with high nanomolar to low-micromolar affinity.


Author(s):  
Yuansheng Liu ◽  
Xiaocai Zhang ◽  
Quan Zou ◽  
Xiangxiang Zeng

Abstract Summary Removing duplicate and near-duplicate reads, generated by high-throughput sequencing technologies, is able to reduce computational resources in downstream applications. Here we develop minirmd, a de novo tool to remove duplicate reads via multiple rounds of clustering using different length of minimizer. Experiments demonstrate that minirmd removes more near-duplicate reads than existing clustering approaches and is faster than existing multi-core tools. To the best of our knowledge, minirmd is the first tool to remove near-duplicates on reverse-complementary strand. Availability and implementation https://github.com/yuansliu/minirmd. Supplementary information Supplementary data are available at Bioinformatics online.


2005 ◽  
Vol 71 (11) ◽  
pp. 7152-7163 ◽  
Author(s):  
Christophe Gitton ◽  
Mickael Meyrand ◽  
Juhui Wang ◽  
Christophe Caron ◽  
Alain Trubuil ◽  
...  

ABSTRACT We have compared the proteomic profiles of L. lactis subsp. cremoris NCDO763 growing in the synthetic medium M17Lac, skim milk microfiltrate (SMM), and skim milk. SMM was used as a simple model medium to reproduce the initial phase of growth of L. lactis in milk. To widen the analysis of the cytoplasmic proteome, we used two different gel systems (pH ranges of 4 to 7 and 4.5 to 5.5), and the proteins associated with the cell envelopes were also studied by two-dimensional electrophoresis. In the course of the study, we analyzed about 800 spots and identified 330 proteins by mass spectrometry. We observed that the levels of more than 50 and 30 proteins were significantly increased upon growth in SMM and milk, respectively. The large redeployment of protein synthesis was essentially associated with an activation of pathways involved in the metabolism of nitrogenous compounds: peptidolytic and peptide transport systems, amino acid biosynthesis and interconversion, and de novo biosynthesis of purines. We also showed that enzymes involved in reactions feeding the purine biosynthetic pathway in one-carbon units and amino acids have an increased level in SMM and milk. The analysis of the proteomic data suggested that the glutamine synthetase (GS) would play a pivotal role in the adaptation to SMM and milk. The analysis of glnA expression during growth in milk and the construction of a glnA-defective mutant confirmed that GS is an essential enzyme for the development of L. lactis in dairy media. This analysis thus provides a proteomic signature of L. lactis, a model lactic acid bacterium, growing in its technological environment.


2017 ◽  
Vol 61 (4) ◽  
pp. 421-426 ◽  
Author(s):  
Joanna Kołsut ◽  
Paulina Borówka ◽  
Błażej Marciniak ◽  
Ewelina Wójcik ◽  
Arkadiusz Wojtasik ◽  
...  

AbstractIntroduction: Colibacillosis – the most common disease of poultry, is caused mainly by avian pathogenic Escherichia coli (APEC). However, thus far, no pattern to the molecular basis of the pathogenicity of these bacteria has been established beyond dispute. In this study, genomes of APEC were investigated to ascribe importance and explore the distribution of 16 genes recognised as their virulence factors.Material and Methods: A total of 14 pathogenic for poultry E. coli strains were isolated, and their DNA was sequenced, assembled de novo, and annotated. Amino acid sequences from these bacteria and an additional 16 freely available APEC amino acid sequences were analysed with the DIFFIND tool to define their virulence factors.Results: The DIFFIND tool enabled quick, reliable, and convenient assessment of the differences between compared amino acid sequences from bacterial genomes. The presence of 16 protein sequences indicated as pathogenicity factors in poultry resulted in the generation of a heatmap which categorises genomes in terms of the existence and similarity of the analysed protein sequences.Conclusion: The proposed method of detection of virulence factors using the capabilities of the DIFFIND tool may be useful in the analysis of similarities of E. coli and other sequences deriving from bacteria. Phylogenetic analysis resulted in reliable segregation of 30 APEC strains into five main clusters containing various virulence associated genes (VAGs).


2021 ◽  
Vol 8 ◽  
Author(s):  
Charles Christoffer ◽  
Vijay Bharadwaj ◽  
Ryan Luu ◽  
Daisuke Kihara

Protein-protein docking is a useful tool for modeling the structures of protein complexes that have yet to be experimentally determined. Understanding the structures of protein complexes is a key component for formulating hypotheses in biophysics regarding the functional mechanisms of complexes. Protein-protein docking is an established technique for cases where the structures of the subunits have been determined. While the number of known structures deposited in the Protein Data Bank is increasing, there are still many cases where the structures of individual proteins that users want to dock are not determined yet. Here, we have integrated the AttentiveDist method for protein structure prediction into our LZerD webserver for protein-protein docking, which enables users to simply submit protein sequences and obtain full-complex atomic models, without having to supply any structure themselves. We have further extended the LZerD docking interface with a symmetrical homodimer mode. The LZerD server is available at https://lzerd.kiharalab.org/.


2017 ◽  
Author(s):  
Pierre Peterlongo ◽  
Chloé Riou ◽  
Erwan Drezen ◽  
Claire Lemaitre

AbstractMotivationNext Generation Sequencing (NGS) data provide an unprecedented access to life mechanisms. In particular, these data enable to detect polymorphisms such as SNPs and indels. As these polymorphisms represent a fundamental source of information in agronomy, environment or medicine, their detection in NGS data is now a routine task. The main methods for their prediction usually need a reference genome. However, non-model organisms and highly divergent genomes such as in cancer studies are extensively investigated.ResultsWe propose DiscoSnp++, in which we revisit the DiscoSnp algorithm. DiscoSnp++ is designed for detecting and ranking all kinds of SNPs and small indels from raw read set(s). It outputs files in fasta and VCF formats. In particular, predicted variants can be automatically localized afterwards on a reference genome if available. Its usage is extremely simple and its low resource requirements make it usable on common desktop computers. Results show that DiscoSnp++ performs better than state-of-the-art methods in terms of computational resources and in terms of results quality. An important novelty is the de novo detection of indels, for which we obtained 99% precision when calling indels on simulated human datasets and 90% recall on high confident indels from the Platinum dataset.LicenseGNU Affero general public licenseAvailabilityhttps://github.com/GATB/[email protected]


Blood ◽  
2021 ◽  
Vol 138 (Supplement 1) ◽  
pp. 3427-3427
Author(s):  
Michael H Kramer ◽  
Qiang Zhang ◽  
Robert W. Sprung ◽  
Petra Erdmann-Gilmore ◽  
Daniel R George ◽  
...  

Abstract Introduction: Proteins, despite being the primary effectors of cellular processes, are often studied only indirectly through analysis of the transcriptome. However, it is clear that the relationship between mRNA expression and protein expression is approximate at best. In Acute Myeloid Leukemia (AML), the genome and transcriptome have been thoroughly characterized, but the proteome has been less well studied. Here, we present a deep-scale study of the proteomes of 44 primary AML bone marrow samples representing a wide range of AML across the spectrum of cytogenetic risk, common mutations, and driver fusions. Methods: Bone marrow samples were collected at presentation from 44 adult patients with de novo AML as part of an institutional banking protocol, and buffy coat cells were immediately cryopreserved without further manipulation. Cryovials were thawed in the presence of the cell permeable serine protease inhibitor diisopropyl fluorophosphate (DFP) to inactivate the abundant neutrophil serine proteases (ELANE, CTSG, PRTN3, and PRSS57), and further processed for nano-liquid chromatography mass spectrometry in the presence of an extensive cocktail of protease inhibitors. Both label-free quantification (LFQ) and tandem-mass-tag (TMT) deep-scale proteomics were performed on these 44 patient samples, as well as 3 lineage-depleted bone marrow samples from healthy adult donors. Matching RNA-seq and exome sequencing data were available for the same samples as part of The Cancer Genome Atlas (TCGA) AML project. Results: 10,651 and 6,679 unique proteins were detected in the TMT and LFQ experiments, respectively. Correlations between measurements derived from the independent proteomic platforms (i.e. TMT and LFQ) is higher (mean Spearman correlation, 0.60, Figure 1A) than correlation between proteomic (TMT) and transcriptomic measurements from bulk RNA-seq data (Spearman 0.43, Figure 1B). Quality checks of the proteomic data strongly supported the reliability of quantification of protein measurements; for example, the mean ratio of beta globin protein (HBB) to alpha globin (HBA1) was 1.2 +/- 0.25 (Figure 1C), and several proteins known to be dysregulated by specific AML-initiating fusion proteins (for PML-RARA, HGF and RARA; for RUNX1-RUNX1T1, RUNX1T1; and for CBFB-MYH11, MYH11) were detected in the expected samples (Figure 1D). Globally, 1,364 proteins were differentially expressed in the AML samples (corrected p-value <0.05, fold change ≥ 1.5) compared to the lineage-depleted, healthy bone marrow samples. Globally overexpressed proteins were enriched for ribosomal RNA modification, mitochondrial protein import, nuclear export, and the mitochondrial electron transport chain, among others. These overexpressed proteins include 61 cell surface proteins that could potentially represent therapeutic targets (overexpressed on average in 82% of AML samples, range 25-97%). Globally downregulated proteins in AML samples were enriched for glycogen metabolism and protein groups associated with mature neutrophils (reflecting the expected maturation block in AML), among others. 771 of the 1364 differentially expressed proteins (56.5%) showed only minimal variability in mRNA expression levels (fold change of <1.1 between AML and normal marrow CD34 cell mRNA) that could not explain dysregulated protein expression. Several protein complexes likewise showed coordinated differential expression in the proteomic data, but no change in the transcriptome, including the THO complex (Figure 1E) and the phosphorylase kinase complex (Figure 1F), among others, indicating the presence of posttranscriptional regulation of the levels of many proteins in AML samples. Conclusion: We have created a deep-scale proteomic database from a set of well-characterized AML samples, allowing for a proteogenomic study of AML. We have identified many examples of post-transcriptional regulation of key metabolic pathways that may be relevant for better understanding AML cell metabolism and therapeutic vulnerabilities. Additional studies linking patterns of protein dysregulation with a variety of AML covariates are underway. Figure 1 Figure 1. Disclosures No relevant conflicts of interest to declare.


Author(s):  
R. Zhang ◽  
M. Mirdita ◽  
E. Levy Karin ◽  
C. Norroy ◽  
C. Galiez ◽  
...  

SummarySpacePHARER (CRISPR Spacer Phage-Host Pair Finder) is a sensitive and fast tool for de novo prediction of phage-host relationships via identifying phage genomes that match CRISPR spacers in genomic or metagenomic data. SpacePHARER gains sensitivity by comparing spacers and phages at the protein-level, optimizing its scores for matching very short sequences, and combining evidences from multiple matches, while controlling for false positives. We demonstrate SpacePHARER by searching a comprehensive spacer list against all complete phage genomes.Availability and implementationSpacePHARER is available as an open-source (GPLv3), user-friendly command-line software for Linux and macOS at spacepharer.soedinglab.org.


2015 ◽  
Author(s):  
Alejandro Hernandez Wences ◽  
Michael Schatz

Genome assembly projects typically run multiple algorithms in an attempt to find the single best assembly, although those assemblies often have complementary, if untapped, strengths and weaknesses. We present our metassembler algorithm that merges multiple assemblies of a genome into a single superior sequence. We apply it to the four genomes from the Assemblathon competitions and show it consistently and substantially improves the contiguity and quality of each assembly. We also develop guidelines for metassembly by systematically evaluating 120 permutations of merging the top 5 assemblies of the first Assemblathon competition. The software is open-source at http://metassembler.sourceforge.net.


Author(s):  
Taiga Abe ◽  
Ian Kinsella ◽  
Shreya Saxena ◽  
Liam Paninski ◽  
John P. Cunningham

AbstractA major goal of computational neuroscience is to develop powerful analysis tools that operate on large datasets. These methods provide an essential toolset to unlock scientific insights from new experiments. Unfortunately, a major obstacle currently impedes progress: while existing analysis methods are frequently shared as open source software, the infrastructure needed to deploy these methods – at scale, reproducibly, cheaply, and quickly – remains totally inaccessible to all but a minority of expert users. As a result, many users can not fully exploit these tools, due to constrained computational resources (limited or costly compute hardware) and/or mismatches in expertise (experimentalists vs. large-scale computing experts). In this work we develop Neuroscience Cloud Analysis As a Service (NeuroCAAS): a fully-managed infrastructure platform, based on modern large-scale computing advances, that makes state-of-the-art data analysis tools accessible to the neuroscience community. We offer NeuroCAAS as an open source service with a drag-and-drop interface, entirely removing the burden of infrastructure expertise, purchasing, maintenance, and deployment. NeuroCAAS is enabled by three key contributions. First, NeuroCAAS cleanly separates tool implementation from usage, allowing cutting-edge methods to be served directly to the end user with no need to read or install any analysis software. Second, NeuroCAAS automatically scales as needed, providing reliable, highly elastic computational resources that are more efficient than personal or lab-supported hardware, without management overhead. Finally, we show that many popular data analysis tools offered through NeuroCAAS outperform typical analysis solutions (in terms of speed and cost) while improving ease of use and maintenance, dispelling the myth that cloud compute is prohibitively expensive and technically inaccessible. By removing barriers to fast, efficient cloud computation, NeuroCAAS can dramatically accelerate both the dissemination and the effective use of cutting-edge analysis tools for neuroscientific discovery.


Sign in / Sign up

Export Citation Format

Share Document