An iterative and automated computational pipeline for untargeted strain-level identification using MS/MS spectra from pathogenic samples

Mapping Intimacies ◽

10.1101/812313 ◽

2019 ◽

Author(s):

Mathias Kuhring ◽

Joerg Doellinger ◽

Andreas Nitsche ◽

Thilo Muth ◽

Bernhard Y. Renard

Keyword(s):

Statistical Power ◽

Sequence Data ◽

A Priori ◽

Search Space ◽

Strain Level ◽

Reference Sequence ◽

Viral Origin ◽

Identification Of Species ◽

Taxonomic Assignments

AbstractUntargeted accurate strain-level classification of a priori unidentified organisms using tandem mass spectrometry is a challenging task. Reference databases often lack taxonomic depth, limiting peptide assignments to the species level. However, the extension with detailed strain information increases runtime and decreases statistical power. In addition, larger databases contain a higher number of similar proteomes.We present TaxIt, an iterative workflow to address the increasing search space required for MS/MS-based strain-level classification of samples with unknown taxonomic origin. TaxIt first applies reference sequence data for initial identification of species candidates, followed by automated acquisition of relevant strain sequences for low level classification. Furthermore, proteome similarities resulting in ambiguous taxonomic assignments are addressed with an abundance weighting strategy to improve candidate confidence.We apply our iterative workflow on several samples of bacterial and viral origin. In comparison to non-iterative approaches using unique peptides or advanced abundance correction, TaxIt identifies microbial strains correctly in all examples presented (with one tie), thereby demonstrating the potential for untargeted and deeper taxonomic classification. TaxIt makes extensive use of public, unrestricted and continuously growing sequence resources such as the NCBI databases and is available under open-source license at https://gitlab.com/rki_bioinformatics.

Download Full-text

Rooting morphologically divergent taxa – slow-evolving sequence data might help

10.1101/2020.03.15.983684 ◽

2020 ◽

Author(s):

Jorge Flores ◽

Alexander C. Bippus ◽

Alexandru Tomescu ◽

Neil Bell ◽

Jaakko Hyvönen

Keyword(s):

Sequence Data ◽

Phylogenetic Analyses ◽

Mitochondrial Gene ◽

A Priori ◽

Nuclear Gene ◽

Search Space ◽

Detailed Examination ◽

Morphological Characters ◽

Current Data ◽

Parsimonious Tree

AbstractWhen fossils are sparse and the lineages studied are very divergent morphologically, analyses based exclusively on morphology may lead to conflicting and unexpected hypotheses. Through integration of data from conservative genes/gene regions the terminals including these data can anchor or constrain the search, thereby practically circumscribing the search space of the combined analyses. In this study, we revisit the phylogeny of a highly divergent group of mosses, class Polytrichopsida. We supplemented the morphological matrix by adding sequence data of the nuclear gene 18S, chloroplast genes rbcL and rps4, plus the mitochondrial gene nad5. For the phylogenetic analyses we used parsimony as the optimality criterion. Analyses that included all the terminals resulted in one most parsimonious tree with a clade comprised of Alophosia azorica and the fossil Meantoinea alophosioides representing the basal-most lineage. Analyses with different outgroup sampling produced the same topology for most ingroup relationships. An analysis excluding morphological characters and the four terminals for which only morphological characters were scored (the two fossil and two extant terminals) resulted in one optimal tree with identical topology to the one obtained when including all terminals. These results are largely congruent with those obtained in the recent analyses based exclusively on sequence level data of a larger number of terminals. Our results indicate that large size and complexity of the gametophyte have evolved independently in several lineages. Notably, the nodes of the backbone of the most parsimonious tree have very low support values, thus these inferred relationships could change if new additional information conflicts with the current data. Future studies should be aimed at incorporating all terminals into phylogenetic analyses, which is not an unrealistic goal for a group with less than 200 species. Also, additional fossils, some of which await detailed examination and description, need to be included. Whether these will affect the overall pattern of phylogeny presented here remains to be seen. In a group that is obviously very ancient, we cannot assume, a priori, that currently known fossil taxa, which go back in time less than 140 Ma, represent the oldest lineages of the group.

Download Full-text

How to get your goat: automated identification of species from MALDI-ToF spectra

Bioinformatics ◽

10.1093/bioinformatics/btaa181 ◽

2020 ◽

Vol 36 (12) ◽

pp. 3719-3725

Author(s):

Simon Hickinbotham ◽

Sarah Fiddyment ◽

Timothy L Stinson ◽

Matthew J Collins

Keyword(s):

Sequence Data ◽

New Method ◽

Supplementary Information ◽

Automated Identification ◽

Confidence Measure ◽

Maldi Tof ◽

Automated Method ◽

Identification Of Species ◽

Quantifiable Level

Abstract Motivation Classification of archaeological animal samples is commonly achieved via manual examination of matrix-assisted laser desorption/ionization time-of-flight (MALDI-ToF) spectra. This is a time-consuming process which requires significant training and which does not produce a measure of confidence in the classification. We present a new, automated method for arriving at a classification of a MALDI-ToF sample, provided the collagen sequences for each candidate species are available. The approach derives a set of peptide masses from the sequence data for comparison with the sample data, which is carried out by cross-correlation. A novel way of combining evidence from multiple marker peptides is used to interpret the raw alignments and arrive at a classification with an associated confidence measure. Results To illustrate the efficacy of the approach, we tested the new method with a previously published classification of parchment folia from a copy of the Gospel of Luke, produced around 1120 C.E. by scribes at St Augustine’s Abbey in Canterbury, UK. In total, 80 of the 81 samples were given identical classifications by both methods. In addition, the new method gives a quantifiable level of confidence in each classification. Availability and implementation The software can be found at https://github.com/bioarch-sjh/bacollite, and can be installed in R using devtools. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Gist – an ensemble approach to the taxonomic classification of metatranscriptomic sequence data

10.1101/081026 ◽

2016 ◽

Author(s):

Samantha Halliday ◽

John Parkinson

Keyword(s):

Species Interactions ◽

Sequence Data ◽

Rna Seq ◽

Sequencing Data ◽

Community Profiling ◽

Search Tool ◽

Functional Identities ◽

Taxonomic Assignments ◽

Generation Sequencing

ABSTRACTThe study of whole microbial communities through RNA-seq, or metatranscriptomics, offers a unique view of the relative levels of activity for different genes across a large number of species simultaneously. To make sense of these sequencing data, it is necessary to be able to assign both taxonomic and functional identities to each sequenced read. High-quality identifications are important not only for community profiling, but to also ensure that functional assignments of sequence reads are correctly attributed to their source taxa. Such assignments allow biochemical pathways to be appropriately allocated to discrete species, enabling the capture of cross-species interactions. Typically read annotation is performed by a single alignment-based search tool such as BLAST. However, due to the vast extent of bacterial diversity, these approaches tend to be highly error prone, particularly for taxonomic assignments. Here we introduce a novel program for generating taxonomic assignments, called Gist, which integrates information from a number of machine learning methods and the Burrows-Wheeler Aligner. Uniquely Gist establishes the most appropriate weightings of methods for individual genomes, facilitating high classification accuracy on next-generation sequencing reads. We validate our approach using a synthetic metatranscriptome generator based on Flux Simulator, termed Genepuddle. Further, unlike previous taxonomic classifiers, we demonstrate the capacity of composition-based techniques to accurately inform on taxonomic origin without resorting to longer scanning windows that mimic alignment-based methods. Gist is made freely available under the terms of the GNU General Public License at compsysbio.org/gist.

Download Full-text

Replication of the Superstition and Performance Study by

Social Psychology ◽

10.1027/1864-9335/a000190 ◽

2014 ◽

Vol 45 (3) ◽

pp. 239-245 ◽

Cited By ~ 18

Author(s):

Robert J. Calin-Jageman ◽

Tracy L. Caldwell

Keyword(s):

Task Difficulty ◽

Statistical Power ◽

Meta Analysis ◽

A Priori ◽

Significant Heterogeneity ◽

Performance Study ◽

Improve Performance ◽

Research Designs ◽

Series Of Experiments ◽

And Performance

A recent series of experiments suggests that fostering superstitions can substantially improve performance on a variety of motor and cognitive tasks ( Damisch, Stoberock, & Mussweiler, 2010 ). We conducted two high-powered and precise replications of one of these experiments, examining if telling participants they had a lucky golf ball could improve their performance on a 10-shot golf task relative to controls. We found that the effect of superstition on performance is elusive: Participants told they had a lucky ball performed almost identically to controls. Our failure to replicate the target study was not due to lack of impact, lack of statistical power, differences in task difficulty, nor differences in participant belief in luck. A meta-analysis indicates significant heterogeneity in the effect of superstition on performance. This could be due to an unknown moderator, but no effect was observed among the studies with the strongest research designs (e.g., high power, a priori sampling plan).

Download Full-text

On-Shelf Utility Mining of Sequence Data

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3457570 ◽

2021 ◽

Vol 16 (2) ◽

pp. 1-31

Author(s):

Chunkai Zhang ◽

Zilin Du ◽

Yuting Yang ◽

Wensheng Gan ◽

Philip S. Yu

Keyword(s):

High Efficiency ◽

Sequence Data ◽

Real Life ◽

Search Space ◽

Upper Bounds ◽

Utility Mining ◽

Limited Memory ◽

Time Periods ◽

High Utility ◽

Synthetic Datasets

Utility mining has emerged as an important and interesting topic owing to its wide application and considerable popularity. However, conventional utility mining methods have a bias toward items that have longer on-shelf time as they have a greater chance to generate a high utility. To eliminate the bias, the problem of on-shelf utility mining (OSUM) is introduced. In this article, we focus on the task of OSUM of sequence data, where the sequential database is divided into several partitions according to time periods and items are associated with utilities and several on-shelf time periods. To address the problem, we propose two methods, OSUM of sequence data (OSUMS) and OSUMS + , to extract on-shelf high-utility sequential patterns. For further efficiency, we also design several strategies to reduce the search space and avoid redundant calculation with two upper bounds time prefix extension utility ( TPEU ) and time reduced sequence utility ( TRSU ). In addition, two novel data structures are developed for facilitating the calculation of upper bounds and utilities. Substantial experimental results on certain real and synthetic datasets show that the two methods outperform the state-of-the-art algorithm. In conclusion, OSUMS may consume a large amount of memory and is unsuitable for cases with limited memory, while OSUMS + has wider real-life applications owing to its high efficiency.

Download Full-text

Enriching Textual Search Results at Query Time Using Entity Mining, Linked Data and Link Analysis

International Journal of Semantic Computing ◽

10.1142/s1793351x14400170 ◽

2014 ◽

Vol 08 (04) ◽

pp. 515-544 ◽

Cited By ~ 3

Author(s):

Pavlos Fafalios ◽

Panagiotis Papadakos ◽

Yannis Tzitzikas

Keyword(s):

Semantic Information ◽

A Priori ◽

Open Data ◽

Search Space ◽

Link Analysis ◽

Query Time ◽

Search Results ◽

Ranking Scheme ◽

Web Of Data ◽

Comparative Results

The integration of the classical Web (of documents) with the emerging Web of Data is a challenging vision. In this paper we focus on an integration approach during searching which aims at enriching the responses of non-semantic search systems with semantic information, i.e. Linked Open Data (LOD), and exploiting the outcome for offering advanced exploratory search services which provide an overview of the search space and allow the users to explore the related LOD. We use named entities identified in the search results for automatically connecting search hits with LOD and we consider a scenario where this entity-based integration is performed at query time with no human effort and no a-priori indexing which is beneficial in terms of configurability and freshness. However, the number of identified entities can be high and the same is true for the semantic information about these entities that can be fetched from the available LOD. To this end, in this paper we propose a Link Analysis-based method which is used for ranking (and thus selecting to show) the more important semantic information related to the search results. We report the results of a survey regarding the marine domain with promising results, and comparative results that illustrate the effectiveness of the proposed (PageRank-based) ranking scheme. Finally, we report experimental results regarding efficiency showing that the proposed functionality can be offered even at query time.

Download Full-text

Using next generation sequencing of alpine plants to improve fecal metabarcoding diet analysis for Dall’s sheep

BMC Research Notes ◽

10.1186/s13104-021-05590-z ◽

2021 ◽

Vol 14 (1) ◽

Author(s):

Kelly E. Williams ◽

Damian M. Menning ◽

Eric J. Wald ◽

Sandra L. Talbot ◽

Kumi L. Rattenbury ◽

...

Keyword(s):

Sequence Data ◽

Vascular Plant ◽

Alpine Plants ◽

Diet Analysis ◽

Reference Sequence ◽

Reference Library ◽

Ovis Dalli ◽

Plant Animal Interactions ◽

Dall’S Sheep ◽

Northwestern North America

Abstract Objectives Dall’s sheep (Ovis dalli dalli) are important herbivores in the mountainous ecosystems of northwestern North America, and recent declines in some populations have sparked concern. Our aim was to improve capabilities for fecal metabarcoding diet analysis of Dall’s sheep and other herbivores by contributing new sequence data for arctic and alpine plants. This expanded reference library will provide critical reference sequence data that will facilitate metabarcoding diet analysis of Dall’s sheep and thus improve understanding of plant-animal interactions in a region undergoing rapid climate change. Data description We provide sequences for the chloroplast rbcL gene of 16 arctic-alpine vascular plant species that are known to comprise the diet of Dall’s sheep. These sequences contribute to a growing reference library that can be used in diet studies of arctic herbivores.

Download Full-text

Sequence data from isolated lichen-associated melanized fungi enhance delimitation of two new lineages within Chaetothyriomycetidae

Mycological Progress ◽

10.1007/s11557-021-01706-8 ◽

2021 ◽

Vol 20 (7) ◽

pp. 911-927

Author(s):

Lucia Muggia ◽

Yu Quan ◽

Cécile Gueidan ◽

Abdullah M. S. Al-Hatmi ◽

Martin Grube ◽

...

Keyword(s):

Sequence Data ◽

Single Species ◽

Sister Group ◽

Asexual Propagation ◽

Dna Sequence Data ◽

Wide Range ◽

The Family ◽

Rock Inhabiting Fungi ◽

Stable Habitat

AbstractLichen thalli provide a long-lived and stable habitat for colonization by a wide range of microorganisms. Increased interest in these lichen-associated microbial communities has revealed an impressive diversity of fungi, including several novel lineages which still await formal taxonomic recognition. Among these, members of the Eurotiomycetes and Dothideomycetes usually occur asymptomatically in the lichen thalli, even if they share ancestry with fungi that may be parasitic on their host. Mycelia of the isolates are characterized by melanized cell walls and the fungi display exclusively asexual propagation. Their taxonomic placement requires, therefore, the use of DNA sequence data. Here, we consider recently published sequence data from lichen-associated fungi and characterize and formally describe two new, individually monophyletic lineages at family, genus, and species levels. The Pleostigmataceae fam. nov. and Melanina gen. nov. both comprise rock-inhabiting fungi that associate with epilithic, crust-forming lichens in subalpine habitats. The phylogenetic placement and the monophyly of Pleostigmataceae lack statistical support, but the family was resolved as sister to the order Verrucariales. This family comprises the species Pleostigma alpinum sp. nov., P. frigidum sp. nov., P. jungermannicola, and P. lichenophilum sp. nov. The placement of the genus Melanina is supported as a lineage within the Chaetothyriales. To date, this genus comprises the single species M. gunde-cimermaniae sp. nov. and forms a sister group to a large lineage including Herpotrichiellaceae, Chaetothyriaceae, Cyphellophoraceae, and Trichomeriaceae. The new phylogenetic analysis of the subclass Chaetothyiomycetidae provides new insight into genus and family level delimitation and classification of this ecologically diverse group of fungi.

Download Full-text

Modeling the Process of Event Sequence Data Generated for Working Condition Diagnosis

Mathematical Problems in Engineering ◽

10.1155/2015/693450 ◽

2015 ◽

Vol 2015 ◽

pp. 1-13

Author(s):

Jianwei Ding ◽

Yingbo Liu ◽

Li Zhang ◽

Jianmin Wang

Keyword(s):

Working Condition ◽

Sequence Data ◽

A Priori ◽

Real Data ◽

Data Sets ◽

Main Task ◽

Event Sequence ◽

Telemetry Data ◽

Condition Monitoring Systems ◽

Condition Diagnosis

Condition monitoring systems are widely used to monitor the working condition of equipment, generating a vast amount and variety of telemetry data in the process. The main task of surveillance focuses on analyzing these routinely collected telemetry data to help analyze the working condition in the equipment. However, with the rapid increase in the volume of telemetry data, it is a nontrivial task to analyze all the telemetry data to understand the working condition of the equipment without any a priori knowledge. In this paper, we proposed a probabilistic generative model called working condition model (WCM), which is capable of simulating the process of event sequence data generated and depicting the working condition of equipment at runtime. With the help of WCM, we are able to analyze how the event sequence data behave in different working modes and meanwhile to detect the working mode of an event sequence (working condition diagnosis). Furthermore, we have applied WCM to illustrative applications like automated detection of an anomalous event sequence for the runtime of equipment. Our experimental results on the real data sets demonstrate the effectiveness of the model.

Download Full-text