Nucleotide-pair encoding of 16S rRNA sequences for host phenotype and biomarker detection

Mapping Intimacies ◽

10.1101/334722 ◽

2018 ◽

Cited By ~ 1

Author(s):

Ehsaneddin Asgari ◽

Philipp C. Münch ◽

Till R. Lesker ◽

Alice C. McHardy ◽

Mohammad R.K. Mofrad

Keyword(s):

Rheumatoid Arthritis ◽

16S Rrna ◽

State Of The Art ◽

Data Representation ◽

The State ◽

Biomarker Detection ◽

New Paradigm ◽

Phenotype Prediction ◽

Nucleotide Pair ◽

Phenotype Classification

ABSTRACTIdentifying combinations of taxa distinctive for microbiome-associated diseases is considered key to the establishment of diagnosis and therapy options in precision medicine and imposes high demands on accuracy of microbiome analysis techniques. We propose subsequence based 16S rRNA data analysis, as a new paradigm for microbiome phenotype classification and biomarker detection. This method and software called DiTaxa substitutes standard OTU-clustering or sequence-level analysis by segmenting 16S rRNA reads into the most frequent variable-length subsequences. These subsequences are then used as data representation for downstream phenotype prediction, biomarker detection and taxonomic analysis. Our proposed sequence segmentation called nucleotide-pair encoding (NPE) is an unsupervised data-driven segmentation inspired by Byte-pair encoding, a data compression algorithm. The identified subsequences represent commonly occurring sequence portions, which we found to be distinctive for taxa at varying evolutionary distances and highly informative for predicting host phenotypes. We compared the performance of DiTaxa to the state-of-the-art methods in disease phenotype prediction and biomarker detection, using human-associated 16S rRNA samples for periodontal disease, rheumatoid arthritis and inflammatory bowel diseases, as well as a synthetic benchmark dataset. DiTaxa identified 17 out of 29 taxa with confirmed links to periodontitis (recall= 0.59), relative to 3 out of 29 taxa (recall= 0.10) by the state-of-the-art method. On synthetic benchmark data, DiTaxa obtained full precision and recall in biomarker detection, compared to 0.91 and 0.90, respectively. In addition, machine-learning classifiers trained to predict host disease phenotypes based on the NPE representation performed competitively to the state-of-the art using OTUs or k-mers. For the rheumatoid arthritis dataset, DiTaxa substantially outperformed OTU features with a macro-F1 score of 0.76 compared to 0.65. Due to the alignment- and reference free nature, DiTaxa can efficiently run on large datasets. The full analysis of a large 16S rRNA dataset of 1359 samples required ≈1.5 hours on 20 cores, while the standard pipeline needed ≈6.5 hours in the same setting.AvailabilityAn implementation of our method called DiTaxa is available under the Apache 2 licence at http://llp.berkeley.edu/ditaxa.

Download Full-text

DiTaxa: nucleotide-pair encoding of 16S rRNA for host phenotype and biomarker detection

Bioinformatics ◽

10.1093/bioinformatics/bty954 ◽

2018 ◽

Vol 35 (14) ◽

pp. 2498-2500 ◽

Cited By ~ 6

Author(s):

Ehsaneddin Asgari ◽

Philipp C Münch ◽

Till R Lesker ◽

Alice C McHardy ◽

Mohammad R K Mofrad

Keyword(s):

16S Rrna ◽

State Of The Art ◽

Operational Taxonomic Unit ◽

Supplementary Information ◽

Biomarker Detection ◽

New Paradigm ◽

Nucleotide Pair ◽

Bowel Diseases ◽

Benchmark Datasets ◽

Inflammatory Bowel

Abstract Summary Identifying distinctive taxa for micro-biome-related diseases is considered key to the establishment of diagnosis and therapy options in precision medicine and imposes high demands on the accuracy of micro-biome analysis techniques. We propose an alignment- and reference- free subsequence based 16S rRNA data analysis, as a new paradigm for micro-biome phenotype and biomarker detection. Our method, called DiTaxa, substitutes standard operational taxonomic unit (OTU)-clustering by segmenting 16S rRNA reads into the most frequent variable-length subsequences. We compared the performance of DiTaxa to the state-of-the-art methods in phenotype and biomarker detection, using human-associated 16S rRNA samples for periodontal disease, rheumatoid arthritis and inflammatory bowel diseases, as well as a synthetic benchmark dataset. DiTaxa performed competitively to the k-mer based state-of-the-art approach in phenotype prediction while outperforming the OTU-based state-of-the-art approach in finding biomarkers in both resolution and coverage evaluated over known links from literature and synthetic benchmark datasets. Availability and implementation DiTaxa is available under the Apache 2 license at http://llp.berkeley.edu/ditaxa. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Multi-threaded ASP solving with clasp

Theory and Practice of Logic Programming ◽

10.1017/s1471068412000166 ◽

2012 ◽

Vol 12 (4-5) ◽

pp. 525-545 ◽

Cited By ~ 14

Author(s):

MARTIN GEBSER ◽

BENJAMIN KAUFMANN ◽

TORSTEN SCHAUB

Keyword(s):

Experimental Analysis ◽

State Of The Art ◽

Data Representation ◽

The State ◽

Communication Architecture ◽

Answer Set

AbstractWe present the new multi-threaded version of the state-of-the-art answer set solverclasp. We detail its component and communication architecture and illustrate how they support the principal functionalities ofclasp. Also, we provide some insights into the data representation used for different constraint types handled byclasp. All this is accompanied by an extensive experimental analysis of the major features related to multi-threading inclasp.

Download Full-text

Practical Picture Processing

Proceedings, annual meeting, Electron Microscopy Society of America ◽

10.1017/s0424820100051700 ◽

1974 ◽

Vol 32 ◽

pp. 338-339

Author(s):

T. A. Welton

Keyword(s):

Radiation Damage ◽

Coherence Length ◽

Spatial Information ◽

State Of The Art ◽

Coherent Radiation ◽

The State ◽

Energy Spread ◽

Electron Micrograph ◽

Picture Processing ◽

Molecular Skeleton

Various authors have emphasized the spatial information resident in an electron micrograph taken with adequately coherent radiation. In view of the completion of at least one such instrument, this opportunity is taken to summarize the state of the art of processing such micrographs. We use the usual symbols for the aberration coefficients, and supplement these with £ and 6 for the transverse coherence length and the fractional energy spread respectively. He also assume a weak, biologically interesting sample, with principal interest lying in the molecular skeleton remaining after obvious hydrogen loss and other radiation damage has occurred.

Download Full-text