Centrifuge: rapid and sensitive classification of metagenomic sequences

Mapping Intimacies ◽

10.1101/054965 ◽

2016 ◽

Cited By ~ 11

Author(s):

Daehwan Kim ◽

Li Song ◽

Florian P. Breitwieser ◽

Steven L. Salzberg

Keyword(s):

High Speed ◽

Classification Problem ◽

Data Sets ◽

Accurate Analysis ◽

Desktop Computers ◽

Metagenomics Data ◽

High Throughput Dna Sequencing ◽

Small Index ◽

Burrows Wheeler Transform

AbstractCentrifuge is a novel microbial classification engine that enables rapid, accurate and sensitive labeling of reads and quantification of species on desktop computers. The system uses an indexing scheme based on the Burrows-Wheeler transform (BWT) and the Ferragina-Manzini (FM) index, optimized specifically for the metagenomic classification problem. Centrifuge requires a relatively small index (4.2 GB for 4,078 bacterial and 200 archaeal genomes) and classifies sequences at very high speed, allowing it to process the millions of reads from a typical high-throughput DNA sequencing run within a few minutes. Together these advances enable timely and accurate analysis of large metagenomics data sets on conventional desktop computers. Because of its space-optimized indexing schemes, Centrifuge also makes it possible to index the entire NCBI non-redundant nucleotide sequence database (a total of 109 billion bases) with an index size of 69 GB, in contrast to k-mer based indexing schemes, which require far more extensive space. Centrifuge is available as free, open-source software from www.ccb.jhu.edu/software/centrifuge

Download Full-text

Automatic Classification of Morphologically Similar Fish Species Using Their Head Contours

Applied Sciences ◽

10.3390/app10103408 ◽

2020 ◽

Vol 10 (10) ◽

pp. 3408

Author(s):

Pere Marti-Puig ◽

Amalia Manjabacas ◽

Antoni Lombarte

Keyword(s):

Fish Species ◽

Image Data ◽

Classification Problem ◽

Marine Resources ◽

Data Sets ◽

Similar Form ◽

Multi Class Classification ◽

Demersal Species ◽

Do So

This work deals with the task of distinguishing between different Mediterranean demersal species of fish that share a remarkably similar form and that are also used for the evaluation of marine resources. The experts who are currently able to classify these types of species do so by considering only a segment of the contour of the fish, specifically its head, instead of using the entire silhouette of the animal. Based on this knowledge, a set of features to classify contour segments is presented to address both a binary and a multi-class classification problem. In addition to the difficulty present in successfully discriminating between very similar forms, we have the limitation of having small, unreliably labeled image data sets. The results obtained were comparable to those obtained by trained experts.

Download Full-text

A Novel Algorithm for Imbalance Data Classification Based on Neighborhood Hypergraph

The Scientific World JOURNAL ◽

10.1155/2014/876875 ◽

2014 ◽

Vol 2014 ◽

pp. 1-13 ◽

Cited By ~ 2

Author(s):

Feng Hu ◽

Xiao Liu ◽

Jin Dai ◽

Hong Yu

Keyword(s):

Rough Set Theory ◽

Classification Problem ◽

Boundary Region ◽

Training Data ◽

Data Sets ◽

Data Set ◽

Imbalance Data ◽

Average Accuracy ◽

Fold Cross Validation

The classification problem for imbalance data is paid more attention to. So far, many significant methods are proposed and applied to many fields. But more efficient methods are needed still. Hypergraph may not be powerful enough to deal with the data in boundary region, although it is an efficient tool to knowledge discovery. In this paper, the neighborhood hypergraph is presented, combining rough set theory and hypergraph. After that, a novel classification algorithm for imbalance data based on neighborhood hypergraph is developed, which is composed of three steps: initialization of hyperedge, classification of training data set, and substitution of hyperedge. After conducting an experiment of 10-fold cross validation on 18 data sets, the proposed algorithm has higher average accuracy than others.

Download Full-text

Investigating the impact of pre-processing techniques and pre-trained word embeddings in detecting Arabic health information on social media

Journal Of Big Data ◽

10.1186/s40537-021-00488-w ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Yahya Albalawi ◽

Jim Buckley ◽

Nikola S. Nikolov

Keyword(s):

Social Media ◽

Deep Learning ◽

Comprehensive Evaluation ◽

Classification Problem ◽

Data Sets ◽

Word Embeddings ◽

Data Set ◽

Lower Accuracy ◽

Health Related ◽

The Impact

AbstractThis paper presents a comprehensive evaluation of data pre-processing and word embedding techniques in the context of Arabic document classification in the domain of health-related communication on social media. We evaluate 26 text pre-processings applied to Arabic tweets within the process of training a classifier to identify health-related tweets. For this task we use the (traditional) machine learning classifiers KNN, SVM, Multinomial NB and Logistic Regression. Furthermore, we report experimental results with the deep learning architectures BLSTM and CNN for the same text classification problem. Since word embeddings are more typically used as the input layer in deep networks, in the deep learning experiments we evaluate several state-of-the-art pre-trained word embeddings with the same text pre-processing applied. To achieve these goals, we use two data sets: one for both training and testing, and another for testing the generality of our models only. Our results point to the conclusion that only four out of the 26 pre-processings improve the classification accuracy significantly. For the first data set of Arabic tweets, we found that Mazajak CBOW pre-trained word embeddings as the input to a BLSTM deep network led to the most accurate classifier with F1 score of 89.7%. For the second data set, Mazajak Skip-Gram pre-trained word embeddings as the input to BLSTM led to the most accurate model with F1 score of 75.2% and accuracy of 90.7% compared to F1 score of 90.8% achieved by Mazajak CBOW for the same architecture but with lower accuracy of 70.89%. Our results also show that the performance of the best of the traditional classifier we trained is comparable to the deep learning methods on the first dataset, but significantly worse on the second dataset.

Download Full-text

An aggregate method for thorax diseases classification

Scientific Reports ◽

10.1038/s41598-021-81765-9 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Bayu Adhi Nugroho

Keyword(s):

Network Architecture ◽

Classification Problem ◽

Calculation Algorithm ◽

Training Pattern ◽

Deep Network ◽

Network Training ◽

Chest X Ray ◽

Medical Image Classification ◽

Positive Pattern

AbstractA common problem found in real-word medical image classification is the inherent imbalance of the positive and negative patterns in the dataset where positive patterns are usually rare. Moreover, in the classification of multiple classes with neural network, a training pattern is treated as a positive pattern in one output node and negative in all the remaining output nodes. In this paper, the weights of a training pattern in the loss function are designed based not only on the number of the training patterns in the class but also on the different nodes where one of them treats this training pattern as positive and the others treat it as negative. We propose a combined approach of weights calculation algorithm for deep network training and the training optimization from the state-of-the-art deep network architecture for thorax diseases classification problem. Experimental results on the Chest X-Ray image dataset demonstrate that this new weighting scheme improves classification performances, also the training optimization from the EfficientNet improves the performance furthermore. We compare the aggregate method with several performances from the previous study of thorax diseases classifications to provide the fair comparisons against the proposed method.

Download Full-text

Classification of jujube defects in small data sets based on transfer learning

Neural Computing and Applications ◽

10.1007/s00521-021-05715-2 ◽

2021 ◽

Author(s):

Jianping Ju ◽

Hong Zheng ◽

Xiaohang Xu ◽

Zhongyuan Guo ◽

Zhaohui Zheng ◽

...

Keyword(s):

Transfer Learning ◽

Loss Function ◽

Training Model ◽

Parameter Distribution ◽

Test Accuracy ◽

Small Data ◽

Data Sets ◽

Data Set ◽

Small Data Sets

AbstractAlthough convolutional neural networks have achieved success in the field of image classification, there are still challenges in the field of agricultural product quality sorting such as machine vision-based jujube defects detection. The performance of jujube defect detection mainly depends on the feature extraction and the classifier used. Due to the diversity of the jujube materials and the variability of the testing environment, the traditional method of manually extracting the features often fails to meet the requirements of practical application. In this paper, a jujube sorting model in small data sets based on convolutional neural network and transfer learning is proposed to meet the actual demand of jujube defects detection. Firstly, the original images collected from the actual jujube sorting production line were pre-processed, and the data were augmented to establish a data set of five categories of jujube defects. The original CNN model is then improved by embedding the SE module and using the triplet loss function and the center loss function to replace the softmax loss function. Finally, the depth pre-training model on the ImageNet image data set was used to conduct training on the jujube defects data set, so that the parameters of the pre-training model could fit the parameter distribution of the jujube defects image, and the parameter distribution was transferred to the jujube defects data set to complete the transfer of the model and realize the detection and classification of the jujube defects. The classification results are visualized by heatmap through the analysis of classification accuracy and confusion matrix compared with the comparison models. The experimental results show that the SE-ResNet50-CL model optimizes the fine-grained classification problem of jujube defect recognition, and the test accuracy reaches 94.15%. The model has good stability and high recognition accuracy in complex environments.

Download Full-text

Fast and robust identity-by-descent inference with the templated positional Burrows-Wheeler transform

Molecular Biology and Evolution ◽

10.1093/molbev/msaa328 ◽

2020 ◽

Author(s):

William A Freyman ◽

Kimberly F McManus ◽

Suyash S Shringarpure ◽

Ethan M Jewett ◽

Katarzyna Bryc ◽

...

Keyword(s):

Isolation By Distance ◽

False Negative ◽

Segment Length ◽

Data Sets ◽

Haplotype Sharing ◽

Binary File ◽

Inference Algorithms ◽

Out Of Sample ◽

Massive Scale ◽

Burrows Wheeler Transform

Abstract Estimating the genomic location and length of identical-by-descent (IBD) segments among individuals is a crucial step in many genetic analyses. However, the exponential growth in the size of biobank and direct-to-consumer (DTC) genetic data sets makes accurate IBD inference a significant computational challenge. Here we present the templated positional Burrows-Wheeler transform (TPBWT) to make fast IBD estimates robust to genotype and phasing errors. Using haplotype data simulated over pedigrees with realistic genotyping and phasing errors we show that the TPBWT outperforms other state-of-the-art IBD inference algorithms in terms of speed and accuracy. For each phase-aware method, we explore the false positive and false negative rates of inferring IBD by segment length and characterize the types of error commonly found. Our results highlight the fragility of most phased IBD inference methods; the accuracy of IBD estimates can be highly sensitive to the quality of haplotype phasing. Additionally we compare the performance of the TPBWT against a widely used phase-free IBD inference approach that is robust to phasing errors. We introduce both in-sample and out-of-sample TPBWT-based IBD inference algorithms and demonstrate their computational efficiency on massive-scale datasets with millions of samples. Furthermore we describe the binary file format for TPBWT-compressed haplotypes that results in fast and efficient out-of-sample IBD computes against very large cohort panels. Finally, we demonstrate the utility of the TPBWT in a brief empirical analysis exploring geographic patterns of haplotype sharing within Mexico. Hierarchical clustering of IBD shared across regions within Mexico reveals geographically structured haplotype sharing and a strong signal of isolation by distance. Our software implementation of the TPBWT is freely available for non-commercial use in the code repository https://github.com/23andMe/phasedibd.

Download Full-text

Detection and Classification of Anomalies in Large Data Sets on the Basis of Information Granules

IEEE Transactions on Fuzzy Systems ◽

10.1109/tfuzz.2021.3076265 ◽

2021 ◽

pp. 1-1

Author(s):

Adam Kiersztyn ◽

Pawe Karczmarek ◽

Krystyna Kiersztyn ◽

Witold Pedrycz

Keyword(s):

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Information Granules

Download Full-text

Software Benchmark—Classification Tree Algorithms for Cell Atlases Annotation Using Single-Cell RNA-Sequencing Data

Microbiology Research ◽

10.3390/microbiolres12020022 ◽

2021 ◽

Vol 12 (2) ◽

pp. 317-334

Author(s):

Omar Alaqeeli ◽

Li Xing ◽

Xuekui Zhang

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Classification Tree ◽

Area Under The Curve ◽

Data Sets ◽

Sequencing Data ◽

Single Cell Rna Sequencing ◽

Tree Algorithms ◽

R Packages

Classification tree is a widely used machine learning method. It has multiple implementations as R packages; rpart, ctree, evtree, tree and C5.0. The details of these implementations are not the same, and hence their performances differ from one application to another. We are interested in their performance in the classification of cells using the single-cell RNA-Sequencing data. In this paper, we conducted a benchmark study using 22 Single-Cell RNA-sequencing data sets. Using cross-validation, we compare packages’ prediction performances based on their Precision, Recall, F1-score, Area Under the Curve (AUC). We also compared the Complexity and Run-time of these R packages. Our study shows that rpart and evtree have the best Precision; evtree is the best in Recall, F1-score and AUC; C5.0 prefers more complex trees; tree is consistently much faster than others, although its complexity is often higher than others.

Download Full-text

Classification of unlabeled online media

Scientific Reports ◽

10.1038/s41598-021-85608-5 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Sakthi Kumar Arul Prakash ◽

Conrad Tucker

Keyword(s):

Social Media ◽

Real World ◽

Graphical Model ◽

Ground Truth ◽

Classification Problem ◽

Machine Learning Algorithms ◽

Social Media Networks ◽

Online Social Media ◽

Wide Range

AbstractThis work investigates the ability to classify misinformation in online social media networks in a manner that avoids the need for ground truth labels. Rather than approach the classification problem as a task for humans or machine learning algorithms, this work leverages user–user and user–media (i.e.,media likes) interactions to infer the type of information (fake vs. authentic) being spread, without needing to know the actual details of the information itself. To study the inception and evolution of user–user and user–media interactions over time, we create an experimental platform that mimics the functionality of real-world social media networks. We develop a graphical model that considers the evolution of this network topology to model the uncertainty (entropy) propagation when fake and authentic media disseminates across the network. The creation of a real-world social media network enables a wide range of hypotheses to be tested pertaining to users, their interactions with other users, and with media content. The discovery that the entropy of user–user and user–media interactions approximate fake and authentic media likes, enables us to classify fake media in an unsupervised learning manner.

Download Full-text

Action Recognition Network Using Stacked Short-Term Deep Features and Bidirectional Moving Average

Applied Sciences ◽

10.3390/app11125563 ◽

2021 ◽

Vol 11 (12) ◽

pp. 5563

Author(s):

Jinsol Ha ◽

Joongchol Shin ◽

Hasil Park ◽

Joonki Paik

Keyword(s):

Action Recognition ◽

High Speed ◽

Moving Average ◽

Video Clip ◽

Visual Surveillance ◽

Temporal Information ◽

Feature Maps ◽

Short Term ◽

Accurate Analysis ◽

Difference Images

Action recognition requires the accurate analysis of action elements in the form of a video clip and a properly ordered sequence of the elements. To solve the two sub-problems, it is necessary to learn both spatio-temporal information and the temporal relationship between different action elements. Existing convolutional neural network (CNN)-based action recognition methods have focused on learning only spatial or temporal information without considering the temporal relation between action elements. In this paper, we create short-term pixel-difference images from the input video, and take the difference images as an input to a bidirectional exponential moving average sub-network to analyze the action elements and their temporal relations. The proposed method consists of: (i) generation of RGB and differential images, (ii) extraction of deep feature maps using an image classification sub-network, (iii) weight assignment to extracted feature maps using a bidirectional, exponential, moving average sub-network, and (iv) late fusion with a three-dimensional convolutional (C3D) sub-network to improve the accuracy of action recognition. Experimental results show that the proposed method achieves a higher performance level than existing baseline methods. In addition, the proposed action recognition network takes only 0.075 seconds per action class, which guarantees various high-speed or real-time applications, such as abnormal action classification, human–computer interaction, and intelligent visual surveillance.

Download Full-text