Utilizing the Microbiota and Machine Learning Algorithms to Assess Risk of Salmonella Contamination in Poultry Rinsate

Journal of Food Protection ◽

10.4315/jfp-20-367 ◽

2021 ◽

Author(s):

Hannah Bolinger ◽

David Tran ◽

Kenneth Harary ◽

George C. Paoli ◽

Giselle Guron ◽

...

Keyword(s):

Machine Learning ◽

Machine Learning Algorithms ◽

Diagnostic Tools ◽

Sequencing Data ◽

Testing Methods ◽

16S Sequencing ◽

Sequencing Technologies ◽

Microbiological Testing ◽

Microbiome Data ◽

Larger Sample

Traditional microbiological testing methods are slow, and many molecular-based techniques rely on culture-based enrichment to overcome low limits of detection. Recent advancements in sequencing technologies may make it possible to utilize machine learning (ML) to identify patterns in microbiome data to potentially predict the presence or absence of pathogens. In this study, 299 poultry rinsate samples from various points in the processing chain were analyzed to determine if microbiota could inform about a sample’s risk for containing Salmonella . Samples were culture confirmed as Salmonella -positive or -negative following modified USDA MLG protocols. The culture confirmation result was used as a reference to compare with 16S sequencing data. Pre-chill samples tested positive (71/82) at a higher frequency than post-chill samples (30/217) and contained greater microbial diversity. Due to their larger sample size, post-chill samples were analyzed more deeply. Analysis of variance (ANOVA) identified a significant effect of chilling on the number of genera (p<0.001), but analysis of similarities (ANOSIM) failed to provide evidence for microbial dissimilarity between pre- and post-chill samples (p=0.001, R=0.443). Various ML models were trained using post-chill samples to predict if a sample contained Salmonella based on the samples’ microbiota pre-enrichment. The optimal model was a Random Forest-based model with a performance as follows: accuracy (88%), sensitivity (85%), specificity (90%). While the algorithms described in this paper are prototypes, these risk-based algorithms demonstrate the potential and need for further studies to provide insight alongside diagnostic tests. Combining risk-based information with diagnostic tools can help poultry processors make informed decisions to help identify and prevent the spread of Salmonella . These data add to the growing body of literature exploring novel ways to utilize microbiome data for predictive food safety.

Download Full-text

Comparison of 16S and whole genome dog microbiomes using machine learning

BioData Mining ◽

10.1186/s13040-021-00270-x ◽

2021 ◽

Vol 14 (1) ◽

Author(s):

Scott Lewis ◽

Andrea Nash ◽

Qinghong Li ◽

Tae-Hyuk Ahn

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Shotgun Sequencing ◽

Machine Learning Algorithms ◽

Whole Genome ◽

Sequencing Technology ◽

16S Sequencing ◽

Sequencing Technologies ◽

Whole Genome Shotgun Sequencing ◽

Study Designs

Abstract Background Recent advances in sequencing technologies have driven studies identifying the microbiome as a key regulator of overall health and disease in the host. Both 16S amplicon and whole genome shotgun sequencing technologies are currently being used to investigate this relationship, however, the choice of sequencing technology often depends on the nature and experimental design of the study. In principle, the outputs rendered by analysis pipelines are heavily influenced by the data used as input; it is then important to consider that the genomic features produced by different sequencing technologies may emphasize different results. Results In this work, we use public 16S amplicon and whole genome shotgun sequencing (WGS) data from the same dogs to investigate the relationship between sequencing technology and the captured gut metagenomic landscape in dogs. In our analyses, we compare the taxonomic resolution at the species and phyla levels and benchmark 12 classification algorithms in their ability to accurately identify host phenotype using only taxonomic relative abundance information from 16S and WGS datasets with identical study designs. Our best performing model, a random forest trained by the WGS dataset, identified a species (Bacteroides coprocola) that predominantly contributes to the abundance of leuB, a gene involved in branched chain amino acid biosynthesis; a risk factor for glucose intolerance, insulin resistance, and type 2 diabetes. This trend was not conserved when we trained the model using 16S sequencing profiles from the same dogs. Conclusions Our results indicate that WGS sequencing of dog microbiomes detects a greater taxonomic diversity than 16S sequencing of the same dogs at the species level and with respect to four gut-enriched phyla levels. This difference in detection does not significantly impact the performance metrics of machine learning algorithms after down-sampling. Although the important features extracted from our best performing model are not conserved between the two technologies, the important features extracted from either instance indicate the utility of machine learning algorithms in identifying biologically meaningful relationships between the host and microbiome community members. In conclusion, this work provides the first systematic machine learning comparison of dog 16S and WGS microbiomes derived from identical study designs.

Download Full-text

LightCUD: a program for diagnosing IBD based on human gut microbiome data

BioData Mining ◽

10.1186/s13040-021-00241-2 ◽

2021 ◽

Vol 14 (1) ◽

Author(s):

Congmin Xu ◽

Man Zhou ◽

Zhongjie Xie ◽

Mo Li ◽

Xi Zhu ◽

...

Keyword(s):

Machine Learning ◽

Gut Microbiome ◽

High Performance ◽

Machine Learning Algorithms ◽

Healthy Controls ◽

Sequencing Data ◽

Human Gut ◽

Data Set ◽

16S Sequencing ◽

Human Gut Microbiome

Abstract Background The diagnosis of inflammatory bowel disease (IBD) and discrimination between the types of IBD are clinically important. IBD is associated with marked changes in the intestinal microbiota. Advances in next-generation sequencing (NGS) technology and the improved hospital bioinformatics analysis ability motivated us to develop a diagnostic method based on the gut microbiome. Results Using a set of whole-genome sequencing (WGS) data from 349 human gut microbiota samples with two types of IBD and healthy controls, we assembled and aligned WGS short reads to obtain feature profiles of strains and genera. The genus and strain profiles were used for the 16S-based and WGS-based diagnostic modules construction respectively. We designed a novel feature selection procedure to select those case-specific features. With these features, we built discrimination models using different machine learning algorithms. The machine learning algorithm LightGBM outperformed other algorithms in this study and thus was chosen as the core algorithm. Specially, we identified two small sets of biomarkers (strains) separately for the WGS-based health vs IBD module and ulcerative colitis vs Crohn’s disease module, which contributed to the optimization of model performance during pre-training. We released LightCUD as an IBD diagnostic program built with LightGBM. The high performance has been validated through five-fold cross-validation and using an independent test data set. LightCUD was implemented in Python and packaged free for installation with customized databases. With WGS data or 16S rRNA sequencing data of gut microbiome samples as the input, LightCUD can discriminate IBD from healthy controls with high accuracy and further identify the specific type of IBD. The executable program LightCUD was released in open source with instructions at the webpage http://cqb.pku.edu.cn/ZhuLab/LightCUD/. The identified strain biomarkers could be used to study the critical factors for disease development and recommend treatments regarding changes in the gut microbial community. Conclusions As the first released human gut microbiome-based IBD diagnostic tool, LightCUD demonstrates a high-performance for both WGS and 16S sequencing data. The strains that either identify healthy controls from IBD patients or distinguish the specific type of IBD are expected to be clinically important to serve as biomarkers.

Download Full-text

Characterizing and Evaluating the Zoonotic Potential of Novel Viruses Discovered in Vampire Bats

Viruses ◽

10.3390/v13020252 ◽

2021 ◽

Vol 13 (2) ◽

pp. 252

Author(s):

Laura M. Bergner ◽

Nardus Mollentze ◽

Richard J. Orton ◽

Carlos Tello ◽

Alice Broos ◽

...

Keyword(s):

Machine Learning ◽

Phylogenetic Analyses ◽

Human Infection ◽

Machine Learning Algorithms ◽

Zoonotic Potential ◽

Metagenomic Sequencing ◽

Learning Models ◽

Sequencing Data ◽

Vampire Bats ◽

Machine Learning Models

The contemporary surge in metagenomic sequencing has transformed knowledge of viral diversity in wildlife. However, evaluating which newly discovered viruses pose sufficient risk of infecting humans to merit detailed laboratory characterization and surveillance remains largely speculative. Machine learning algorithms have been developed to address this imbalance by ranking the relative likelihood of human infection based on viral genome sequences, but are not yet routinely applied to viruses at the time of their discovery. Here, we characterized viral genomes detected through metagenomic sequencing of feces and saliva from common vampire bats (Desmodus rotundus) and used these data as a case study in evaluating zoonotic potential using molecular sequencing data. Of 58 detected viral families, including 17 which infect mammals, the only known zoonosis detected was rabies virus; however, additional genomes were detected from the families Hepeviridae, Coronaviridae, Reoviridae, Astroviridae and Picornaviridae, all of which contain human-infecting species. In phylogenetic analyses, novel vampire bat viruses most frequently grouped with other bat viruses that are not currently known to infect humans. In agreement, machine learning models built from only phylogenetic information ranked all novel viruses similarly, yielding little insight into zoonotic potential. In contrast, genome composition-based machine learning models estimated different levels of zoonotic potential, even for closely related viruses, categorizing one out of four detected hepeviruses and two out of three picornaviruses as having high priority for further research. We highlight the value of evaluating zoonotic potential beyond ad hoc consideration of phylogeny and provide surveillance recommendations for novel viruses in a wildlife host which has frequent contact with humans and domestic animals.

Download Full-text

Predicting Growth and Carcass Traits in Swine Using Microbiome Data and Machine Learning Algorithms

Scientific Reports ◽

10.1038/s41598-019-43031-x ◽

2019 ◽

Vol 9 (1) ◽

Cited By ~ 8

Author(s):

Christian Maltecca ◽

Duc Lu ◽

Constantino Schillebeeckx ◽

Nathan P. McNulty ◽

Clint Schwab ◽

...

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Carcass Traits ◽

Machine Learning Algorithms ◽

Microbiome Data

Download Full-text

Exploration of predictive and prognostic alternative splicing signatures in lung adenocarcinoma using machine learning methods

Journal of Translational Medicine ◽

10.1186/s12967-020-02635-y ◽

2020 ◽

Vol 18 (1) ◽

Author(s):

Qidong Cai ◽

Boxue He ◽

Pengfei Zhang ◽

Zhenyu Zhao ◽

Xiong Peng ◽

...

Keyword(s):

Machine Learning ◽

Alternative Splicing ◽

Lung Adenocarcinoma ◽

Prognostic Model ◽

Cox Regression ◽

Machine Learning Algorithms ◽

The Cancer Genome Atlas ◽

Sequencing Data ◽

Learning Methods ◽

Machine Learning Methods

Abstract Background Alternative splicing (AS) plays critical roles in generating protein diversity and complexity. Dysregulation of AS underlies the initiation and progression of tumors. Machine learning approaches have emerged as efficient tools to identify promising biomarkers. It is meaningful to explore pivotal AS events (ASEs) to deepen understanding and improve prognostic assessments of lung adenocarcinoma (LUAD) via machine learning algorithms. Method RNA sequencing data and AS data were extracted from The Cancer Genome Atlas (TCGA) database and TCGA SpliceSeq database. Using several machine learning methods, we identified 24 pairs of LUAD-related ASEs implicated in splicing switches and a random forest-based classifiers for identifying lymph node metastasis (LNM) consisting of 12 ASEs. Furthermore, we identified key prognosis-related ASEs and established a 16-ASE-based prognostic model to predict overall survival for LUAD patients using Cox regression model, random survival forest analysis, and forward selection model. Bioinformatics analyses were also applied to identify underlying mechanisms and associated upstream splicing factors (SFs). Results Each pair of ASEs was spliced from the same parent gene, and exhibited perfect inverse intrapair correlation (correlation coefficient = − 1). The 12-ASE-based classifier showed robust ability to evaluate LNM status of LUAD patients with the area under the receiver operating characteristic (ROC) curve (AUC) more than 0.7 in fivefold cross-validation. The prognostic model performed well at 1, 3, 5, and 10 years in both the training cohort and internal test cohort. Univariate and multivariate Cox regression indicated the prognostic model could be used as an independent prognostic factor for patients with LUAD. Further analysis revealed correlations between the prognostic model and American Joint Committee on Cancer stage, T stage, N stage, and living status. The splicing network constructed of survival-related SFs and ASEs depicts regulatory relationships between them. Conclusion In summary, our study provides insight into LUAD researches and managements based on these AS biomarkers.

Download Full-text

OperonSEQer: A set of machine-learning algorithms with threshold voting for detection of operon pairs using short-read RNA-sequencing data

10.1101/2021.07.29.454062 ◽

2021 ◽

Author(s):

Raga Krishnakumar ◽

Anne M Ruffing

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

P Value ◽

Sequencing Data ◽

High Coverage ◽

Endogenous Gene ◽

Operon Prediction ◽

Long Read ◽

Transcriptomics Data

Operon prediction in prokaryotes is critical not only for understanding the regulation of endogenous gene expression, but also for exogenous targeting of genes using newly developed tools such as CRISPR-based gene modulation. A number of methods have used transcriptomics data to predict operons, based on the premise that contiguous genes in an operon will be expressed at similar levels. While promising results have been observed using these methods, most of them do not address uncertainty caused by technical variability between experiments, which is especially relevant when the amount of data available is small. In addition, many existing methods do not provide the flexibility to determine whether the stringency with which genes should be evaluated for being in an operon pair. We present OperonSEQer, a set of machine learning algorithms that uses the statistic and p-value from a non-parametric analysis of variance test (Kruskal-Wallis) to determine the likelihood that two adjacent genes are expressed from the same RNA molecule. We implement a voting system to allow users to choose the stringency of operon calls depending on whether your priority is high coverage of operons or high accuracy of the calls. In addition, we provide the code so that users can retrain the algorithm and re-establish hyperparameters based on any data they choose, allowing for this method to be expanded on as additional data is generated and incorporated. We show that our approach detects operon pairs that are missed by current methods by comparing our predictions to publicly available long-read sequencing data. OperonSEQer therefore improves on existing methods in terms of accuracy, flexibility and adaptability.

Download Full-text

Development of high affinity monobodies recognizing SARS-CoV-2 antigen

10.21203/rs.3.rs-25828/v1 ◽

2020 ◽

Author(s):

Yushen Du ◽

Tian-hao Zhang ◽

Xiangzhi Meng ◽

Yuan Shi ◽

Menglong Hu ◽

...

Keyword(s):

Machine Learning ◽

Deep Sequencing ◽

Machine Learning Algorithms ◽

Detection Sensitivity ◽

Enzyme Linked Immunosorbent Assay ◽

Mrna Display ◽

Global Public Health ◽

Sequencing Data ◽

Patient Identification ◽

Viral Antigens

Abstract The coronavirus disease 2019 (COVID-19) pandemic caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has been a threat to global public health. Prompt patient identification and quarantine is the most effective way to control its rapid transmission, which can be facilitated by early detection of viral antigens. Here we present a platform to develop and optimize the fibronectin-based affinity-enhanced antibody mimetics (monobodies) for recognizing viral antigens. Specifically, we developed monobodies targeting SARS-CoV-2 nucleocapsid (N) protein. We showed that two monobodies, NN2 and NC2, bind to N protein’s N- and C-terminal domains respectively with a Kd in nM range.The specificity of the recognition was confirmed with co-immunoprecipitation and immunofluorescence assays. Furthermore, we demonstrated that one round of in vitro maturation using mRNA display can improve the binding affinity of monobodies. Machine learning algorithms were integrated with deep sequencing data for selecting candidates that improve the detection sensitivity of N. Using this pair of monobodies, we have developed an enzyme-linked immunosorbent assay (ELISA) for viral detection. We were able to detect recombinant N at 4 pg/ml and detect N in viral culture supernatant, with no cross-reactivity with other CoV. Integrating high-dense mutagenesis, mRNA display, deep sequencing and machine learning, this platform can be applied through iterations to identify and optimize monobodies against emerging viral antigens, potentiating point-of-care detection of communicable diseases in a cost-and time-sensitive manner.Authors Yushen Du, Tian-hao Zhang, Xiangzhi Meng, Yuan Shi, and Menglong Hu contributed equally to this work.

Download Full-text

Supervised Machine Learning Algorithms For Early Diagnosis Of Alzheimer’s Disease

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.c6646.098319 ◽

2019 ◽

Vol 8 (3) ◽

pp. 7964-7967

Keyword(s):

Machine Learning ◽

Alzheimer’S Disease ◽

Alzheimer's Disease ◽

Early Stage ◽

Memory Loss ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Diagnostic Tools ◽

Longitudinal Mri ◽

Permanent Brain Damage

Alzheimer’s is a neurodegenerative disease which can eventually leads to dementia. Mostly occurring in elderly people over the age of 65, it is hard to detect and diagnose correctly. Most common symptoms include memory loss and slow deterioration of cognitive functions. Given that these symptoms are seen often in old people, this hinders the detection of Alzheimer’s disease (AD). Alzheimer’s is currently incurable, but detection of the disease during its early stage is often beneficial to the patient, since there are treatments which can considerably improve the quality of life of the patient. However this can only be done if the patient has been diagnosed at a stage before any permanent brain damage has been done. Most of the current methods for detecting and diagnosing AD are not good enough. It is the need of the hour to develop better and early diagnostic tools. With the improvements in the field of machine learning, we now have the tools needed to drastically improve detection of Alzheimer’s. We examine various machine learning methods and algorithms to find a method which can boost the chances of detecting the disease. We will use the following algorithms: Decision Tree, SVM, Random Forest and Adaboost. The dataset being used is the longitudinal MRI data available included in the OASIS dataset. We will use the aforementioned algorithms on the dataset and compare the accuracies achieved to find an optimal.

Download Full-text

OperonSEQer: A set of machine-learning algorithms with threshold voting for detection of operon pairs using short-read RNA-sequencing data

PLoS Computational Biology ◽

10.1371/journal.pcbi.1009731 ◽

2022 ◽

Vol 18 (1) ◽

pp. e1009731

Author(s):

Raga Krishnakumar ◽

Anne M. Ruffing

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

High Specificity ◽

Machine Learning Algorithms ◽

P Value ◽

Sequencing Data ◽

Endogenous Gene ◽

Operon Prediction ◽

Long Read ◽

Transcriptomics Data

Operon prediction in prokaryotes is critical not only for understanding the regulation of endogenous gene expression, but also for exogenous targeting of genes using newly developed tools such as CRISPR-based gene modulation. A number of methods have used transcriptomics data to predict operons, based on the premise that contiguous genes in an operon will be expressed at similar levels. While promising results have been observed using these methods, most of them do not address uncertainty caused by technical variability between experiments, which is especially relevant when the amount of data available is small. In addition, many existing methods do not provide the flexibility to determine the stringency with which genes should be evaluated for being in an operon pair. We present OperonSEQer, a set of machine learning algorithms that uses the statistic and p-value from a non-parametric analysis of variance test (Kruskal-Wallis) to determine the likelihood that two adjacent genes are expressed from the same RNA molecule. We implement a voting system to allow users to choose the stringency of operon calls depending on whether your priority is high recall or high specificity. In addition, we provide the code so that users can retrain the algorithm and re-establish hyperparameters based on any data they choose, allowing for this method to be expanded as additional data is generated. We show that our approach detects operon pairs that are missed by current methods by comparing our predictions to publicly available long-read sequencing data. OperonSEQer therefore improves on existing methods in terms of accuracy, flexibility, and adaptability.

Download Full-text

Machine learning meets genome assembly

Briefings in Bioinformatics ◽

10.1093/bib/bby072 ◽

2018 ◽

Vol 20 (6) ◽

pp. 2116-2129 ◽

Cited By ~ 4

Author(s):

Kleber Padovani de Souza ◽

João Carlos Setubal ◽

André Carlos Ponce de Leon F. de Carvalho ◽

Guilherme Oliveira ◽

Annie Chateau ◽

...

Keyword(s):

Machine Learning ◽

Approximate Solutions ◽

Machine Learning Algorithms ◽

Np Hard ◽

Sequencing Technologies ◽

Starting Point ◽

Hard Problems ◽

Dna Fragment ◽

Living Organisms ◽

Dna Fragment Assembly

Abstract Motivation: With the recent advances in DNA sequencing technologies, the study of the genetic composition of living organisms has become more accessible for researchers. Several advances have been achieved because of it, especially in the health sciences. However, many challenges which emerge from the complexity of sequencing projects remain unsolved. Among them is the task of assembling DNA fragments from previously unsequenced organisms, which is classified as an NP-hard (nondeterministic polynomial time hard) problem, for which no efficient computational solution with reasonable execution time exists. However, several tools that produce approximate solutions have been used with results that have facilitated scientific discoveries, although there is ample room for improvement. As with other NP-hard problems, machine learning algorithms have been one of the approaches used in recent years in an attempt to find better solutions to the DNA fragment assembly problem, although still at a low scale. Results: This paper presents a broad review of pioneering literature comprising artificial intelligence-based DNA assemblers—particularly the ones that use machine learning—to provide an overview of state-of-the-art approaches and to serve as a starting point for further study in this field.

Download Full-text