Tracking Major Sources of Water Contamination Using Machine Learning

Machine Learning Algorithms for Biological Targets: Investigating the Error Tolerance in Various Computational Methods

10.31219/osf.io/zkumv ◽

2019 ◽

Author(s):

Thomas M. Kaiser ◽

Pieter B. Burger

Keyword(s):

Machine Learning ◽

Random Forest ◽

Naive Bayes ◽

Probabilistic Neural Network ◽

Naïve Bayes ◽

Machine Learning Algorithms ◽

Learning Models ◽

Bayes Network ◽

Insight Into ◽

Machine Learning Models

Machine learning continues to make strident advances in the prediction of desired properties concerning drug development. Problematically, the efficacy of machine learning in these arenas is reliant upon highly accurate and abundant data. These two limitations, high accuracy and abundance, are often taken together; however, insight into the dataset accuracy limitation of contemporary machine learning algorithms may yield insight into whether non-bench experimental sources of data may be used to generate useful machine learning models where there is a paucity of experimental data. We took highly accurate data across six kinase types, one GPCR, one polymerase, a human protease, and HIV protease, and intentionally introduced error at varying population proportions in the datasets for each target. With the generated error in the data, we explored how the retrospective accuracy of a Naïve Bayes Network, a Random Forest Model, and a Probabilistic Neural Network model decayed as a function of error. Additionally, we explored the ability of a training dataset with an error profile resembling that produced by the Free Energy Perturbation method (FEP+) to generate machine learning models with useful retrospective capabilities. The categorical error tolerance was quite high for a Naïve Bayes Network algorithm averaging 39% error in the training set required to lose predictivity on the test set. Additionally, a Random Forest tolerated a significant degree of categorical error introduced into the training set with an average error of 29% required to lose predictivity. However, we found the Probabilistic Neural Network algorithm did not tolerate as much categorical error requiring an average of 20% error to lose predictivity. Finally, we found that a Naïve Bayes Network and a Random Forest could both use datasets with an error profile resembling that of FEP+. This work demonstrates that computational methods of known error distribution like FEP+ may be useful in generating machine learning models not based on extensive and expensive in vitro-generated datasets.

Download Full-text

Error Tolerance of Machine Learning Algorithms across Contemporary Biological Targets

Molecules ◽

10.3390/molecules24112115 ◽

2019 ◽

Vol 24 (11) ◽

pp. 2115 ◽

Cited By ~ 2

Author(s):

Thomas M. Kaiser ◽

Pieter B. Burger

Keyword(s):

Machine Learning ◽

Random Forest ◽

Naive Bayes ◽

Probabilistic Neural Network ◽

Naïve Bayes ◽

Machine Learning Algorithms ◽

Learning Models ◽

Bayes Network ◽

Insight Into ◽

Machine Learning Models

Machine learning continues to make strident advances in the prediction of desired properties concerning drug development. Problematically, the efficacy of machine learning in these arenas is reliant upon highly accurate and abundant data. These two limitations, high accuracy and abundance, are often taken together; however, insight into the dataset accuracy limitation of contemporary machine learning algorithms may yield insight into whether non-bench experimental sources of data may be used to generate useful machine learning models where there is a paucity of experimental data. We took highly accurate data across six kinase types, one GPCR, one polymerase, a human protease, and HIV protease, and intentionally introduced error at varying population proportions in the datasets for each target. With the generated error in the data, we explored how the retrospective accuracy of a Naïve Bayes Network, a Random Forest Model, and a Probabilistic Neural Network model decayed as a function of error. Additionally, we explored the ability of a training dataset with an error profile resembling that produced by the Free Energy Perturbation method (FEP+) to generate machine learning models with useful retrospective capabilities. The categorical error tolerance was quite high for a Naïve Bayes Network algorithm averaging 39% error in the training set required to lose predictivity on the test set. Additionally, a Random Forest tolerated a significant degree of categorical error introduced into the training set with an average error of 29% required to lose predictivity. However, we found the Probabilistic Neural Network algorithm did not tolerate as much categorical error requiring an average of 20% error to lose predictivity. Finally, we found that a Naïve Bayes Network and a Random Forest could both use datasets with an error profile resembling that of FEP+. This work demonstrates that computational methods of known error distribution like FEP+ may be useful in generating machine learning models not based on extensive and expensive in vitro-generated datasets.

Download Full-text

Stochastic Analysis of LES Atmospheric Turbulence Solutions With Generative Machine Learning Models

Volume 1: Fluid Applications and Systems; Fluid Measurement and Instrumentation ◽

10.1115/fedsm2020-20127 ◽

2020 ◽

Author(s):

Arturo Rodriguez ◽

Carlos R. Cuellar ◽

Luis F. Rodriguez ◽

Armando Garcia ◽

V. S. Rao Gudimetla ◽

...

Keyword(s):

Machine Learning ◽

Atmospheric Turbulence ◽

Text Categorization ◽

Naive Bayes ◽

Naïve Bayes ◽

Machine Learning Techniques ◽

Learning Models ◽

Long Distance ◽

Turbulence Effects ◽

Machine Learning Models

Abstract The Large Eddy Simulations (LES) modeling of turbulence effects is computationally expensive even when not all scales are resolved, especially in the presence of deep turbulence effects in the atmosphere. Machine learning techniques provide a novel way to propagate the effects from inner- to outer-scale in atmospheric turbulence spectrum and to accelerate its characterization on long-distance laser propagation. We simulated the turbulent flow of atmospheric air in an idealized box with a temperature difference between the lower and upper surfaces of about 27 degrees Celsius with the LES method. The volume was voxelized, and several quantities, such as the velocity, temperature, and the pressure were obtained at regularly spaced grid points. These values were binned and converted into symbols that were concatenated along the length of the box to create a ‘text’ that was used to train a long short-term memory (LSTM) neural network and propose a way to use a naive Bayes model. LSTMs are used in speech recognition, and handwriting recognition tasks and naïve Bayes is used extensively in text categorization. The trained LSTM and the naïve Bayes models were used to generate instances of turbulent-like flows. Errors are quantified, and portrait as a difference that enables our studies to track error quantities passed through stochastic generative machine learning models — considering that our LES studies have a high state of the art high-fidelity approximation solutions of the Navier-Stokes. In the present work, LES solutions are imitated and compare against generative machine learning models.

Download Full-text

Using different machine learning approaches to evaluate performance on spare parts request for aircraft engines

E3S Web of Conferences ◽

10.1051/e3sconf/202019711014 ◽

2020 ◽

Vol 197 ◽

pp. 11014

Author(s):

Antonio Capodieci ◽

Antonio Caricato ◽

Antonio Paolo Carlucci ◽

Antonio Ficarella ◽

Luca Mainetti ◽

...

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Naive Bayes ◽

Confusion Matrix ◽

Spare Part ◽

Naïve Bayes ◽

Spare Parts ◽

Working Time ◽

Learning Models ◽

Machine Learning Models

The Aircraft uptime is getting increasingly important as the transport solutions become more complex and the transport industry seeks new ways of being competitive. To reach this objective, traditional Fleet Management systems are gradually extended with new features to improve reliability and then provide better maintenance planning. Main goal of this work is the development of iterative algorithms based on Artificial Intelligence to define the engine removal plan and its maintenance work, optimizing engine availability at the customer and maintenance costs, as well as obtaining a procurement plan of integrated parts with planning of interventions and implementation of a maintenance strategy. In order to reach this goal, Machine Learning has been applied on a workshop dataset with the aim to optimize warehouse spare parts number, costs and lead-time. This dataset consists of the repair history of a specific engine type, from several years and several fleets, and contains information like repair claim, engine working time, forensic evidences and general information about processed spare parts. Using these data as input, several Machine Learning models have been built in order to predict the repair state of each spare part for a better warehouse handling. A multi-label classification approach has been used in order to build and train, for each spare part, a Machine Learning model that predicts the part repair state as a multiclass classifier does. Mainly, each classifier is requested to predict the repair state (classified as “Efficient”, “Repaired” or “Replaced”) of the corresponding part, starting from two variables: the repairing claim and the engine working time. Then, global results have been evaluated using the Confusion Matrix, from which Accuracy, Precision, Recall and F1-Score metrics are retrieved, in order to analyse the cost of incorrect prediction. These metrics are calculated for each spare part related model on test sets and, then, a final single performance value is obtained by averaging results. In this way, three Machine Learning models (Naïve Bayes, Logistic Regression and Random Forest classifiers) are applied and results are compared. Naïve Bayes and Logistic Regression, that are fully probabilistic methods, have best global performances with an accuracy value of almost 80%, making the models being correct most of the times.

Download Full-text

Document Preprocessing with TF-IDF to Improve the Polarity Classification Performance of Unstructured Sentiment Analysis

Kinetik Game Technology Information System Computer Network Computing Electronics and Control ◽

10.22219/kinetik.v5i3.1066 ◽

2020 ◽

pp. 235-242

Author(s):

Farrikh Alzami ◽

Erika Devi Udayanti ◽

Dwi Puji Prabowo ◽

Rama Aria Megantara

Keyword(s):

Machine Learning ◽

Feature Extraction ◽

Random Forest ◽

Sentiment Analysis ◽

Classification Performance ◽

Document Preparation ◽

Learning Models ◽

Polarity Classification ◽

Negative Sentiment ◽

Machine Learning Models

Sentiment analysis in terms of polarity classification is very important in everyday life, with the existence of polarity, many people can find out whether the respected document has positive or negative sentiment so that it can help in choosing and making decisions. Sentiment analysis usually done manually. Therefore, an automatic sentiment analysis classification process is needed. However, it is rare to find studies that discuss extraction features and which learning models are suitable for unstructured sentiment analysis types with the Amazon food review case. This research explores some extraction features such as Word Bags, TF-IDF, Word2Vector, as well as a combination of TF-IDF and Word2Vector with several machine learning models such as Random Forest, SVM, KNN and Naïve Bayes to find out a combination of feature extraction and learning models that can help add variety to the analysis of polarity sentiments. By assisting with document preparation such as html tags and punctuation and special characters, using snowball stemming, TF-IDF results obtained with SVM are suitable for obtaining a polarity classification in unstructured sentiment analysis for the case of Amazon food review with a performance result of 87,3 percent.

Download Full-text

Random forest and long short-term memory based machine learning models for classification of ion mobility spectrometry spectra

Chemical, Biological, Radiological, Nuclear, and Explosives (CBRNE) Sensing XXII ◽

10.1117/12.2585829 ◽

2021 ◽

Author(s):

Patrick C. Riley ◽

Samir V. Deshpande ◽

Brian S. Ince ◽

Brian C. Hauck ◽

Kyle P. O'Donnell ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Ion Mobility ◽

Short Term Memory ◽

Learning Models ◽

Short Term ◽

Term Memory ◽

Long Short Term Memory ◽

Machine Learning Models

Download Full-text

Comparison of Resampling Algorithms to Address Class Imbalance when Developing Machine Learning Models to Predict Foodborne Pathogen Presence in Agricultural Water

Frontiers in Environmental Science ◽

10.3389/fenvs.2021.701288 ◽

2021 ◽

Vol 9 ◽

Author(s):

Daniel Lowell Weller ◽

Tanzy M. T. Love ◽

Martin Wiedmann

Keyword(s):

Machine Learning ◽

Random Forest ◽

Predictive Models ◽

Training Data ◽

Agricultural Water ◽

Learning Models ◽

Safety Hazards ◽

E Coli ◽

Resampling Method ◽

Machine Learning Models

Recent studies have shown that predictive models can supplement or provide alternatives to E. coli-testing for assessing the potential presence of food safety hazards in water used for produce production. However, these studies used balanced training data and focused on enteric pathogens. As such, research is needed to determine 1) if predictive models can be used to assess Listeria contamination of agricultural water, and 2) how resampling (to deal with imbalanced data) affects performance of these models. To address these knowledge gaps, this study developed models that predict nonpathogenic Listeria spp. (excluding L. monocytogenes) and L. monocytogenes presence in agricultural water using various combinations of learner (e.g., random forest, regression), feature type, and resampling method (none, oversampling, SMOTE). Four feature types were used in model training: microbial, physicochemical, spatial, and weather. “Full models” were trained using all four feature types, while “nested models” used between one and three types. In total, 45 full (15 learners*3 resampling approaches) and 108 nested (5 learners*9 feature sets*3 resampling approaches) models were trained per outcome. Model performance was compared against baseline models where E. coli concentration was the sole predictor. Overall, the machine learning models outperformed the baseline E. coli models, with random forests outperforming models built using other learners (e.g., rule-based learners). Resampling produced more accurate models than not resampling, with SMOTE models outperforming, on average, oversampling models. Regardless of resampling method, spatial and physicochemical water quality features drove accurate predictions for the nonpathogenic Listeria spp. and L. monocytogenes models, respectively. Overall, these findings 1) illustrate the need for alternatives to existing E. coli-based monitoring programs for assessing agricultural water for the presence of potential food safety hazards, and 2) suggest that predictive models may be one such alternative. Moreover, these findings provide a conceptual framework for how such models can be developed in the future with the ultimate aim of developing models that can be integrated into on-farm risk management programs. For example, future studies should consider using random forest learners, SMOTE resampling, and spatial features to develop models to predict the presence of foodborne pathogens, such as L. monocytogenes, in agricultural water when the training data is imbalanced.

Download Full-text

Penerapan Klasifikasi Kueri untuk Meningkatkan Efektivitas Mesin Pencari

Seminar Nasional Official Statistics ◽

10.34123/semnasoffstat.v2021i1.914 ◽

2021 ◽

Vol 2021 (1) ◽

pp. 1012-1018

Author(s):

Handy Geraldy ◽

Lutfi Rahmatuti Maghfiroh

Keyword(s):

Machine Learning ◽

Random Forest ◽

Naive Bayes ◽

Naïve Bayes ◽

Gradient Boosting

Dalam menjalankan peran sebagai penyedia data, Badan Pusat Statistik (BPS) memberikan layanan akses data BPS bagi masyarakat. Salah satu layanan tersebut adalah fitur pencarian di website BPS. Namun, layanan pencarian yang diberikan belum memenuhi harapan konsumen. Untuk memenuhi harapan konsumen, salah satu upaya yang dapat dilakukan adalah meningkatkan efektivitas pencarian agar lebih relevan dengan maksud pengguna. Oleh karena itu, penelitian ini bertujuan untuk membangun fungsi klasifikasi kueri pada mesin pencari dan menguji apakah fungsi tersebut dapat meningkatkan efektivitas pencarian. Fungsi klasifikasi kueri dibangun menggunakan model machine learning. Kami membandingkan lima algoritma yaitu SVM, Random Forest, Gradient Boosting, KNN, dan Naive Bayes. Dari lima algoritma tersebut, model terbaik diperoleh pada algoritma SVM. Kemudian, fungsi tersebut diimplementasikan pada mesin pencari yang diukur efektivitasnya berdasarkan nilai precision dan recall. Hasilnya, fungsi klasifikasi kueri dapat mempersempit hasil pencarian pada kueri tertentu, sehingga meningkatkan nilai precision. Namun, fungsi klasifikasi kueri tidak memengaruhi nilai recall.

Download Full-text

Preliminary Screening of COVID-19 Infection Employing Machine Learning Techniques From Simple Blood Profile

International Journal of Quantitative Structure-Property Relationships ◽

10.4018/ijqspr.2021070103 ◽

2021 ◽

Vol 6 (3) ◽

pp. 35-47

Author(s):

Anirudh Reddy Cingireddy ◽

Robin Ghosh ◽

Supratik Kar ◽

Venkata Melapu ◽

Sravanthi Joginipeli ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Naive Bayes ◽

Albert Einstein ◽

Naïve Bayes ◽

Machine Learning Techniques ◽

Support Vector ◽

Blood Profile ◽

Molecular Tests ◽

Large Populations

Frequent testing of the entire population would help to identify individuals with active COVID-19 and allow us to identify concealed carriers. Molecular tests, antigen tests, and antibody tests are being widely used to confirm COVID-19 in the population. Molecular tests such as the real-time reverse transcription-polymerase chain reaction (rRT-PCR) test will take a minimum of 3 hours to a maximum of 4 days for the results. The authors suggest using machine learning and data mining tools to filter large populations at a preliminary level to overcome this issue. The ML tools could reduce the testing population size by 20 to 30%. In this study, they have used a subset of features from full blood profile which are drawn from patients at Israelita Albert Einstein hospital located in Brazil. They used classification models, namely KNN, logistic regression, XGBooting, naive Bayes, decision tree, random forest, support vector machine, and multilayer perceptron with k-fold cross-validation, to validate the models. Naïve bayes, KNN, and random forest stand out as the most predictive ones with 88% accuracy each.

Download Full-text

Application of Random Forest Machine Learning Models to Forecast Combustion Profile Parameters of a Natural Gas Spark Ignition Engine

10.1115/1.0004390v ◽

2021 ◽

Author(s):

Jinlong Liu ◽

Christopher Ulishney ◽

Cosmin Dumitrescu

Keyword(s):

Machine Learning ◽

Random Forest ◽

Natural Gas ◽

Spark Ignition ◽

Spark Ignition Engine ◽

Learning Models ◽

Machine Learning Models

Download Full-text