scholarly journals Comprehensive and empirical evaluation of machine learning algorithms for LC retention time prediction

2018 ◽  
Author(s):  
Robbin Bouwmeester ◽  
Lennart Martens ◽  
Sven Degroeve

AbstractLiquid chromatography is a core component of almost all mass spectrometric analyses of (bio)molecules. Because of the high-throughput nature of mass spectrometric analyses, the interpretation of these chromatographic data increasingly relies on informatics solutions that attempt to predict an analyte’s retention time. The key components of such predictive algorithms are the features these are supplies with, and the actual machine learning algorithm used to fit the model parameters.We here therefore evaluate the performance of seven machine learning algorithms on 36 distinct metabolomics data sets, using two distinct feature sets. Interestingly, the results show that no single learning algorithm performs optimally for all data sets, with different algorithm types achieving top performance for different types of analytes or different protocols. Our results can thus be used to find an optimal retention time prediction algorithm for specific analytes or protocols. Importantly, however, our results also show that blending different types of models together decreases the error on outliers, indicating that the combination of several approaches holds substantial promise for the development of more generic, high-performing algorithms.

Author(s):  
Lakshmi Prayaga ◽  
Krishna Devulapalli ◽  
Chandra Prayaga

Wearable devices are contributing heavily towards the proliferation of data and creating a rich minefield for data analytics. Recent trends in the design of wearable devices include several embedded sensors which also provide useful data for many applications. This research presents results obtained from studying human-activity related data, collected from wearable devices. The activities considered for this study were working at the computer, standing and walking, standing, walking, walking up and down the stairs, and talking while walking. The research entails the use of a portion of the data to train machine learning algorithms and build a model. The rest of the data is used as test data for predicting the activity of an individual. Details of data collection, processing, and presentation are also discussed. After studying the literature and the data sets, a Random Forest machine learning algorithm was determined to be best applicable algorithm for analyzing data from wearable devices. The software used in this research includes the R statistical package and the SensorLog app.


Author(s):  
Sotiris Kotsiantis ◽  
Dimitris Kanellopoulos ◽  
Panayotis Pintelas

In classification learning, the learning scheme is presented with a set of classified examples from which it is expected tone can learn a way of classifying unseen examples (see Table 1). Formally, the problem can be stated as follows: Given training data {(x1, y1)…(xn, yn)}, produce a classifier h: X- >Y that maps an object x ? X to its classification label y ? Y. A large number of classification techniques have been developed based on artificial intelligence (logic-based techniques, perception-based techniques) and statistics (Bayesian networks, instance-based techniques). No single learning algorithm can uniformly outperform other algorithms over all data sets. The concept of combining classifiers is proposed as a new direction for the improvement of the performance of individual machine learning algorithms. Numerous methods have been suggested for the creation of ensembles of classi- fiers (Dietterich, 2000). Although, or perhaps because, many methods of ensemble creation have been proposed, there is as yet no clear picture of which method is best.


Author(s):  
Shahadat Uddin ◽  
Arif Khan ◽  
Md Ekramul Hossain ◽  
Mohammad Ali Moni

Abstract Background Supervised machine learning algorithms have been a dominant method in the data mining field. Disease prediction using health data has recently shown a potential application area for these methods. This study aims to identify the key trends among different types of supervised machine learning algorithms, and their performance and usage for disease risk prediction. Methods In this study, extensive research efforts were made to identify those studies that applied more than one supervised machine learning algorithm on single disease prediction. Two databases (i.e., Scopus and PubMed) were searched for different types of search items. Thus, we selected 48 articles in total for the comparison among variants supervised machine learning algorithms for disease prediction. Results We found that the Support Vector Machine (SVM) algorithm is applied most frequently (in 29 studies) followed by the Naïve Bayes algorithm (in 23 studies). However, the Random Forest (RF) algorithm showed superior accuracy comparatively. Of the 17 studies where it was applied, RF showed the highest accuracy in 9 of them, i.e., 53%. This was followed by SVM which topped in 41% of the studies it was considered. Conclusion This study provides a wide overview of the relative performance of different variants of supervised machine learning algorithms for disease prediction. This important information of relative performance can be used to aid researchers in the selection of an appropriate supervised machine learning algorithm for their studies.


Author(s):  
John Yearwood ◽  
Adil Bagirov ◽  
Andrei V. Kelarev

The applications of machine learning algorithms to the analysis of data sets of DNA sequences are very important. The present chapter is devoted to the experimental investigation of applications of several machine learning algorithms for the analysis of a JLA data set consisting of DNA sequences derived from non-coding segments in the junction of the large single copy region and inverted repeat A of the chloroplast genome in Eucalyptus collected by Australian biologists. Data sets of this sort represent a new situation, where sophisticated alignment scores have to be used as a measure of similarity. The alignment scores do not satisfy properties of the Minkowski metric, and new machine learning approaches have to be investigated. The authors’ experiments show that machine learning algorithms based on local alignment scores achieve very good agreement with known biological classes for this data set. A new machine learning algorithm based on graph partitioning performed best for clustering of the JLA data set. Our novel k-committees algorithm produced most accurate results for classification. Two new examples of synthetic data sets demonstrate that the authors’ k-committees algorithm can outperform both the Nearest Neighbour and k-medoids algorithms simultaneously.


Author(s):  
Sanjay Kumar Singh ◽  
Anjali Goyal

Cervical cancer is second most prevailing cancer in women all over the world and the Pap smear is one of the most popular techniques used to diagnosis cervical cancer at an early stage. Developing countries like India has to face the challenges in order to handle more cases day by day. In this article, various online and offline machine learning algorithms has been applied on benchmarked data sets to detect cervical cancer. This article also addresses the problem of segmentation with hybrid techniques and optimizes the number of features using extra tree classifiers. Accuracy, precision score, recall score, and F1 score are increasing in the proportion of data for training and attained up to 100% by some algorithms. Algorithm like logistic regression with L1 regularization has an accuracy of 100%, but it is too much costly in terms of CPU time in comparison to some of the algorithms which obtain 99% accuracy with less CPU time. The key finding in this article is the selection of the best machine learning algorithm with the highest accuracy. Cost effectiveness in terms of CPU time is also analysed.


2020 ◽  
Author(s):  
Eman Alanazi ◽  
Alaa Abdou ◽  
Jake Luo

UNSTRUCTURED Stroke, a cerebrovascular disease, is one of the major causes of death. It is also causing a health burden for both the patients and the healthcare systems. One of the important risk factors of stroke is health behavior which is an increasing focus of prevention. In addition, chronic diseases such as hypertension, diabetes, cardiac diseases, and asthma are potential risk factors for stroke. There are a lot of machine learning that built using predictors such as lifestyle or radiology imaging. However, there are no models built using lab tests. The aim of the study is to fill this gap by building prediction models to predict stroke from lab tests. We utilized the National Health and Nutrition Examination Survey (NHNES) data sets to develop models that would predict stroke from patient lab tests. We found that accurate and sensitive machine learning models can be created to predict stroke from lab tests. The results showed that prediction with the best tested algorithm random forest could reach the highest accuracy (ACC = 0.96) when all the attributes were used. The model proposed can be integrated with electronic health records to provide a real-time prediction of stroke from lab tests. Due to the data, we could not predict the type of stroke wither hemorrigic or ischemic. In future studies, we aim to use data that provide different types of stroke and explore the data to build a prediction model of each type.


Metabolites ◽  
2020 ◽  
Vol 10 (6) ◽  
pp. 237 ◽  
Author(s):  
Bradley C. Naylor ◽  
J. Leon Catrow ◽  
J. Alan Maschek ◽  
James E. Cox

The use of retention time is often critical for the identification of compounds in metabolomic and lipidomic studies. Standards are frequently unavailable for the retention time measurement of many metabolites, thus the ability to predict retention time for these compounds is highly valuable. A number of studies have applied machine learning to predict retention times, but applying a published machine learning model to different lab conditions is difficult. This is due to variation between chromatographic equipment, methods, and columns used for analysis. Recreating a machine learning model is likewise difficult without a dedicated bioinformatician. Herein we present QSRR Automator, a software package to automate retention time prediction model creation and demonstrate its utility by testing data from multiple chromatography columns from previous publications and in-house work. Analysis of these data sets shows similar accuracy to published models, demonstrating the software’s utility in metabolomic and lipidomic studies.


2020 ◽  
pp. 1-11
Author(s):  
Jie Liu ◽  
Lin Lin ◽  
Xiufang Liang

The online English teaching system has certain requirements for the intelligent scoring system, and the most difficult stage of intelligent scoring in the English test is to score the English composition through the intelligent model. In order to improve the intelligence of English composition scoring, based on machine learning algorithms, this study combines intelligent image recognition technology to improve machine learning algorithms, and proposes an improved MSER-based character candidate region extraction algorithm and a convolutional neural network-based pseudo-character region filtering algorithm. In addition, in order to verify whether the algorithm model proposed in this paper meets the requirements of the group text, that is, to verify the feasibility of the algorithm, the performance of the model proposed in this study is analyzed through design experiments. Moreover, the basic conditions for composition scoring are input into the model as a constraint model. The research results show that the algorithm proposed in this paper has a certain practical effect, and it can be applied to the English assessment system and the online assessment system of the homework evaluation system algorithm system.


2021 ◽  
Author(s):  
Yingxian Liu ◽  
Cunliang Chen ◽  
Hanqing Zhao ◽  
Yu Wang ◽  
Xiaodong Han

Abstract Fluid properties are key factors for predicting single well productivity, well test interpretation and oilfield recovery prediction, which directly affect the success of ODP program design. The most accurate and direct method of acquisition is underground sampling. However, not every well has samples due to technical reasons such as excessive well deviation or high cost during the exploration stage. Therefore, analogies or empirical formulas have to be adopted to carry out research in many cases. But a large number of oilfield developments have shown that the errors caused by these methods are very large. Therefore, how to quickly and accurately obtain fluid physical properties is of great significance. In recent years, with the development and improvement of artificial intelligence or machine learning algorithms, their applications in the oilfield have become more and more extensive. This paper proposed a method for predicting crude oil physical properties based on machine learning algorithms. This method uses PVT data from nearly 100 wells in Bohai Oilfield. 75% of the data is used for training and learning to obtain the prediction model, and the remaining 25% is used for testing. Practice shows that the prediction results of the machine learning algorithm are very close to the actual data, with a very small error. Finally, this method was used to apply the preliminary plan design of the BZ29 oilfield which is a new oilfield. Especially for the unsampled sand bodies, the fluid physical properties prediction was carried out. It also compares the influence of the analogy method on the scheme, which provides potential and risk analysis for scheme design. This method will be applied in more oil fields in the Bohai Sea in the future and has important promotion value.


Sign in / Sign up

Export Citation Format

Share Document