Towards improving machine learning algorithms accuracy by benefiting from similarities between cases

2021 ◽  
Vol 40 (1) ◽  
pp. 947-972
Author(s):  
Samih M. Mostafa

Data preprocessing is a necessary core in data mining. Preprocessing involves handling missing values, outlier and noise removal, data normalization, etc. The problem with existing methods which handle missing values is that they deal with the whole data ignoring the characteristics of the data (e.g., similarities and differences between cases). This paper focuses on handling the missing values using machine learning methods taking into account the characteristics of the data. The proposed preprocessing method clusters the data, then imputes the missing values in each cluster depending on the data belong to this cluster rather than the whole data. The author performed a comparative study of the proposed method and ten popular imputation methods namely mean, median, mode, KNN, IterativeImputer, IterativeSVD, Softimpute, Mice, Forimp, and Missforest. The experiments were done on four datasets with different number of clusters, sizes, and shapes. The empirical study showed better effectiveness from the point of view of imputation time, Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and coefficient of determination (R2 score) (i.e., the similarity of the original removed value to the imputed one).

2019 ◽  
Vol 21 (9) ◽  
pp. 693-699 ◽  
Author(s):  
A. Alper Öztürk ◽  
A. Bilge Gündüz ◽  
Ozan Ozisik

Aims and Objectives: Solid Lipid Nanoparticles (SLNs) are pharmaceutical delivery systems that have advantages such as controlled drug release, long-term stability etc. Particle Size (PS) is one of the important criteria of SLNs. These factors affect drug release rate, bio-distribution etc. In this study, the formulation of SLNs using high-speed homogenization technique has been evaluated. The main emphasis of the work is to study whether the effect of mixing time and formulation ingredients on PS can be modeled. For this purpose, different machine learning algorithms have been applied and evaluated using the mean absolute error metric. Materials and Methods: SLNs were prepared by high-speed homogenizaton. PS, size distribution and zeta potential measurements were performed on freshly prepared samples. In order to model the formulation of the particles in terms of mixing time and formulation ingredients and evaluate the predictability of PS depending on these parameters, different machine learning algorithms were applied on the prepared dataset and the performances of the algorithms were also evaluated. Results: PS of SLNs obtained was in the range of 263-498nm. The results present that PS of SLNs can be best estimated by decision tree based methods, among which Random Forest has the least mean absolute error value with 0.028. As a result, the estimation of machine learning algorithms demonstrates that particle size can be estimated by both decision rule-based machine learning methods and function fitting machine learning methods. Conclusion: Our findings present that machine learning methods can be highly useful for determining formulation parameters for further research.


Symmetry ◽  
2020 ◽  
Vol 12 (10) ◽  
pp. 1594
Author(s):  
Samih M. Mostafa ◽  
Abdelrahman S. Eladimy ◽  
Safwat Hamad ◽  
Hirofumi Amano

In most scientific studies such as data analysis, the existence of missing data is a critical problem, and selecting the appropriate approach to deal with missing data is a challenge. In this paper, the authors perform a fair comparative study of some practical imputation methods used for handling missing values against two proposed imputation algorithms. The proposed algorithms depend on the Bayesian Ridge technique under two different feature selection conditions. The proposed algorithms differ from the existing approaches in that they cumulate the imputed features; those imputed features will be incorporated within the Bayesian Ridge equation for predicting the missing values in the next incomplete selected feature. The authors applied the proposed algorithms on eight datasets with different amount of missing values created from different missingness mechanisms. The performance was measured in terms of imputation time, root-mean-square error (RMSE), coefficient of determination (R2), and mean absolute error (MAE). The results showed that the performance varies depending on missing values percentage, size of the dataset, and the missingness mechanism. In addition, the performance of the proposed methods is slightly better.


2020 ◽  
Vol 5 (2) ◽  
pp. 183-186
Author(s):  
Ledisi Giok Kabari ◽  
Marcus B. Chigoziri ◽  
Joseph Eneotu

In this study, we discuss various machine learning algorithms and architectures suitable for the Nigerian Naira exchange rate forecast. Our analyses were focused on the exchange rates of the British Pounds, US Dollars and the Euro against the Naira. The exchange rate data was sourced from the Central Bank of Nigeria. The performances of the algorithms were evaluated using Mean Squared Error, Root Mean Squared Error, Mean Absolute Error and the coefficient of determination (R-Squared score). Finally, we compared the performances of these algorithms in forecasting the exchange rates.


2021 ◽  
Author(s):  
Hangsik Shin

BACKGROUND Arterial stiffness due to vascular aging is a major indicator for evaluating cardiovascular risk. OBJECTIVE In this study, we propose a method of estimating age by applying machine learning to photoplethysmogram for non-invasive vascular age assessment. METHODS The machine learning-based age estimation model that consists of three convolutional layers and two-layer fully connected layers, was developed using segmented photoplethysmogram by pulse from a total of 752 adults aged 19–87 years. The performance of the developed model was quantitatively evaluated using mean absolute error, root-mean-squared-error, Pearson’s correlation coefficient, coefficient of determination. The Grad-Cam was used to explain the contribution of photoplethysmogram waveform characteristic in vascular age estimation. RESULTS Mean absolute error of 8.03, root mean squared error of 9.96, 0.62 of correlation coefficient, and 0.38 of coefficient of determination were shown through 10-fold cross validation. Grad-Cam, used to determine the weight that the input signal contributes to the result, confirmed that the contribution to the age estimation of the photoplethysmogram segment was high around the systolic peak. CONCLUSIONS The machine learning-based vascular aging analysis method using the PPG waveform showed comparable or superior performance compared to previous studies without complex feature detection in evaluating vascular aging. CLINICALTRIAL 2015-0104


2019 ◽  
Author(s):  
Dimitri Abrahamsson ◽  
June-Soo Park ◽  
Marina Sirota ◽  
Tracey Woodruff

We developed two in silico quantification methods for chemicals analyzed with capillary electrophoresis electrospray ionization-mass spectrometry (CE-ESI-MS) using machine learning - a random forest (RF) and an artificial neural network (ANN). The algorithms can be used to predict chemical concentrations based on the chemicals’ relative response factors (RRFs) and their physicochemical properties. The RF and ANN predicted the measured concentrations with a mean absolute error of 0.2 log units and a coefficient of determination (R2) of about 0.85 for the testing set.


Energies ◽  
2021 ◽  
Vol 14 (9) ◽  
pp. 2486
Author(s):  
Vanesa Mateo-Pérez ◽  
Marina Corral-Bobadilla ◽  
Francisco Ortega-Fernández ◽  
Vicente Rodríguez-Montequín

One of the fundamental maintenance tasks of ports is the periodic dredging of them. This is necessary to guarantee a minimum draft that will enable ships to access ports safely. The determination of bathymetries is the instrument that determines the need for dredging and permits an analysis of the behavior of the port bottom over time, in order to achieve adequate water depth. Satellite data processing to predict environmental parameters is used increasingly. Based on satellite data and using different machine learning algorithm techniques, this study has sought to estimate the seabed in ports, taking into account the fact that the port areas are strongly anthropized areas. The algorithms that were used were Support Vector Machine (SVM), Random Forest (RF) and the Multi-Adaptive Regression Splines (MARS). The study was carried out in the ports of Candás and Luarca in the Principality of Asturias. In order to validate the results obtained, data was acquired in situ by using a single beam provided. The results show that this type of methodology can be used to estimate coastal bathymetry. However, when deciding which system was best, priority was given to simplicity and robustness. The results of the SVM and RF algorithms outperform those of the MARS. RF performs better in Candás with a mean absolute error (MAE) of 0.27 cm, whereas SVM performs better in Luarca with a mean absolute error of 0.37 cm. It is suggested that this approach is suitable as a simpler and more cost-effective rough resolution alternative, for estimating the depth of turbid water in ports, than single-beam sonar, which is labor-intensive and polluting.


Algorithms ◽  
2021 ◽  
Vol 14 (7) ◽  
pp. 201
Author(s):  
Charlyn Nayve Villavicencio ◽  
Julio Jerison Escudero Macrohon ◽  
Xavier Alphonse Inbaraj ◽  
Jyh-Horng Jeng ◽  
Jer-Guang Hsieh

Early diagnosis is crucial to prevent the development of a disease that may cause danger to human lives. COVID-19, which is a contagious disease that has mutated into several variants, has become a global pandemic that demands to be diagnosed as soon as possible. With the use of technology, available information concerning COVID-19 increases each day, and extracting useful information from massive data can be done through data mining. In this study, authors utilized several supervised machine learning algorithms in building a model to analyze and predict the presence of COVID-19 using the COVID-19 Symptoms and Presence dataset from Kaggle. J48 Decision Tree, Random Forest, Support Vector Machine, K-Nearest Neighbors and Naïve Bayes algorithms were applied through WEKA machine learning software. Each model’s performance was evaluated using 10-fold cross validation and compared according to major accuracy measures, correctly or incorrectly classified instances, kappa, mean absolute error, and time taken to build the model. The results show that Support Vector Machine using Pearson VII universal kernel outweighs other algorithms by attaining 98.81% accuracy and a mean absolute error of 0.012.


Sensors ◽  
2021 ◽  
Vol 21 (7) ◽  
pp. 2361
Author(s):  
Giovanni Delnevo ◽  
Giacomo Mancini ◽  
Marco Roccetti ◽  
Paola Salomoni ◽  
Elena Trombini ◽  
...  

This study investigates on the relationship between affect-related psychological variables and Body Mass Index (BMI). We have utilized a novel method based on machine learning (ML) algorithms that forecast unobserved BMI values based on psychological variables, like depression, as predictors. We have employed various machine learning algorithms, including gradient boosting and random forest, with psychological variables relative to 221 subjects to predict both the BMI values and the BMI status (normal, overweight, and obese) of those subjects. We have found that the psychological variables in use allow one to predict both the BMI values (with a mean absolute error of 5.27–5.50) and the BMI status with an accuracy of over 80% (metric: F1-score). Further, our study has also confirmed the particular efficacy of psychological variables of negative type, such as depression for example, compared to positive ones, to achieve excellent predictive BMI values.


Liver malady is an overall medical issue that is related with different inconveniences and high mortality. It is of basic significance that illness be recognized before such huge numbers of these lives can be spared. The phases of liver ailment are a significant viewpoint for focused treatment. It is a terribly troublesome undertaking for therapeutic analysts to foresee the disease inside the beginning times on account of sensitive manifestations. Generally the side effects become evident once it's past the point of no return. To beat this issue, we have liver infection forecast. Liver sickness might be distinguished with incalculable order systems, and these have been classified the utilization forecast of a number highlights and classifier blends. In this investigation, we applied five sort of classifiers that is Naïve Bayes, logistic regression, support vector machines, Random Forest, K Nearest Neighbour for the examination of liver malady. The classification exhibitions are assessed with 5 distinctive by and large execution measurements, i.e., precision, kappa, Mean absolute error (MAE), Root mean square error (RMSE), and F measures. The objective of this query work is to foresee liver infection with different machine learning and pick most efficient algorithm.


2019 ◽  
Author(s):  
Dimitri Abrahamsson ◽  
June-Soo Park ◽  
Marina Sirota ◽  
Tracey Woodruff

We developed two in silico quantification methods for chemicals analyzed with capillary electrophoresis electrospray ionization-mass spectrometry (CE-ESI-MS) using machine learning - a random forest (RF) and an artificial neural network (ANN). The algorithms can be used to predict chemical concentrations based on the chemicals’ relative response factors (RRFs) and their physicochemical properties. The RF and ANN predicted the measured concentrations with a mean absolute error of 0.2 log units and a coefficient of determination (R2) of about 0.85 for the testing set.


Sign in / Sign up

Export Citation Format

Share Document