Dataset Splitting Techniques Comparison For Face Classification on CCTV Images

Ade Nurhopipah; Uswatun Hasanah

doi:10.22146/ijccs.58092

Dataset Splitting Techniques Comparison For Face Classification on CCTV Images

IJCCS (Indonesian Journal of Computing and Cybernetics Systems) ◽

10.22146/ijccs.58092 ◽

2020 ◽

Vol 14 (4) ◽

pp. 341

Author(s):

Ade Nurhopipah ◽

Uswatun Hasanah

Keyword(s):

Splitting Method ◽

Machine Learning Algorithms ◽

Support Vector ◽

Training Set ◽

Test Set ◽

Face Classification ◽

Lower Accuracy ◽

Svm Algorithm ◽

Stable Performance ◽

Validation Set

The performance of classification models in machine learning algorithms is influenced by many factors, one of which is dataset splitting method. To avoid overfitting, it is important to apply a suitable dataset splitting strategy. This study presents comparison of four dataset splitting techniques, namely Random Sub-sampling Validation (RSV), k-Fold Cross Validation (k-FCV), Bootstrap Validation (BV) and Moralis Lima Martin Validation (MLMV). This comparison is done in face classification on CCTV images using Convolutional Neural Network (CNN) algorithm and Support Vector Machine (SVM) algorithm. This study is also applied in two image datasets. The results of the comparison are reviewed by using model accuracy in training set, validation set and test set, also bias and variance of the model. The experiment shows that k-FCV technique has more stable performance and provide high accuracy on training set as well as good generalizations on validation set and test set. Meanwhile, data splitting using MLMV technique has lower performance than the other three techniques since it yields lower accuracy. This technique also shows higher bias and variance values and it builds overfitting models, especially when it is applied on validation set.

Download Full-text

Using Machine Learning for Estimating Rice Chlorophyll Content from In Situ Hyperspectral Data

Remote Sensing ◽

10.3390/rs12183104 ◽

2020 ◽

Vol 12 (18) ◽

pp. 3104

Author(s):

Gangqiang An ◽

Minfeng Xing ◽

Binbin He ◽

Chunhua Liao ◽

Xiaodong Huang ◽

...

Keyword(s):

Machine Learning ◽

Chlorophyll Content ◽

Precision Agriculture ◽

Rate Of Change ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Support Vector ◽

Training Set ◽

Validation Set

Chlorophyll is an essential pigment for photosynthesis in crops, and leaf chlorophyll content can be used as an indicator for crop growth status and help guide nitrogen fertilizer applications. Estimating crop chlorophyll content plays an important role in precision agriculture. In this study, a variable, rate of change in reflectance between wavelengths ‘a’ and ‘b’ (RCRWa-b), derived from in situ hyperspectral remote sensing data combined with four advanced machine learning techniques, Gaussian process regression (GPR), random forest regression (RFR), support vector regression (SVR), and gradient boosting regression tree (GBRT), were used to estimate the chlorophyll content (measured by a portable soil–plant analysis development meter) of rice. The performances of the four machine learning models were assessed and compared using root mean square error (RMSE), mean absolute error (MAE), and coefficient of determination (R2). The results revealed that four features of RCRWa-b, RCRW551.0–565.6, RCRW739.5–743.5, RCRW684.4–687.1 and RCRW667.9–672.0, were effective in estimating the chlorophyll content of rice, and the RFR model generated the highest prediction accuracy (training set: RMSE = 1.54, MAE =1.23 and R2 = 0.95; validation set: RMSE = 2.64, MAE = 1.99 and R2 = 0.80). The GPR model was found to have the strongest generalization (training set: RMSE = 2.83, MAE = 2.16 and R2 = 0.77; validation set: RMSE = 2.97, MAE = 2.30 and R2 = 0.76). We conclude that RCRWa-b is a useful variable to estimate chlorophyll content of rice, and RFR and GPR are powerful machine learning algorithms for estimating the chlorophyll content of rice.

Download Full-text

Correlation between the structure and skin permeability of compounds

Scientific Reports ◽

10.1038/s41598-021-89587-5 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Ruolan Zeng ◽

Jiyong Deng ◽

Limin Dang ◽

Xinliang Yu

Keyword(s):

Large Data ◽

Qsar Model ◽

Coefficient Of Determination ◽

Support Vector ◽

Skin Permeability ◽

Data Set ◽

Test Set ◽

Svm Algorithm ◽

Svm Model ◽

Toxicity Relationship

AbstractA three-descriptor quantitative structure–activity/toxicity relationship (QSAR/QSTR) model was developed for the skin permeability of a sufficiently large data set consisting of 274 compounds, by applying support vector machine (SVM) together with genetic algorithm. The optimal SVM model possesses the coefficient of determination R2 of 0.946 and root mean square (rms) error of 0.253 for the training set of 139 compounds; and a R2 of 0.872 and rms of 0.302 for the test set of 135 compounds. Compared with other models reported in the literature, our SVM model shows better statistical performance in a model that deals with more samples in the test set. Therefore, applying a SVM algorithm to develop a nonlinear QSAR model for skin permeability was achieved.

Download Full-text

Application of Multi-Scale Fusion Attention U-Net to Segment the Thyroid Gland on CT Localization Images for Radiotherapy

10.21203/rs.3.rs-949323/v1 ◽

2021 ◽

Author(s):

Xiaobo Wen ◽

Biao Zhao ◽

Meifang Yuan ◽

Jinzhi Li ◽

Mengzhen Sun ◽

...

Keyword(s):

Thyroid Gland ◽

Clinical Work ◽

Similarity Coefficient ◽

Dice Similarity Coefficient ◽

Training Set ◽

Data Set ◽

Test Set ◽

Noise Interference ◽

Multi Scale ◽

Validation Set

Abstract Objectives: To explore the performance of Multi-scale Fusion Attention U-net (MSFA-U-net) in thyroid gland segmentation on CT localization images for radiotherapy. Methods: CT localization images for radiotherapy of 80 patients with breast cancer or head and neck tumors were selected; label images were manually delineated by experienced radiologists. The data set was randomly divided into the training set (n=60), the validation set (n=10), and the test set (n=10). Data expansion was performed in the training set, and the performance of the MSFA-U-net model was evaluated using the evaluation indicators Dice similarity coefficient (DSC), Jaccard similarity coefficient (JSC), positive predictive value (PPV), sensitivity (SE), and Hausdorff distance (HD). Results: With the MSFA-U-net model, the DSC, JSC, PPV, SE, and HD indexes of the segmented thyroid gland in the test set were 0.8967±0.0935, 0.8219±0.1115, 0.9065±0.0940, 0.8979±0.1104, and 2.3922±0.5423, respectively. Compared with U-net, HR-net, and Attention U-net, MSFA-U-net showed that DSC increased by 0.052, 0.0376, and 0.0346 respectively; JSC increased by 0.0569, 0.0805, and 0.0433, respectively; SE increased by 0.0361, 0.1091, and 0.0831, respectively; and HD increased by −0.208, −0.1952, and −0.0548, respectively. The test set image results showed that the thyroid edges segmented by the MSFA-U-net model were closer to the standard thyroid delineated by the experts, in comparison with those segmented by the other three models. Moreover, the edges were smoother, over-anti-noise interference was stronger, and oversegmentation and undersegmentation were reduced. Conclusion: The MSFA-U-net model can meet basic clinical requirements and improve the efficiency of physicians' clinical work.

Download Full-text

Predictive Modeling of Surgical Site Infections Using Sparse Laboratory Data

Data Analytics in Medicine ◽

10.4018/978-1-7998-1204-3.ch022 ◽

2020 ◽

pp. 410-423

Author(s):

Prabhu RV Shankar ◽

Anupama Kesari ◽

Priya Shalini ◽

N. Kamalashree ◽

Charan Bharadwaj ◽

...

Keyword(s):

Laboratory Data ◽

Healthcare Providers ◽

Surgical Site Infections ◽

Support Vector ◽

Surgical Patient ◽

Training Set ◽

Test Set ◽

New Associations ◽

Healthcare Data ◽

Laboratory Biomarkers

As part of a data mining competition, a training and test set of laboratory test data about patients with and without surgical site infection (SSI) were provided. The task was to develop predictive models with training set and identify patients with SSI in the no label test set. Lab test results are vital resources that guide healthcare providers make decisions about all aspects of surgical patient management. Many machine learning models were developed after pre-processing and imputing the lab tests data and only the top performing methods are discussed. Overall, RANDOM FOREST algorithms performed better than Support Vector Machine and Logistic Regression. Using a set of 74 lab tests, with RF, there were only 4 false positives in the training set and predicted 35 out of 50 SSI patients in the test set (Accuracy 0.86, Sensitivity 0.68, and Specificity 0.91). Optimal ways to address healthcare data quality concerns and imputation methods as well as newer generalizable algorithms need to be explored further to decipher new associations and knowledge among laboratory biomarkers and SSI.

Download Full-text

Combination of NIR spectroscopy and machine learning for monitoring chili sauce adulterated with ripened papaya

E3S Web of Conferences ◽

10.1051/e3sconf/202018704001 ◽

2020 ◽

Vol 187 ◽

pp. 04001

Author(s):

Ravipat Lapcharoensuk ◽

Kitticheat Danupattanin ◽

Chaowarin Kanjanapornprapa ◽

Tawin Inkawee

Keyword(s):

Machine Learning ◽

Food Industry ◽

Partial Least Squares Regression ◽

Nir Spectroscopy ◽

Machine Learning Algorithms ◽

Support Vector ◽

Learning Approaches ◽

Least Squares Regression ◽

Validation Set ◽

Global Food

This research aimed to study the combination of NIR spectroscopy and machine learning for monitoring chilli sauce adulterated with papaya smoothie. The chilli sauce was produced by the famous community enterprise of chilli sauce processing in Thailand. The ingredients of the chilli sauce consisted of 45% chilli, 25% sugar, 20% garlic, 5% vinegar, and 5% salt. The chilli sauce sample was mixed with ripened papaya (Khaek Dam variety) smoothie with 9 levels from 10 to 90 %w/w. The NIR spectra of pure chilli sauce, papaya smoothie and 9 adulterated chilli sauce samples were recorded using FT-NIR spectrometer in the wavenumber range of 12500 and 4000 cm-1. Three machine learning algorithms were applied to develop a model for monitoring adulterated chilli sauce, including partial least squares regression (PLS), support vector machine (SVM), and backpropagation neural network (BPNN). All model presented performance of prediction in the validation set with R2al = 0.99 while RMSEP of PLS, SVM and BPNN were 1.71, 2.18 and 3.27% w/w respectively. This finding indicated that NIR spectroscopy coupled with machine learning approaches were shown to be an alternative technique to monitor papaya smoothie adulterated in chilli sauce in the global food industry.

Download Full-text

A Data-Analytics Tutorial: Building Predictive Models for Oil Production in an Unconventional Shale Reservoir

SPE Journal ◽

10.2118/189969-pa ◽

2018 ◽

Vol 23 (04) ◽

pp. 1075-1089 ◽

Cited By ~ 14

Author(s):

Jared Schuetter ◽

Srikanta Mishra ◽

Ming Zhong ◽

Randy LaFollette (ret.)

Keyword(s):

Predictive Models ◽

Decision Rules ◽

Regression Tree ◽

Production Performance ◽

Classification And Regression Tree ◽

Gradient Boosting ◽

Support Vector ◽

Training Set ◽

Test Set ◽

Well Completion

Summary Considerable amounts of data are being generated during the development and operation of unconventional reservoirs. Statistical methods that can provide data-driven insights into production performance are gaining in popularity. Unfortunately, the application of advanced statistical algorithms remains somewhat of a mystery to petroleum engineers and geoscientists. The objective of this paper is to provide some clarity to this issue, focusing on how to build robust predictive models and how to develop decision rules that help identify factors separating good wells from poor performers. The data for this study come from wells completed in the Wolfcamp Shale Formation in the Permian Basin. Data categories used in the study included well location and assorted metrics capturing various aspects of well architecture, well completion, stimulation, and production. Predictive models for the production metric of interest are built using simple regression and other advanced methods such as random forests (RFs), support-vector regression (SVR), gradient-boosting machine (GBM), and multidimensional Kriging. The data-fitting process involves splitting the data into a training set and a test set, building a regression model on the training set and validating it with the test set. Repeated application of a “cross-validation” procedure yields valuable information regarding the robustness of each regression-modeling approach. Furthermore, decision rules that can identify extreme behavior in production wells (i.e., top x% of the wells vs. bottom x%, as ranked by the production metric) are generated using the classification and regression-tree algorithm. The resulting decision tree (DT) provides useful insights regarding what variables (or combinations of variables) can drive production performance into such extreme categories. The main contributions of this paper are to provide guidelines on how to build robust predictive models, and to demonstrate the utility of DTs for identifying factors responsible for good vs. poor wells.

Download Full-text

Multiclass Classifier for P-Glycoprotein Substrates, Inhibitors, and Non-Active Compounds

Molecules ◽

10.3390/molecules24102006 ◽

2019 ◽

Vol 24 (10) ◽

pp. 2006 ◽

Cited By ~ 1

Author(s):

Liadys Mora Lagares ◽

Nikola Minovski ◽

Marjana Novič

Keyword(s):

In Silico ◽

Transmembrane Protein ◽

External Validation ◽

Assessment Process ◽

Classification Model ◽

Training Set ◽

Test Set ◽

Active Compounds ◽

P Glycoprotein ◽

Validation Set

P-glycoprotein (P-gp) is a transmembrane protein that actively transports a wide variety of chemically diverse compounds out of the cell. It is highly associated with the ADMET (absorption, distribution, metabolism, excretion and toxicity) properties of drugs/drug candidates and contributes to decreasing toxicity by eliminating compounds from cells, thereby preventing intracellular accumulation. Therefore, in the drug discovery and toxicological assessment process it is advisable to pay attention to whether a compound under development could be transported by P-gp or not. In this study, an in silico multiclass classification model capable of predicting the probability of a compound to interact with P-gp was developed using a counter-propagation artificial neural network (CP ANN) based on a set of 2D molecular descriptors, as well as an extensive dataset of 2512 compounds (1178 P-gp inhibitors, 477 P-gp substrates and 857 P-gp non-active compounds). The model provided a good classification performance, producing non error rate (NER) values of 0.93 for the training set and 0.85 for the test set, while the average precision (AvPr) was 0.93 for the training set and 0.87 for the test set. An external validation set of 385 compounds was used to challenge the model’s performance. On the external validation set the NER and AvPr values were 0.70 for both indices. We believe that this in silico classifier could be effectively used as a reliable virtual screening tool for identifying potential P-gp ligands.

Download Full-text

Comparison of Support Vector Machine, Bayesian Logistic Regression, and Alternating Decision Tree Algorithms for Shallow Landslide Susceptibility Mapping along a Mountainous Road in the West of Iran

Applied Sciences ◽

10.3390/app10155047 ◽

2020 ◽

Vol 10 (15) ◽

pp. 5047 ◽

Cited By ~ 7

Author(s):

Viet-Ha Nhu ◽

Danesh Zandi ◽

Himan Shahabi ◽

Kamran Chapi ◽

Ataollah Shirzadi ◽

...

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Logistic Regression ◽

Decision Tree ◽

Shallow Landslide ◽

Machine Learning Algorithms ◽

Support Vector ◽

Svm Algorithm ◽

Alternating Decision Tree ◽

Bayesian Logistic Regression

This paper aims to apply and compare the performance of the three machine learning algorithms–support vector machine (SVM), bayesian logistic regression (BLR), and alternating decision tree (ADTree)–to map landslide susceptibility along the mountainous road of the Salavat Abad saddle, Kurdistan province, Iran. We identified 66 shallow landslide locations, based on field surveys, by recording the locations of the landslides by a global position System (GPS), Google Earth imagery and black-and-white aerial photographs (scale 1: 20,000) and 19 landslide conditioning factors, then tested these factors using the information gain ratio (IGR) technique. We checked the validity of the models using statistical metrics, including sensitivity, specificity, accuracy, kappa, root mean square error (RMSE), and area under the receiver operating characteristic curve (AUC). We found that, although all three machine learning algorithms yielded excellent performance, the SVM algorithm (AUC = 0.984) slightly outperformed the BLR (AUC = 0.980), and ADTree (AUC = 0.977) algorithms. We observed that not only all three algorithms are useful and effective tools for identifying shallow landslide-prone areas but also the BLR algorithm can be used such as the SVM algorithm as a soft computing benchmark algorithm to check the performance of the models in future.

Download Full-text

An Effective K-Means Clustering Based SVM Algorithm

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.333-335.1344 ◽

2013 ◽

Vol 333-335 ◽

pp. 1344-1348

Author(s):

Yu Kai Yao ◽

Yang Liu ◽

Zhao Li ◽

Xiao Yun Chen

Keyword(s):

Support Vector ◽

Svm Classifier ◽

Small Subset ◽

Training Set ◽

Data Mining Algorithms ◽

Svm Algorithm ◽

Svm Model ◽

Separating Hyperplane ◽

Regression Problems ◽

Mining Algorithms

Support Vector Machine (SVM) is one of the most popular and effective data mining algorithms which can be used to resolve classification or regression problems, and has attracted much attention these years. SVM could find the optimal separating hyperplane between classes, which afford outstanding generalization ability with it. Usually all the labeled records are used as training set. However, the optimal separating hyperplane only depends on a few crucial samples (Support Vectors, SVs), we neednt train SVM model on the whole training set. In this paper a novel SVM model based on K-means clustering is presented, in which only a small subset of the original training set is selected to constitute the final training set, and the SVM classifier is built through training on these selected samples. This greatly decrease the scale of the training set, and effectively saves the training and predicting cost of SVM, meanwhile guarantees its generalization performance.

Download Full-text

Classification of Automobile Lubricant by Near-Infrared Spectroscopy Combined with Machine Classification

Key Engineering Materials ◽

10.4028/www.scientific.net/kem.460-461.667 ◽

2011 ◽

Vol 460-461 ◽

pp. 667-672

Author(s):

Yun Zhao ◽

Xing Xu ◽

Yong He

Keyword(s):

Near Infrared ◽

Prediction Models ◽

Nir Spectroscopy ◽

Partial Least Square ◽

Least Square ◽

Support Vector ◽

Training Set ◽

Test Set ◽

Svm Model

The main objective of this paper is to classify four kinds of automobile lubricant by near-infrared (NIR) spectral technology and to observe whether NIR spectroscopy could be used for predicting water content. Principle component analysis (PCA) was applied to reduce the information from the spectral data and first two PCs were used to cluster the samples. Partial least square (PLS), least square support vector machine (LS-SVM), and Gaussian processes classification (GPC) were employed to develop prediction models. There were 120 samples for training set and test set. Two LS-SVM models with first five PCs and first six PCs were built, respectively, and accuracy of the model with five PCs is adequate with less calculation. The results from the experiment indicate that the LS-SVM model outperforms the PLS model and GPC model outperforms the LS-SVM model.

Download Full-text