Test data reuse for evaluation of adaptive machine learning algorithms: over-fitting to a fixed 'test' dataset and a potential solution

Author(s):  
Alexej Gossmann ◽  
Aria Pezeshk ◽  
Berkman Sahiner
2019 ◽  
Vol 6 (1) ◽  
Author(s):  
Su Bin Lim ◽  
Swee Jin Tan ◽  
Wan-Teck Lim ◽  
Chwee Teck Lim

Abstract There are massive transcriptome profiles in the form of microarray. The challenge is that they are processed using diverse platforms and preprocessing tools, requiring considerable time and informatics expertise for cross-dataset analyses. If there exists a single, integrated data source, data-reuse can be facilitated for discovery, analysis, and validation of biomarker-based clinical strategy. Here, we present merged microarray-acquired datasets (MMDs) across 11 major cancer types, curating 8,386 patient-derived tumor and tumor-free samples from 95 GEO datasets. Using machine learning algorithms, we show that diagnostic models trained from MMDs can be directly applied to RNA-seq-acquired TCGA data with high classification accuracy. Machine learning optimized MMD further aids to reveal immune landscape across various carcinomas critically needed in disease management and clinical interventions. This unified data source may serve as an excellent training or test set to apply, develop, and refine machine learning algorithms that can be tapped to better define genomic landscape of human cancers.


2018 ◽  
Author(s):  
Su Bin Lim ◽  
Swee Jin Tan ◽  
Wan-Teck Lim ◽  
Chwee Teck Lim

AbstractBackgroundThere exist massive transcriptome profiles in the form of microarray, enabling reuse. The challenge is that they are processed with diverse platforms and preprocessing tools, requiring considerable time and informatics expertise for cross-dataset or cross-cancer analyses. If there exists a single, integrated data source consisting of thousands of samples, similar to TCGA, data-reuse will be facilitated for discovery, analysis, and validation of biomarker-based clinical strategy.FindingsWe present 11 merged microarray-acquired datasets (MMDs) of major cancer types, curating 8,386 patient-derived tumor and tumor-free samples from 95 GEO datasets. Highly concordant MMD-derived patterns of genome-wide differential gene expression were observed with matching TCGA cohorts. Using machine learning algorithms, we show that clinical models trained from all MMDs, except breast MMD, can be directly applied to RNA-seq-acquired TCGA data with an average accuracy of 0.96 in classifying cancer. Machine learning optimized MMD further aids to reveal immune landscape of human cancers critically needed in disease management and clinical interventions.ConclusionsTo facilitate large-scale meta-analysis, we generated a newly curated, unified, large-scale MMD across 11 cancer types. Besides TCGA, this single data source may serve as an excellent training or test set to apply, develop, and refine machine learning algorithms that can be tapped to better define genomic landscape of human cancers.


2020 ◽  
Vol 13 (4) ◽  
pp. 1693-1707 ◽  
Author(s):  
Minxing Si ◽  
Ying Xiong ◽  
Shan Du ◽  
Ke Du

Abstract. Particle sensing technology has shown great potential for monitoring particulate matter (PM) with very few temporal and spatial restrictions because of its low cost, compact size, and easy operation. However, the performance of low-cost sensors for PM monitoring in ambient conditions has not been thoroughly evaluated. Monitoring results by low-cost sensors are often questionable. In this study, a low-cost fine particle monitor (Plantower PMS 5003) was colocated with a reference instrument, the Synchronized Hybrid Ambient Real-time Particulate (SHARP) monitor, at the Calgary Varsity air monitoring station from December 2018 to April 2019. The study evaluated the performance of this low-cost PM sensor in ambient conditions and calibrated its readings using simple linear regression (SLR), multiple linear regression (MLR), and two more powerful machine-learning algorithms using random search techniques for the best model architectures. The two machine-learning algorithms are XGBoost and a feedforward neural network (NN). Field evaluation showed that the Pearson correlation (r) between the low-cost sensor and the SHARP instrument was 0.78. The Fligner and Killeen (F–K) test indicated a statistically significant difference between the variances of the PM2.5 values by the low-cost sensor and the SHARP instrument. Large overestimations by the low-cost sensor before calibration were observed in the field and were believed to be caused by the variation of ambient relative humidity. The root mean square error (RMSE) was 9.93 when comparing the low-cost sensor with the SHARP instrument. The calibration by the feedforward NN had the smallest RMSE of 3.91 in the test dataset compared to the calibrations by SLR (4.91), MLR (4.65), and XGBoost (4.19). After calibrations, the F–K test using the test dataset showed that the variances of the PM2.5 values by the NN, XGBoost, and the reference method were not statistically significantly different. From this study, we conclude that a feedforward NN is a promising method to address the poor performance of low-cost sensors for PM2.5 monitoring. In addition, the random search method for hyperparameters was demonstrated to be an efficient approach for selecting the best model structure.


2020 ◽  
Vol 35 ◽  
pp. 153331752092716
Author(s):  
Jin-Hyuck Park

Background: The mobile screening test system for mild cognitive impairment (mSTS-MCI) was developed and validated to address the low sensitivity and specificity of the Montreal Cognitive Assessment (MoCA) widely used clinically. Objective: This study was to evaluate the efficacy machine learning algorithms based on the mSTS-MCI and Korean version of MoCA. Method: In total, 103 healthy individuals and 74 patients with MCI were randomly divided into training and test data sets, respectively. The algorithm using TensorFlow was trained based on the training data set, and then its accuracy was calculated based on the test data set. The cost was calculated via logistic regression in this case. Result: Predictive power of the algorithms was higher than those of the original tests. In particular, the algorithm based on the mSTS-MCI showed the highest positive-predictive value. Conclusion: The machine learning algorithms predicting MCI showed the comparable findings with the conventional screening tools.


Machines ◽  
2020 ◽  
Vol 8 (2) ◽  
pp. 19
Author(s):  
Johanna Wolf ◽  
Sebastian Carsch ◽  
Clemens Troll ◽  
Jens-Peter Majschak

Operator assistance systems can help to reduce disturbance-related machine downtime in food production and packaging processes, especially when combined with machine learning algorithms. These assistance systems analyze the available sensor signals of the process control over time to help operators identify the causes of disturbances. Training such systems requires sufficient test data, which often are hardly available. Thus, this paper presents a study to investigate how test data for teaching machine learning algorithms can be generated by numerical simulation. The potential of using virtual commissioning (VC) software for simulating disturbances of discrete processes is examined, considering the example of a friction and collision-afflicted sub-process from an intermitting wrapping machine for confectionary. In this study the software industrialPhysics (iP) is analyzed regarding accuracy of static and dynamic friction and restitution. The values are verified by setting up virtual substitute tests and comparing the results with analytically determined values. Subsequently, prerecorded disturbances are classified, and seven selected elements are simulated in VC software, recording visual effects and switching the characteristics of sensors. The verification shows that VC software is generally adequate for the assigned task. Restrictions occur regarding the computing power required of the built-in physics engine and the resulting reduction of the machine to be simulated.


2019 ◽  
Author(s):  
Minxing Si ◽  
Ying Xiong ◽  
Shan Du ◽  
Ke Du

Abstract. Particle sensing technology has shown great potential for monitoring particulate matter (PM) with very few temporal and spatial restrictions because of low-cost, compact size, and easy operation. However, the performance of low-cost sensors for PM monitoring in ambient conditions has not been thoroughly evaluated. Monitoring results by low-cost sensors are often questionable. In this study, a low-cost fine particle monitor (Plantower PMS 5003) was co-located with a reference instrument, named Synchronized Hybrid Ambient Real-time Particulate (SHARP) monitor, in Calgary Varsity air monitoring station from December 2018 to April 2019. The study evaluated the performance of this low-cost PM sensor in ambient conditions and calibrated its readings using simple linear regression (SLR), multiple linear regression (MLR), and two more powerful machine learning algorithms using random search techniques for the best model architectures. The two machine learning algorithms are XGBoost and feedforward neural network (NN). Field evaluation showed that the Pearson r between the low-cost sensor and the SHAPR instrument was 0.78. Fligner and Killeen (F-K) test indicated a statistically significant difference between the variances of the PM2.5 values by the low-cost sensor and by the SHARP instrument. Large overestimations by the low-cost sensor before calibration were observed in the field and were believed to be caused by the variation of ambient relative humidity. The root mean square error (RMSE) was 9.93 when comparing the low-cost sensor with the SHARP instrument. The calibration by the feedforward NN had the smallest RMSE of 3.91 in the test dataset, compared to the calibrations by SLR (4.91), MLR (4.65), and XGBoost (4.19). After calibrations, the F-K test using the test dataset showed that the variances of the PM2.5 values by the NN and the XGBoost and by the reference method were not statistically significantly different. From this study, we conclude that feedforward NN is a promising method to address the poor performance of the low-cost sensors for PM2.5 monitoring. In addition, the random search method for hyperparameters was demonstrated to be an efficient approach for selecting the best model structure.


2019 ◽  
Vol 143 (8) ◽  
pp. 990-998 ◽  
Author(s):  
Min Yu ◽  
Lindsay A. L. Bazydlo ◽  
David E. Bruns ◽  
James H. Harrison

Context.— Turnaround time and productivity of clinical mass spectrometric (MS) testing are hampered by time-consuming manual review of the analytical quality of MS data before release of patient results. Objective.— To determine whether a classification model created by using standard machine learning algorithms can verify analytically acceptable MS results and thereby reduce manual review requirements. Design.— We obtained retrospective data from gas chromatography–MS analyses of 11-nor-9-carboxy-delta-9-tetrahydrocannabinol (THC-COOH) in 1267 urine samples. The data for each sample had been labeled previously as either analytically unacceptable or acceptable by manual review. The dataset was randomly split into training and test sets (848 and 419 samples, respectively), maintaining equal proportions of acceptable (90%) and unacceptable (10%) results in each set. We used stratified 10-fold cross-validation in assessing the abilities of 6 supervised machine learning algorithms to distinguish unacceptable from acceptable assay results in the training dataset. The classifier with the highest recall was used to build a final model, and its performance was evaluated against the test dataset. Results.— In comparison testing of the 6 classifiers, a model based on the Support Vector Machines algorithm yielded the highest recall and acceptable precision. After optimization, this model correctly identified all unacceptable results in the test dataset (100% recall) with a precision of 81%. Conclusions.— Automated data review identified all analytically unacceptable assays in the test dataset, while reducing the manual review requirement by about 87%. This automation strategy can focus manual review only on assays likely to be problematic, allowing improved throughput and turnaround time without reducing quality.


Sign in / Sign up

Export Citation Format

Share Document