Test data reuse for evaluation of adaptive machine learning algorithms: over-fitting to a fixed 'test' dataset and a potential solution

Compendiums of cancer transcriptomes for machine learning applications

Scientific Data ◽

10.1038/s41597-019-0207-2 ◽

2019 ◽

Vol 6 (1) ◽

Cited By ~ 2

Author(s):

Su Bin Lim ◽

Swee Jin Tan ◽

Wan-Teck Lim ◽

Chwee Teck Lim

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Data Reuse ◽

Rna Seq ◽

Genomic Landscape ◽

Source Data ◽

Machine Learning Applications ◽

Cancer Types ◽

Data Source

Abstract There are massive transcriptome profiles in the form of microarray. The challenge is that they are processed using diverse platforms and preprocessing tools, requiring considerable time and informatics expertise for cross-dataset analyses. If there exists a single, integrated data source, data-reuse can be facilitated for discovery, analysis, and validation of biomarker-based clinical strategy. Here, we present merged microarray-acquired datasets (MMDs) across 11 major cancer types, curating 8,386 patient-derived tumor and tumor-free samples from 95 GEO datasets. Using machine learning algorithms, we show that diagnostic models trained from MMDs can be directly applied to RNA-seq-acquired TCGA data with high classification accuracy. Machine learning optimized MMD further aids to reveal immune landscape across various carcinomas critically needed in disease management and clinical interventions. This unified data source may serve as an excellent training or test set to apply, develop, and refine machine learning algorithms that can be tapped to better define genomic landscape of human cancers.

Download Full-text

Compendiums of Cancer Transcriptome for Machine Learning Applications

10.1101/353698 ◽

2018 ◽

Cited By ~ 1

Author(s):

Su Bin Lim ◽

Swee Jin Tan ◽

Wan-Teck Lim ◽

Chwee Teck Lim

Keyword(s):

Machine Learning ◽

Large Scale ◽

Meta Analysis ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Data Reuse ◽

Human Cancers ◽

Cancer Transcriptome ◽

Cancer Types ◽

Data Source

AbstractBackgroundThere exist massive transcriptome profiles in the form of microarray, enabling reuse. The challenge is that they are processed with diverse platforms and preprocessing tools, requiring considerable time and informatics expertise for cross-dataset or cross-cancer analyses. If there exists a single, integrated data source consisting of thousands of samples, similar to TCGA, data-reuse will be facilitated for discovery, analysis, and validation of biomarker-based clinical strategy.FindingsWe present 11 merged microarray-acquired datasets (MMDs) of major cancer types, curating 8,386 patient-derived tumor and tumor-free samples from 95 GEO datasets. Highly concordant MMD-derived patterns of genome-wide differential gene expression were observed with matching TCGA cohorts. Using machine learning algorithms, we show that clinical models trained from all MMDs, except breast MMD, can be directly applied to RNA-seq-acquired TCGA data with an average accuracy of 0.96 in classifying cancer. Machine learning optimized MMD further aids to reveal immune landscape of human cancers critically needed in disease management and clinical interventions.ConclusionsTo facilitate large-scale meta-analysis, we generated a newly curated, unified, large-scale MMD across 11 cancer types. Besides TCGA, this single data source may serve as an excellent training or test set to apply, develop, and refine machine learning algorithms that can be tapped to better define genomic landscape of human cancers.

Download Full-text

Evaluation and calibration of a low-cost particle sensor in ambient conditions using machine-learning methods

Atmospheric Measurement Techniques ◽

10.5194/amt-13-1693-2020 ◽

2020 ◽

Vol 13 (4) ◽

pp. 1693-1707 ◽

Cited By ~ 8

Author(s):

Minxing Si ◽

Ying Xiong ◽

Shan Du ◽

Ke Du

Keyword(s):

Machine Learning ◽

Linear Regression ◽

Random Search ◽

Low Cost ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Ambient Conditions ◽

Test Dataset ◽

Compact Size ◽

Significant Difference

Abstract. Particle sensing technology has shown great potential for monitoring particulate matter (PM) with very few temporal and spatial restrictions because of its low cost, compact size, and easy operation. However, the performance of low-cost sensors for PM monitoring in ambient conditions has not been thoroughly evaluated. Monitoring results by low-cost sensors are often questionable. In this study, a low-cost fine particle monitor (Plantower PMS 5003) was colocated with a reference instrument, the Synchronized Hybrid Ambient Real-time Particulate (SHARP) monitor, at the Calgary Varsity air monitoring station from December 2018 to April 2019. The study evaluated the performance of this low-cost PM sensor in ambient conditions and calibrated its readings using simple linear regression (SLR), multiple linear regression (MLR), and two more powerful machine-learning algorithms using random search techniques for the best model architectures. The two machine-learning algorithms are XGBoost and a feedforward neural network (NN). Field evaluation showed that the Pearson correlation (r) between the low-cost sensor and the SHARP instrument was 0.78. The Fligner and Killeen (F–K) test indicated a statistically significant difference between the variances of the PM2.5 values by the low-cost sensor and the SHARP instrument. Large overestimations by the low-cost sensor before calibration were observed in the field and were believed to be caused by the variation of ambient relative humidity. The root mean square error (RMSE) was 9.93 when comparing the low-cost sensor with the SHARP instrument. The calibration by the feedforward NN had the smallest RMSE of 3.91 in the test dataset compared to the calibrations by SLR (4.91), MLR (4.65), and XGBoost (4.19). After calibrations, the F–K test using the test dataset showed that the variances of the PM2.5 values by the NN, XGBoost, and the reference method were not statistically significantly different. From this study, we conclude that a feedforward NN is a promising method to address the poor performance of low-cost sensors for PM2.5 monitoring. In addition, the random search method for hyperparameters was demonstrated to be an efficient approach for selecting the best model structure.

Download Full-text

Machine-Learning Algorithms Based on Screening Tests for Mild Cognitive Impairment

American Journal of Alzheimer s Disease & Other Dementias® ◽

10.1177/1533317520927163 ◽

2020 ◽

Vol 35 ◽

pp. 153331752092716

Author(s):

Jin-Hyuck Park

Keyword(s):

Machine Learning ◽

Cognitive Impairment ◽

Mild Cognitive Impairment ◽

Test Data ◽

Learning Algorithms ◽

Test System ◽

Machine Learning Algorithms ◽

Screening Tools ◽

Screening Tests ◽

Data Set

Background: The mobile screening test system for mild cognitive impairment (mSTS-MCI) was developed and validated to address the low sensitivity and specificity of the Montreal Cognitive Assessment (MoCA) widely used clinically. Objective: This study was to evaluate the efficacy machine learning algorithms based on the mSTS-MCI and Korean version of MoCA. Method: In total, 103 healthy individuals and 74 patients with MCI were randomly divided into training and test data sets, respectively. The algorithm using TensorFlow was trained based on the training data set, and then its accuracy was calculated based on the test data set. The cost was calculated via logistic regression in this case. Result: Predictive power of the algorithms was higher than those of the original tests. In particular, the algorithm based on the mSTS-MCI showed the highest positive-predictive value. Conclusion: The machine learning algorithms predicting MCI showed the comparable findings with the conventional screening tools.

Download Full-text

Soil Classification System from Cone Penetration Test Data Applying Distance-Based Machine Learning Algorithms

Soils and Rocks ◽

10.28927/sr.422167 ◽

2019 ◽

Vol 42 (2) ◽

pp. 167-178

Author(s):

Lucas Orbolato Carvalho ◽

Dimas Betioli Ribeiro

Keyword(s):

Machine Learning ◽

Test Data ◽

Classification System ◽

Learning Algorithms ◽

Soil Classification ◽

Cone Penetration Test ◽

Machine Learning Algorithms ◽

Penetration Test ◽

Cone Penetration

Download Full-text

Disturbance Simulation in the Packaging Process of Confectionary Using Virtual Commissioning

Machines ◽

10.3390/machines8020019 ◽

2020 ◽

Vol 8 (2) ◽

pp. 19

Author(s):

Johanna Wolf ◽

Sebastian Carsch ◽

Clemens Troll ◽

Jens-Peter Majschak

Keyword(s):

Machine Learning ◽

Test Data ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Visual Effects ◽

Computing Power ◽

Virtual Commissioning ◽

Sensor Signals ◽

Assistance Systems ◽

Sufficient Test

Operator assistance systems can help to reduce disturbance-related machine downtime in food production and packaging processes, especially when combined with machine learning algorithms. These assistance systems analyze the available sensor signals of the process control over time to help operators identify the causes of disturbances. Training such systems requires sufficient test data, which often are hardly available. Thus, this paper presents a study to investigate how test data for teaching machine learning algorithms can be generated by numerical simulation. The potential of using virtual commissioning (VC) software for simulating disturbances of discrete processes is examined, considering the example of a friction and collision-afflicted sub-process from an intermitting wrapping machine for confectionary. In this study the software industrialPhysics (iP) is analyzed regarding accuracy of static and dynamic friction and restitution. The values are verified by setting up virtual substitute tests and comparing the results with analytically determined values. Subsequently, prerecorded disturbances are classified, and seven selected elements are simulated in VC software, recording visual effects and switching the characteristics of sensors. The verification shows that VC software is generally adequate for the assigned task. Restrictions occur regarding the computing power required of the built-in physics engine and the resulting reduction of the machine to be simulated.

Download Full-text

Evaluation and Calibration of a Low-cost Particle Sensor in Ambient Conditions Using Machine Learning Technologies

10.5194/amt-2019-393 ◽

2019 ◽

Author(s):

Minxing Si ◽

Ying Xiong ◽

Shan Du ◽

Ke Du

Keyword(s):

Machine Learning ◽

Linear Regression ◽

Random Search ◽

Low Cost ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Ambient Conditions ◽

Test Dataset ◽

Compact Size ◽

Significant Difference

Abstract. Particle sensing technology has shown great potential for monitoring particulate matter (PM) with very few temporal and spatial restrictions because of low-cost, compact size, and easy operation. However, the performance of low-cost sensors for PM monitoring in ambient conditions has not been thoroughly evaluated. Monitoring results by low-cost sensors are often questionable. In this study, a low-cost fine particle monitor (Plantower PMS 5003) was co-located with a reference instrument, named Synchronized Hybrid Ambient Real-time Particulate (SHARP) monitor, in Calgary Varsity air monitoring station from December 2018 to April 2019. The study evaluated the performance of this low-cost PM sensor in ambient conditions and calibrated its readings using simple linear regression (SLR), multiple linear regression (MLR), and two more powerful machine learning algorithms using random search techniques for the best model architectures. The two machine learning algorithms are XGBoost and feedforward neural network (NN). Field evaluation showed that the Pearson r between the low-cost sensor and the SHAPR instrument was 0.78. Fligner and Killeen (F-K) test indicated a statistically significant difference between the variances of the PM2.5 values by the low-cost sensor and by the SHARP instrument. Large overestimations by the low-cost sensor before calibration were observed in the field and were believed to be caused by the variation of ambient relative humidity. The root mean square error (RMSE) was 9.93 when comparing the low-cost sensor with the SHARP instrument. The calibration by the feedforward NN had the smallest RMSE of 3.91 in the test dataset, compared to the calibrations by SLR (4.91), MLR (4.65), and XGBoost (4.19). After calibrations, the F-K test using the test dataset showed that the variances of the PM2.5 values by the NN and the XGBoost and by the reference method were not statistically significantly different. From this study, we conclude that feedforward NN is a promising method to address the poor performance of the low-cost sensors for PM2.5 monitoring. In addition, the random search method for hyperparameters was demonstrated to be an efficient approach for selecting the best model structure.

Download Full-text

Streamlining Quality Review of Mass Spectrometry Data in the Clinical Laboratory by Use of Machine Learning

Archives of Pathology & Laboratory Medicine ◽

10.5858/arpa.2018-0238-oa ◽

2019 ◽

Vol 143 (8) ◽

pp. 990-998 ◽

Cited By ~ 2

Author(s):

Min Yu ◽

Lindsay A. L. Bazydlo ◽

David E. Bruns ◽

James H. Harrison

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Turnaround Time ◽

Machine Learning Algorithms ◽

Classification Model ◽

Supervised Machine Learning ◽

Training Dataset ◽

Support Vector ◽

Test Dataset ◽

Manual Review

Context.— Turnaround time and productivity of clinical mass spectrometric (MS) testing are hampered by time-consuming manual review of the analytical quality of MS data before release of patient results. Objective.— To determine whether a classification model created by using standard machine learning algorithms can verify analytically acceptable MS results and thereby reduce manual review requirements. Design.— We obtained retrospective data from gas chromatography–MS analyses of 11-nor-9-carboxy-delta-9-tetrahydrocannabinol (THC-COOH) in 1267 urine samples. The data for each sample had been labeled previously as either analytically unacceptable or acceptable by manual review. The dataset was randomly split into training and test sets (848 and 419 samples, respectively), maintaining equal proportions of acceptable (90%) and unacceptable (10%) results in each set. We used stratified 10-fold cross-validation in assessing the abilities of 6 supervised machine learning algorithms to distinguish unacceptable from acceptable assay results in the training dataset. The classifier with the highest recall was used to build a final model, and its performance was evaluated against the test dataset. Results.— In comparison testing of the 6 classifiers, a model based on the Support Vector Machines algorithm yielded the highest recall and acceptable precision. After optimization, this model correctly identified all unacceptable results in the test dataset (100% recall) with a precision of 81%. Conclusions.— Automated data review identified all analytically unacceptable assays in the test dataset, while reducing the manual review requirement by about 87%. This automation strategy can focus manual review only on assays likely to be problematic, allowing improved throughput and turnaround time without reducing quality.

Download Full-text

Disease Detection and Prediction Using the Liver Function Test Data: A Review of Machine Learning Algorithms

10.1007/978-981-16-2597-8_68 ◽

2021 ◽

pp. 785-800

Author(s):

Ifra Altaf ◽

Muheet Ahmed Butt ◽

Majid Zaman

Keyword(s):

Machine Learning ◽

Liver Function ◽

Liver Function Test ◽

Test Data ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Disease Detection ◽

Function Test

Download Full-text

Creating Test Data for Market Surveillance Systems with Embedded Machine Learning Algorithms

Proceedings of the Institute for System Programming of RAS ◽

10.15514/ispras-2017-29(4)-18 ◽

2017 ◽

Vol 29 (4) ◽

pp. 269-282

Author(s):

O. Moskaleva ◽

A. Gromova

Keyword(s):

Machine Learning ◽

Test Data ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Surveillance Systems ◽

Market Surveillance

Download Full-text