set splitting
Recently Published Documents


TOTAL DOCUMENTS

34
(FIVE YEARS 1)

H-INDEX

11
(FIVE YEARS 0)

PLoS ONE ◽  
2021 ◽  
Vol 16 (8) ◽  
pp. e0256152
Author(s):  
Chansik An ◽  
Yae Won Park ◽  
Sung Soo Ahn ◽  
Kyunghwa Han ◽  
Hwiyoung Kim ◽  
...  

This study aims to determine how randomly splitting a dataset into training and test sets affects the estimated performance of a machine learning model and its gap from the test performance under different conditions, using real-world brain tumor radiomics data. We conducted two classification tasks of different difficulty levels with magnetic resonance imaging (MRI) radiomics features: (1) “Simple” task, glioblastomas [n = 109] vs. brain metastasis [n = 58] and (2) “difficult” task, low- [n = 163] vs. high-grade [n = 95] meningiomas. Additionally, two undersampled datasets were created by randomly sampling 50% from these datasets. We performed random training-test set splitting for each dataset repeatedly to create 1,000 different training-test set pairs. For each dataset pair, the least absolute shrinkage and selection operator model was trained and evaluated using various validation methods in the training set, and tested in the test set, using the area under the curve (AUC) as an evaluation metric. The AUCs in training and testing varied among different training-test set pairs, especially with the undersampled datasets and the difficult task. The mean (±standard deviation) AUC difference between training and testing was 0.039 (±0.032) for the simple task without undersampling and 0.092 (±0.071) for the difficult task with undersampling. In a training-test set pair with the difficult task without undersampling, for example, the AUC was high in training but much lower in testing (0.882 and 0.667, respectively); in another dataset pair with the same task, however, the AUC was low in training but much higher in testing (0.709 and 0.911, respectively). When the AUC discrepancy between training and test, or generalization gap, was large, none of the validation methods helped sufficiently reduce the generalization gap. Our results suggest that machine learning after a single random training-test set split may lead to unreliable results in radiomics studies especially with small sample sizes.


2020 ◽  
Author(s):  
Chansik An ◽  
Yae Won Park ◽  
Sung Soo Ahn ◽  
Kyunghwa Han ◽  
Hwiyoung Kim ◽  
...  

Abstract Objective: This study aims to determine how randomly splitting a dataset into training and test sets affects the estimated performance of a machine learning model under different conditions, using real-world brain tumor radiomics data.Materials and Methods: We conducted two classification tasks of different difficulty levels with magnetic resonance imaging (MRI) radiomics features: (1) “Simple” task, glioblastomas [n=109] vs. brain metastasis [n=58] and (2) “difficult” task, low- [n=163] vs. high-grade [n=95] meningiomas. Additionally, two undersampled datasets were created by randomly sampling 50% from these datasets. We performed random training-test set splitting for each dataset repeatedly to create 1,000 different training and test set pairs. For each dataset pair, the least absolute shrinkage and selection operator model was trained by five-fold cross-validation (CV) or nested CV with or without repetitions in the training set and tested with the test set, using the area under the curve (AUC) as an evaluation metric.Results: The AUCs in CV and testing varied widely based on data composition, especially with the undersampled datasets and the difficult task. The mean (±standard deviation) AUC difference between CV and testing was 0.029 (±0.022) for the simple task without undersampling and 0.108 (±0.079) for the difficult task with undersampling. In a training-test set pair, the AUC was high in CV but much lower in testing (0.840 and 0.650, respectively); in another dataset pair with the same task, however, the AUC was low in CV but much higher in testing (0.702 and 0.836, respectively). None of the CV methods helped overcome this issue.Conclusions: Machine learning after a single random training-test set split may lead to unreliable results in radiomics studies, especially when the sample size is small.


2020 ◽  
Vol 365 ◽  
pp. 112961
Author(s):  
John M. Harmon ◽  
Daniel Arthur ◽  
José E. Andrade

Automatica ◽  
2020 ◽  
Vol 111 ◽  
pp. 108602
Author(s):  
Xuhui Feng ◽  
Mario E. Villanueva ◽  
Boris Houska

2017 ◽  
Vol 9 (2) ◽  
pp. 134-143
Author(s):  
Mihai Oltean

AbstractWe describe here an optical device, based on time-delays, for solving the set splitting problem which is well-known NP-complete problem. The device has a graph-like structure and the light is traversing it from a start node to a destination node. All possible (potential) paths in the graph are generated and at the destination we will check which one satisfies completely the problem's constrains.


2017 ◽  
Vol 27 (6) ◽  
Author(s):  
Alexander M. Chudnov
Keyword(s):  

AbstractWe study conditions for the existence of coalition games with the result invariant under cyclic shifts of players sequence numbers. Given a total number


2014 ◽  
Vol 11 (3) ◽  
pp. 899-900
Author(s):  
Zhaocai Wang ◽  
Chengpei Tang ◽  
Haifeng Liu ◽  
Renlin Pei

2013 ◽  
Vol 23 (1) ◽  
pp. 31-41 ◽  
Author(s):  
Jozef Kratica

In this paper, an electromagnetism-like approach (EM) for solving the maximum set splitting problem (MSSP) is applied. Hybrid approach consisting of the movement based on the attraction-repulsion mechanisms combined with the proposed scaling technique directs EM to promising search regions. Fast implementation of the local search procedure additionally improves the efficiency of overall EM system. The performance of the proposed EM approach is evaluated on two classes of instances from the literature: minimum hitting set and Steiner triple systems. The results show, except in one case, that EM reaches optimal solutions up to 500 elements and 50000 subsets on minimum hitting set instances. It also reaches all optimal/best-known solutions for Steiner triple systems.


Sign in / Sign up

Export Citation Format

Share Document