SIMON, an automated machine learning system reveals immune signatures of influenza vaccine responses

Mapping Intimacies ◽

10.1101/545186 ◽

2019 ◽

Author(s):

Adriana Tomic ◽

Ivan Tomic ◽

Yael Rosenberg-Hasson ◽

Cornelia L. Dekker ◽

Holden T. Maecker ◽

...

Keyword(s):

Machine Learning ◽

Statistical Power ◽

Missing Values ◽

Learning System ◽

T Cell Subsets ◽

Vaccine Responses ◽

Biological Insight ◽

Speed Up ◽

Automated Machine Learning ◽

Incomplete Datasets

AbstractMachine learning holds considerable promise for understanding complex biological processes such as vaccine responses. Capturing interindividual variability is essential to increase the statistical power necessary for building more accurate predictive models. However, available approaches have difficulty coping with incomplete datasets which is often the case when combining studies. Additionally, there are hundreds of algorithms available and no simple way to find the optimal one. Here, we developed Sequential Iterative Modelling “OverNight” or SIMON, an automated machine learning system that compares results from 128 different algorithms and is particularly suitable for datasets containing many missing values. We applied SIMON to data from five clinical studies of seasonal influenza vaccination. The results reveal previously unrecognized CD4+ and CD8+ T cell subsets strongly associated with a robust antibody response to influenza antigens. These results demonstrate that SIMON can greatly speed up the choice of analysis modalities. Hence, it is a highly useful approach for data-driven hypothesis generation from disparate clinical datasets. Our strategy could be used to gain biological insight from ever-expanding heterogeneous datasets that are publicly available.

Download Full-text

SIMON, an Automated Machine Learning System, Reveals Immune Signatures of Influenza Vaccine Responses

The Journal of Immunology ◽

10.4049/jimmunol.1900033 ◽

2019 ◽

Vol 203 (3) ◽

pp. 749-759 ◽

Cited By ~ 4

Author(s):

Adriana Tomic ◽

Ivan Tomic ◽

Yael Rosenberg-Hasson ◽

Cornelia L. Dekker ◽

Holden T. Maecker ◽

...

Keyword(s):

Machine Learning ◽

Influenza Vaccine ◽

Learning System ◽

Vaccine Responses ◽

Immune Signatures ◽

Automated Machine Learning

Download Full-text

A Robust Automated Machine Learning System with Pseudoinverse Learning

Cognitive Computation ◽

10.1007/s12559-021-09853-6 ◽

2021 ◽

Author(s):

Ke Wang ◽

Ping Guo

Keyword(s):

Machine Learning ◽

Learning System ◽

Automated Machine Learning

Download Full-text

Testing the Suitability of Automated Machine Learning for Weeds Identification

AI ◽

10.3390/ai2010004 ◽

2021 ◽

Vol 2 (1) ◽

pp. 34-47

Author(s):

Borja Espejo-Garcia ◽

Ioannis Malounas ◽

Eleanna Vali ◽

Spyros Fountas

Keyword(s):

Machine Learning ◽

Plant Protection ◽

Crop Protection ◽

Identification Problem ◽

Learning System ◽

Classifier Ensembles ◽

Automated Machine Learning ◽

A New Technique ◽

Plant Seedlings ◽

And Training

In the past years, several machine-learning-based techniques have arisen for providing effective crop protection. For instance, deep neural networks have been used to identify different types of weeds under different real-world conditions. However, these techniques usually require extensive involvement of experts working iteratively in the development of the most suitable machine learning system. To support this task and save resources, a new technique called Automated Machine Learning has started being studied. In this work, a complete open-source Automated Machine Learning system was evaluated with two different datasets, (i) The Early Crop Weeds dataset and (ii) the Plant Seedlings dataset, covering the weeds identification problem. Different configurations, such as the use of plant segmentation, the use of classifier ensembles instead of Softmax and training with noisy data, have been compared. The results showed promising performances of 93.8% and 90.74% F1 score depending on the dataset used. These performances were aligned with other related works in AutoML, but they are far from machine-learning-based systems manually fine-tuned by human experts. From these results, it can be concluded that finding a balance between manual expert work and Automated Machine Learning will be an interesting path to work in order to increase the efficiency in plant protection.

Download Full-text

Evaluating the impact of multivariate imputation by MICE in feature selection

PLoS ONE ◽

10.1371/journal.pone.0254720 ◽

2021 ◽

Vol 16 (7) ◽

pp. e0254720

Author(s):

Maritza Mera-Gaona ◽

Ursula Neumann ◽

Rubiel Vargas-Canas ◽

Diego M. López

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Missing Values ◽

Positive Impact ◽

Selection Process ◽

Full Data ◽

Crucial Step ◽

Incomplete Datasets ◽

Selection Algorithms ◽

The Impact

Handling missing values is a crucial step in preprocessing data in Machine Learning. Most available algorithms for analyzing datasets in the feature selection process and classification or estimation process analyze complete datasets. Consequently, in many cases, the strategy for dealing with missing values is to use only instances with full data or to replace missing values with a mean, mode, median, or a constant value. Usually, discarding missing samples or replacing missing values by means of fundamental techniques causes bias in subsequent analyzes on datasets. Aim: Demonstrate the positive impact of multivariate imputation in the feature selection process on datasets with missing values. Results: We compared the effects of the feature selection process using complete datasets, incomplete datasets with missingness rates between 5 and 50%, and imputed datasets by basic techniques and multivariate imputation. The feature selection algorithms used are well-known methods. The results showed that the datasets imputed by multivariate imputation obtained the best results in feature selection compared to datasets imputed by basic techniques or non-imputed incomplete datasets. Conclusions: Considering the results obtained in the evaluation, applying multivariate imputation by MICE reduces bias in the feature selection process.

Download Full-text

The Prototype Development Applied as a Conceptual Design of an Automated Machine Learning System

KIISE Transactions on Computing Practices ◽

10.5626/ktcp.2020.26.6.268 ◽

2020 ◽

Vol 26 (6) ◽

pp. 268-273

Author(s):

Jayoung James Goo ◽

Hyosung Kim

Keyword(s):

Machine Learning ◽

Conceptual Design ◽

Learning System ◽

Prototype Development ◽

Automated Machine Learning

Download Full-text

Self–Training With Quantile Errors for Multivariate Missing Data Imputation for Regression Problems in Electronic Medical Records: Algorithm Development Study (Preprint)

10.2196/preprints.30824 ◽

2021 ◽

Author(s):

Hansle Gwon ◽

Imjin Ahn ◽

Yunha Kim ◽

Hee Jun Kang ◽

Hyeram Seo ◽

...

Keyword(s):

Machine Learning ◽

Incomplete Data ◽

Missing Values ◽

Pearson Correlation ◽

Laboratory Data ◽

Complete Data ◽

Learning System ◽

Training Data ◽

Rank Test ◽

Missing Value

BACKGROUND When using machine learning in the real world, the missing value problem is the first problem encountered. Methods to impute this missing value include statistical methods such as mean, expectation-maximization, and multiple imputations by chained equations (MICE) as well as machine learning methods such as multilayer perceptron, k-nearest neighbor, and decision tree. OBJECTIVE The objective of this study was to impute numeric medical data such as physical data and laboratory data. We aimed to effectively impute data using a progressive method called self-training in the medical field where training data are scarce. METHODS In this paper, we propose a self-training method that gradually increases the available data. Models trained with complete data predict the missing values in incomplete data. Among the incomplete data, the data in which the missing value is validly predicted are incorporated into the complete data. Using the predicted value as the actual value is called pseudolabeling. This process is repeated until the condition is satisfied. The most important part of this process is how to evaluate the accuracy of pseudolabels. They can be evaluated by observing the effect of the pseudolabeled data on the performance of the model. RESULTS In self-training using random forest (RF), mean squared error was up to 12% lower than pure RF, and the Pearson correlation coefficient was 0.1% higher. This difference was confirmed statistically. In the Friedman test performed on MICE and RF, self-training showed a <i>P</i> value between .003 and .02. A Wilcoxon signed-rank test performed on the mean imputation showed the lowest possible <i>P</i> value, 3.05e-5, in all situations. CONCLUSIONS Self-training showed significant results in comparing the predicted values and actual values, but it needs to be verified in an actual machine learning system. And self-training has the potential to improve performance according to the pseudolabel evaluation method, which will be the main subject of our future research.

Download Full-text

Self–Training With Quantile Errors for Multivariate Missing Data Imputation for Regression Problems in Electronic Medical Records: Algorithm Development Study

JMIR Public Health and Surveillance ◽

10.2196/30824 ◽

2021 ◽

Vol 7 (10) ◽

pp. e30824

Author(s):

Hansle Gwon ◽

Imjin Ahn ◽

Yunha Kim ◽

Hee Jun Kang ◽

Hyeram Seo ◽

...

Keyword(s):

Machine Learning ◽

Incomplete Data ◽

Missing Values ◽

Pearson Correlation ◽

Laboratory Data ◽

Complete Data ◽

Learning System ◽

Rank Test ◽

P Value ◽

Missing Value

Background When using machine learning in the real world, the missing value problem is the first problem encountered. Methods to impute this missing value include statistical methods such as mean, expectation-maximization, and multiple imputations by chained equations (MICE) as well as machine learning methods such as multilayer perceptron, k-nearest neighbor, and decision tree. Objective The objective of this study was to impute numeric medical data such as physical data and laboratory data. We aimed to effectively impute data using a progressive method called self-training in the medical field where training data are scarce. Methods In this paper, we propose a self-training method that gradually increases the available data. Models trained with complete data predict the missing values in incomplete data. Among the incomplete data, the data in which the missing value is validly predicted are incorporated into the complete data. Using the predicted value as the actual value is called pseudolabeling. This process is repeated until the condition is satisfied. The most important part of this process is how to evaluate the accuracy of pseudolabels. They can be evaluated by observing the effect of the pseudolabeled data on the performance of the model. Results In self-training using random forest (RF), mean squared error was up to 12% lower than pure RF, and the Pearson correlation coefficient was 0.1% higher. This difference was confirmed statistically. In the Friedman test performed on MICE and RF, self-training showed a P value between .003 and .02. A Wilcoxon signed-rank test performed on the mean imputation showed the lowest possible P value, 3.05e-5, in all situations. Conclusions Self-training showed significant results in comparing the predicted values and actual values, but it needs to be verified in an actual machine learning system. And self-training has the potential to improve performance according to the pseudolabel evaluation method, which will be the main subject of our future research.

Download Full-text

Towards Automated Semi-Supervised Learning

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33014237 ◽

2019 ◽

Vol 33 ◽

pp. 4237-4244 ◽

Cited By ~ 1

Author(s):

Yu-Feng Li ◽

Hai Wang ◽

Tong Wei ◽

Wei-Wei Tu

Keyword(s):

Machine Learning ◽

Supervised Learning ◽

Semisupervised Learning ◽

Learning System ◽

Large Margin ◽

Automated Learning ◽

Machine Learning Model ◽

Meta Learning ◽

Performance Deterioration ◽

Automated Machine Learning

Automated Machine Learning (AutoML) aims to build an appropriate machine learning model for any unseen dataset automatically, i.e., without human intervention. Great efforts have been devoted on AutoML while they typically focus on supervised learning. In many applications, however, semisupervised learning (SSL) are widespread and current AutoML systems could not well address SSL problems. In this paper, we propose to present an automated learning system for SSL (AUTO-SSL). First, meta-learning with enhanced meta-features is employed to quickly suggest some instantiations of the SSL techniques which are likely to perform quite well. Second, a large margin separation method is proposed to fine-tune the hyperparameters and more importantly, alleviate performance deterioration. The basic idea is that, if a certain hyperparameter owns a high quality, its predictive results on unlabeled data may have a large margin separation. Extensive empirical results over 200 cases demonstrate that our proposal on one side achieves highly competitive or better performance compared to the state-of-the-art AutoML system AUTO-SKLEARN and classical SSL techniques, on the other side unlike classical SSL techniques which often significantly degenerate performance, our proposal seldom suffers from such deficiency.

Download Full-text

Missing Data - Better "Not to Have Them", but What If You Do? (Part 1)

Marketing ZFP ◽

10.15358/0344-1369-2019-4-21 ◽

2019 ◽

Vol 41 (4) ◽

pp. 21-32

Author(s):

Dirk Temme ◽

Sarah Jensen

Keyword(s):

Missing Data ◽

Statistical Power ◽

Missing Values ◽

Graphical Representation ◽

Marketing Research ◽

Likelihood Estimation ◽

Parameter Estimates ◽

Full Information Maximum Likelihood ◽

Definition Of ◽

Traditional Approaches

Missing values are ubiquitous in empirical marketing research. If missing data are not dealt with properly, this can lead to a loss of statistical power and distorted parameter estimates. While traditional approaches for handling missing data (e.g., listwise deletion) are still widely used, researchers can nowadays choose among various advanced techniques such as multiple imputation analysis or full-information maximum likelihood estimation. Due to the available software, using these modern missing data methods does not pose a major obstacle. Still, their application requires a sound understanding of the prerequisites and limitations of these methods as well as a deeper understanding of the processes that have led to missing values in an empirical study. This article is Part 1 and first introduces Rubin’s classical definition of missing data mechanisms and an alternative, variable-based taxonomy, which provides a graphical representation. Secondly, a selection of visualization tools available in different R packages for the description and exploration of missing data structures is presented.

Download Full-text

Big Data to Knowledge: Application of Machine Learning to Predictive Modeling of Therapeutic Response in Cancer.

Current Genomics ◽

10.2174/1389202921999201224110101 ◽

2020 ◽

Vol 21 ◽

Author(s):

Sukanya Panja ◽

Sarra Rahem ◽

Cassandra J. Chu ◽

Antonina Mitrofanova

Keyword(s):

Machine Learning ◽

Missing Values ◽

Therapeutic Response ◽

Patient Data ◽

Machine Learning Techniques ◽

Support Vector ◽

Complex Data ◽

Human Machine Interaction ◽

Data Repositories ◽

Response Modeling

Background: In recent years, the availability of high throughput technologies, establishment of large molecular patient data repositories, and advancement in computing power and storage have allowed elucidation of complex mechanisms implicated in therapeutic response in cancer patients. The breadth and depth of such data, alongside experimental noise and missing values, requires a sophisticated human-machine interaction that would allow effective learning from complex data and accurate forecasting of future outcomes, ideally embedded in the core of machine learning design. Objective: In this review, we will discuss machine learning techniques utilized for modeling of treatment response in cancer, including Random Forests, support vector machines, neural networks, and linear and logistic regression. We will overview their mathematical foundations and discuss their limitations and alternative approaches all in light of their application to therapeutic response modeling in cancer. Conclusion: We hypothesize that the increase in the number of patient profiles and potential temporal monitoring of patient data will define even more complex techniques, such as deep learning and causal analysis, as central players in therapeutic response modeling.

Download Full-text