Analysis of Learning Influence of Training Data Selected by Distribution Consistency

Myunggwon Hwang; Yuna Jeong; Won-Kyung Sung

doi:10.3390/s21041045

Analysis of Learning Influence of Training Data Selected by Distribution Consistency

Sensors ◽

10.3390/s21041045 ◽

2021 ◽

Vol 21 (4) ◽

pp. 1045

Author(s):

Myunggwon Hwang ◽

Yuna Jeong ◽

Won-Kyung Sung

Keyword(s):

Machine Learning ◽

Data Distribution ◽

Training Data ◽

Learning Performance ◽

Two Dimensional ◽

Target Class ◽

Core Data ◽

Dimensional Distribution ◽

Improved Performance

This study suggests a method to select core data that will be helpful for machine learning. Specifically, we form a two-dimensional distribution based on the similarity of the training data and compose grids with fixed ratios on the distribution. In each grid, we select data based on the distribution consistency (DC) of the target class data and examine how it affects the classifier. We use CIFAR-10 for the experiment and set various grid ratios from 0.5 to 0.005. The influences of these variables were analyzed with the use of different training data sizes selected based on high-DC, low-DC (inverse of high DC), and random (no criteria) selections. As a result, the average point accuracy at 0.95% (±0.65) and the point accuracy at 1.54% (±0.59) improved for the grid configurations of 0.008 and 0.005, respectively. These outcomes justify an improved performance compared with that of the existing approach (data distribution search). In this study, we confirmed that the learning performance improved when the training data were selected for very small grid and high-DC settings.

Download Full-text

Noise Removal Process from Label Classification using Machine Learning

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.c3920.098319 ◽

2019 ◽

Vol 8 (3) ◽

pp. 172-175

Keyword(s):

Machine Learning ◽

Big Data ◽

Supervised Learning ◽

Noise Removal ◽

Error Rates ◽

Training Data ◽

Learning Performance ◽

Training Dataset ◽

Noise Filtering ◽

Label Noise

Text classification and clustering approach is essential for big data environments. In supervised learning applications many classification algorithms have been proposed. In the era of big data, a large volume of training data is available in many machine learning works. However, there is a possibility of mislabeled or unlabeled data that are not labeled properly. Some labels may be incorrect resulted in label noise which in turn regress learning performance of a classifier. A general approach to address label noise is to apply noise filtering techniques to identify and remove noise before learning. A range of noise filtering approaches have been developed to improve the classifiers performance. This paper proposes noise filtering approach in text data during the training phase. Many supervised learning algorithms generates high error rates due to noise in training dataset, our work eliminates such noise and provides accurate classification system.

Download Full-text

Prediction of galaxy halo masses in SDSS DR7 via a machine learning approach

Monthly Notices of the Royal Astronomical Society ◽

10.1093/mnras/stz2775 ◽

2019 ◽

Vol 490 (2) ◽

pp. 2367-2379 ◽

Cited By ~ 5

Author(s):

Victor F Calderon ◽

Andreas A Berlind

Keyword(s):

Machine Learning ◽

Dark Matter ◽

Sloan Digital Sky Survey ◽

Training Data ◽

Dark Matter Halo ◽

Joint Distributions ◽

Sky Survey ◽

Machine Learning Approach ◽

Improved Performance ◽

Halo Masses

ABSTRACT We present a machine learning (ML) approach for the prediction of galaxies’ dark matter halo masses which achieves an improved performance over conventional methods. We train three ML algorithms (XGBoost, random forests, and neural network) to predict halo masses using a set of synthetic galaxy catalogues that are built by populating dark matter haloes in N-body simulations with galaxies and that match both the clustering and the joint distributions of properties of galaxies in the Sloan Digital Sky Survey (SDSS). We explore the correlation of different galaxy- and group-related properties with halo mass, and extract the set of nine features that contribute the most to the prediction of halo mass. We find that mass predictions from the ML algorithms are more accurate than those from halo abundance matching (HAM) or dynamical mass estimates (DYN). Since the danger of this approach is that our training data might not accurately represent the real Universe, we explore the effect of testing the model on synthetic catalogues built with different assumptions than the ones used in the training phase. We test a variety of models with different ways of populating dark matter haloes, such as adding velocity bias for satellite galaxies. We determine that, though training and testing on different data can lead to systematic errors in predicted masses, the ML approach still yields substantially better masses than either HAM or DYN. Finally, we apply the trained model to a galaxy and group catalogue from the SDSS DR7 and present the resulting halo masses.

Download Full-text

Training data distribution significantly impacts the estimation of tissue microstructure with machine learning

Magnetic Resonance in Medicine ◽

10.1002/mrm.29014 ◽

2021 ◽

Author(s):

Noemi G. Gyori ◽

Marco Palombo ◽

Christopher A. Clark ◽

Hui Zhang ◽

Daniel C. Alexander

Keyword(s):

Machine Learning ◽

Data Distribution ◽

Training Data ◽

Tissue Microstructure

Download Full-text

EEG-Based Brain-Computer Interfaces Are Vulnerable to Backdoor Attacks

10.21203/rs.3.rs-108085/v1 ◽

2021 ◽

Author(s):

Lubin Meng ◽

Jian Huang ◽

Zhigang Zeng ◽

Xue Jiang ◽

Shan Yu ◽

...

Keyword(s):

Machine Learning ◽

Test Sample ◽

Machine Learning Algorithms ◽

Training Data ◽

Learning Approaches ◽

Brain Computer Interfaces ◽

Target Class ◽

Computer Interfaces ◽

Machine Learning Model ◽

Electroencephalogram Eeg

Abstract Research and development of electroencephalogram (EEG) based brain-computer interfaces (BCIs) have advanced rapidly, partly due to the wide adoption of sophisticated machine learning approaches for decoding the EEG signals. However, recent studies have shown that machine learning algorithms are vulnerable to adversarial attacks, e.g., the attacker can add tiny adversarial perturbations to a test sample to fool the model, or poison the training data to insert a secret backdoor. Previous research has shown that adversarial attacks are also possible for EEG-based BCIs. However, only adversarial perturbations have been considered, and the approaches are theoretically sound but very difficult to implement in practice. This article proposes to use narrow period pulse for poisoning attack of EEG-based BCIs, which is more feasible in practice and has never been considered before. One can create dangerous backdoors in the machine learning model by injecting poisoning samples into the training set. Test samples with the backdoor key will then be classified into the target class specified by the attacker. What most distinguishes our approach from previous ones is that the backdoor key does not need to be synchronized with the EEG trials, making it very easy to implement. The effectiveness and robustness of the backdoor attack approach is demonstrated, highlighting a critical security concern for EEG-based BCIs.

Download Full-text

Deep Data Source Fusion with Bias-Undoing for Lung Adenocarcinoma Classification

Journal of Medical Imaging and Health Informatics ◽

10.1166/jmihi.2020.3262 ◽

2020 ◽

Vol 10 (11) ◽

pp. 2620-2627

Author(s):

Yuanli Feng ◽

Pengyi Hao ◽

Wei Chen ◽

Xinguo Liu

Keyword(s):

Deep Learning ◽

Lung Adenocarcinoma ◽

Early Stage ◽

Lung Nodule ◽

Data Distribution ◽

Training Data ◽

Learning Performance ◽

Natural Image ◽

Base Parameters ◽

Data Source

Computer-aided diagnosis of early-stage lung adenocarcinoma based on deep learning is prospective for assisting the prevention and treatment of the deathly disease of lung cancer, however, relevant works face the problem of limited training data. The technique of data source fusion with the training of deep models on multiple relevant datasets is promising to resolve the lack of training data, while the bias of data distribution from different data sources exists as a universal issue to affect the learning performance. In this paper, we propose a deep learning framework based on bias-undoing data source fusion to classify early stages of lung adenocarcinoma in computed tomography (CT) images. The framework conducts learning on the integrated datasets for respectively natural image, lung nodule CT and lung adenocarcinoma CT, as designed with an organization of base parameters and bias parameters to adapt to the data distribution with bias. Experimental results demonstrate that the proposed bias-undoing framework is effective to improve the performance of deep learning for lung adenocarcinoma classification, and is with great superiority to those general fusion frameworks on alleviating the effect of dataset bias.

Download Full-text

Learning with uncertainty for biological discovery and design

10.1101/2020.08.11.247072 ◽

2020 ◽

Cited By ~ 1

Author(s):

Brian Hie ◽

Bryan Bryson ◽

Bonnie Berger

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Data Distribution ◽

Machine Learning Algorithms ◽

Training Data ◽

Cell Growth Inhibition ◽

Compound Library ◽

Uncertainty Prediction ◽

Transformative Potential ◽

Biological Discovery

AbstractMachine learning that generates biological hypotheses has transformative potential, but most learning algorithms are susceptible to pathological failure when exploring regimes beyond the training data distribution. A solution is to quantify prediction uncertainty so that algorithms can gracefully handle novel phenomena that confound standard methods. Here, we demonstrate the broad utility of robust uncertainty prediction in biological discovery. By leveraging Gaussian process-based uncertainty prediction on modern pretrained features, we train a model on just 72 compounds to make predictions over a 10,833-compound library, identifying and experimentally validating compounds with nanomolar affinity for diverse kinases and whole-cell growth inhibition of Mycobacterium tuberculosis. We show how uncertainty facilitates a tight iterative loop between computation and experimentation, improves the generative design of novel biochemical structures, and generalizes across disparate biological domains. More broadly, our work demonstrates that uncertainty should play a key role in the increasing adoption of machine learning algorithms into the experimental lifecycle.

Download Full-text

Scalable Approach to High Coverages on Oxides via Iterative Training of a Machine-Learning Algorithm

10.26434/chemrxiv.10288514.v1 ◽

2019 ◽

Author(s):

Andrew Medford ◽

Shengchun Yang ◽

Fuzhu Liu

Keyword(s):

Machine Learning ◽

Chemical Potential ◽

Learning Algorithm ◽

Absolute Error ◽

Low Energy ◽

Training Data ◽

High Coverage ◽

Metal Compounds ◽

Adsorption Energies ◽

The Stability

Understanding the interaction of multiple types of adsorbate molecules on solid surfaces is crucial to establishing the stability of catalysts under various chemical environments. Computational studies on the high coverage and mixed coverages of reaction intermediates are still challenging, especially for transition-metal compounds. In this work, we present a framework to predict differential adsorption energies and identify low-energy structures under high- and mixed-adsorbate coverages on oxide materials. The approach uses Gaussian process machine-learning models with quantified uncertainty in conjunction with an iterative training algorithm to actively identify the training set. The framework is demonstrated for the mixed adsorption of CHx, NHx and OHx species on the oxygen vacancy and pristine rutile TiO2(110) surface sites. The results indicate that the proposed algorithm is highly efficient at identifying the most valuable training data, and is able to predict differential adsorption energies with a mean absolute error of ~0.3 eV based on <25% of the total DFT data. The algorithm is also used to identify 76% of the low-energy structures based on <30% of the total DFT data, enabling construction of surface phase diagrams that account for high and mixed coverage as a function of the chemical potential of C, H, O, and N. Furthermore, the computational scaling indicates the algorithm scales nearly linearly (N1.12) as the number of adsorbates increases. This framework can be directly extended to metals, metal oxides, and other materials, providing a practical route toward the investigation of the behavior of catalysts under high-coverage conditions.

Download Full-text

Optimization of Diabetes Training DATA using Machine Learning Algorithms

International Journal of Computer Sciences and Engineering ◽

10.26438/ijcse/v6i2.283286 ◽

2018 ◽

Vol 6 (2) ◽

pp. 283-286

Author(s):

M. Samba Siva Rao ◽

◽

M.Yaswanth . ◽

K. Raghavendra Swamy ◽

◽

...

Keyword(s):

Machine Learning ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Training Data

Download Full-text

Comparison of Machine Learning Performance for Earnings Forecasting

Journal of Taxation and Accounting ◽

10.35850/kjta.20.6.01 ◽

2019 ◽

Vol 20 (6) ◽

pp. 9-34

Author(s):

Woo June Jung

Keyword(s):

Machine Learning ◽

Learning Performance ◽

Earnings Forecasting

Download Full-text

Comparative Analysis of Machine Learning Techniques Using Predictive Modeling

Recent Advances in Computer Science and Communications ◽

10.2174/2666255813999200904164539 ◽

2020 ◽

Vol 13 ◽

Author(s):

Ritu Khandelwal ◽

Hemlata Goyal ◽

Rajveer Singh Shekhawat

Keyword(s):

Machine Learning ◽

Comparative Analysis ◽

Data Science ◽

Training Data ◽

Machine Learning Techniques ◽

Future Trends ◽

Data Set ◽

Learning Stage ◽

Learning Techniques ◽

Different Types

Introduction: Machine learning is an intelligent technology that works as a bridge between businesses and data science. With the involvement of data science, the business goal focuses on findings to get valuable insights on available data. The large part of Indian Cinema is Bollywood which is a multi-million dollar industry. This paper attempts to predict whether the upcoming Bollywood Movie would be Blockbuster, Superhit, Hit, Average or Flop. For this Machine Learning techniques (classification and prediction) will be applied. To make classifier or prediction model first step is the learning stage in which we need to give the training data set to train the model by applying some technique or algorithm and after that different rules are generated which helps to make a model and predict future trends in different types of organizations. Methods: All the techniques related to classification and Prediction such as Support Vector Machine(SVM), Random Forest, Decision Tree, Naïve Bayes, Logistic Regression, Adaboost, and KNN will be applied and try to find out efficient and effective results. All these functionalities can be applied with GUI Based workflows available with various categories such as data, Visualize, Model, and Evaluate. Result: To make classifier or prediction model first step is learning stage in which we need to give the training data set to train the model by applying some technique or algorithm and after that different rules are generated which helps to make a model and predict future trends in different types of organizations Conclusion: This paper focuses on Comparative Analysis that would be performed based on different parameters such as Accuracy, Confusion Matrix to identify the best possible model for predicting the movie Success. By using Advertisement Propaganda, they can plan for the best time to release the movie according to the predicted success rate to gain higher benefits. Discussion: Data Mining is the process of discovering different patterns from large data sets and from that various relationships are also discovered to solve various problems that come in business and helps to predict the forthcoming trends. This Prediction can help Production Houses for Advertisement Propaganda and also they can plan their costs and by assuring these factors they can make the movie more profitable.

Download Full-text