Improving the Accuracy of Convolutional Neural Networks by Identifying and Removing Outlier Images in Datasets Using t-SNE

Husein Perez; Joseph H. M. Tah

doi:10.3390/math8050662

Improving the Accuracy of Convolutional Neural Networks by Identifying and Removing Outlier Images in Datasets Using t-SNE

Mathematics ◽

10.3390/math8050662 ◽

2020 ◽

Vol 8 (5) ◽

pp. 662 ◽

Cited By ~ 3

Author(s):

Husein Perez ◽

Joseph H. M. Tah

Keyword(s):

Machine Learning ◽

Density Distribution ◽

Image Classification ◽

High Dimensional Data ◽

Supervised Machine Learning ◽

Learning Problems ◽

High Dimensional ◽

Feature Engineering ◽

Outlier Data

In the field of supervised machine learning, the quality of a classifier model is directly correlated with the quality of the data that is used to train the model. The presence of unwanted outliers in the data could significantly reduce the accuracy of a model or, even worse, result in a biased model leading to an inaccurate classification. Identifying the presence of outliers and eliminating them is, therefore, crucial for building good quality training datasets. Pre-processing procedures for dealing with missing and outlier data, commonly known as feature engineering, are standard practice in machine learning problems. They help to make better assumptions about the data and also prepare datasets in a way that best expose the underlying problem to the machine learning algorithms. In this work, we propose a multistage method for detecting and removing outliers in high-dimensional data. Our proposed method is based on utilising a technique called t-distributed stochastic neighbour embedding (t-SNE) to reduce high-dimensional map of features into a lower, two-dimensional, probability density distribution and then use a simple descriptive statistical method called interquartile range (IQR) to identifying any outlier values from the density distribution of the features. t-SNE is a machine learning algorithm and a nonlinear dimensionality reduction technique well-suited for embedding high-dimensional data for visualisation in a low-dimensional space of two or three dimensions. We applied this method on a dataset containing images for training a convolutional neural network model (ConvNet) for an image classification problem. The dataset contains four different classes of images: three classes contain defects in construction (mould, stain, and paint deterioration) and a no-defect class (normal). We used the transfer learning technique to modify a pre-trained VGG-16 model. We used this model as a feature extractor and as a benchmark to evaluate our method. We have shown that, when using this method, we can identify and remove the outlier images in the dataset. After removing the outlier images from the dataset and re-training the VGG-16 model, the results have also shown that the accuracy of the classification has significantly improved and the number of misclassified cases has also dropped. While many feature engineering techniques for handling missing and outlier data are common in predictive machine learning problems involving numerical or categorical data, there is little work on developing techniques for handling outliers in high-dimensional data which can be used to improve the quality of machine learning problems involving images such as ConvNet models for image classification and object detection problems.

Download Full-text

Scalable hierarchical clustering by composition rank vector encoding and tree structure

10.1101/2020.04.12.038026 ◽

2020 ◽

Author(s):

Xiao Lai ◽

Pu Tian

Keyword(s):

Machine Learning ◽

Hierarchical Clustering ◽

Clustering Algorithm ◽

High Dimensional Data ◽

Machine Learning Algorithms ◽

Tree Structure ◽

Supervised Machine Learning ◽

High Dimensional ◽

Rank Vector ◽

Nonlinear Correlations

AbstractSupervised machine learning, especially deep learning based on a wide variety of neural network architectures, have contributed tremendously to fields such as marketing, computer vision and natural language processing. However, development of un-supervised machine learning algorithms has been a bottleneck of artificial intelligence. Clustering is a fundamental unsupervised task in many different subjects. Unfortunately, no present algorithm is satisfactory for clustering of high dimensional data with strong nonlinear correlations. In this work, we propose a simple and highly efficient hierarchical clustering algorithm based on encoding by composition rank vectors and tree structure, and demonstrate its utility with clustering of protein structural domains. No record comparison, which is an expensive and essential common step to all present clustering algorithms, is involved. Consequently, it achieves linear time and space computational complexity hierarchical clustering, thus applicable to arbitrarily large datasets. The key factor in this algorithm is definition of composition, which is dependent upon physical nature of target data and therefore need to be constructed case by case. Nonetheless, the algorithm is general and applicable to any high dimensional data with strong nonlinear correlations. We hope this algorithm to inspire a rich research field of encoding based clustering well beyond composition rank vector trees.

Download Full-text

Cross Breed Clustering Algorithm for High Dimensional Data

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.a5313.119119 ◽

2019 ◽

Vol 9 (1) ◽

pp. 5049-5052

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Data Clustering ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

High Dimensional Data ◽

High Dimensional ◽

Growing Domain ◽

Present World

Clustering plays a major role in machine learning and also in data mining. Deep learning is fast growing domain in present world. Improving the quality of the clustering results by adopting the deep learning algorithms. Many clustering algorithm process various datasets to get the better results. But for the high dimensional data clustering is still an issue to process and get the quality clustering results with the existing clustering algorithms. In this paper, the cross breed clustering algorithm for high dimensional data is utilized. Various datasets are used to get the results.

Download Full-text

Stochastic parallel extreme artificial hydrocarbon networks: An implementation for fast and robust supervised machine learning in high-dimensional data

Engineering Applications of Artificial Intelligence ◽

10.1016/j.engappai.2019.103427 ◽

2020 ◽

Vol 89 ◽

pp. 103427 ◽

Cited By ~ 4

Author(s):

Hiram Ponce ◽

Paulo V. de Campos Souza ◽

Augusto Junio Guimarães ◽

Guillermo González-Mora

Keyword(s):

Machine Learning ◽

High Dimensional Data ◽

Supervised Machine Learning ◽

High Dimensional

Download Full-text

Data Quality Measures and Efficient Evaluation Algorithms for Large-Scale High-Dimensional Data

Applied Sciences ◽

10.3390/app11020472 ◽

2021 ◽

Vol 11 (2) ◽

pp. 472

Author(s):

Hyeongmin Cho ◽

Sangkyun Lee

Keyword(s):

Machine Learning ◽

Data Quality ◽

Large Scale ◽

High Dimensional Data ◽

Quality Measures ◽

Training Data ◽

Measure Data ◽

High Dimensional ◽

Small Scale ◽

Class Separability

Machine learning has been proven to be effective in various application areas, such as object and speech recognition on mobile systems. Since a critical key to machine learning success is the availability of large training data, many datasets are being disclosed and published online. From a data consumer or manager point of view, measuring data quality is an important first step in the learning process. We need to determine which datasets to use, update, and maintain. However, not many practical ways to measure data quality are available today, especially when it comes to large-scale high-dimensional data, such as images and videos. This paper proposes two data quality measures that can compute class separability and in-class variability, the two important aspects of data quality, for a given dataset. Classical data quality measures tend to focus only on class separability; however, we suggest that in-class variability is another important data quality factor. We provide efficient algorithms to compute our quality measures based on random projections and bootstrapping with statistical benefits on large-scale high-dimensional data. In experiments, we show that our measures are compatible with classical measures on small-scale data and can be computed much more efficiently on large-scale high-dimensional datasets.

Download Full-text

Leveraging Road Characteristics and Contributor Behaviour for Assessing Road Type Quality in OSM

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi10070436 ◽

2021 ◽

Vol 10 (7) ◽

pp. 436

Author(s):

Amerah Alghanim ◽

Musfira Jilani ◽

Michela Bertolotto ◽

Gavin McArdle

Keyword(s):

Machine Learning ◽

Spatial Data ◽

Classification Accuracy ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Data Set ◽

Semantic Inference ◽

Road Type ◽

The Impact

Volunteered Geographic Information (VGI) is often collected by non-expert users. This raises concerns about the quality and veracity of such data. There has been much effort to understand and quantify the quality of VGI. Extrinsic measures which compare VGI to authoritative data sources such as National Mapping Agencies are common but the cost and slow update frequency of such data hinder the task. On the other hand, intrinsic measures which compare the data to heuristics or models built from the VGI data are becoming increasingly popular. Supervised machine learning techniques are particularly suitable for intrinsic measures of quality where they can infer and predict the properties of spatial data. In this article we are interested in assessing the quality of semantic information, such as the road type, associated with data in OpenStreetMap (OSM). We have developed a machine learning approach which utilises new intrinsic input features collected from the VGI dataset. Specifically, using our proposed novel approach we obtained an average classification accuracy of 84.12%. This result outperforms existing techniques on the same semantic inference task. The trustworthiness of the data used for developing and training machine learning models is important. To address this issue we have also developed a new measure for this using direct and indirect characteristics of OSM data such as its edit history along with an assessment of the users who contributed the data. An evaluation of the impact of data determined to be trustworthy within the machine learning model shows that the trusted data collected with the new approach improves the prediction accuracy of our machine learning technique. Specifically, our results demonstrate that the classification accuracy of our developed model is 87.75% when applied to a trusted dataset and 57.98% when applied to an untrusted dataset. Consequently, such results can be used to assess the quality of OSM and suggest improvements to the data set.

Download Full-text

Study on a prediction of P2P network loan default based on the machine learning LightGBM and XGboost algorithms according to different high dimensional data cleaning

Electronic Commerce Research and Applications ◽

10.1016/j.elerap.2018.08.002 ◽

2018 ◽

Vol 31 ◽

pp. 24-39 ◽

Cited By ~ 48

Author(s):

Xiaojun Ma ◽

Jinglan Sha ◽

Dehua Wang ◽

Yuanbo Yu ◽

Qian Yang ◽

...

Keyword(s):

Machine Learning ◽

Data Cleaning ◽

High Dimensional Data ◽

P2p Network ◽

High Dimensional ◽

Loan Default

Download Full-text

Enlisting Supervised Machine Learning in Mapping Scientific Uncertainty Expressed in Food Risk Analysis

Sociological Methods & Research ◽

10.1177/0049124117729701 ◽

2017 ◽

Vol 48 (3) ◽

pp. 608-641 ◽

Cited By ~ 1

Author(s):

Akos Rona-Tas ◽

Antoine Cornuéjols ◽

Sandrine Blanchemanche ◽

Antonin Duroy ◽

Christine Martin

Keyword(s):

Machine Learning ◽

Food Safety ◽

Supervised Machine Learning ◽

Risk Assessments ◽

Scientific Uncertainty ◽

Empirical Measure ◽

Taxonomic Structure ◽

Safety Risk ◽

Food Safety Risk

Recently, both sociology of science and policy research have shown increased interest in scientific uncertainty. To contribute to these debates and create an empirical measure of scientific uncertainty, we inductively devised two systems of classification or ontologies to describe scientific uncertainty in a large corpus of food safety risk assessments with the help of machine learning (ML). We ask three questions: (1) Can we use ML to assist with coding complex documents such as food safety risk assessments on a difficult topic like scientific uncertainty? (2) Can we assess using ML the quality of the ontologies we devised? (3) And, finally, does the quality of our ontologies depend on social factors? We found that ML can do surprisingly well in its simplest form identifying complex meanings, and it does not benefit from adding certain types of complexity to the analysis. Our ML experiments show that in one ontology which is a simple typology, against expectations, semantic opposites attract each other and support the taxonomic structure of the other. And finally, we found some evidence that institutional factors do influence how well our taxonomy of uncertainty performs, but its ability to capture meaning does not vary greatly across the time, institutional context, and cultures we investigated.

Download Full-text

Machine Learning and High-Dimensional Data Analysis

Principles of Clinical Cancer Research ◽

10.1891/9781617052392.0017 ◽

2018 ◽

Author(s):

Sanjay Aneja ◽

James B. Yu

Keyword(s):

Machine Learning ◽

Data Analysis ◽

High Dimensional Data ◽

High Dimensional ◽

High Dimensional Data Analysis

Download Full-text

Quality of Transmission Estimation for Multi-User Free Space Optical Communication Using Supervised Machine Learning

10.1109/ccaaw50069.2021.9527304 ◽

2021 ◽

Author(s):

Federica Aveta ◽

Amal Algedir ◽

Hazem Refai

Keyword(s):

Machine Learning ◽

Optical Communication ◽

Free Space ◽

Supervised Machine Learning ◽

Free Space Optical Communication ◽

Free Space Optical ◽

Space Optical Communication ◽

Quality Of Transmission ◽

Transmission Estimation

Download Full-text

Applying Machine Learning Algorithms to Solve Inverse Problems in Electrical Tomography

MATEC Web of Conferences ◽

10.1051/matecconf/201821002016 ◽

2018 ◽

Vol 210 ◽

pp. 02016 ◽

Cited By ~ 1

Author(s):

Tomasz Rymarczyk ◽

Grzegorz Kłosowski

Keyword(s):

Machine Learning ◽

Machine Learning Algorithms ◽

Multivariate Adaptive Regression Splines ◽

Supervised Machine Learning ◽

Electrical Tomography ◽

Regression Error ◽

Adaptive Regression ◽

Regression Problems ◽

Adaptive Regression Splines

The article presents four selected methods of supervised machine learning, which can be successfully used in the tomography of flood embankments, walls, tanks, reactors and pipes. A comparison of the following methods was made: Artificial Neural Networks (ANN), Supported Vector Machine (SVM), K-Nearest Neighbour (KNN) and Multivariate Adaptive Regression Splines (MAR Splines). All analysed methods concerned regression problems. Thanks to performed analysis the differences expressed quantitatively were visualized with the use of indicators such as regression, error of mean square deviation, etc. Moreover, an innovative method of denoising tomographic output images with the use of convolutional auto-encoders was presented. Thanks to the use of a convolutional structure composed of two auto-encoders, a significant improvement in the quality of the output image from the ECT tomography was achieved.

Download Full-text