Improving accuracy of missing data imputation in data mining

Nzar A. Ali; Zhyan M. Omer

doi:10.24017/science.2017.3.30

Improving accuracy of missing data imputation in data mining

Kurdistan Journal of Applied Research ◽

10.24017/science.2017.3.30 ◽

2017 ◽

Vol 2 (3) ◽

pp. 66-73

Author(s):

Nzar A. Ali ◽

Zhyan M. Omer

Keyword(s):

Data Mining ◽

Missing Data ◽

Real World ◽

Missing Values ◽

Large Data ◽

Data Repository ◽

Data Sets ◽

Real World Data ◽

Missing Data Imputation ◽

Improving Accuracy

In fact, raw data in the real world is dirty. Each large data repository contains various types of anomalous values that influence the result of the analysis, since in data mining, good models usually need good data, databases in the world are not always clean and includes noise, incomplete data, duplicate records, inconsistent data and missing values. Missing data is a common drawback in many real-world data sets. In this paper, we proposed an algorithm depending on improving (MIGEC) algorithm in the way of imputation for dealing missing values. We implement grey relational analysis (GRA) on attribute values instead of instance values, and the missing data were initially imputed by mean imputation and then estimated by our proposed algorithm (PA) used as a complete value for imputing next missing value.We compare our proposed algorithm with several other algorithms such as MMS, HDI, KNNMI, FCMOCS, CRI, CMI, NIIA and MIGEC under different missing mechanisms. Experimental results demonstrate that the proposed algorithm has less RMSE values than other algorithms under all missingness mechanisms.

Download Full-text

The Issue of Missing Values in Data Mining

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch171 ◽

2011 ◽

pp. 1102-1109

Author(s):

Malcolm J. Beynon

Keyword(s):

Data Mining ◽

Missing Data ◽

Incomplete Data ◽

Missing Values ◽

Large Data ◽

Original Data ◽

Customer Relationship ◽

Data Sets ◽

Data Mining Technique ◽

Data Set

The essence of data mining is to investigate for pertinent information that may exist in data (often large data sets). The immeasurably large amount of data present in the world, due to the increasing capacity of storage media, manifests the issue of the presence of missing values (Olinsky et al., 2003; Brown and Kros, 2003). The presented encyclopaedia article considers the general issue of the presence of missing values when data mining, and demonstrates the effect of when managing their presence is or is not undertaken, through the utilisation of a data mining technique. The issue of missing values was first exposited over forty years ago in Afifi and Elashoff (1966). Since then it is continually the focus of study and explanation (El-Masri and Fox-Wasylyshyn, 2005), covering issues such as the nature of their presence and management (Allison, 2000). With this in mind, the naïve consistent aspect of the missing value debate is the limited general strategies available for their management, the main two being either the simple deletion of cases with missing data or a form of imputation of the missing values in someway (see Elliott and Hawthorne, 2005). Examples of the specific investigation of missing data (and data quality), include in; data warehousing (Ma et al., 2000), and customer relationship management (Berry and Linoff, 2000). An alternative strategy considered is the retention of the missing values, and their subsequent ‘ignorance’ contribution in any data mining undertaken on the associated original incomplete data set. A consequence of this retention is that full interpretability can be placed on the results found from the original incomplete data set. This strategy can be followed when using the nascent CaRBS technique for object classification (Beynon, 2005a, 2005b). CaRBS analyses are presented here to illustrate that data mining can manage the presence of missing values in a much more effective manner than the more inhibitory traditional strategies. An example data set is considered, with a noticeable level of missing values present in the original data set. A critical increase in the number of missing values present in the data set further illustrates the benefit from ‘intelligent’ data mining (in this case using CaRBS).

Download Full-text

Comparison of Missing Data Infilling Mechanisms for Recovering a Real-World Single Station Streamflow Observation

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph18168375 ◽

2021 ◽

Vol 18 (16) ◽

pp. 8375

Author(s):

Thelma Dede Baddoo ◽

Zhijia Li ◽

Samuel Nii Odai ◽

Kenneth Rodolphe Chabi Boni ◽

Isaac Kwesi Nooni ◽

...

Keyword(s):

Missing Data ◽

Multiple Imputation ◽

Real World ◽

Missing Values ◽

Total Error ◽

Extensive Study ◽

Error Measurement ◽

Missing Data Imputation ◽

Single Station ◽

Real World Datasets

Reconstructing missing streamflow data can be challenging when additional data are not available, and missing data imputation of real-world datasets to investigate how to ascertain the accuracy of imputation algorithms for these datasets are lacking. This study investigated the necessary complexity of missing data reconstruction schemes to obtain the relevant results for a real-world single station streamflow observation to facilitate its further use. This investigation was implemented by applying different missing data mechanisms spanning from univariate algorithms to multiple imputation methods accustomed to multivariate data taking time as an explicit variable. The performance accuracy of these schemes was assessed using the total error measurement (TEM) and a recommended localized error measurement (LEM) in this study. The results show that univariate missing value algorithms, which are specially developed to handle univariate time series, provide satisfactory results, but the ones which provide the best results are usually time and computationally intensive. Also, multiple imputation algorithms which consider the surrounding observed values and/or which can understand the characteristics of the data provide similar results to the univariate missing data algorithms and, in some cases, perform better without the added time and computational downsides when time is taken as an explicit variable. Furthermore, the LEM would be especially useful when the missing data are in specific portions of the dataset or where very large gaps of ‘missingness’ occur. Finally, proper handling of missing values of real-world hydroclimatic datasets depends on imputing and extensive study of the particular dataset to be imputed.

Download Full-text

Predictive Analytics with Strategically Missing Data

INFORMS Journal on Computing ◽

10.1287/ijoc.2019.0947 ◽

2020 ◽

Author(s):

Juheng Zhang ◽

Xiaoping Liu ◽

Xiao-Bai Li

Keyword(s):

Missing Data ◽

Financial Reporting ◽

Real World ◽

Missing Values ◽

Predictive Analytics ◽

Support Vector ◽

Real World Data ◽

Novel Approach ◽

Strategic Behaviors ◽

Job Application

We study strategically missing data problems in predictive analytics with regression. In many real-world situations, such as financial reporting, college admission, job application, and marketing advertisement, data providers often conceal certain information on purpose in order to gain a favorable outcome. It is important for the decision-maker to have a mechanism to deal with such strategic behaviors. We propose a novel approach to handle strategically missing data in regression prediction. The proposed method derives imputation values of strategically missing data based on the Support Vector Regression models. It provides incentives for the data providers to disclose their true information. We show that with the proposed method imputation errors for the missing values are minimized under some reasonable conditions. An experimental study on real-world data demonstrates the effectiveness of the proposed approach.

Download Full-text

Analyzing the impact of missing values and selection bias on fairness

International Journal of Data Science and Analytics ◽

10.1007/s41060-021-00259-z ◽

2021 ◽

Author(s):

Yanchen Wang ◽

Lisa Singh

Keyword(s):

Selection Bias ◽

Real World ◽

Missing Values ◽

Empirical Evaluation ◽

Data Sets ◽

Generation Process ◽

Real World Data ◽

World Data ◽

Impact Prediction ◽

The Impact

AbstractAlgorithmic decision making is becoming more prevalent, increasingly impacting people’s daily lives. Recently, discussions have been emerging about the fairness of decisions made by machines. Researchers have proposed different approaches for improving the fairness of these algorithms. While these approaches can help machines make fairer decisions, they have been developed and validated on fairly clean data sets. Unfortunately, most real-world data have complexities that make them more dirty. This work considers two of these complexities by analyzing the impact of two real-world data issues on fairness—missing values and selection bias—for categorical data. After formulating this problem and showing its existence, we propose fixing algorithms for data sets containing missing values and/or selection bias that use different forms of reweighting and resampling based upon the missing value generation process. We conduct an extensive empirical evaluation on both real-world and synthetic data using various fairness metrics, and demonstrate how different missing values generated from different mechanisms and selection bias impact prediction fairness, even when prediction accuracy remains fairly constant.

Download Full-text

SATORI: A System for Ontology-Guided Visual Exploration of Biomedical Data Repositories

10.1101/046755 ◽

2016 ◽

Author(s):

Fritz Lekschas ◽

Nils Gehlenborg

Keyword(s):

Real World ◽

Web Application ◽

Large Data ◽

Current Data ◽

Visual Exploration ◽

Data Sets ◽

Biomedical Data ◽

Structured Interviews ◽

Real World Data ◽

Data Repositories

AbstractThe ever-increasing number of biomedical data sets provides tremendous opportunities for re-use but current data repositories provide limited means of exploration apart from text-based search. Ontological metadata annotations provide context by semantically relating data sets. Visualizing this rich network of relationships can improve the explorability of large data repositories and help researchers find data sets of interest. We developed SATORI—an integrative search and visual exploration interface for the exploration of biomedical data repositories. The design is informed by a requirements analysis through a series of semi-structured interviews. We evaluated the implementation of SATORI in a field study on a real-world data collection.SATORI enables researchers to seamlessly search, browse, and semantically query data repositories via two visualizations that are highly interconnected with a powerful search interface. SATORI is an open-source web application,which is freely available at http://satori.refinery-platform.org and integrated into the Refinery Platform.

Download Full-text

Data Preparation in Large Real-World Data Mining Projects: Methods for Imputing Missing Values

Studies in Classification, Data Analysis, and Knowledge Organization - Exploratory Data Analysis in Empirical Research ◽

10.1007/978-3-642-55721-7_26 ◽

2003 ◽

pp. 248-256 ◽

Cited By ~ 2

Author(s):

Th. Liehr

Keyword(s):

Data Mining ◽

Real World ◽

Missing Values ◽

Data Preparation ◽

Real World Data ◽

World Data ◽

Mining Projects

Download Full-text

Genomic data imputation with variational auto-encoders

GigaScience ◽

10.1093/gigascience/giaa082 ◽

2020 ◽

Vol 9 (8) ◽

Author(s):

Yeping Lina Qiu ◽

Hong Zheng ◽

Olivier Gevaert

Keyword(s):

Deep Learning ◽

Missing Data ◽

Large Scale ◽

Missing Values ◽

Genomic Data ◽

Large Data ◽

Missing At Random ◽

Data Sets ◽

Data Imputation ◽

Missing Not At Random

Abstract Background As missing values are frequently present in genomic data, practical methods to handle missing data are necessary for downstream analyses that require complete data sets. State-of-the-art imputation techniques, including methods based on singular value decomposition and K-nearest neighbors, can be computationally expensive for large data sets and it is difficult to modify these algorithms to handle certain cases not missing at random. Results In this work, we use a deep-learning framework based on the variational auto-encoder (VAE) for genomic missing value imputation and demonstrate its effectiveness in transcriptome and methylome data analysis. We show that in the vast majority of our testing scenarios, VAE achieves similar or better performances than the most widely used imputation standards, while having a computational advantage at evaluation time. When dealing with data missing not at random (e.g., few values are missing), we develop simple yet effective methodologies to leverage the prior knowledge about missing data. Furthermore, we investigate the effect of varying latent space regularization strength in VAE on the imputation performances and, in this context, show why VAE has a better imputation capacity compared to a regular deterministic auto-encoder. Conclusions We describe a deep learning imputation framework for transcriptome and methylome data using a VAE and show that it can be a preferable alternative to traditional methods for data imputation, especially in the setting of large-scale data and certain missing-not-at-random scenarios.

Download Full-text

Applications of clustering algorithms and self organizing maps as data mining and business intelligence tools on real world data sets

2010 International Conference on Methods and Models in Computer Science (ICM2CS-2010) ◽

10.1109/icm2cs.2010.5706714 ◽

2010 ◽

Cited By ~ 1

Author(s):

L Singh ◽

S Singh ◽

P K Dubey

Keyword(s):

Data Mining ◽

Real World ◽

Business Intelligence ◽

Clustering Algorithms ◽

Data Sets ◽

Real World Data ◽

Self Organizing Maps ◽

World Data ◽

Self Organizing

Download Full-text

Hfinger: Malware HTTP Request Fingerprinting

Entropy ◽

10.3390/e23050507 ◽

2021 ◽

Vol 23 (5) ◽

pp. 507

Author(s):

Piotr Białczak ◽

Wojciech Mazurczyk

Keyword(s):

Real World ◽

Network Traffic ◽

Experimental Evaluation ◽

Data Sets ◽

Real World Data ◽

Malicious Software ◽

Default Mode ◽

World Data ◽

Effectiveness Analysis ◽

Http Protocol

Malicious software utilizes HTTP protocol for communication purposes, creating network traffic that is hard to identify as it blends into the traffic generated by benign applications. To this aim, fingerprinting tools have been developed to help track and identify such traffic by providing a short representation of malicious HTTP requests. However, currently existing tools do not analyze all information included in the HTTP message or analyze it insufficiently. To address these issues, we propose Hfinger, a novel malware HTTP request fingerprinting tool. It extracts information from the parts of the request such as URI, protocol information, headers, and payload, providing a concise request representation that preserves the extracted information in a form interpretable by a human analyst. For the developed solution, we have performed an extensive experimental evaluation using real-world data sets and we also compared Hfinger with the most related and popular existing tools such as FATT, Mercury, and p0f. The conducted effectiveness analysis reveals that on average only 1.85% of requests fingerprinted by Hfinger collide between malware families, what is 8–34 times lower than existing tools. Moreover, unlike these tools, in default mode, Hfinger does not introduce collisions between malware and benign applications and achieves it by increasing the number of fingerprints by at most 3 times. As a result, Hfinger can effectively track and hunt malware by providing more unique fingerprints than other standard tools.

Download Full-text

Different algorithms, different models

Quality & Quantity ◽

10.1007/s11135-021-01193-9 ◽

2021 ◽

Author(s):

Martyna Daria Swiatczak

Keyword(s):

Comparative Analysis ◽

Real World ◽

Qualitative Comparative Analysis ◽

Comparative Methods ◽

Data Sets ◽

Simulation Studies ◽

Threshold Values ◽

Real World Data ◽

Software Packages ◽

Methodological Approaches

AbstractThis study assesses the extent to which the two main Configurational Comparative Methods (CCMs), i.e. Qualitative Comparative Analysis (QCA) and Coincidence Analysis (CNA), produce different models. It further explains how this non-identity is due to the different algorithms upon which both methods are based, namely QCA’s Quine–McCluskey algorithm and the CNA algorithm. I offer an overview of the fundamental differences between QCA and CNA and demonstrate both underlying algorithms on three data sets of ascending proximity to real-world data. Subsequent simulation studies in scenarios of varying sample sizes and degrees of noise in the data show high overall ratios of non-identity between the QCA parsimonious solution and the CNA atomic solution for varying analytical choices, i.e. different consistency and coverage threshold values and ways to derive QCA’s parsimonious solution. Clarity on the contrasts between the two methods is supposed to enable scholars to make more informed decisions on their methodological approaches, enhance their understanding of what is happening behind the results generated by the software packages, and better navigate the interpretation of results. Clarity on the non-identity between the underlying algorithms and their consequences for the results is supposed to provide a basis for a methodological discussion about which method and which variants thereof are more successful in deriving which search target.

Download Full-text