Missing-Values Adjustment for Mixed-Type Data

Journal of Probability and Statistics ◽

10.1155/2011/290380 ◽

2011 ◽

Vol 2011 ◽

pp. 1-20 ◽

Cited By ~ 1

Author(s):

Agostino Tarsitano ◽

Marianna Falcone

Keyword(s):

Missing Values ◽

Nearest Neighbor ◽

Convex Combination ◽

Mixed Data ◽

Categorical Variables ◽

Data Sets ◽

Distance Matrices ◽

Type Data ◽

Box Cox Transformation ◽

Statistical Adequacy

We propose a new method of single imputation, reconstruction, and estimation of nonreported, incorrect, implausible, or excluded values in more than one field of the record. In particular, we will be concerned with data sets involving a mixture of numeric, ordinal, binary, and categorical variables. Our technique is a variation of the popular nearest neighbor hot deck imputation (NNHDI) where “nearest” is defined in terms of a global distance obtained as a convex combination of the distance matrices computed for the various types of variables. We address the problem of proper weighting of the partial distance matrices in order to reflect their significance, reliability, and statistical adequacy. Performance of several weighting schemes is compared under a variety of settings in coordination with imputation of the least power mean of the Box-Cox transformation applied to the values of the donors. Through analysis of simulated and actual data sets, we will show that this approach is appropriate. Our main contribution has been to demonstrate that mixed data may optimally be combined to allow the accurate reconstruction of missing values in the target variable even when some data are absent from the other fields of the record.

Download Full-text

A SELF-ORGANIZING MAP FOR MIXED CONTINUOUS AND CATEGORICAL DATA

International Journal of Computing ◽

10.47839/ijc.10.1.733 ◽

2011 ◽

pp. 24-32 ◽

Cited By ~ 1

Author(s):

Nicoleta Rogovschi ◽

Mustapha Lebbah ◽

Younès Bennani

Keyword(s):

Clustering Algorithm ◽

Clustering Algorithms ◽

Mixed Data ◽

Categorical Variables ◽

Data Sets ◽

Self Organizing Map ◽

Data Set ◽

Public Data ◽

Self Organizing

Most traditional clustering algorithms are limited to handle data sets that contain either continuous or categorical variables. However data sets with mixed types of variables are commonly used in data mining field. In this paper we introduce a weighted self-organizing map for clustering, analysis and visualization mixed data (continuous/binary). The learning of weights and prototypes is done in a simultaneous manner assuring an optimized data clustering. More variables has a high weight, more the clustering algorithm will take into account the informations transmitted by these variables. The learning of these topological maps is combined with a weighting process of different variables by computing weights which influence the quality of clustering. We illustrate the power of this method with data sets taken from a public data set repository: a handwritten digit data set, Zoo data set and other three mixed data sets. The results show a good quality of the topological ordering and homogenous clustering.

Download Full-text

Interactive Quantification of Categorical Variables in Mixed Data Sets

2008 12th International Conference Information Visualisation ◽

10.1109/iv.2008.33 ◽

2008 ◽

Cited By ~ 9

Author(s):

Sara Johansson ◽

Mikael Jern ◽

Jimmy Johansson

Keyword(s):

Mixed Data ◽

Categorical Variables ◽

Data Sets

Download Full-text

Multiple Imputation for Continuous and Categorical Data: Comparing Joint Multivariate Normal and Conditional Approaches

Political Analysis ◽

10.1093/pan/mpu007 ◽

2014 ◽

Vol 22 (4) ◽

pp. 497-519 ◽

Cited By ~ 26

Author(s):

Jonathan Kropko ◽

Ben Goodrich ◽

Andrew Gelman ◽

Jennifer Hill

Keyword(s):

Multiple Imputation ◽

Categorical Data ◽

Missing Values ◽

Missing At Random ◽

Model Fit ◽

Categorical Variables ◽

Data Sets ◽

Multivariate Normal ◽

Evaluating Methods ◽

Election Studies

We consider the relative performance of two common approaches to multiple imputation (MI): joint multivariate normal (MVN) MI, in which the data are modeled as a sample from a joint MVN distribution; and conditional MI, in which each variable is modeled conditionally on all the others. In order to use the multivariate normal distribution, implementations of joint MVN MI typically assume that categories of discrete variables are probabilistically constructed from continuous values. We use simulations to examine the implications of these assumptions. For each approach, we assess (1) the accuracy of the imputed values; and (2) the accuracy of coefficients and fitted values from a model fit to completed data sets. These simulations consider continuous, binary, ordinal, and unordered-categorical variables. One set of simulations uses multivariate normal data, and one set uses data from the 2008 American National Election Studies. We implement a less restrictive approach than is typical when evaluating methods using simulations in the missing data literature: in each case, missing values are generated by carefully following the conditions necessary for missingness to be “missing at random” (MAR). We find that in these situations conditional MI is more accurate than joint MVN MI whenever the data include categorical variables.

Download Full-text

Head-to-head comparison of clustering methods for heterogeneous data: a simulation-driven benchmark

Scientific Reports ◽

10.1038/s41598-021-83340-8 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Gregoire Preud’homme ◽

Kevin Duarte ◽

Kevin Dalleau ◽

Claire Lacomblez ◽

Emmanuel Bresso ◽

...

Keyword(s):

Hierarchical Clustering ◽

Latent Class ◽

Latent Class Model ◽

Real Life ◽

Heterogeneous Data ◽

Mixed Data ◽

Categorical Variables ◽

Clustering Methods ◽

Model Based ◽

Partitioning Around Medoids

AbstractThe choice of the most appropriate unsupervised machine-learning method for “heterogeneous” or “mixed” data, i.e. with both continuous and categorical variables, can be challenging. Our aim was to examine the performance of various clustering strategies for mixed data using both simulated and real-life data. We conducted a benchmark analysis of “ready-to-use” tools in R comparing 4 model-based (Kamila algorithm, Latent Class Analysis, Latent Class Model [LCM] and Clustering by Mixture Modeling) and 5 distance/dissimilarity-based (Gower distance or Unsupervised Extra Trees dissimilarity followed by hierarchical clustering or Partitioning Around Medoids, K-prototypes) clustering methods. Clustering performances were assessed by Adjusted Rand Index (ARI) on 1000 generated virtual populations consisting of mixed variables using 7 scenarios with varying population sizes, number of clusters, number of continuous and categorical variables, proportions of relevant (non-noisy) variables and degree of variable relevance (low, mild, high). Clustering methods were then applied on the EPHESUS randomized clinical trial data (a heart failure trial evaluating the effect of eplerenone) allowing to illustrate the differences between different clustering techniques. The simulations revealed the dominance of K-prototypes, Kamila and LCM models over all other methods. Overall, methods using dissimilarity matrices in classical algorithms such as Partitioning Around Medoids and Hierarchical Clustering had a lower ARI compared to model-based methods in all scenarios. When applying clustering methods to a real-life clinical dataset, LCM showed promising results with regard to differences in (1) clinical profiles across clusters, (2) prognostic performance (highest C-index) and (3) identification of patient subgroups with substantial treatment benefit. The present findings suggest key differences in clustering performance between the tested algorithms (limited to tools readily available in R). In most of the tested scenarios, model-based methods (in particular the Kamila and LCM packages) and K-prototypes typically performed best in the setting of heterogeneous data.

Download Full-text

Symmetry Breaking and Training from Incomplete Data with Radial Basis Boltzmann Machines

International Journal of Neural Systems ◽

10.1142/s0129065797000318 ◽

1997 ◽

Vol 08 (03) ◽

pp. 301-315 ◽

Cited By ~ 8

Author(s):

Marcel J. Nijman ◽

Hilbert J. Kappen

Keyword(s):

Symmetry Breaking ◽

Incomplete Data ◽

Missing Values ◽

Nearest Neighbor ◽

Boltzmann Machine ◽

K Nearest Neighbor ◽

Data Set ◽

Input Space ◽

Learning Rules ◽

Radial Basis

A Radial Basis Boltzmann Machine (RBBM) is a specialized Boltzmann Machine architecture that combines feed-forward mapping with probability estimation in the input space, and for which very efficient learning rules exist. The hidden representation of the network displays symmetry breaking as a function of the noise in the dynamics. Thus, generalization can be studied as a function of the noise in the neuron dynamics instead of as a function of the number of hidden units. We show that the RBBM can be seen as an elegant alternative of k-nearest neighbor, leading to comparable performance without the need to store all data. We show that the RBBM has good classification performance compared to the MLP. The main advantage of the RBBM is that simultaneously with the input-output mapping, a model of the input space is obtained which can be used for learning with missing values. We derive learning rules for the case of incomplete data, and show that they perform better on incomplete data than the traditional learning rules on a 'repaired' data set.

Download Full-text

Multiple Regression and K-Nearest-Neighbor Based Algorithm for Estimating Missing Values within Sensor

10.1109/icnisc54316.2021.00116 ◽

2021 ◽

Author(s):

Xiantong Li ◽

Yuan Sui

Keyword(s):

Multiple Regression ◽

Missing Values ◽

Nearest Neighbor ◽

K Nearest Neighbor

Download Full-text

A context-intensive approach to imputation of missing values in data sets from networks of environmental monitors

Journal of the Air & Waste Management Association ◽

10.1080/10962247.2015.1108251 ◽

2015 ◽

Vol 66 (1) ◽

pp. 38-52 ◽

Cited By ~ 2

Author(s):

Lawrence C. Larsen ◽

Mena Shah

Keyword(s):

Missing Values ◽

Data Sets ◽

Intensive Approach

Download Full-text

ClusterTree: Integration of cluster representation and nearest-neighbor search for large data sets with high dimensions

IEEE Transactions on Knowledge and Data Engineering ◽

10.1109/tkde.2003.1232281 ◽

2003 ◽

Vol 15 (5) ◽

pp. 1316-1337 ◽

Cited By ~ 24

Author(s):

Dantong Yu ◽

Aidong Zhang

Keyword(s):

Nearest Neighbor ◽

Large Data ◽

Nearest Neighbor Search ◽

Large Data Sets ◽

Data Sets ◽

High Dimensions ◽

Neighbor Search ◽

Cluster Representation

Download Full-text

Distance estimation in numerical data sets with missing values

Information Sciences ◽

10.1016/j.ins.2013.03.043 ◽

2013 ◽

Vol 240 ◽

pp. 115-128 ◽

Cited By ~ 19

Author(s):

Emil Eirola ◽

Gauthier Doquire ◽

Michel Verleysen ◽

Amaury Lendasse

Keyword(s):

Missing Values ◽

Distance Estimation ◽

Numerical Data ◽

Data Sets

Download Full-text

Determination of Reactivity Ratios from Binary Copolymerization Using the k-Nearest Neighbor Non-Parametric Regression

Polymers ◽

10.3390/polym13213811 ◽

2021 ◽

Vol 13 (21) ◽

pp. 3811

Author(s):

Iosif Sorin Fazakas-Anca ◽

Arina Modrea ◽

Sorin Vlase

Keyword(s):

Experimental Data ◽

Nearest Neighbor ◽

Optimization Method ◽

Reactivity Ratios ◽

Data Sets ◽

K Nearest Neighbor ◽

Integration Algorithm ◽

Data Set ◽

Parametric Regression ◽

Non Parametric

This paper proposes a new method for calculating the monomer reactivity ratios for binary copolymerization based on the terminal model. The original optimization method involves a numerical integration algorithm and an optimization algorithm based on k-nearest neighbour non-parametric regression. The calculation method has been tested on simulated and experimental data sets, at low (<10%), medium (10–35%) and high conversions (>40%), yielding reactivity ratios in a good agreement with the usual methods such as intersection, Fineman–Ross, reverse Fineman–Ross, Kelen–Tüdös, extended Kelen–Tüdös and the error in variable method. The experimental data sets used in this comparative analysis are copolymerization of 2-(N-phthalimido) ethyl acrylate with 1-vinyl-2-pyrolidone for low conversion, copolymerization of isoprene with glycidyl methacrylate for medium conversion and copolymerization of N-isopropylacrylamide with N,N-dimethylacrylamide for high conversion. Also, the possibility to estimate experimental errors from a single experimental data set formed by n experimental data is shown.

Download Full-text