Interactive Quantification of Categorical Variables in Mixed Data Sets

Author(s):  
Sara Johansson ◽  
Mikael Jern ◽  
Jimmy Johansson
2011 ◽  
pp. 24-32 ◽  
Author(s):  
Nicoleta Rogovschi ◽  
Mustapha Lebbah ◽  
Younès Bennani

Most traditional clustering algorithms are limited to handle data sets that contain either continuous or categorical variables. However data sets with mixed types of variables are commonly used in data mining field. In this paper we introduce a weighted self-organizing map for clustering, analysis and visualization mixed data (continuous/binary). The learning of weights and prototypes is done in a simultaneous manner assuring an optimized data clustering. More variables has a high weight, more the clustering algorithm will take into account the informations transmitted by these variables. The learning of these topological maps is combined with a weighting process of different variables by computing weights which influence the quality of clustering. We illustrate the power of this method with data sets taken from a public data set repository: a handwritten digit data set, Zoo data set and other three mixed data sets. The results show a good quality of the topological ordering and homogenous clustering.


2011 ◽  
Vol 2011 ◽  
pp. 1-20 ◽  
Author(s):  
Agostino Tarsitano ◽  
Marianna Falcone

We propose a new method of single imputation, reconstruction, and estimation of nonreported, incorrect, implausible, or excluded values in more than one field of the record. In particular, we will be concerned with data sets involving a mixture of numeric, ordinal, binary, and categorical variables. Our technique is a variation of the popular nearest neighbor hot deck imputation (NNHDI) where “nearest” is defined in terms of a global distance obtained as a convex combination of the distance matrices computed for the various types of variables. We address the problem of proper weighting of the partial distance matrices in order to reflect their significance, reliability, and statistical adequacy. Performance of several weighting schemes is compared under a variety of settings in coordination with imputation of the least power mean of the Box-Cox transformation applied to the values of the donors. Through analysis of simulated and actual data sets, we will show that this approach is appropriate. Our main contribution has been to demonstrate that mixed data may optimally be combined to allow the accurate reconstruction of missing values in the target variable even when some data are absent from the other fields of the record.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Gregoire Preud’homme ◽  
Kevin Duarte ◽  
Kevin Dalleau ◽  
Claire Lacomblez ◽  
Emmanuel Bresso ◽  
...  

AbstractThe choice of the most appropriate unsupervised machine-learning method for “heterogeneous” or “mixed” data, i.e. with both continuous and categorical variables, can be challenging. Our aim was to examine the performance of various clustering strategies for mixed data using both simulated and real-life data. We conducted a benchmark analysis of “ready-to-use” tools in R comparing 4 model-based (Kamila algorithm, Latent Class Analysis, Latent Class Model [LCM] and Clustering by Mixture Modeling) and 5 distance/dissimilarity-based (Gower distance or Unsupervised Extra Trees dissimilarity followed by hierarchical clustering or Partitioning Around Medoids, K-prototypes) clustering methods. Clustering performances were assessed by Adjusted Rand Index (ARI) on 1000 generated virtual populations consisting of mixed variables using 7 scenarios with varying population sizes, number of clusters, number of continuous and categorical variables, proportions of relevant (non-noisy) variables and degree of variable relevance (low, mild, high). Clustering methods were then applied on the EPHESUS randomized clinical trial data (a heart failure trial evaluating the effect of eplerenone) allowing to illustrate the differences between different clustering techniques. The simulations revealed the dominance of K-prototypes, Kamila and LCM models over all other methods. Overall, methods using dissimilarity matrices in classical algorithms such as Partitioning Around Medoids and Hierarchical Clustering had a lower ARI compared to model-based methods in all scenarios. When applying clustering methods to a real-life clinical dataset, LCM showed promising results with regard to differences in (1) clinical profiles across clusters, (2) prognostic performance (highest C-index) and (3) identification of patient subgroups with substantial treatment benefit. The present findings suggest key differences in clustering performance between the tested algorithms (limited to tools readily available in R). In most of the tested scenarios, model-based methods (in particular the Kamila and LCM packages) and K-prototypes typically performed best in the setting of heterogeneous data.


2019 ◽  
Author(s):  
Benedikt Ley ◽  
Komal Raj Rijal ◽  
Jutta Marfurt ◽  
Nabaraj Adhikari ◽  
Megha Banjara ◽  
...  

Abstract Objective: Electronic data collection (EDC) has become a suitable alternative to paper based data collection (PBDC) in biomedical research even in resource poor settings. During a survey in Nepal, data were collected using both systems and data entry errors compared between both methods. Collected data were checked for completeness, values outside of realistic ranges, internal logic and date variables for reasonable time frames. Variables were grouped into 5 categories and the number of discordant entries were compared between both systems, overall and per variable category. Results: Data from 52 variables collected from 358 participants were available. Discrepancies between both data sets were found in 12.6% of all entries (2352/18,616). Differences between data points were identified in 18.0% (643/3,580) of continuous variables, 15.8% of time variables (113/716), 13.0% of date variables (140/1,074), 12.0% of text variables (86/716), and 10.9% of categorical variables (1,370/12,530). Overall 64% (1,499/2,352) of all discrepancies were due to data omissions, 76.6% (1,148/1,499) of missing entries were among categorical data. Omissions in PBDC (n=1002) were twice as frequent as in EDC (n=497, p<0.001). Data omissions, specifically among categorical variables were identified as the greatest source of error. If designed accordingly, EDC can address this short fall effectively.


Author(s):  
SUNG-GI LEE ◽  
DEOK-KYUN YUN

In this paper, we present a concept based on the similarity of categorical attribute values considering implicit relationships and propose a new and effective clustering procedure for mixed data. Our procedure obtains similarities between categorical values from careful analysis and maps the values in each categorical attribute into points in two-dimensional coordinate space using multidimensional scaling. These mapped values make it possible to interpret the relationships between attribute values and to directly apply categorical attributes to clustering algorithms using a Euclidean distance. After trivial modifications, our procedure for clustering mixed data uses the k-means algorithm, well known for its efficiency in clustering large data sets. We use the familiar soybean disease and adult data sets to demonstrate the performance of our clustering procedure. The satisfactory results that we have obtained demonstrate the effectiveness of our algorithm in discovering structure in data.


2016 ◽  
Vol 23 (4) ◽  
pp. 1009-1031 ◽  
Author(s):  
Anna Glaser ◽  
Sonia Ben Slimane ◽  
Claire Auplat ◽  
Régis Coeurderoy

Purpose The purpose of this paper is to build a holistic theoretical framework of enabling factors contributing to the development of enterprise in nanotechnology-related industries, in a French context. Design/methodology/approach A systematic literature review methodology was adopted. The review used three gauges to identify enabling factors contributing to the development of enterprise in nanotechnology-related industries in a French context: first, it analysed the literature related to the development of nanotechnologies in a perspective of sustainability in a multidisciplinary stance (“Green view”). Second, it took a disciplinary stance by exploring academic journals in the field of entrepreneurship (“Entrepreneurship view”). Third, it studied the perspective of France (“French view”). Findings The main finding is that in spite of different approaches and sometimes seemingly conflicting stances, the three views converge on three enabling factors: the importance of knowledge sharing across boundaries, access to university scientists and facilities, and government intervention. However, each view also has its particularities: the “Green view” emphasizes the need for civil society inclusion, the “Entrepreneurship view” underlines the importance of early stage capital and entrepreneurial behaviour and the “French view” concentrates on the role of clusters. Research limitations/implications The paper provides a theoretical framework and a starting point for further work on entrepreneurial nanotechnology facilitation. Its findings constitute a benchmark which may be tested in empirical cases. The focus on the French context may be seen as a limitation but also as a source of interesting comparative work focussing on other national or regional contexts. Practical implications The paper shows that public policy is an important element in the nascent field of enterprise development for nano-based materials. It outlines how different contexts create different barriers to entrepreneurship, and it proposes recommendations to overcome some of these barriers. Originality/value In this paper, findings result from an exploration of the nanotechnology literature that focusses solely on nanotechnology data sets and not on mixed data sets. The use of three different gauges leads to the construction of a holistic theoretical framework that includes enabling factors as well as the types of barriers that entrepreneurs have to overcome to succeed.


2012 ◽  
Vol 2 (1) ◽  
pp. 11-20 ◽  
Author(s):  
Ritu Vijay ◽  
Prerna Mahajan ◽  
Rekha Kandwal

Cluster analysis has been extensively used in machine learning and data mining to discover distribution patterns in the data. Clustering algorithms are generally based on a distance metric in order to partition the data into small groups such that data instances in the same group are more similar than the instances belonging to different groups. In this paper the authors have extended the concept of hamming distance for categorical data .As a data processing step they have transformed the data into binary representation. The authors have used proposed algorithm to group data points into clusters. The experiments are carried out on the data sets from UCI machine learning repository to analyze the performance study. They conclude by stating that this proposed algorithm shows promising result and can be extended to handle numeric as well as mixed data.


2019 ◽  
Vol 49 (2) ◽  
pp. 153-162 ◽  
Author(s):  
Ricardo González-Quintero ◽  
María Solange Sánchez-Pinzón ◽  
Diana María Bolívar-Vergara ◽  
Ngonidzashe Chirinda ◽  
Jacobo Arango ◽  
...  

In Colombia, cattle-fattening farms account for 20.7% of the Colombian cattle herd and play an important role in terms of economic and social benefits for rural communities. However, few characterization studies have been conducted on these production systems, which limit our understanding of their production dynamics and environmental impacts. This study aimed to characterize very small, small, medium, and large cattle-fattening farms from technical and environmental perspectives. The data analyzed were obtained from the Ganadería Colombiana Sostenible and the LivestockPlus projects, which gathered information from a total of 2618 farms, classified according to their cattle production orientation. From those, 275 cattle-fattening farms were classified as being either very small (1–30 bovines), small (31–50 bovines), medium (51–250 bovines), or large farms (more than 251 bovines). Numerical and categorical variables were distributed into five components: (1) general farm information, (2) composition and management of the herd, (3) pasture management, (4) production information, and (5) environmental information. Each component was analyzed using the factorial analysis of mixed data (FAMD) method. According to FAMD, for the components general farm information, herd composition and management, pasture management, and production information, distribution of variables led to a spatial separation of the centroid from each category of producers. For the component environmental information, there was no separation of the centroid. Better infrastructure, machinery and equipment, better pasture management, and better productive parameters and practices were observed in larger farms. This suggests that those public policies aimed at improving productive and environmental performance of the livestock sector should give priority to small- and medium-sized livestock producers considering their farm characteristics.


Sign in / Sign up

Export Citation Format

Share Document