scholarly journals A Comparative Analysis of Data Cleaning Approaches to Dirty Data

2013 ◽  
Vol 62 (17) ◽  
pp. 30-34
Author(s):  
Sonal Porwal ◽  
Deepali Vora

Dirty data exist in many systems. Efficient and effective management of dirty data is in demand. Since data cleaning may result in useful data lost and new dirty data, this research attempts to manage dirty data without cleaning and retrieve query result according to the quality requirement of users. Since entity is the unit for understanding objects in the world and many dirty data are led by different descriptions of the same real-world entity, this chapter defines the entity data model to manage dirty data and then it proposes EntityManager, a dirty data management system with entity as the basic unit, keeping conflicts in data as uncertain attributes. Even though the query language is SQL, the query in the system has different semantics on dirty data. To process queries efficiently, this research proposes novel index, data operator implementation, and query optimization algorithms for the system.


Author(s):  
Arif Hanafi ◽  
Sulaiman Harun ◽  
Sofika Enggari ◽  
Larissa Navia Rani

The way that email has extraordinary significance in present day business communication is certain. Consistently, a bulk of emails is sent from organizations to clients and suppliers, from representatives to their managers and starting with one colleague then onto the next. In this way there is vast of email in data warehouse. Data cleaning is an activity performed on the data sets of data warehouse to upgrade and keep up the quality and consistency of the data. This paper underlines the issues related with dirty data, detection of duplicatein email column. The paper identifies the strategy of data cleaning from adifferent point of view. It provides an algorithm to the discovery of error and duplicates entries in the data sets of existing data warehouse. The paper characterizes the alliance rules based on the concept of mathematical association rules to determine the duplicate entries in email column in data sets.


2021 ◽  
Vol ahead-of-print (ahead-of-print) ◽  
Author(s):  
Samir Al-Janabi ◽  
Ryszard Janicki

PurposeData quality is a major challenge in data management. For organizations, the cleanliness of data is a significant problem that affects many business activities. Errors in data occur for different reasons, such as violation of business rules. However, because of the huge amount of data, manual cleaning alone is infeasible. Methods are required to repair and clean the dirty data through automatic detection, which are data quality issues to address. The purpose of this work is to extend the density-based data cleaning approach using conditional functional dependencies to achieve better data repair.Design/methodology/approachA set of conditional functional dependencies is introduced as an input to the density-based data cleaning algorithm. The algorithm repairs inconsistent data using this set.FindingsThis new approach was evaluated through experiments on real-world as well as synthetic datasets. The repair quality was determined using the F-measure. The results showed that the quality and scalability of the density-based data cleaning approach improved when conditional functional dependencies were introduced.Originality/valueConditional functional dependencies capture semantic errors among data values. This work demonstrates that the density-based data cleaning approach can be improved in terms of repairing inconsistent data by using conditional functional dependencies.


Entity resolution, that is to build corresponding relationships between objects and entities in dirty data, plays an important role in data cleaning. In bibliography information management system, the confusion between authors and their names often results in dirty data. That is, different authors may share the identical name, and different names may correspond to the identical author. Therefore, the major task of entity resolution is to distinguish entities sharing the same name and recognize different names referring to the same entity. However, current research focuses on only one aspect and cannot solve the problem completely. To address this problem, in this chapter, EIF, a framework of entity resolution with the consideration of the both kinds of confusions, is proposed. With effective clustering techniques, approximate string matching algorithms, and a flexible mechanism of knowledge integration, EIF can be widely used to solve many different kinds of entity resolution problems. In this chapter, as an application of EIF, the authors solve the author resolution problem. The effectiveness of this framework is verified by extensive experiments.


Author(s):  
Kumar Rahul ◽  
Rohitash Kumar Banyal

Each and every business enterprises require noise-free and clean data. There is a chance of an increase in dirty data as the data warehouse loads and refreshes a large quantity of data continuously from the various sources. Hence, in order to avoid the wrong conclusions, the data cleaning process becomes a vital one in various data-connected projects. This paper made an effort to introduce a novel data cleaning technique for the effective removal of dirty data. This process involves the following two steps: (i) dirty data detection and (ii) dirty data cleaning. The dirty data detection process has been assigned with the following process namely, data normalization, hashing, clustering, and finding the suspected data. In the clustering process, the optimal selection of centroid is the promising one and is carried out by employing the optimization concept. After the finishing of dirty data prediction, the subsequent process: dirty data cleaning begins to activate. The cleaning process also assigns with some processes namely, the leveling process, Huffman coding, and cleaning the suspected data. The cleaning of suspected data is performed based on the optimization concept. Hence, for solving all optimization problems, a new hybridized algorithm is proposed, the so-called Firefly Update Enabled Rider Optimization Algorithm (FU-ROA), which is the hybridization of the Rider Optimization Algorithm (ROA) and Firefly (FF) algorithm is introduced. To the end, the analysis of the performance of the implanted data cleaning method is scrutinized over the other traditional methods like Particle Swarm Optimization (PSO), FF, Grey Wolf Optimizer (GWO), and ROA in terms of their positive and negative measures. From the result, it can be observed that for iteration 12, the performance of the proposed FU-ROA model for test case 1 on was 0.013%, 0.7%, 0.64%, and 0.29% better than the extant PSO, FF, GWO, and ROA models, respectively.


Author(s):  
Jesmeen M. Z. H ◽  
J. Hossen ◽  
S. Sayeed ◽  
CK Ho ◽  
Tawsif K ◽  
...  

<span>Recently Big Data has become one of the important new factors in the business field. This needs to have strategies to manage large volumes of structured, unstructured and semi-structured data. It’s challenging to analyze such large scale of data to extract data meaning and handling uncertain outcomes. Almost all big data sets are dirty, i.e. the set may contain inaccuracies, missing data, miscoding and other issues that influence the strength of big data analytics. One of the biggest challenges in big data analytics is to discover and repair dirty data; failure to do this can lead to inaccurate analytics and unpredictable conclusions. Data cleaning is an essential part of managing and analyzing data. In this survey paper, data quality troubles which may occur in big data processing to understand clearly why an organization requires data cleaning are examined, followed by data quality criteria (dimensions used to indicate data quality). Then, cleaning tools available in market are summarized. Also challenges faced in cleaning big data due to nature of data are discussed. Machine learning algorithms can be used to analyze data and make predictions and finally clean data automatically.</span>


2020 ◽  
Author(s):  
Santosh Kumar Singh ◽  
Dr. Rajiv Kumar Dwivedi
Keyword(s):  

Sign in / Sign up

Export Citation Format

Share Document