Data Mining: Dirty Data and Data Cleaning

2020 ◽  
Author(s):  
Santosh Kumar Singh ◽  
Dr. Rajiv Kumar Dwivedi
Keyword(s):  
2011 ◽  
Vol 403-408 ◽  
pp. 1804-1807
Author(s):  
Ning Zhao ◽  
Shao Hua Dong ◽  
Qing Tian

In order to optimize electric- arc welding (ERW) welded tube scheduling , the paper introduces data cleaning, data extraction and transformation in detail and defines the datasets of sample attribute, which is based on analysis of production process of ERW welded tube. Furthermore, Decision-Tree method is adopted to achieve data mining and summarize scheduling rules which are validated by an example.


Data quality is a main issue in quality information management. Data quality problems occur anywhere in information systems. These problems are solved by Data Cleaning (DC). DC is a process used to determine inaccurate, incomplete or unreasonable data and then improve the quality through correcting of detected errors and omissions. Various process of DC have been discussed in the previous studies, but there is no standard or formalized the DC process. The Domain Driven Data Mining (DDDM) is one of the KDD methodology often used for this purpose. This paper review and emphasize the important of DC in data preparation. The future works was also being highlight.


2021 ◽  
Vol 2021 ◽  
pp. 1-11
Author(s):  
Biqiu Li ◽  
Jiabin Wang ◽  
Xueli Liu

Data is an important source of knowledge discovery, but the existence of similar duplicate data not only increases the redundancy of the database but also affects the subsequent data mining work. Cleaning similar duplicate data is helpful to improve work efficiency. Based on the complexity of the Chinese language and the bottleneck of the single machine system to large-scale data computing performance, this paper proposes a Chinese data cleaning method that combines the BERT model and a k-means clustering algorithm and gives a parallel implementation scheme of the algorithm. In the process of text to vector, the position vector is introduced to obtain the context features of words, and the vector is dynamically adjusted according to the semantics so that the polysemous words can obtain different vector representations in different contexts. At the same time, the parallel implementation of the process is designed based on Hadoop. After that, k-means clustering algorithm is used to cluster similar duplicate data to achieve the purpose of cleaning. Experimental results on a variety of data sets show that the parallel cleaning algorithm proposed in this paper not only has good speedup and scalability but also improves the precision and recall of similar duplicate data cleaning, which will be of great significance for subsequent data mining.


2017 ◽  
Vol 7 (2) ◽  
pp. 121
Author(s):  
REZA ANDREA ◽  
Shinta Palupi ◽  
Siti Qomariah

The inability of students to absorb the various knowledge conveyed by the teacher is not due to the inability of his understanding and not because the teacher is not able to teach, but rather due to the incompatibility of learning styles (learning style) between students and teachers, so that students feel uncomfortable learning to certain teachers, it occurred also in SMKN 2 Penajam Paser Utara (PPU), research to analyze cluster (group) type of student learning by applying data mining method that is K-means and Fuzzy C-means (FCM). The goal to be achieved is to know the effectiveness of this type of learning clustering on the development of absorptive capacity and improvement of student achievement. In this research, the method used to cluster the learning type with data mining process starting from data cleaning, data selection, data transformation, data mining, pattern evolution, and knowledge (knowledge).


Author(s):  
Hirak Dasgupta

In the age of information, the world abounds with data. In order to obtain an intelligent appreciation of current developments, we need to absorb and interpret substantial amounts of data. The amount of data collected has grown at a phenomenal rate over the past few years. The computer age has given us both the power to rapidly process, summarize and analyse data and the encouragement to produce and store more data. The aim of data mining is to make sense of large amounts of mostly unsupervised data, in some domain. Data Mining is used to discover the patterns and relationships in data, with an emphasis on large observational data bases. This chapter aims to compare the approaches and conclude that Statisticians and Data miners can profit by studying each other's methods by using the combination of methods judiciously. The chapter also attempts to discuss data cleaning techniques involved in data mining.


Author(s):  
Ibrahim Nasir Mahmood ◽  
Hussein Ali Aliedane ◽  
Mustafa Ali Abuzaraida

<p class="0abstract">Due to the increased rate of fire accidents which cause many damages and losses to people souls, material, and property in Basra city. The necessity of analyzing and mining the data of the fire accidents became an urgent need to find a solution. The need increased for a solution that helps to mitigate and reduce the number of accidents. In this paper, data mining techniques and applications including data preprocessing, data cleaning, and data exploration have been applied. Data mining applications is performed to analyze and discover the hidden knowledge in ten years of data (fire accidents happened from 2010 – 2019) which is approximately 20k record of accidents. These data mining techniques along with the association rules algorithm is applied on the dataset. The applied approach and techniques resulted in discovering the patterns and the nature of the fire accidents in Basra city. It also helped to reach to recommendations and resolutions for mitigating the fire accidents and its occurrence rate.</p>


Sign in / Sign up

Export Citation Format

Share Document