Data Mining: Dirty Data and Data Cleaning

Data Mining for ERW Welded Tube Scheduling Rules Based on Decision Tree

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.403-408.1804 ◽

2011 ◽

Vol 403-408 ◽

pp. 1804-1807

Author(s):

Ning Zhao ◽

Shao Hua Dong ◽

Qing Tian

Keyword(s):

Data Mining ◽

Decision Tree ◽

Arc Welding ◽

Data Cleaning ◽

Data Extraction ◽

Electric Arc ◽

Welded Tube ◽

Electric Arc Welding ◽

Decision Tree Method ◽

Tree Method

In order to optimize electric- arc welding （ERW） welded tube scheduling , the paper introduces data cleaning, data extraction and transformation in detail and defines the datasets of sample attribute, which is based on analysis of production process of ERW welded tube. Furthermore, Decision-Tree method is adopted to achieve data mining and summarize scheduling rules which are validated by an example.

Download Full-text

A Comparative Analysis of Data Cleaning Approaches to Dirty Data

International Journal of Computer Applications ◽

10.5120/10175-5041 ◽

2013 ◽

Vol 62 (17) ◽

pp. 30-34

Author(s):

Sonal Porwal ◽

Deepali Vora

Keyword(s):

Comparative Analysis ◽

Data Cleaning ◽

Dirty Data

Download Full-text

Data Cleaning in Knowledge Discovery Database-Data Mining (KDD-DM)

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.f1100.0986s319 ◽

2019 ◽

Vol 8 (6S3) ◽

pp. 2196-2199

Keyword(s):

Data Mining ◽

Information Systems ◽

Data Quality ◽

Knowledge Discovery ◽

Information Management ◽

Data Cleaning ◽

Main Issue ◽

Quality Information ◽

Data Preparation ◽

Paper Review

Data quality is a main issue in quality information management. Data quality problems occur anywhere in information systems. These problems are solved by Data Cleaning (DC). DC is a process used to determine inaccurate, incomplete or unreasonable data and then improve the quality through correcting of detected errors and omissions. Various process of DC have been discussed in the previous studies, but there is no standard or formalized the DC process. The Domain Driven Data Mining (DDDM) is one of the KDD methodology often used for this purpose. This paper review and emphasize the important of DC in data preparation. The future works was also being highlight.

Download Full-text

Exploratory Data Mining and Data Cleaning

Journal of the American Statistical Association ◽

10.1198/jasa.2006.s81 ◽

2006 ◽

Vol 101 (473) ◽

pp. 399-399 ◽

Cited By ~ 3

Author(s):

Alan F Karr

Keyword(s):

Data Mining ◽

Data Cleaning ◽

Exploratory Data ◽

Exploratory Data Mining

Download Full-text

Parallel Cleaning Algorithm for Similar Duplicate Chinese Data Based on BERT

Scientific Programming ◽

10.1155/2021/5916748 ◽

2021 ◽

Vol 2021 ◽

pp. 1-11

Author(s):

Biqiu Li ◽

Jiabin Wang ◽

Xueli Liu

Keyword(s):

Data Mining ◽

Large Scale ◽

Clustering Algorithm ◽

Parallel Implementation ◽

Data Cleaning ◽

Position Vector ◽

Data Sets ◽

Implementation Scheme ◽

Mining Work ◽

Context Features

Data is an important source of knowledge discovery, but the existence of similar duplicate data not only increases the redundancy of the database but also affects the subsequent data mining work. Cleaning similar duplicate data is helpful to improve work efficiency. Based on the complexity of the Chinese language and the bottleneck of the single machine system to large-scale data computing performance, this paper proposes a Chinese data cleaning method that combines the BERT model and a k-means clustering algorithm and gives a parallel implementation scheme of the algorithm. In the process of text to vector, the position vector is introduced to obtain the context features of words, and the vector is dynamically adjusted according to the semantics so that the polysemous words can obtain different vector representations in different contexts. At the same time, the parallel implementation of the process is designed based on Hadoop. After that, k-means clustering algorithm is used to cluster similar duplicate data to achieve the purpose of cleaning. Experimental results on a variety of data sets show that the parallel cleaning algorithm proposed in this paper not only has good speedup and scalability but also improves the precision and recall of similar duplicate data cleaning, which will be of great significance for subsequent data mining.

Download Full-text

Exploratory Data Mining and Data Cleaning

Journal of Statistical Software ◽

10.18637/jss.v011.b09 ◽

2004 ◽

Vol 11 (Book Review 9) ◽

Author(s):

Nicholas J. Cox

Keyword(s):

Data Mining ◽

Data Cleaning ◽

Exploratory Data ◽

Exploratory Data Mining

Download Full-text

Cluster Analysis for Learning Style of Vocational High School Student Using K-Means and FUZZY C-MEANS (FCM)

Jurnal Penelitian Pos dan informatika ◽

10.17933/jppi.2017.070204 ◽

2017 ◽

Vol 7 (2) ◽

pp. 121

Author(s):

REZA ANDREA ◽

Shinta Palupi ◽

Siti Qomariah

Keyword(s):

Data Mining ◽

Learning Styles ◽

Learning Style ◽

Absorptive Capacity ◽

Data Cleaning ◽

Mining Method ◽

Vocational High School ◽

Fuzzy C Means ◽

Group Type ◽

Cluster Group

The inability of students to absorb the various knowledge conveyed by the teacher is not due to the inability of his understanding and not because the teacher is not able to teach, but rather due to the incompatibility of learning styles (learning style) between students and teachers, so that students feel uncomfortable learning to certain teachers, it occurred also in SMKN 2 Penajam Paser Utara (PPU), research to analyze cluster (group) type of student learning by applying data mining method that is K-means and Fuzzy C-means (FCM). The goal to be achieved is to know the effectiveness of this type of learning clustering on the development of absorptive capacity and improvement of student achievement. In this research, the method used to cluster the learning type with data mining process starting from data cleaning, data selection, data transformation, data mining, pattern evolution, and knowledge (knowledge).

Download Full-text

Data Mining and Statistics

Advances in Business Information Systems and Analytics - Handbook of Research on Advanced Data Mining Techniques and Applications for Business Intelligence ◽

10.4018/978-1-5225-2031-3.ch002 ◽

2017 ◽

pp. 15-33

Author(s):

Hirak Dasgupta

Keyword(s):

Data Mining ◽

Observational Data ◽

Data Cleaning ◽

Data Bases ◽

The Past ◽

The World ◽

Analyse Data ◽

Combination Of Methods ◽

Age Of Information

In the age of information, the world abounds with data. In order to obtain an intelligent appreciation of current developments, we need to absorb and interpret substantial amounts of data. The amount of data collected has grown at a phenomenal rate over the past few years. The computer age has given us both the power to rapidly process, summarize and analyse data and the encouragement to produce and store more data. The aim of data mining is to make sense of large amounts of mostly unsupervised data, in some domain. Data Mining is used to discover the patterns and relationships in data, with an emphasis on large observational data bases. This chapter aims to compare the approaches and conclude that Statisticians and Data miners can profit by studying each other's methods by using the combination of methods judiciously. The chapter also attempts to discuss data cleaning techniques involved in data mining.

Download Full-text

Exploratory Data Mining and Data Cleaning

10.1002/0471448354 ◽

2003 ◽

Cited By ~ 218

Author(s):

Tamraparni Dasu ◽

Theodore Johnson

Keyword(s):

Data Mining ◽

Data Cleaning ◽

Exploratory Data ◽

Exploratory Data Mining

Download Full-text

Applications of Data Mining in Mitigating Fire Accidents Based on Association Rules

International Journal of Interactive Mobile Technologies (iJIM) ◽

10.3991/ijim.v15i12.22687 ◽

2021 ◽

Vol 15 (12) ◽

pp. 158

Author(s):

Ibrahim Nasir Mahmood ◽

Hussein Ali Aliedane ◽

Mustafa Ali Abuzaraida

Keyword(s):

Data Mining ◽

Association Rules ◽

Data Cleaning ◽

Data Exploration ◽

Occurrence Rate ◽

Data Mining Techniques ◽

Hidden Knowledge ◽

Applied Approach ◽

Fire Accidents ◽

Damages And Losses

<p class="0abstract">Due to the increased rate of fire accidents which cause many damages and losses to people souls, material, and property in Basra city. The necessity of analyzing and mining the data of the fire accidents became an urgent need to find a solution. The need increased for a solution that helps to mitigate and reduce the number of accidents. In this paper, data mining techniques and applications including data preprocessing, data cleaning, and data exploration have been applied. Data mining applications is performed to analyze and discover the hidden knowledge in ten years of data (fire accidents happened from 2010 – 2019) which is approximately 20k record of accidents. These data mining techniques along with the association rules algorithm is applied on the dataset. The applied approach and techniques resulted in discovering the patterns and the nature of the fire accidents in Basra city. It also helped to reach to recommendations and resolutions for mitigating the fire accidents and its occurrence rate.</p>

Download Full-text