An Enhanced Hybrid Clustering Approach for Privacy Preservation (ECPS) in Big Data using Apache Spark Framework

Ensuring the privacy for the big data stored in a cloud system is one of the demanding and critical process in recent days. Generally, the big data contains a huge amount of data, which requires some security measures and rules for assuring the confidentiality. For this reason, different techniques have been developed in the traditional works, which intends to guarantee the privacy of the big data by implementing key generation, encryption, and anonymization mechanisms. But, it limits the issues of increased time consumption, computational complexity, and error rate. Thus, the proposed work aims to design an enhanced mechanism for a secure big data storage. Here, the user’s bank dataset is considered as the input, which is protected from the unauthorized users by guaranteeing both the privacy and secrecy of the data. Here, the raw dataset is preprocessed to increase the data quality and correctness. Then, the security policies (i.e. rules) are generated for allowing the restricted access on the data by using an Improved FP-Growth (IFP-G) algorithm. Consequently, the sensitive and non-sensitive data attributes are classified based on the extracted features by using an Enhanced Random Forest (ERF) classification technique. At last, the privacy of user’s personal information and other details are protected with the use of a Modified Incognito Anonymization based Privacy Preservation (MIA-PP) algorithm. These enhanced mechanisms guarantee the security and confidentiality of the big data with reduced time consumption and increased accuracy. During experimental evaluation, the results of the proposed privacy mechanism is analyzed and compared by using different measures. Also, some of the existing anonymization and classification techniques have been considered to prove the betterment of the proposed technique.

Download Full-text

A Robust Distributed Big Data Clustering-based on Adaptive Density Partitioning using Apache Spark

Symmetry ◽

10.3390/sym10080342 ◽

2018 ◽

Vol 10 (8) ◽

pp. 342 ◽

Cited By ~ 3

Author(s):

Behrooz Hosseini ◽

Kourosh Kiani

Keyword(s):

Big Data ◽

Data Clustering ◽

Local Density ◽

Apache Spark ◽

Locality Sensitive Hashing ◽

Weighted Averaging ◽

Noise Robustness ◽

Locality Preservation ◽

Density Peaks ◽

Clustering Approach

Unsupervised machine learning and knowledge discovery from large-scale datasets have recently attracted a lot of research interest. The present paper proposes a distributed big data clustering approach-based on adaptive density estimation. The proposed method is developed-based on Apache Spark framework and tested on some of the prevalent datasets. In the first step of this algorithm, the input data is divided into partitions using a Bayesian type of Locality Sensitive Hashing (LSH). Partitioning makes the processing fully parallel and much simpler by avoiding unneeded calculations. Each of the proposed algorithm steps is completely independent of the others and no serial bottleneck exists all over the clustering procedure. Locality preservation also filters out the outliers and enhances the robustness of the proposed approach. Density is defined on the basis of Ordered Weighted Averaging (OWA) distance which makes clusters more homogenous. According to the density of each node, the local density peaks will be detected adaptively. By merging the local peaks, final cluster centers will be obtained and other data points will be a member of the cluster with the nearest center. The proposed method has been implemented and compared with similar recently published researches. Cluster validity indexes achieved from the proposed method shows its superiorities in precision and noise robustness in comparison with recent researches. Comparison with similar approaches also shows superiorities of the proposed method in scalability, high performance, and low computation cost. The proposed method is a general clustering approach and it has been used in gene expression clustering as a sample of its application.

Download Full-text

Big Data Privacy Preservation Using Two Phase Top-Down Specialization Algorithm with Multidimensional Map Reduce Framework on Hadoop

International Journal of Distributed and Cloud Computing ◽

10.21863/ijdcc/2015.3.2.009 ◽

2015 ◽

Vol 3 (2) ◽

Author(s):

Shalin Eliabeth S. ◽

Sarju S.

Keyword(s):

Big Data ◽

Data Privacy ◽

Privacy Preservation ◽

Experimental Result ◽

Map Reduce ◽

Distributed Environment ◽

Top Down ◽

Two Phase ◽

Data Anonymization ◽

Big Data Privacy

Big data privacy preservation is one of the most disturbed issues in current industry. Sometimes the data privacy problems never identified when input data is published on cloud environment. Data privacy preservation in hadoop deals in hiding and publishing input dataset to the distributed environment. In this paper investigate the problem of big data anonymization for privacy preservation from the perspectives of scalability and time factor etc. At present, many cloud applications with big data anonymization faces the same kind of problems. For recovering this kind of problems, here introduced a data anonymization algorithm called Two Phase Top-Down Specialization (TPTDS) algorithm that is implemented in hadoop. For the data anonymization-45,222 records of adults information with 15 attribute values was taken as the input big data. With the help of multidimensional anonymization in map reduce framework, here implemented proposed Two-Phase Top-Down Specialization anonymization algorithm in hadoop and it will increases the efficiency on the big data processing system. By conducting experiment in both one dimensional and multidimensional map reduce framework with Two Phase Top-Down Specialization algorithm on hadoop, the better result shown in multidimensional anonymization on input adult dataset. Data sets is generalized in a top-down manner and the better result was shown in multidimensional map reduce framework by the better IGPL values generated by the algorithm. The anonymization was performed with specialization operation on taxonomy tree. The experiment shows that the solutions improves the IGPL values, anonymity parameter and decreases the execution time of big data privacy preservation by compared to the existing algorithm. This experimental result will leads to great application to the distributed environment.

Download Full-text