A Global Survey on Data Deduplication

2018 ◽  
Vol 10 (4) ◽  
pp. 43-66 ◽  
Author(s):  
Shubhanshi Singhal ◽  
Pooja Sharma ◽  
Rajesh Kumar Aggarwal ◽  
Vishal Passricha

This article describes how data deduplication efficiently eliminates the redundant data by selecting and storing only single instance of it and becoming popular in storage systems. Digital data is growing much faster than storage volumes, which shows the importance of data deduplication among scientists and researchers. Data deduplication is considered as most successful and efficient technique of data reduction because it is computationally efficient and offers a lossless data reduction. It is applicable to various storage systems, i.e. local storage, distributed storage, and cloud storage. This article discusses the background, components, and key features of data deduplication which helps the reader to understand the design issues and challenges in this field.

2018 ◽  
Vol 7 (2.8) ◽  
pp. 13
Author(s):  
B Tirapathi Reddy ◽  
M V. P. Chandra Sekhara Rao

Storing data in cloud has become a necessity as users are accumulating abundant data every day and they are running out of physical storage devices. But majority of the data in the cloud storage is redundant. Data deduplication using convergent key encryption has been the mechanism popularly used to eliminate redundant data items in the cloud storage. Convergent key encryption suffers from various drawbacks. For instance, if data items are deduplicated based on convergent key, any unauthorized user can compromise the cloud storage by simply having a guessed hash of the file. So, ensuring the ownership of the data items is essential to protect the data items. As cukoo filter offers the minimum false positive rate, with minimal space overhead our mechanism has provided the proof of ownership.


Author(s):  
Sumit Kumar Mahana ◽  
Rajesh Kumar Aggarwal

In the present digital scenario, data is of prime significance for individuals and moreover for organizations. With the passage of time, data content being produced increases exponentially, which poses a serious concern as the huge amount of redundant data contents stored on the cloud employs a severe load on the cloud storage systems itself which cannot be accepted. Therefore, a storage optimization strategy is a fundamental prerequisite to cloud storage systems. Data deduplication is a storage optimization strategy that is used for deleting identical copies of redundant data, optimizing bandwidth, improves utilization of storage space, and hence, minimizes storage cost. To guarantee the security parameter, the data which is stored on the cloud must be in an encrypted form to ensure the security of the stored data. Consequently, executing deduplication safely over the encrypted information in the cloud seems to be a challenging job. This chapter discusses various existing data deduplication techniques with a notion of securing the data on the cloud that addresses this challenge.


The enormous growth of digital data, especially the data in unstructured format has brought a tremendous challenge on data analysis as well as the data storage systems which are essentially increasing the cost and performance of the backup systems. The traditional systems do not provide any optimization techniques to keep the duplicated data from being backed up. Deduplication of data has become an essential and financial way of the capacity optimization technique which replaces the redundant data. The following paper reviews the deduplication process, types of deduplication and techniques available for data deduplication. Also, many approaches proposed by various researchers on deduplication in Big data storage systems are studied and compared.


Author(s):  
Sumit Kumar Mahana ◽  
Rajesh Kumar Aggarwal

In the present digital scenario, data is of prime significance for individuals and moreover for organizations. With the passage of time, data content being produced increases exponentially, which poses a serious concern as the huge amount of redundant data contents stored on the cloud employs a severe load on the cloud storage systems itself which cannot be accepted. Therefore, a storage optimization strategy is a fundamental prerequisite to cloud storage systems. Data deduplication is a storage optimization strategy that is used for deleting identical copies of redundant data, optimizing bandwidth, improves utilization of storage space, and hence, minimizes storage cost. To guarantee the security parameter, the data which is stored on the cloud must be in an encrypted form to ensure the security of the stored data. Consequently, executing deduplication safely over the encrypted information in the cloud seems to be a challenging job. This chapter discusses various existing data deduplication techniques with a notion of securing the data on the cloud that addresses this challenge.


2018 ◽  
Vol 7 (2.4) ◽  
pp. 46 ◽  
Author(s):  
Shubhanshi Singhal ◽  
Akanksha Kaushik ◽  
Pooja Sharma

Due to drastic growth of digital data, data deduplication has become a standard component of modern backup systems. It reduces data redundancy, saves storage space, and simplifies the management of data chunks. This process is performed in three steps: chunking, fingerprinting, and indexing of fingerprints. In chunking, data files are divided into the chunks and the chunk boundary is decided by the value of the divisor. For each chunk, a unique identifying value is computed using a hash signature (i.e. MD-5, SHA-1, SHA-256), known as fingerprint. At last, these fingerprints are stored in the index to detect redundant chunks means chunks having the same fingerprint values. In chunking, the chunk size is an important factor that should be optimal for better performance of deduplication system. Genetic algorithm (GA) is gaining much popularity and can be applied to find the best value of the divisor. Secondly, indexing also enhances the performance of the system by reducing the search time. Binary search tree (BST) based indexing has the time complexity of  which is minimum among the searching algorithm. A new model is proposed by associating GA to find the value of the divisor. It is the first attempt when GA is applied in the field of data deduplication. The second improvement in the proposed system is that BST index tree is applied to index the fingerprints. The performance of the proposed system is evaluated on VMDK, Linux, and Quanto datasets and a good improvement is achieved in deduplication ratio.


Author(s):  
Huan Liu

The amounts of data become increasingly large in recent years as the capacity of digital data storage worldwide has significantly increased. As the size of data grows, the demand for data reduction increases for effective data mining. Instance selection is one of the effective means to data reduction. This article introduces basic concepts of instance selection, its context, necessity and functionality. It briefly reviews the state-of-the-art methods for instance selection. Selection is a necessity in the world surrounding us. It stems from the sheer fact of limited resources. No exception for data mining. Many factors give rise to data selection: data is not purely collected for data mining or for one particular application; there are missing data, redundant data, and errors during collection and storage; and data can be too overwhelming to handle. Instance selection is one effective approach to data selection. It is a process of choosing a subset of data to achieve the original purpose of a data mining application. The ideal outcome of instance selection is a model independent, minimum sample of data that can accomplish tasks with little or no performance deterioration.


Author(s):  
Vishal Passricha ◽  
Ashish Chopra ◽  
Shubhanshi Singhal

Cloud storage (CS) is gaining much popularity nowadays because it offers low-cost and convenient network storage services. In this big data era, the explosive growth in digital data moves the users towards CS but this causes a lot of storage pressure on CS systems because a large volume of this data is redundant. Data deduplication is an effective data reduction technique. The dynamic nature of data makes security and ownership of data as a very important issue. Proof-of-ownership schemes are a robust way to check the ownership claimed by any owner. However, this method affects the deduplication process because encryption methods have varying characteristics. A convergent encryption (CE) scheme is widely used for secure data deduplication. The problem with the CE-based scheme is that the user can decrypt the cloud data while he has lost his ownership. This article addresses the problem of ownership revocation by proposing a secure deduplication scheme for encrypted data. The proposed scheme enhances the security against unauthorized encryption and poison attack on the predicted set of data.


2014 ◽  
Vol 2014 ◽  
pp. 1-10
Author(s):  
Sun-Ho Lee ◽  
Im-Yeong Lee

Data outsourcing services have emerged with the increasing use of digital information. They can be used to store data from various devices via networks that are easy to access. Unlike existing removable storage systems, storage outsourcing is available to many users because it has no storage limit and does not require a local storage medium. However, the reliability of storage outsourcing has become an important topic because many users employ it to store large volumes of data. To protect against unethical administrators and attackers, a variety of cryptography systems are used, such as searchable encryption and proxy reencryption. However, existing searchable encryption technology is inconvenient for use in storage outsourcing environments where users upload their data to be shared with others as necessary. In addition, some existing schemes are vulnerable to collusion attacks and have computing cost inefficiencies. In this paper, we analyze existing proxy re-encryption with keyword search.


Author(s):  
Spyridon V. Gogouvitis ◽  
Athanasios Voulodimos ◽  
Dimosthenis Kyriazis

Distributed storage systems are becoming the method of data storage for the new generation of applications, as it appears a promising solution to handle the immense volume of data produced in today’s rich and ubiquitous digital environment. In this chapter, the authors first present the requirements end users pose on Cloud Storage solutions. Then they compare some of the most prominent commercial distributed storage systems against these requirements. Lastly, the authors present the innovations the VISION Cloud project brings in the field of Storage Clouds.


Sign in / Sign up

Export Citation Format

Share Document