scholarly journals Network-Aware Data Placement Strategy in Storage Cluster System

2020 ◽  
Vol 2020 ◽  
pp. 1-16
Author(s):  
Bilin Shao ◽  
Dan Song ◽  
Genqing Bian ◽  
Yu Zhao

The dramatic increase of storage devices in distributed storage cluster system and the inherent characteristics of distributed deployment mode make network resources become one of the bottlenecks of data storage process. By analyzing the functional characteristics of data flow transmission of the components of the storage system, the network topology structure of the storage system is constructed, and the evaluation index of node flow load is put forward based on the degree centrality and the betweenness centrality theory of the network to explore the network topology and real-time flow characteristics. According to the evaluation index of node flow load, a network-aware data layout scheme is proposed. By balancing the flow load of bottleneck link, congestion and transmission delay can be reduced to further shorten the total task execution time and improve the efficiency of data writing.

2013 ◽  
Vol 5 (1) ◽  
pp. 53-69
Author(s):  
Jacques Jorda ◽  
Aurélien Ortiz ◽  
Abdelaziz M’zoughi ◽  
Salam Traboulsi

Grid computing is commonly used for large scale application requiring huge computation capabilities. In such distributed architectures, the data storage on the distributed storage resources must be handled by a dedicated storage system to ensure the required quality of service. In order to simplify the data placement on nodes and to increase the performance of applications, a storage virtualization layer can be used. This layer can be a single parallel filesystem (like GPFS) or a more complex middleware. The latter is preferred as it allows the data placement on the nodes to be tuned to increase both the reliability and the performance of data access. Thus, in such a middleware, a dedicated monitoring system must be used to ensure optimal performance. In this paper, the authors briefly introduce the Visage middleware – a middleware for storage virtualization. They present the most broadly used grid monitoring systems, and explain why they are not adequate for virtualized storage monitoring. The authors then present the architecture of their monitoring system dedicated to storage virtualization. We introduce the workload prediction model used to define the best node for data placement, and show on a simple experiment its accuracy.


2021 ◽  
Vol 2021 ◽  
pp. 1-13
Author(s):  
Xiangli Chang ◽  
Hailang Cui

With the increasing popularity of a large number of Internet-based services and a large number of services hosted on cloud platforms, a more powerful back-end storage system is needed to support these services. At present, it is very difficult or impossible to implement a distributed storage to meet all the above assumptions. Therefore, the focus of research is to limit different characteristics to design different distributed storage solutions to meet different usage scenarios. Economic big data should have the basic requirements of high storage efficiency and fast retrieval speed. The large number of small files and the diversity of file types make the storage and retrieval of economic big data face severe challenges. This paper is oriented to the application requirements of cross-modal analysis of economic big data. According to the source and characteristics of economic big data, the data types are analyzed and the database storage architecture and data storage structure of economic big data are designed. Taking into account the spatial, temporal, and semantic characteristics of economic big data, this paper proposes a unified coding method based on the spatiotemporal data multilevel division strategy combined with Geohash and Hilbert and spatiotemporal semantic constraints. A prototype system was constructed based on Mongo DB, and the performance of the multilevel partition algorithm proposed in this paper was verified by the prototype system based on the realization of data storage management functions. The Wiener distributed memory based on the principle of Wiener filter is used to store the workload of each workload distributed storage window in a distributed manner. For distributed storage workloads, this article adopts specific types of workloads. According to its periodicity, the workload is divided into distributed storage windows of specific duration. At the beginning of each distributed storage window, distributed storage is distributed to the next distributed storage window. Experiments and tests have verified the distributed storage strategy proposed in this article, which proves that the Wiener distributed storage solution can save platform resources and configuration costs while ensuring Service Level Agreement (SLA).


2020 ◽  
Vol 245 ◽  
pp. 04037
Author(s):  
Xiaowei Aaron Chu ◽  
Jeff LeFevre ◽  
Aldrin Montana ◽  
Dana Robinson ◽  
Quincey Koziol ◽  
...  

Access libraries such as ROOT[1] and HDF5[2] allow users to interact with datasets using high level abstractions, like coordinate systems and associated slicing operations. Unfortunately, the implementations of access libraries are based on outdated assumptions about storage systems interfaces and are generally unable to fully benefit from modern fast storage devices. For example, access libraries often implement buffering and data layout that assume that large, single-threaded sequential access patterns are causing less overall latency than small parallel random access: while this is true for spinning media, it is not true for flash media. The situation is getting worse with rapidly evolving storage devices such as non-volatile memory and ever larger datasets. This project explores distributed dataset mapping infrastructures that can integrate and scale out existing access libraries using Ceph’s extensible object model, avoiding re-implementation or even modifications of these access libraries as much as possible. These programmable storage extensions coupled with our distributed dataset mapping techniques enable: 1) access library operations to be offloaded to storage system servers, 2) the independent evolution of access libraries and storage systems and 3) fully leveraging of the existing load balancing, elasticity, and failure management of distributed storage systems like Ceph. They also create more opportunities to conduct storage server-local optimizations specific to storage servers. For example, storage servers might include local key/value stores combined with chunk stores that require different optimizations than a local file system. As storage servers evolve to support new storage devices like non-volatile memory, these server-local optimizations can be implemented while minimizing disruptions to applications. We will report progress on the means by which distributed dataset mapping can be abstracted over particular access libraries, including access libraries for ROOT data, and how we address some of the challenges revolving around data partitioning and composability of access operations.


2013 ◽  
Vol 4 (1) ◽  
pp. 106-110
Author(s):  
Rajasekaran S ◽  
Kalifulla. Y ◽  
Murugesan. S ◽  
Ezhilvendan. M ◽  
Gunasekaran. J

cloud storage system, consisting of a collection of storage servers, provides long-term storage services over the Internet.  Storing data in a third party’s cloud system causes serious concern over data confidentiality. General encryption schemes protect data confidentiality, but also limit the functionality of the storage system because a few operations are supported over encrypted data. Constructing a secure storage system that supports multiple functions is challenging when the storage system is distributed and has no central authority. We propose a threshold proxy re-encryption scheme and integrate it with a decentralized erasure code such that a secure distributed storage system is formulated. The distributed storage system not only supports secure and robust data storage and retrieval, but also lets a user forward his data in the storage servers to another user without retrieving the data back. The main technical contribution is that the proxy re-encryption scheme supports encoding operations over encrypted messages as well as forwarding operations over encoded and encrypted messages. Our method fully integrates encrypting, encoding, and forwarding. We analyze and suggest suitable parameters for the number of copies of a message dispatched to storage servers and the number of storage servers queried by a key server. These parameters allow more flexible adjustment between the number of storage servers and robustness.


2016 ◽  
Vol 4 (1) ◽  
Author(s):  
Agus Maman Abadi ◽  
Musthofa Musthofa ◽  
Emut Emut

The increasing need in techniques of storing big data presents a new challenge. One way to address this challenge is the use of distributed storage systems. One strategy that implemented in distributed data storage systems is the use of Erasure Code which applied to network coding. The code used in this technique is based on the algebraic structure which is called as vector space. Some studies have also been carried out to create code that is based on other algebraic structures such as module.  In this study, we are going to try to set up a code based on the algebraic structure which is a generalization of the module that is semimodule by utilizing the max operations and sum operations at max plus algebra. The results of this study indicate that the max operation and the addition operation on max plus algebra cannot be used to establish a semimodule code, but by modifying the operation "+" as "min", we get a code based on semimodule. Keywords:   code, distributed storage systems, network coding, semimodule, max plus algebra


IoT ◽  
2021 ◽  
Vol 2 (4) ◽  
pp. 610-632
Author(s):  
Oluwashina Joseph Ajayi ◽  
Joseph Rafferty ◽  
Jose Santos ◽  
Matias Garcia-Constantino ◽  
Zhan Cui

The scale of Internet of Things (IoT) systems has expanded in recent times and, in tandem with this, IoT solutions have developed symbiotic relationships with technologies, such as edge Computing. IoT has leveraged edge computing capabilities to improve the capabilities of IoT solutions, such as facilitating quick data retrieval, low latency response, and advanced computation, among others. However, in contrast with the benefits offered by edge computing capabilities, there are several detractors, such as centralized data storage, data ownership, privacy, data auditability, and security, which concern the IoT community. This study leveraged blockchain’s inherent capabilities, including distributed storage system, non-repudiation, privacy, security, and immutability, to provide a novel, advanced edge computing architecture for IoT systems. Specifically, this blockchain-based edge computing architecture addressed centralized data storage, data auditability, privacy, data ownership, and security. Following implementation, the performance of this solution was evaluated to quantify performance in terms of response time and resource utilization. The results show the viability of the proposed and implemented architecture, characterized by improved privacy, device data ownership, security, and data auditability while implementing decentralized storage.


Big data applications play an important role in real time data processing. Apache Spark is a data processing framework with in-memory data engine that quickly processes large data sets. It can also distribute data processing tasks across multiple computers, either on its own or in tandem with other distributed computing tools. Spark’s in-memory processing cannot share data between the applications and hence, the RAM memory will be insufficient for storing petabytes of data. Alluxio is a virtual distributed storage system that leverages memory for data storage and provides faster access to data in different storage systems. Alluxio helps to speed up data intensive Spark applications, with various storage systems. In this work, the performance of applications on Spark as well as Spark running over Alluxio have been studied with respect to several storage formats such as Parquet, ORC, CSV, and JSON; and four types of queries from Star Schema Benchmark (SSB). A benchmark is evolved to suggest the suitability of Spark Alluxio combination for big data applications. It is found that Alluxio is suitable for applications that use databases of size more than 2.6 GB storing data in JSON and CSV formats. Spark is found suitable for applications that use storage formats such as parquet and ORC with database sizes less than 2.6GB.


2017 ◽  
Vol 5 (1) ◽  
pp. 60
Author(s):  
Agus Maman Abadi ◽  
Karyati Karyati ◽  
Musthofa Musthofa ◽  
Emut Emut

Abstract The Increasing need of storing large amounts of data presents a new challenge. One way to address this challenge is to use distributed data storage system. One of the strategies implemented in the distributed data storage system is using the technique of regenerating code. The code used in this technique is based on the algebraic structure of fields. Some studies have also been carried out to create code that is based on the other algebraic structure namely module. In this study, we attempted to assess the implementation of the code module at regenerating technique code. The study showed there is a potential properties code based on module that can be used in regenerating code technique. Keywords: Distributed storage, regenerating code technique, module code


2011 ◽  
Vol 19 (1) ◽  
pp. 27-43
Author(s):  
Tevfik Kosar ◽  
Ismail Akturk ◽  
Mehmet Balman ◽  
Xinqi Wang

Modern collaborative science has placed increasing burden on data management infrastructure to handle the increasingly large data archives generated. Beside functionality, reliability and availability are also key factors in delivering a data management system that can efficiently and effectively meet the challenges posed and compounded by the unbounded increase in the size of data generated by scientific applications. We have developed a reliable and efficient distributed data storage system, PetaShare, which spans multiple institutions across the state of Louisiana. At the back-end, PetaShare provides a unified name space and efficient data movement across geographically distributed storage sites. At the front-end, it provides light-weight clients the enable easy, transparent and scalable access. In PetaShare, we have designed and implemented an asynchronously replicated multi-master metadata system for enhanced reliability and availability, and an advanced buffering system for improved data transfer performance. In this paper, we present the details of our design and implementation, show performance results, and describe our experience in developing a reliable and efficient distributed data management system for data-intensive science.


Sign in / Sign up

Export Citation Format

Share Document