Performance Evaluations of Distributed File Systems for Scientific Big Data in FUSE Environment

Jun-Yeong Lee; Moon-Hyun Kim; Syed Asif Raza Raza Shah; Sang-Un Ahn; Heejun Yoon; Seo-Young Noh

doi:10.3390/electronics10121471

Performance Evaluations of Distributed File Systems for Scientific Big Data in FUSE Environment

Electronics ◽

10.3390/electronics10121471 ◽

2021 ◽

Vol 10 (12) ◽

pp. 1471

Author(s):

Jun-Yeong Lee ◽

Moon-Hyun Kim ◽

Syed Asif Raza Raza Shah ◽

Sang-Un Ahn ◽

Heejun Yoon ◽

...

Keyword(s):

Data Storage ◽

Scale Up ◽

File Systems ◽

Performance Evaluations ◽

Distributed File Systems ◽

Data Intensive Computing ◽

Data Intensive ◽

Tremendous Amount ◽

Computing Environments ◽

And Performance

Data are important and ever growing in data-intensive scientific environments. Such research data growth requires data storage systems that play pivotal roles in data management and analysis for scientific discoveries. Redundant Array of Independent Disks (RAID), a well-known storage technology combining multiple disks into a single large logical volume, has been widely used for the purpose of data redundancy and performance improvement. However, this requires RAID-capable hardware or software to build up a RAID-enabled disk array. In addition, it is difficult to scale up the RAID-based storage. In order to mitigate such a problem, many distributed file systems have been developed and are being actively used in various environments, especially in data-intensive computing facilities, where a tremendous amount of data have to be handled. In this study, we investigated and benchmarked various distributed file systems, such as Ceph, GlusterFS, Lustre and EOS for data-intensive environments. In our experiment, we configured the distributed file systems under a Reliable Array of Independent Nodes (RAIN) structure and a Filesystem in Userspace (FUSE) environment. Our results identify the characteristics of each file system that affect the read and write performance depending on the features of data, which have to be considered in data-intensive computing environments.

Download Full-text

A Comprehensive Survey on Data-Intensive Computing and MapReduce Paradigm in Cloud Computing Environments

Informatics and Communication Technologies for Societal Development ◽

10.1007/978-81-322-1916-3_9 ◽

2014 ◽

pp. 85-93

Author(s):

Girish Neelakanta Iyer ◽

Salaja Silas

Keyword(s):

Cloud Computing ◽

Data Intensive Computing ◽

Data Intensive ◽

Comprehensive Survey ◽

Computing Environments ◽

Mapreduce Paradigm

Download Full-text

SepStore: Data Storage Accelerator for Distributed File Systems by Separating Small Files from Large Files

Lecture Notes in Computer Science - Internet of Vehicles – Technologies and Services ◽

10.1007/978-3-319-11167-4_27 ◽

2014 ◽

pp. 272-281

Author(s):

Zhenzhao Wang ◽

Kang Chen ◽

Yongwei Wu ◽

Weimin Zheng

Keyword(s):

Data Storage ◽

File Systems ◽

Distributed File Systems

Download Full-text

A New Data Classification Algorithm for Data-Intensive Computing Environments

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.756-759.3318 ◽

2013 ◽

Vol 756-759 ◽

pp. 3318-3323

Author(s):

Qi Zhi Deng ◽

Long Bo Zhang ◽

Xin Qian ◽

Ya Li Chen ◽

Feng Ying Wang

Keyword(s):

Data Mining ◽

Large Datasets ◽

Data Availability ◽

Learning Method ◽

Data Intensive Computing ◽

Data Intensive ◽

Distributed Computations ◽

Split Point ◽

Computing Environments ◽

Mapreduce Model

In order to solve the problem of how to improve the scalability of data processing capabilities and the data availability which encountered by data mining techniques for Data-intensive computing, a new method of tree learning is presented in this paper. By introducing the MapReduce, the tree learning method based on SPRINT can obtain a well scalability when address large datasets. Moreover, we define the process of split point as a series of distributed computations, which is implemented with the MapReduce model respectively. And a new data structure called class distribution table is introduced to assist the calculation of histogram. Experiments and results analysis shows that the algorithm has strong processing capabilities of data mining for data-intensive computing environments.

Download Full-text

Modeling of distributed file System in big data storage by event- B

MATEC Web of Conferences ◽

10.1051/matecconf/201821004042 ◽

2018 ◽

Vol 210 ◽

pp. 04042

Author(s):

Ammar Alhaj Ali ◽

Pavel Varacha ◽

Said Krayem ◽

Roman Jasek ◽

Petr Zacek ◽

...

Keyword(s):

Big Data ◽

Data Storage ◽

High Performance ◽

File System ◽

Formal Method ◽

File Systems ◽

Distributed File System ◽

Distributed File Systems ◽

Data Systems ◽

Big Data Systems

Nowadays, a wide set of systems and application, especially in high performance computing, depends on distributed environments to process and analyses huge amounts of data. As we know, the amount of data increases enormously, and the goal to provide and develop efficient, scalable and reliable storage solutions has become one of the major issue for scientific computing. The storage solution used by big data systems is Distributed File Systems (DFSs), where DFS is used to build a hierarchical and unified view of multiple file servers and shares on the network. In this paper we will offer Hadoop Distributed File System (HDFS) as DFS in big data systems and we will present an Event-B as formal method that can be used in modeling, where Event-B is a mature formal method which has been widely used in a number of industry projects in a number of domains, such as automotive, transportation, space, business information, medical device and so on, And will propose using the Rodin as modeling tool for Event-B, which integrates modeling and proving as well as the Rodin platform is open source, so it supports a large number of plug-in tools.

Download Full-text

Data classification algorithm for data-intensive computing environments

EURASIP Journal on Wireless Communications and Networking ◽

10.1186/s13638-017-1002-4 ◽

2017 ◽

Vol 2017 (1) ◽

Cited By ~ 1

Author(s):

Tiedong Chen ◽

Shifeng Liu ◽

Daqing Gong ◽

Honghu Gao

Keyword(s):

Data Classification ◽

Classification Algorithm ◽

Data Intensive Computing ◽

Data Intensive ◽

Computing Environments

Download Full-text

Overview of Big Data-Intensive Storage and its Technologies for Cloud and Fog Computing

Research Anthology on Privatizing and Securing Data ◽

10.4018/978-1-7998-8954-0.ch005 ◽

2021 ◽

pp. 112-153

Author(s):

Richard S. Segall ◽

Jeffrey S Cook ◽

Gao Niu

Keyword(s):

Big Data ◽

Data Storage ◽

High Performance ◽

Storage Systems ◽

Fog Computing ◽

Storage Management ◽

Data Intensive Computing ◽

Computing Systems ◽

Application Performance ◽

Data Intensive

Computing systems are becoming increasingly data-intensive because of the explosion of data and the needs for processing the data, and subsequently storage management is critical to application performance in such data-intensive computing systems. However, if existing resource management frameworks in these systems lack the support for storage management, this would cause unpredictable performance degradation when applications are under input/output (I/O) contention. Storage management of data-intensive systems is a challenge. Big Data plays a most major role in storage systems for data-intensive computing. This article deals with these difficulties along with discussion of High Performance Computing (HPC) systems, background for storage systems for data-intensive applications, storage patterns and storage mechanisms for Big Data, the Top 10 Cloud Storage Systems for data-intensive computing in today's world, and the interface between Big Data Intensive Storage and Cloud/Fog Computing. Big Data storage and its server statistics and usage distributions for the Top 500 Supercomputers in the world are also presented graphically and discussed as data-intensive storage components that can be interfaced with Fog-to-cloud interactions and enabling protocols.

Download Full-text

Improving Efficiency and Performance of Distributed File-Systems

10.1109/nca.2008.55 ◽

2008 ◽

Author(s):

Micah Galizia ◽

Hanan Lutfiyya

Keyword(s):

File Systems ◽

Distributed File Systems ◽

And Performance

Download Full-text

An Inter-framework Cache for Diverse Data-Intensive Computing Environments

2015 IEEE International Conference on Smart City/SocialCom/SustainCom (SmartCity) ◽

10.1109/smartcity.2015.192 ◽

2015 ◽

Author(s):

Chun-Yu Wang ◽

Tzu-En Huang ◽

Yu-Tang Huang ◽

Jyh-Biau Chang ◽

Ce-Kuen Shieh

Keyword(s):

Data Intensive Computing ◽

Data Intensive ◽

Computing Environments ◽

Diverse Data

Download Full-text

Performance-efficient Recommendation and Prediction Service for Big Data frameworks focusing on Data Compression and In-memory Data Storage Indicators

Scalable Computing Practice and Experience ◽

10.12694/scpe.v22i4.1945 ◽

2021 ◽

Vol 22 (4) ◽

pp. 401-412

Author(s):

Hrachya Astsatryan ◽

Arthur Lalayan ◽

Aram Kocharyan ◽

Daniel Hagimont

Keyword(s):

Big Data ◽

Data Compression ◽

Data Storage ◽

File Systems ◽

Large Datasets ◽

Data Sets ◽

Mapreduce Framework ◽

Data Intensive ◽

Parallel Data ◽

Data Intensive Applications

The MapReduce framework manages Big Data sets by splitting the large datasets into a set of distributed blocks and processes them in parallel. Data compression and in-memory file systems are widely used methods in Big Data processing to reduce resource-intensive I/O operations and improve I/O rate correspondingly. The article presents a performance-efficient modular and configurable decision-making robust service relying on data compression and in-memory data storage indicators. The service consists of Recommendation and Prediction modules, predicts the execution time of a given job based on metrics, and recommends the best configuration parameters to improve Hadoop and Spark frameworks' performance. Several CPU and data-intensive applications and micro-benchmarks have been evaluated to improve the performance, including Log Analyzer, WordCount, and K-Means.

Download Full-text

Role of Open Source Software in Big Data Storage

Advances in Data Mining and Database Management - Handbook of Research on Big Data Storage and Visualization Techniques ◽

10.4018/978-1-5225-3142-5.ch005 ◽

2018 ◽

pp. 123-150 ◽

Cited By ~ 1

Author(s):

Rupali Ahuja ◽

Jigyasa Malik ◽

Ronak Tyagi ◽

R. Brinda

Keyword(s):

Big Data ◽

Open Source ◽

Data Storage ◽

Open Source Software ◽

File Systems ◽

Distributed File Systems ◽

The World ◽

Storage Technologies ◽

Big Data Storage

Today, the world is revolving around Big Data. Each organization is trying hard to explore ways for deriving value out of huge pile of data we are generating each moment. Open Source Software are widely being adopted by most academicians, researchers and industrialists to handle various Big Data needs because of their easy availability, flexibility, affordability and interoperability. As a result, several open source Big Data tools have been developed. This chapter discusses the role of Open Source Software in Big Data Storage and how various organizations have benefitted from its use. It provides an overview of popular Open Source Big Data Storage technologies existing today. Distributed File Systems and NoSQL databases meant for storing Big Data have been discussed with their features, applications and comparison.

Download Full-text