Deister: A light-weight autonomous block management in data-intensive file systems using deterministic declustering distribution

Deister: A Light-Weight Autonomous Block Management in Data-Intensive File Systems Using Deterministic Declustering Distribution

2015 IEEE International Conference on Smart City/SocialCom/SustainCom (SmartCity) ◽

10.1109/smartcity.2015.135 ◽

2015 ◽

Author(s):

Xuhong Zhang ◽

Jiangling Yin ◽

Jun Wang ◽

Ruijun Wang ◽

Dan Huang

Keyword(s):

File Systems ◽

Light Weight ◽

Data Intensive ◽

Autonomous Block

Download Full-text

Performance Evaluations of Distributed File Systems for Scientific Big Data in FUSE Environment

Electronics ◽

10.3390/electronics10121471 ◽

2021 ◽

Vol 10 (12) ◽

pp. 1471

Author(s):

Jun-Yeong Lee ◽

Moon-Hyun Kim ◽

Syed Asif Raza Raza Shah ◽

Sang-Un Ahn ◽

Heejun Yoon ◽

...

Keyword(s):

Data Storage ◽

Scale Up ◽

File Systems ◽

Performance Evaluations ◽

Distributed File Systems ◽

Data Intensive Computing ◽

Data Intensive ◽

Tremendous Amount ◽

Computing Environments ◽

And Performance

Data are important and ever growing in data-intensive scientific environments. Such research data growth requires data storage systems that play pivotal roles in data management and analysis for scientific discoveries. Redundant Array of Independent Disks (RAID), a well-known storage technology combining multiple disks into a single large logical volume, has been widely used for the purpose of data redundancy and performance improvement. However, this requires RAID-capable hardware or software to build up a RAID-enabled disk array. In addition, it is difficult to scale up the RAID-based storage. In order to mitigate such a problem, many distributed file systems have been developed and are being actively used in various environments, especially in data-intensive computing facilities, where a tremendous amount of data have to be handled. In this study, we investigated and benchmarked various distributed file systems, such as Ceph, GlusterFS, Lustre and EOS for data-intensive environments. In our experiment, we configured the distributed file systems under a Reliable Array of Independent Nodes (RAIN) structure and a Filesystem in Userspace (FUSE) environment. Our results identify the characteristics of each file system that affect the read and write performance depending on the features of data, which have to be considered in data-intensive computing environments.

Download Full-text

Light-weight Cloud Job Management System for Data Intensive Science

2011 Fourth IEEE International Conference on Utility and Cloud Computing ◽

10.1109/ucc.2011.63 ◽

2011 ◽

Author(s):

Haehyun Kim ◽

Jaegyoon Hahm

Keyword(s):

Management System ◽

Light Weight ◽

Job Management ◽

Data Intensive ◽

Job Management System

Download Full-text

CHAIO: Enabling HPC Applications on Data-Intensive File Systems

2012 41st International Conference on Parallel Processing ◽

10.1109/icpp.2012.1 ◽

2012 ◽

Cited By ~ 7

Author(s):

Hui Jin ◽

Jiayu Ji ◽

Xian-He Sun ◽

Yong Chen ◽

Rajeev Thakur

Keyword(s):

File Systems ◽

Data Intensive

Download Full-text

Can Applications Recover from fsync Failures?

ACM Transactions on Storage ◽

10.1145/3450338 ◽

2021 ◽

Vol 17 (2) ◽

pp. 1-30

Author(s):

Anthony Rebello ◽

Yuvraj Patel ◽

Ramnatthan Alagappan ◽

Andrea C. Arpaci-Dusseau ◽

Remzi H. Arpaci-Dusseau

Keyword(s):

File Systems ◽

Data Loss ◽

Data Intensive ◽

Failure Handling ◽

Data Intensive Applications ◽

Failure Reporting

We analyze how file systems and modern data-intensive applications react to fsync failures. First, we characterize how three Linux file systems (ext4, XFS, Btrfs) behave in the presence of failures. We find commonalities across file systems (pages are always marked clean, certain block writes always lead to unavailability) as well as differences (page content and failure reporting is varied). Next, we study how five widely used applications (PostgreSQL, LMDB, LevelDB, SQLite, Redis) handle fsync failures. Our findings show that although applications use many failure-handling strategies, none are sufficient: fsync failures can cause catastrophic outcomes such as data loss and corruption. Our findings have strong implications for the design of file systems and applications that intend to provide strong durability guarantees.

Download Full-text

Performance-efficient Recommendation and Prediction Service for Big Data frameworks focusing on Data Compression and In-memory Data Storage Indicators

Scalable Computing Practice and Experience ◽

10.12694/scpe.v22i4.1945 ◽

2021 ◽

Vol 22 (4) ◽

pp. 401-412

Author(s):

Hrachya Astsatryan ◽

Arthur Lalayan ◽

Aram Kocharyan ◽

Daniel Hagimont

Keyword(s):

Big Data ◽

Data Compression ◽

Data Storage ◽

File Systems ◽

Large Datasets ◽

Data Sets ◽

Mapreduce Framework ◽

Data Intensive ◽

Parallel Data ◽

Data Intensive Applications

The MapReduce framework manages Big Data sets by splitting the large datasets into a set of distributed blocks and processes them in parallel. Data compression and in-memory file systems are widely used methods in Big Data processing to reduce resource-intensive I/O operations and improve I/O rate correspondingly. The article presents a performance-efficient modular and configurable decision-making robust service relying on data compression and in-memory data storage indicators. The service consists of Recommendation and Prediction modules, predicts the execution time of a given job based on metrics, and recommends the best configuration parameters to improve Hadoop and Spark frameworks' performance. Several CPU and data-intensive applications and micro-benchmarks have been evaluated to improve the performance, including Log Analyzer, WordCount, and K-Means.

Download Full-text

Towards Data Intensive Many-Task Computing

Advances in Systems Analysis, Software Engineering, and High Performance Computing - Data Intensive Distributed Computing ◽

10.4018/978-1-61520-971-2.ch002 ◽

2012 ◽

pp. 28-73 ◽

Cited By ~ 8

Author(s):

Ioan Raicu ◽

Ian Foster ◽

Yong Zhao ◽

Alex Szalay ◽

Philip Little ◽

...

Keyword(s):

High Performance ◽

File Systems ◽

Data Locality ◽

Resource Provisioning ◽

Parallel File Systems ◽

Data Intensive ◽

Dynamic Resource Provisioning ◽

Rate Of Increase ◽

Parallel File ◽

Data Intensive Applications

Many-task computing aims to bridge the gap between two computing paradigms, high throughput computing and high performance computing. Traditional techniques to support many-task computing commonly found in scientific computing (i.e. the reliance on parallel file systems with static configurations) do not scale to today’s largest systems for data intensive application, as the rate of increase in the number of processors per system is outgrowing the rate of performance increase of parallel file systems. In this chapter, the authors argue that in such circumstances, data locality is critical to the successful and efficient use of large distributed systems for data-intensive applications. They propose a “data diffusion” approach to enable data-intensive many-task computing. They define an abstract model for data diffusion, define and implement scheduling policies with heuristics that optimize real world performance, and develop a competitive online caching eviction policy. They also offer many empirical experiments to explore the benefits of data diffusion, both under static and dynamic resource provisioning, demonstrating approaches that improve both performance and scalability.

Download Full-text

GLIDE: A Grid-Based Light-Weight Infrastructure for Data-Intensive Environments

Advances in Grid Computing - EGC 2005 - Lecture Notes in Computer Science ◽

10.1007/11508380_9 ◽

2005 ◽

pp. 68-77 ◽

Cited By ~ 6

Author(s):

Chris A. Mattmann ◽

Sam Malek ◽

Nels Beckman ◽

Marija Mikic-Rakic ◽

Nenad Medvidovic ◽

...

Keyword(s):

Light Weight ◽

Data Intensive ◽

Grid Based

Download Full-text

I/O and File Systems for Data-Intensive Applications

Handbook on Data Centers ◽

10.1007/978-1-4939-2092-1_18 ◽

2015 ◽

pp. 561-582

Author(s):

Yanlong Yin ◽

Hui Jin ◽

Xian-He Sun

Keyword(s):

File Systems ◽

Data Intensive ◽

Data Intensive Applications

Download Full-text

Understanding performance of distributed data-intensive applications

Philosophical Transactions of The Royal Society A Mathematical Physical and Engineering Sciences ◽

10.1098/rsta.2010.0168 ◽

2010 ◽

Vol 368 (1926) ◽

pp. 4089-4102 ◽

Cited By ~ 4

Author(s):

Christopher Miceli ◽

Michael Miceli ◽

Bety Rodriguez-Milla ◽

Shantenu Jha

Keyword(s):

File Systems ◽

Data Placement ◽

Distributed Data ◽

Distributed File Systems ◽

Sequence Matching ◽

Data Intensive ◽

Relative Placement ◽

Level Data ◽

A Genome ◽

Data Intensive Applications

Grids, clouds and cloud-like infrastructures are capable of supporting a broad range of data-intensive applications. There are interesting and unique performance issues that appear as the volume of data and degree of distribution increases. New scalable data-placement and management techniques, as well as novel approaches to determine the relative placement of data and computational workload, are required. We develop and study a genome sequence matching application that is simple to control and deploy, yet serves as a prototype of a data-intensive application. The application uses a SAGA-based implementation of the All-Pairs pattern. This paper aims to understand some of the factors that influence the performance of this application and the interplay of those factors. We also demonstrate how the SAGA approach can enable data-intensive applications to be extensible and interoperable over a range of infrastructure. This capability enables us to compare and contrast two different approaches for executing distributed data-intensive applications—simple application-level data-placement heuristics versus distributed file systems.

Download Full-text