A Replication-Based Mechanism for Fault Tolerance in MapReduce Framework

Grid Data Handling

IT Policy and Ethics ◽

10.4018/978-1-4666-2919-6.ch014 ◽

2013 ◽

pp. 294-321

Author(s):

Alexandru Costan

Keyword(s):

Fault Tolerance ◽

Data Storage ◽

Large Scale ◽

File Systems ◽

Future Research ◽

Distributed Data ◽

Data Handling ◽

Grid Data ◽

Distributed Data Storage ◽

Grid Environments

To accommodate the needs of large-scale distributed systems, scalable data storage and management strategies are required, allowing applications to efficiently cope with continuously growing, highly distributed data. This chapter addresses the key issues of data handling in grid environments focusing on storing, accessing, managing and processing data. We start by providing the background for the data storage issue in grid environments. We outline the main challenges addressed by distributed storage systems: high availability which translates into high resilience and consistency, corruption handling regarding arbitrary faults, fault tolerance, asynchrony, fairness, access control and transparency. The core part of the chapter presents how existing solutions cope with these high requirements. The most important research results are organized along several themes: grid data storage, distributed file systems, data transfer and retrieval and data management. Important characteristics such as performance, efficient use of resources, fault tolerance, security, and others are strongly determined by the adopted system architectures and the technologies behind them. For each topic, we shortly present previous work, describe the most recent achievements, highlight their advantages and limitations, and indicate future research trends in distributed data storage and management.

Download Full-text

Dynamic Fault Tolerant Topology Control for Wireless Sensor Network Based on Node Cascading Failure

International Journal of Online Engineering (iJOE) ◽

10.3991/ijoe.v14i05.8644 ◽

2018 ◽

Vol 14 (05) ◽

pp. 118 ◽

Cited By ~ 1

Author(s):

Yang Xiao

Keyword(s):

Fault Tolerance ◽

Degree Distribution ◽

Large Scale ◽

Fault Tolerant ◽

Wireless Sensor ◽

Cascading Failure ◽

Network Node ◽

Simulation Test ◽

Node Failure ◽

Scale Free

To address the node cascading failure (CF) of the wireless sensor networks (WSNs), considering such factors as node load and maximum capacity in scale-free topology, this paper establishes the WSN dynamic fault tolerant topology model based on node cascading failure, analyses the relationships between node load, topology and dynamic fault tolerance, and demonstrates the proposed model through simulation test. It studies the effects of topology parameter and load in case of random node failure in the network node cascading failure, and utilizes the theoretical derivation method to derive the structural feature of scale-free topology and the capacity limit for the WSNs large-scale cascading failure, effectively enhancing the cascading fault tolerance of traditional WSNs. The simulation test results show that, with the degree distribution parameter <em>C</em> increasing, the minimum network node degree will increase accordingly, and in highly intensive topology, the dynamic fault tolerance will be better; with the parameter<em> λ </em>increasing, the maximum degree of the network node will gradually decrease, and the degree distribution of topology structure tends to be uniform, so that the large-scale cascading failure caused by node failure will have less influence on the WSN, and further improve the dynamic fault tolerance performance of the system.

Download Full-text

Transparent fault tolerance for scalable functional computation

Journal of Functional Programming ◽

10.1017/s095679681600006x ◽

2016 ◽

Vol 26 ◽

Cited By ~ 2

Author(s):

ROBERT STEWART ◽

PATRICK MAIER ◽

PHIL TRINDER

Keyword(s):

Fault Tolerance ◽

High Performance Computing ◽

High Performance ◽

Large Scale ◽

Programming Model ◽

Fault Tolerant ◽

Fault Recovery ◽

Actor Model ◽

Work Stealing ◽

Performance Computing

AbstractReliability is set to become a major concern on emergent large-scale architectures. While there are many parallel languages, and indeed many parallel functional languages, very few address reliability. The notable exception is the widely emulated Erlang distributed actor model that provides explicit supervision and recovery of actors with isolated state. We investigate scalable transparent fault tolerant functional computation with automatic supervision and recovery of tasks. We do so by developing HdpH-RS, a variant of the Haskell distributed parallel Haskell (HdpH) DSL with Reliable Scheduling. Extending the distributed work stealing protocol of HdpH for task supervision and recovery is challenging. To eliminate elusive concurrency bugs, we validate the HdpH-RS work stealing protocol using the SPIN model checker. HdpH-RS differs from the actor model in that its principal entities are tasks, i.e. independent stateless computations, rather than isolated stateful actors. Thanks to statelessness, fault recovery can be performed automatically and entirely hidden in the HdpH-RS runtime system. Statelessness is also key for proving a crucial property of the semantics of HdpH-RS: fault recovery does not change the result of the program, akin to deterministic parallelism. HdpH-RS provides a simple distributed fork/join-style programming model, with minimal exposure of fault tolerance at the language level, and a library of higher level abstractions such as algorithmic skeletons. In fact, the HdpH-RS DSL is exactly the same as the HdpH DSL, hence users can opt in or out of fault tolerant execution without any refactoring. Computations in HdpH-RS are always as reliable as the root node, no matter how many nodes and cores are actually used. We benchmark HdpH-RS on conventional clusters and an High Performance Computing platform: all benchmarks survive Chaos Monkey random fault injection; the system scales well e.g. up to 1,400 cores on the High Performance Computing; reliability and recovery overheads are consistently low even at scale.

Download Full-text

A Novel Approach for Clustering Big Data based on MapReduce

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v8i3.pp1711-1719 ◽

2018 ◽

Vol 8 (3) ◽

pp. 1711 ◽

Cited By ~ 1

Author(s):

Gourav Bathla ◽

Himanshu Aggarwal ◽

Rinkle Rani

Keyword(s):

Big Data ◽

Categorical Data ◽

Large Scale ◽

Clustering Algorithms ◽

Numerical Data ◽

Large Data ◽

Data Sets ◽

Single Node ◽

Novel Approach ◽

Network Analytics

Clustering is one of the most important applications of data mining. It has attracted attention of researchers in statistics and machine learning. It is used in many applications like information retrieval, image processing and social network analytics etc. It helps the user to understand the similarity and dissimilarity between objects. Cluster analysis makes the users understand complex and large data sets more clearly. There are different types of clustering algorithms analyzed by various researchers. Kmeans is the most popular partitioning based algorithm as it provides good results because of accurate calculation on numerical data. But Kmeans give good results for numerical data only. Big data is combination of numerical and categorical data. Kprototype algorithm is used to deal with numerical as well as categorical data. Kprototype combines the distance calculated from numeric and categorical data. With the growth of data due to social networking websites, business transactions, scientific calculation etc., there is vast collection of structured, semi-structured and unstructured data. So, there is need of optimization of Kprototype so that these varieties of data can be analyzed efficiently.In this work, Kprototype algorithm is implemented on MapReduce in this paper. Experiments have proved that Kprototype implemented on Mapreduce gives better performance gain on multiple nodes as compared to single node. CPU execution time and speedup are used as evaluation metrics for comparison.Intellegent splitter is proposed in this paper which splits mixed big data into numerical and categorical data. Comparison with traditional algorithms proves that proposed algorithm works better for large scale of data.

Download Full-text

A Two-Stage Fault Tolerance Method for Large-Scale Manufacturing Network

IEEE Access ◽

10.1109/access.2019.2923660 ◽

2019 ◽

Vol 7 ◽

pp. 81574-81592 ◽

Cited By ~ 1

Author(s):

Yinan Wu ◽

Gongzhuang Peng ◽

Hongwei Wang ◽

Heming Zhang

Keyword(s):

Fault Tolerance ◽

Large Scale ◽

Two Stage ◽

Manufacturing Network ◽

Tolerance Method

Download Full-text

A Survey on MapReduce Implementations

International Journal of Cloud Applications and Computing ◽

10.4018/ijcac.2016010104 ◽

2016 ◽

Vol 6 (1) ◽

pp. 59-87 ◽

Cited By ~ 2

Author(s):

Amer Al-Badarneh ◽

Amr Mohammad ◽

Salah Harb

Keyword(s):

Large Scale ◽

Programming Model ◽

Data Sets ◽

Mapreduce Framework ◽

Large Scale Data ◽

Parallel Data ◽

Efficiency Performance ◽

Scale Data ◽

Large Scale Data Sets ◽

Open Issues

A distinguished successful platform for parallel data processing MapReduce is attracting a significant momentum from both academia and industry as the volume of data to capture, transform, and analyse grows rapidly. Although MapReduce is used in many applications to analyse large scale data sets, there is still a lot of debate among scientists and researchers on its efficiency, performance, and usability to support more classes of applications. This survey presents a comprehensive review of various implementations of MapReduce framework. Initially the authors give an overview of MapReduce programming model. They then present a broad description of various technical aspects of the most successful implementations of MapReduce framework reported in the literature and discuss their main strengths and weaknesses. Finally, the authors conclude by introducing a comparison between MapReduce implementations and discuss open issues and challenges on enhancing MapReduce.

Download Full-text

Near Duplicated Text Detection Based on MapReduce

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.427-429.2618 ◽

2013 ◽

Vol 427-429 ◽

pp. 2618-2621 ◽

Cited By ~ 1

Author(s):

Ling Shen ◽

Qing Xi Peng

Keyword(s):

Large Scale ◽

Programming Model ◽

Linear Time ◽

Large Data ◽

Text Detection ◽

Experimental Result ◽

Data Sets ◽

Document Collections ◽

Document Similarity ◽

Map Function

As the emerging date intensive applications have received more and more attentions from researchers, its a severe challenge for near duplicated text detection for large scale data. This paper presents an algorithm based on MapReduce and ontology for near duplicated text detection via computing pair document similarity in large scale document collections. We mapping the words in the document to the synonym and then calculate the similarity between them. MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key /value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. In large scale test, experimental result demonstrates that this approach outperforms other state of the art solutions. Many advantages such as linear time and accuracy make the algorithm valuable in actual practice.

Download Full-text

Grid Data Handling

Computational and Data Grids ◽

10.4018/978-1-61350-113-9.ch005 ◽

2011 ◽

pp. 112-139

Author(s):

Alexandru Costan

Keyword(s):

Fault Tolerance ◽

Data Storage ◽

Large Scale ◽

File Systems ◽

Future Research ◽

Distributed Data ◽

Data Handling ◽

Grid Data ◽

Distributed Data Storage ◽

Grid Environments

To accommodate the needs of large-scale distributed systems, scalable data storage and management strategies are required, allowing applications to efficiently cope with continuously growing, highly distributed data. This chapter addresses the key issues of data handling in grid environments focusing on storing, accessing, managing and processing data. We start by providing the background for the data storage issue in grid environments. We outline the main challenges addressed by distributed storage systems: high availability which translates into high resilience and consistency, corruption handling regarding arbitrary faults, fault tolerance, asynchrony, fairness, access control and transparency. The core part of the chapter presents how existing solutions cope with these high requirements. The most important research results are organized along several themes: grid data storage, distributed file systems, data transfer and retrieval and data management. Important characteristics such as performance, efficient use of resources, fault tolerance, security, and others are strongly determined by the adopted system architectures and the technologies behind them. For each topic, we shortly present previous work, describe the most recent achievements, highlight their advantages and limitations, and indicate future research trends in distributed data storage and management.

Download Full-text

Extracting Functional Dependencies in Large Datasets Using MapReduce Model

International Journal of Intelligent Information Technologies ◽

10.4018/ijiit.2014070102 ◽

2014 ◽

Vol 10 (3) ◽

pp. 19-35 ◽

Cited By ~ 8

Author(s):

K. Amshakala ◽

R. Nedunchezhian ◽

M. Rajalakshmi

Keyword(s):

Data Processing ◽

Data Quality ◽

Large Scale ◽

Programming Model ◽

Large Data ◽

Large Datasets ◽

Functional Dependencies ◽

Large Scale Data ◽

Large Scale Data Processing ◽

Scale Data

Over the last few years, data are generated in large volume at a faster rate and there has been a remarkable growth in the need for large scale data processing systems. As data grows larger in size, data quality is compromised. Functional dependencies representing semantic constraints in data are important for data quality assessment. Executing functional dependency discovery algorithms on a single computer is hard and laborious with large data sets. MapReduce provides an enabling technology for large scale data processing. The open-source Hadoop implementation of MapReduce has provided researchers a powerful tool for tackling large-data problems in a distributed manner. The objective of this study is to extract functional dependencies between attributes from large datasets using MapReduce programming model. Attribute entropy is used to measure the inter attribute correlations, and exploited to discover functional dependencies hidden in the data.

Download Full-text

Galaxy spin direction distribution in HST and SDSS show similar large-scale asymmetry

Publications of the Astronomical Society of Australia ◽

10.1017/pasa.2020.46 ◽

2020 ◽

Vol 37 ◽

Author(s):

Lior Shamir

Keyword(s):

Large Scale ◽

Spiral Galaxies ◽

Hubble Space Telescope ◽

Gravitational Interaction ◽

Large Data ◽

Sloan Digital Sky Survey ◽

Data Sets ◽

Dipole Axis ◽

Data Set ◽

The Asymmetry

Abstract Several recent observations using large data sets of galaxies showed non-random distribution of the spin directions of spiral galaxies, even when the galaxies are too far from each other to have gravitational interaction. Here, a data set of $\sim8.7\cdot10^3$ spiral galaxies imaged by Hubble Space Telescope (HST) is used to test and profile a possible asymmetry between galaxy spin directions. The asymmetry between galaxies with opposite spin directions is compared to the asymmetry of galaxies from the Sloan Digital Sky Survey. The two data sets contain different galaxies at different redshift ranges, and each data set was annotated using a different annotation method. The results show that both data sets show a similar asymmetry in the COSMOS field, which is covered by both telescopes. Fitting the asymmetry of the galaxies to cosine dependence shows a dipole axis with probabilities of $\sim2.8\sigma$ and $\sim7.38\sigma$ in HST and SDSS, respectively. The most likely dipole axis identified in the HST galaxies is at $(\alpha=78^{\rm o},\delta=47^{\rm o})$ and is well within the $1\sigma$ error range compared to the location of the most likely dipole axis in the SDSS galaxies with $z>0.15$ , identified at $(\alpha=71^{\rm o},\delta=61^{\rm o})$ .

Download Full-text