Blocking vs. Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI

Author(s):  
Camille Coti ◽  
Thomas Herault ◽  
Pierre Lemarinier ◽  
Laurence Pilard ◽  
Ala Rezmerita ◽  
...  
2008 ◽  
Vol 24 (1) ◽  
pp. 73-84 ◽  
Author(s):  
Darius Buntinas ◽  
Camille Coti ◽  
Thomas Herault ◽  
Pierre Lemarinier ◽  
Laurence Pilard ◽  
...  

2014 ◽  
Vol 933 ◽  
pp. 584-589
Author(s):  
Zhi Chun Zhang ◽  
Song Wei Li ◽  
Wei Ren Wang ◽  
Wei Zhang ◽  
Li Jun Qi

This paper presents a system in which the cluster devices are controlled by single-chip microcomputers, with emphasis on the cluster management techniques of single-chip microcomputers. Each device in a cluster is controlled by a single-chip microcomputer collecting sample data sent to and driving the device by driving data received from the same cluster management computer through COMs. The cluster management system running on the cluster management computer carries out such control as initial SCM identification, run time slice management, communication resource utilization, fault tolerance and error corrections on single-chip microcomputers. Initial SCM identification is achieved by signal responses between the single-chip microcomputers and the cluster management computer. By using the port priority and the parallelization of serial communications, the systems real-time performance is maximized. The real-time performance can be adjusted and improved by increasing or decreasing COMs and the ports linked to each COM, and the real-time performance can also be raised by configuring more cluster management computers. Fault-tolerant control occurs in the initialization phase and the operational phase. In the initialization phase, the cluster management system incorporates unidentified single-chip microcomputers into the system based on the history information recorded on external storage media. In the operational phase, if an operation error of reading and writing on a single-chip microcomputer reaches a predetermined threshold, the single-chip microcomputer is regarded as serious fault or not existing. The cluster management system maintains accuracy maintenance database on external storage medium to solve nonlinear control of specific devices and accuracy maintenance due to wear. The cluster management system uses object-oriented method to design a unified driving framework in order to enable the implementation of the cluster management system simplified, standardized and easy to transplant. The system has been applied in a large-scale simulation system of 230 single-chip microcomputers, which proves that the system is reliable, real-time and easy to maintain.


Author(s):  
L. Wang ◽  
K. Pattabiraman ◽  
Z. Kalbarczyk ◽  
R.K. Iyer ◽  
L. Votta ◽  
...  

2018 ◽  
Vol 3 (1) ◽  
pp. 1 ◽  
Author(s):  
Mounir Hafsa ◽  
Farah Jemili

Cybersecurity ventures expect that cyber-attack damage costs will rise to $11.5 billion in 2019 and that a business will fall victim to a cyber-attack every 14 seconds. Notice here that the time frame for such an event is seconds. With petabytes of data generated each day, this is a challenging task for traditional intrusion detection systems (IDSs). Protecting sensitive information is a major concern for both businesses and governments. Therefore, the need for a real-time, large-scale and effective IDS is a must. In this work, we present a cloud-based, fault tolerant, scalable and distributed IDS that uses Apache Spark Structured Streaming and its Machine Learning library (MLlib) to detect intrusions in real-time. To demonstrate the efficacy and effectivity of this system, we implement the proposed system within Microsoft Azure Cloud, as it provides both processing power and storage capabilities. A decision tree algorithm is used to predict the nature of incoming data. For this task, the use of the MAWILab dataset as a data source will give better insights about the system capabilities against cyber-attacks. The experimental results showed a 99.95% accuracy and more than 55,175 events per second were processed by the proposed system on a small cluster.


Sign in / Sign up

Export Citation Format

Share Document