Performance Analysis of Large-Scale Distributed Stream Processing Systems on the Cloud

Author(s):  
Tri Minh Truong ◽  
Aaron Harwood ◽  
Richard O. Sinnott ◽  
Shiping Chen
2017 ◽  
Author(s):  
Francesco Versaci ◽  
Luca Pireddu ◽  
Gianluigi Zanetti

Personalized medicine is in great part enabled by the progress in data acquisition technologies for modern biology, such as next-generation sequencing (NGS). Conventional NGS processing workflows are composed by independent tools implementing shared-memory parallelism which communicate by means of intermediate files. With increasing data sizes this approach is showing its limited scalability and robustness characteristics – problems that make it unsuitable for large-scale, population-wide personalized medicine applications. In this work we propose the adoption of the stream computing architecture to make the genomics pipeline more scalable, and fault-tolerant. We implemented the first processing phases for Illumina sequencing data – from raw data to alignment – using the Apache Flink distributed stream processing framework and Apache Kafka. The new pipeline has been tested processing the raw output of an Illumina HiSeq3000 sequencer and producing aligned reads in CRAM format. The results show near optimal scalability characteristics on experiments from 1 to 12 computing nodes, with a speed-up of 9.5x over the conventional solution (which cannot automatically run on multiple nodes). This result is particularly positive considering that the very short runtime of the experiment – less than 15 minutes – makes significant the constant time costs imposed by the overheads of the frameworks.


2021 ◽  
Vol 2021 ◽  
pp. 1-12
Author(s):  
Zhimin Lin ◽  
Chao Huang ◽  
Mingwei Lin

Distributed stream processing frameworks (DSPFs) are the vital engine, which can handle real-time data processing and analytics for IoT applications. How to prioritize DSPFs and select the most suitable one for special IoT applications is an open issue. To help developers of IoT applications to solve this complex issue, a novel probabilistic hesitant fuzzy multicriteria decision making (MCDM) model is put forward in this paper. To characterize the requirements for large-scale IoT data stream processing, a novel evaluation criteria system including qualitative and quantitative criteria is established. To accurately model the collective opinions from skilled developers and consider their psychological distance, the definition of probabilistic hesitant fuzzy sets (PHFSs) is used. To derive the importance degrees of criteria, a novel probabilistic hesitant fuzzy best-worst (PHFBW) method is proposed based on the score value. To prioritize the DSPFs and choose the most suitable one, a novel probabilistic hesitant fuzzy MULTIMOORA method is put forward. Finally, a practical case composed of four Apache stream processing frameworks, namely, Storm, Flink, Spark, and Samza, is studied. The obtained results indicate that throughput, latency, and reliability are considered to be the three most important criteria, and Flink is the most suitable stream framework.


2008 ◽  
Vol 3 (4) ◽  
pp. 1-28 ◽  
Author(s):  
Kirsten Hildrum ◽  
Fred Douglis ◽  
Joel L. Wolf ◽  
Philip S. Yu ◽  
Lisa Fleischer ◽  
...  

2017 ◽  
Author(s):  
Francesco Versaci ◽  
Luca Pireddu ◽  
Gianluigi Zanetti

Personalized medicine is in great part enabled by the progress in data acquisition technologies for modern biology, such as next-generation sequencing (NGS). Conventional NGS processing workflows are composed by independent tools implementing shared-memory parallelism which communicate by means of intermediate files. With increasing data sizes this approach is showing its limited scalability and robustness characteristics – problems that make it unsuitable for large-scale, population-wide personalized medicine applications. In this work we propose the adoption of the stream computing architecture to make the genomics pipeline more scalable, and fault-tolerant. We implemented the first processing phases for Illumina sequencing data – from raw data to alignment – using the Apache Flink distributed stream processing framework and Apache Kafka. The new pipeline has been tested processing the raw output of an Illumina HiSeq3000 sequencer and producing aligned reads in CRAM format. The results show near optimal scalability characteristics on experiments from 1 to 12 computing nodes, with a speed-up of 9.5x over the conventional solution (which cannot automatically run on multiple nodes). This result is particularly positive considering that the very short runtime of the experiment – less than 15 minutes – makes significant the constant time costs imposed by the overheads of the frameworks.


2012 ◽  
Vol 23 (2) ◽  
pp. 323-334
Author(s):  
Guo-Feng YAN ◽  
Jian-Xin WANG ◽  
Shu-Hong CHEN

Sign in / Sign up

Export Citation Format

Share Document