Towards optimized scheduling for data-intensive scientific workflow in multiple datacenter environment

Jinghui Zhang; Mingjun Wang; Junzhou Luo; Fang Dong; Junxue Zhang

doi:10.1002/cpe.3601

A Survey of Data-Intensive Scientific Workflow Management

Journal of Grid Computing ◽

10.1007/s10723-015-9329-8 ◽

2015 ◽

Vol 13 (4) ◽

pp. 457-493 ◽

Cited By ~ 95

Author(s):

Ji Liu ◽

Esther Pacitti ◽

Patrick Valduriez ◽

Marta Mattoso

Keyword(s):

Workflow Management ◽

Scientific Workflow ◽

Data Intensive

Download Full-text

A novel time computation model based on algorithm complexity for data intensive scientific workflow design and scheduling

Concurrency and Computation Practice and Experience ◽

10.1002/cpe.1445 ◽

2009 ◽

Vol 21 (16) ◽

pp. 2070-2083 ◽

Cited By ~ 1

Author(s):

Jing He ◽

Yanchun Zhang ◽

Guangyan Huang ◽

Chaoyi Pang

Keyword(s):

Scientific Workflow ◽

Algorithm Complexity ◽

Computation Model ◽

Data Intensive ◽

Model Based

Download Full-text

The PBase Scientific Workflow Provenance Repository

International Journal of Digital Curation ◽

10.2218/ijdc.v9i2.332 ◽

2014 ◽

Vol 9 (2) ◽

pp. 28-38 ◽

Cited By ~ 16

Author(s):

Víctor Cuevas-Vicenttín ◽

Parisa Kianmajd ◽

Bertram Ludäscher ◽

Paolo Missier ◽

Fernando Chirigati ◽

...

Keyword(s):

Scientific Workflow ◽

Scientific Workflows ◽

Data Reuse ◽

Data Intensive ◽

Research Collaborations ◽

Provenance Data ◽

Scientific Experiments ◽

History Of ◽

Scientific Results ◽

User Friendly

Scientific workflows and their supporting systems are becoming increasingly popular for compute-intensive and data-intensive scientific experiments. The advantages scientific workflows offer include rapid and easy workflow design, software and data reuse, scalable execution, sharing and collaboration, and other advantages that altogether facilitate “reproducible science”. In this context, provenance – information about the origin, context, derivation, ownership, or history of some artifact – plays a key role, since scientists are interested in examining and auditing the results of scientific experiments. However, in order to perform such analyses on scientific results as part of extended research collaborations, an adequate environment and tools are required. Concretely, the need arises for a repository that will facilitate the sharing of scientific workflows and their associated execution traces in an interoperable manner, also enabling querying and visualization. Furthermore, such functionality should be supported while taking performance and scalability into account. With this purpose in mind, we introduce PBase: a scientific workflow provenance repository implementing the ProvONE proposed standard, which extends the emerging W3C PROV standard for provenance data with workflow specific concepts. PBase is built on the Neo4j graph database, thus offering capabilities such as declarative and efficient querying. Our experiences demonstrate the power gained by supporting various types of queries for provenance data. In addition, PBase is equipped with a user friendly interface tailored for the visualization of scientific workflow provenance data, making the specification of queries and the interpretation of their results easier and more effective.

Download Full-text

Fault-Tolerant and Data-Intensive Resource Scheduling and Management for Scientific Applications in Cloud Computing

Sensors ◽

10.3390/s21217238 ◽

2021 ◽

Vol 21 (21) ◽

pp. 7238

Author(s):

Zulfiqar Ahmad ◽

Ali Imran Jehangiri ◽

Mohammed Alaa Ala’anzy ◽

Mohamed Othman ◽

Arif Iqbal Umar

Keyword(s):

Cloud Computing ◽

Fault Tolerant ◽

Research Work ◽

Resource Scheduling ◽

Scientific Workflow ◽

Scientific Workflows ◽

Scientific Applications ◽

Data Intensive ◽

Computing Paradigm ◽

Cost Constraints

Cloud computing is a fully fledged, matured and flexible computing paradigm that provides services to scientific and business applications in a subscription-based environment. Scientific applications such as Montage and CyberShake are organized scientific workflows with data and compute-intensive tasks and also have some special characteristics. These characteristics include the tasks of scientific workflows that are executed in terms of integration, disintegration, pipeline, and parallelism, and thus require special attention to task management and data-oriented resource scheduling and management. The tasks executed during pipeline are considered as bottleneck executions, the failure of which result in the wholly futile execution, which requires a fault-tolerant-aware execution. The tasks executed during parallelism require similar instances of cloud resources, and thus, cluster-based execution may upgrade the system performance in terms of make-span and execution cost. Therefore, this research work presents a cluster-based, fault-tolerant and data-intensive (CFD) scheduling for scientific applications in cloud environments. The CFD strategy addresses the data intensiveness of tasks of scientific workflows with cluster-based, fault-tolerant mechanisms. The Montage scientific workflow is considered as a simulation and the results of the CFD strategy were compared with three well-known heuristic scheduling policies: (a) MCT, (b) Max-min, and (c) Min-min. The simulation results showed that the CFD strategy reduced the make-span by 14.28%, 20.37%, and 11.77%, respectively, as compared with the existing three policies. Similarly, the CFD reduces the execution cost by 1.27%, 5.3%, and 2.21%, respectively, as compared with the existing three policies. In case of the CFD strategy, the SLA is not violated with regard to time and cost constraints, whereas it is violated by the existing policies numerous times.

Download Full-text

Computation semantics of the functional scientific workflow language Cuneiform

Journal of Functional Programming ◽

10.1017/s0956796817000119 ◽

2017 ◽

Vol 27 ◽

Cited By ~ 8

Author(s):

JÖRGEN BRANDT ◽

WOLFGANG REISIG ◽

ULF LESER

Keyword(s):

Functional Programming ◽

Large Scale ◽

Type System ◽

Black Box ◽

Scientific Data ◽

Scientific Workflow ◽

Simple Type ◽

Flexible Assembly ◽

Data Intensive ◽

Research Areas

AbstractCuneiform is a minimal functional programming language for large-scale scientific data analysis. Implementing a strict black-box view on external operators and data, it allows the direct embedding of code in a variety of external languages like Python or R, provides data-parallel higher order operators for processing large partitioned data sets, allows conditionals and general recursion, and has a naturally parallelizable evaluation strategy suitable for multi-core servers and distributed execution environments like Hadoop, HTCondor, or distributed Erlang. Cuneiform has been applied in several data-intensive research areas including remote sensing, machine learning, and bioinformatics, all of which critically depend on the flexible assembly of pre-existing tools and libraries written in different languages into complex pipelines. This paper introduces the computation semantics for Cuneiform. It presents Cuneiform's abstract syntax, a simple type system, and the semantics of evaluation. Providing an unambiguous specification of the behavior of Cuneiform eases the implementation of interpreters which we showcase by providing a concise reference implementation in Erlang. The similarity of Cuneiform's syntax to the simply typed lambda calculus puts Cuneiform in perspective and allows a straightforward discussion of its design in the context of functional programming. Moreover, the simple type system allows the deduction of the language's safety up to black-box operators. Last, the formulation of the semantics also permits the verification of compilers to and from other workflow languages.

Download Full-text

Distributed Scientific Workflow Management for Data-Intensive Applications

2008 12th IEEE International Workshop on Future Trends of Distributed Computing Systems ◽

10.1109/ftdcs.2008.39 ◽

2008 ◽

Cited By ~ 5

Author(s):

S. Shumilov ◽

Y. Leng ◽

M. El-Gayyar ◽

A.B. Cremers

Keyword(s):

Workflow Management ◽

Scientific Workflow ◽

Data Intensive ◽

Data Intensive Applications

Download Full-text

Integrating Policy with Scientific Workflow Management for Data-Intensive Applications

2012 SC Companion: High Performance Computing, Networking Storage and Analysis ◽

10.1109/sc.companion.2012.29 ◽

2012 ◽

Cited By ~ 6

Author(s):

Ann L. Chervenak ◽

David E. Smith ◽

Weiwei Chen ◽

Ewa Deelman

Keyword(s):

Workflow Management ◽

Scientific Workflow ◽

Data Intensive ◽

Data Intensive Applications

Download Full-text

Services + Components = Data Intensive Scientific Workflow Applications with MeDICi

Component-Based Software Engineering - Lecture Notes in Computer Science ◽

10.1007/978-3-642-02414-6_14 ◽

2009 ◽

pp. 227-241 ◽

Cited By ~ 5

Author(s):

Ian Gorton ◽

Jared Chase ◽

Adam Wynne ◽

Justin Almquist ◽

Alan Chappell

Keyword(s):

Scientific Workflow ◽

Data Intensive

Download Full-text

Scientific workflows applied to the coupling of a continuum (Elmer v8.3) and a discrete element (HiDEM v1.0) ice dynamic model

Geoscientific Model Development ◽

10.5194/gmd-12-3001-2019 ◽

2019 ◽

Vol 12 (7) ◽

pp. 3001-3015 ◽

Cited By ~ 2

Author(s):

Shahbaz Memon ◽

Dorothée Vallot ◽

Thomas Zwinger ◽

Jan Åström ◽

Helmut Neukirchen ◽

...

Keyword(s):

Management System ◽

High Performance ◽

Heterogeneous Computing ◽

Workflow Management ◽

Scientific Workflow ◽

Workflow Management System ◽

Data Intensive ◽

Cpu Utilization ◽

Computing Environments ◽

High Level

Abstract. Scientific computing applications involving complex simulations and data-intensive processing are often composed of multiple tasks forming a workflow of computing jobs. Scientific communities running such applications on computing resources often find it cumbersome to manage and monitor the execution of these tasks and their associated data. These workflow implementations usually add overhead by introducing unnecessary input/output (I/O) for coupling the models and can lead to sub-optimal CPU utilization. Furthermore, running these workflow implementations in different environments requires significant adaptation efforts, which can hinder the reproducibility of the underlying science. High-level scientific workflow management systems (WMS) can be used to automate and simplify complex task structures by providing tooling for the composition and execution of workflows – even across distributed and heterogeneous computing environments. The WMS approach allows users to focus on the underlying high-level workflow and avoid low-level pitfalls that would lead to non-optimal resource usage while still allowing the workflow to remain portable between different computing environments. As a case study, we apply the UNICORE workflow management system to enable the coupling of a glacier flow model and calving model which contain many tasks and dependencies, ranging from pre-processing and data management to repetitive executions in heterogeneous high-performance computing (HPC) resource environments. Using the UNICORE workflow management system, the composition, management, and execution of the glacier modelling workflow becomes easier with respect to usage, monitoring, maintenance, reusability, portability, and reproducibility in different environments and by different user groups. Last but not least, the workflow helps to speed the runs up by reducing model coupling I/O overhead and it optimizes CPU utilization by avoiding idle CPU cores and running the models in a distributed way on the HPC cluster that best fits the characteristics of each model.

Download Full-text

Contemporary challenges for data-intensive scientific workflow management systems

Proceedings of the 10th Workshop on Workflows in Support of Large-Scale Science - WORKS '15 ◽

10.1145/2822332.2822336 ◽

2015 ◽

Cited By ~ 4

Author(s):

Ryan Mork ◽

Paul Martin ◽

Zhiming Zhao

Keyword(s):

Workflow Management ◽

Scientific Workflow ◽

Management Systems ◽

Workflow Management Systems ◽

Data Intensive

Download Full-text