Leveraging High Performance Computing for Managing Large and Evolving Data Collections

Ritu Arora; Maria Esteva; Jessica Trelogan

doi:10.2218/ijdc.v9i2.331

Leveraging High Performance Computing for Managing Large and Evolving Data Collections

International Journal of Digital Curation ◽

10.2218/ijdc.v9i2.331 ◽

2014 ◽

Vol 9 (2) ◽

pp. 17-27 ◽

Cited By ~ 6

Author(s):

Ritu Arora ◽

Maria Esteva ◽

Jessica Trelogan

Keyword(s):

Data Management ◽

High Performance Computing ◽

High Performance ◽

Large Scale ◽

Research Process ◽

Open Science ◽

Test Case ◽

Growth Data ◽

Data Types ◽

Performance Computing

The process of developing a digital collection in the context of a research project often involves a pipeline pattern during which data growth, data types, and data authenticity need to be assessed iteratively in relation to the different research steps and in the interest of archiving. Throughout a project’s lifecycle curators organize newly generated data while cleaning and integrating legacy data when it exists, and deciding what data will be preserved for the long term. Although these actions should be part of a well-oiled data management workflow, there are practical challenges in doing so if the collection is very large and heterogeneous, or is accessed by several researchers contemporaneously. There is a need for data management solutions that can help curators with efficient and on-demand analyses of their collection so that they remain well-informed about its evolving characteristics. In this paper, we describe our efforts towards developing a workflow to leverage open science High Performance Computing (HPC) resources for routinely and efficiently conducting data management tasks on large collections. We demonstrate that HPC resources and techniques can significantly reduce the time for accomplishing critical data management tasks, and enable a dynamic archiving throughout the research process. We use a large archaeological data collection with a long and complex formation history as our test case. We share our experiences in adopting open science HPC resources for large-scale data management, which entails understanding usage of the open source HPC environment and training users. These experiences can be generalized to meet the needs of other data curators working with large collections.

Download Full-text

High-performance computing in water resources hydrodynamics

Journal of Hydroinformatics ◽

10.2166/hydro.2020.163 ◽

2020 ◽

Vol 22 (5) ◽

pp. 1217-1235 ◽

Cited By ~ 3

Author(s):

M. Morales-Hernández ◽

M. B. Sharif ◽

S. Gangrade ◽

T. T. Dullo ◽

S.-C. Kao ◽

...

Keyword(s):

Water Resources ◽

High Performance Computing ◽

Graphics Processing Units ◽

High Performance ◽

Large Scale ◽

Test Case ◽

Processing Unit ◽

Central Processing ◽

Graphics Processing ◽

Performance Computing

Abstract This work presents a vision of future water resources hydrodynamics codes that can fully utilize the strengths of modern high-performance computing (HPC). The advances to computing power, formerly driven by the improvement of central processing unit processors, now focus on parallel computing and, in particular, the use of graphics processing units (GPUs). However, this shift to a parallel framework requires refactoring the code to make efficient use of the data as well as changing even the nature of the algorithm that solves the system of equations. These concepts along with other features such as the precision for the computations, dry regions management, and input/output data are analyzed in this paper. A 2D multi-GPU flood code applied to a large-scale test case is used to corroborate our statements and ascertain the new challenges for the next-generation parallel water resources codes.

Download Full-text

Work in progress — Integration of the scientific workflow paradigm into high performance computing and large scale data management curricula

2010 IEEE Frontiers in Education Conference (FIE) ◽

10.1109/fie.2010.5673235 ◽

2010 ◽

Author(s):

Brandeis Marshall ◽

John Springer ◽

Thomas Hacker

Keyword(s):

Data Management ◽

High Performance Computing ◽

High Performance ◽

Large Scale ◽

Scientific Workflow ◽

Work In Progress ◽

Large Scale Data ◽

Performance Computing ◽

Scale Data

Download Full-text

Accelerating Large-Scale Data Analysis by Offloading to High-Performance Computing Libraries using Alchemist

Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining ◽

10.1145/3219819.3219927 ◽

2018 ◽

Cited By ~ 2

Author(s):

Alex Gittens ◽

Kai Rothauge ◽

Shusen Wang ◽

Michael W. Mahoney ◽

Lisa Gerhardt ◽

...

Keyword(s):

Data Analysis ◽

High Performance Computing ◽

High Performance ◽

Large Scale ◽

Large Scale Data ◽

Performance Computing ◽

Scale Data

Download Full-text

Challenges of Research Data Management for High Performance Computing

Research and Advanced Technology for Digital Libraries - Lecture Notes in Computer Science ◽

10.1007/978-3-319-67008-9_12 ◽

2017 ◽

pp. 140-151 ◽

Cited By ~ 4

Author(s):

Björn Schembera ◽

Thomas Bönisch

Keyword(s):

Data Management ◽

High Performance Computing ◽

High Performance ◽

Research Data ◽

Research Data Management ◽

Performance Computing

Download Full-text

High-Performance Computing Framework Based on Distributed Systems for Large-Scale Neurophysiological Data

10.21203/rs.3.rs-136986/v1 ◽

2021 ◽

Author(s):

Mohsen Hadianpour ◽

Ehsan Rezayat ◽

Mohammad-Reza Dehaqani

Keyword(s):

High Performance Computing ◽

High Performance ◽

Large Scale ◽

Electrophysiological Recording ◽

Neural Data ◽

Data Framework ◽

Neurophysiological Data ◽

Computing Framework ◽

Performance Computing ◽

Neuroscience Community

Abstract Due to the significantly drastic progress and improvement in neurophysiological recording technologies, neuroscientists have faced various complexities dealing with unstructured large-scale neural data. In the neuroscience community, these complexities could create serious bottlenecks in storing, sharing, and processing neural datasets. In this article, we developed a distributed high-performance computing (HPC) framework called `Big neuronal data framework' (BNDF), to overcome these complexities. BNDF is based on open-source big data frameworks, Hadoop and Spark providing a flexible and scalable structure. We examined BNDF on three different large-scale electrophysiological recording datasets from nonhuman primate’s brains. Our results exhibited faster runtimes with scalability due to the distributed nature of BNDF. We compared BNDF results to a widely used platform like MATLAB in an equitable computational resource. Compared with other similar methods, using BNDF provides more than five times faster performance in spike sorting as a usual neuroscience application.

Download Full-text

Measuring and tuning energy efficiency on large scale high performance computing platforms.

10.2172/1035312 ◽

2011 ◽

Cited By ~ 1

Author(s):

James H., III Laros

Keyword(s):

Energy Efficiency ◽

High Performance Computing ◽

High Performance ◽

Large Scale ◽

Computing Platforms ◽

Performance Computing

Download Full-text

Taxonomic assignment for large-scale metagenomic data on high-perfomance systems

Journal of Computer Science and Cybernetics ◽

10.15625/1813-9663/33/2/10753 ◽

2017 ◽

Vol 33 (2) ◽

pp. 119-130

Author(s):

Vinh Van Le ◽

Hoai Van Tran ◽

Hieu Ngoc Duong ◽

Giang Xuan Bui ◽

Lang Van Tran

Keyword(s):

High Performance Computing ◽

Assignment Problem ◽

High Performance ◽

Large Scale ◽

Computing System ◽

Metagenomic Data ◽

Taxonomic Assignment ◽

High Performance Computing System ◽

Powerful Approach ◽

Performance Computing

Metagenomics is a powerful approach to study environment samples which do not require the isolation and cultivation of individual organisms. One of the essential tasks in a metagenomic project is to identify the origin of reads, referred to as taxonomic assignment. Due to the fact that each metagenomic project has to analyze large-scale datasets, the metatenomic assignment is very much computation intensive. This study proposes a parallel algorithm for the taxonomic assignment problem, called SeMetaPL, which aims to deal with the computational challenge. The proposed algorithm is evaluated with both simulated and real datasets on a high performance computing system. Experimental results demonstrate that the algorithm is able to achieve good performance and utilize resources of the system efficiently. The software implementing the algorithm and all test datasets can be downloaded at http://it.hcmute.edu.vn/bioinfo/metapro/SeMetaPL.html.

Download Full-text

Cloud Computing for Scientific Simulation and High Performance Computing

Principles, Methodologies, and Service-Oriented Approaches for Cloud Computing ◽

10.4018/978-1-4666-2854-0.ch003 ◽

2013 ◽

pp. 51-70

Author(s):

Adrian Jackson ◽

Michèle Weiland

Keyword(s):

Cloud Computing ◽

High Performance Computing ◽

High Performance ◽

Large Scale ◽

Parallel Programs ◽

Small Scale ◽

Cloud Infrastructure ◽

Scientific Simulations ◽

Cloud Infrastructures ◽

Performance Computing

This chapter describes experiences using Cloud infrastructures for scientific computing, both for serial and parallel computing. Amazon’s High Performance Computing (HPC) Cloud computing resources were compared to traditional HPC resources to quantify performance as well as assessing the complexity and cost of using the Cloud. Furthermore, a shared Cloud infrastructure is compared to standard desktop resources for scientific simulations. Whilst this is only a small scale evaluation these Cloud offerings, it does allow some conclusions to be drawn, particularly that the Cloud can currently not match the parallel performance of dedicated HPC machines for large scale parallel programs but can match the serial performance of standard computing resources for serial and small scale parallel programs. Also, the shared Cloud infrastructure cannot match dedicated computing resources for low level benchmarks, although for an actual scientific code, performance is comparable.

Download Full-text

Green Computing

Pervasive Cloud Computing Technologies - Advances in Systems Analysis, Software Engineering, and High Performance Computing ◽

10.4018/978-1-4666-4683-4.ch012 ◽

2014 ◽

pp. 248-260

Keyword(s):

Climate Change ◽

Cloud Computing ◽

High Performance Computing ◽

High Performance ◽

Large Scale ◽

Green Computing ◽

Research Topic ◽

The Other ◽

Cloud Infrastructures ◽

Performance Computing

Green computing is a contemporary research topic to address climate and energy challenges. In this chapter, the authors envision the duality of green computing with technological trends in other fields of computing such as High Performance Computing (HPC) and cloud computing on one hand and economy and business on the other hand. For instance, in order to provide electricity for large-scale cloud infrastructures and to reach exascale computing, we need huge amounts of energy. Thus, green computing is a challenge for the future of cloud computing and HPC. Alternatively, clouds and HPC provide solutions for green computing and climate change. In this chapter, the authors discuss this proposition by looking at the technology in detail.

Download Full-text

Architecture for the Integration of High Performance Computing Applications in PLM

Volume 2: 27th Computers and Information in Engineering Conference, Parts A and B ◽

10.1115/detc2007-35185 ◽

2007 ◽

Author(s):

Reiner Anderl ◽

Orkun Yaman

Keyword(s):

Data Management ◽

High Performance Computing ◽

High Performance ◽

State Of The Art ◽

Reference Information ◽

Simulation Domain ◽

Architectural Framework ◽

Industrial Context ◽

Performance Computing ◽

Integrate Data

High Performance Computing (HPC) has become ubiquitous for simulations in the industrial context. To identify the requirements for integration of HPC-relevant data and processes a survey has been conducted concerning the German car manufacturers and service and component suppliers. This contribution presents the results of the evaluation and suggests an architecture concept to integrate data and workflows related with CAE and HPC-facilities in PLM. It describes the state of the art of HPC-applications within the simulation domain. Intensive efforts are currently invested on CAE-data management. However, an approach to systematic data management of HPC does not exist. This study states importance of an integrating approach for data management of HPC-applications and develops an architectural framework to implement HPC-data management into the existing PLM landscape. Requirements on key functionalities and interfaces are defined as well as a framework for a reference information model is conceptualized.

Download Full-text