scholarly journals An agenda for research in large-scale distributed data repositories

Author(s):  
M. Satyanarayanan
Author(s):  
Muhammad Fadhil Ginting ◽  
Kyohei Otsu ◽  
Jeffrey Edlund ◽  
Jay Gao ◽  
Ali-akbar Agha-mohammadi

2018 ◽  
Vol 228 ◽  
pp. 01011
Author(s):  
Haifeng Zhong ◽  
Jianying Xiong

The wan Internet storage system based on Distributed Hash Table uses fully distributed data and metadata management, and constructs an extensible and efficient mass storage system for the application based on Internet. However, such systems work in highly dynamic environments, and the frequent entry and exit of nodes will lead to huge communication costs. Therefore, this paper proposes a new hierarchical metadata routing management mechanism based on DHT, which makes full use of the node stabilization point to reduce the maintenance overhead of the overlay. Analysis shows that the algorithm can effectively improve efficiency and enhance stability.


2018 ◽  
Vol 25 (4) ◽  
pp. 1398-1411 ◽  
Author(s):  
Vishal Patel

The electronic sharing of medical imaging data is an important element of modern healthcare systems, but current infrastructure for cross-site image transfer depends on trust in third-party intermediaries. In this work, we examine the blockchain concept, which enables parties to establish consensus without relying on a central authority. We develop a framework for cross-domain image sharing that uses a blockchain as a distributed data store to establish a ledger of radiological studies and patient-defined access permissions. The blockchain framework is shown to eliminate third-party access to protected health information, satisfy many criteria of an interoperable health system, and readily generalize to domains beyond medical imaging. Relative drawbacks of the framework include the complexity of the privacy and security models and an unclear regulatory environment. Ultimately, the large-scale feasibility of such an approach remains to be demonstrated and will depend on a number of factors which we discuss in detail.


2021 ◽  
Author(s):  
Edzer Pebesma ◽  
Patrick Griffiths ◽  
Christian Briese ◽  
Alexander Jacob ◽  
Anze Skerlevaj ◽  
...  

<p>The OpenEO API allows the analysis of large amounts of Earth Observation data using a high-level abstraction of data and processes. Rather than focusing on the management of virtual machines and millions of imagery files, it allows to create jobs that take a spatio-temporal section of an image collection (such as Sentinel L2A), and treat it as a data cube. Processes iterate or aggregate over pixels, spatial areas, spectral bands, or time series, while working at arbitrary spatial resolution. This pattern, pioneered by Google Earth Engine™ (GEE), lets the user focus on the science rather than on data management.</p><p>The openEO H2020 project (2017-2020) has developed the API as well as an ecosystem of software around it, including clients (JavaScript, Python, R, QGIS, browser-based), back-ends that translate API calls into existing image analysis or GIS software or services (for Sentinel Hub, WCPS, Open Data Cube, GRASS GIS, GeoTrellis/GeoPySpark, and GEE) as well as a hub that allows querying and searching openEO providers for their capabilities and datasets. The project demonstrated this software in a number of use cases, where identical processing instructions were sent to different implementations, allowing comparison of returned results.</p><p>A follow-up, ESA-funded project “openEO Platform” realizes the API and progresses the software ecosystem into operational services and applications that are accessible to everyone, that involve federated deployment (using the clouds managed by EODC, Terrascope, CreoDIAS and EuroDataCube), that will provide payment models (“pay per compute job”) conceived and implemented following the user community needs and that will use the EOSC (European Open Science Cloud) marketplace for dissemination and authentication. A wide range of large-scale cases studies will demonstrate the ability of the openEO Platform to scale to large data volumes.  The case studies to be addressed include on-demand ARD generation for SAR and multi-spectral data, agricultural demonstrators like crop type and condition monitoring, forestry services like near real time forest damage assessment as well as canopy cover mapping, environmental hazard monitoring of floods and air pollution as well as security applications in terms of vessel detection in the mediterranean sea.</p><p>While the landscape of cloud-based EO platforms and services has matured and diversified over the past decade, we believe there are strong advantages for scientists and government agencies to adopt the openEO approach. Beyond the absence of vendor/platform lock-in or EULA’s we mention the abilities to (i) run arbitrary user code (e.g. written in R or Python) close to the data, (ii) carry out scientific computations on an entirely open source software stack, (iii) integrate different platforms (e.g., different cloud providers offering different datasets), and (iv) help create and extend this software ecosystem. openEO uses the OpenAPI standard, aligns with modern OGC API standards, and uses the STAC (SpatioTemporal Asset Catalog) to describe image collections and image tiles.</p>


2013 ◽  
pp. 294-321
Author(s):  
Alexandru Costan

To accommodate the needs of large-scale distributed systems, scalable data storage and management strategies are required, allowing applications to efficiently cope with continuously growing, highly distributed data. This chapter addresses the key issues of data handling in grid environments focusing on storing, accessing, managing and processing data. We start by providing the background for the data storage issue in grid environments. We outline the main challenges addressed by distributed storage systems: high availability which translates into high resilience and consistency, corruption handling regarding arbitrary faults, fault tolerance, asynchrony, fairness, access control and transparency. The core part of the chapter presents how existing solutions cope with these high requirements. The most important research results are organized along several themes: grid data storage, distributed file systems, data transfer and retrieval and data management. Important characteristics such as performance, efficient use of resources, fault tolerance, security, and others are strongly determined by the adopted system architectures and the technologies behind them. For each topic, we shortly present previous work, describe the most recent achievements, highlight their advantages and limitations, and indicate future research trends in distributed data storage and management.


Author(s):  
Mohammad Zubair Khan ◽  
Yasser M. Alginahi

Big Data research is playing a leading role in investigating a wide group of issues fundamentally emerging concerning Database, Data Warehousing, and Data Mining research. Analytics research is intended to develop complex procedures running over large-scale data repositories with the objective of extracting useful knowledge hidden in such repositories. A standout amongst the most noteworthy application situations where Big Data emerge is, without uncertainty, logical figuring. Here, researchers and analysts create immense measures of information everyday by means of investigations (e.g., disciplines like high vitality material science, space science, bioinformatics, etc.). Nevertheless, separating helpful learning for basic leadership purposes from these enormous, vast scale data repositories are practically inconceivable for genuine Data Base Management Systems (DBMS), is inspired investigation tools.


Author(s):  
Amir Basirat ◽  
Asad I. Khan ◽  
Heinz W. Schmidt

One of the main challenges for large-scale computer clouds dealing with massive real-time data is in coping with the rate at which unprocessed data is being accumulated. Transforming big data into valuable information requires a fundamental re-think of the way in which future data management models will need to be developed on the Internet. Unlike the existing relational schemes, pattern-matching approaches can analyze data in similar ways to which our brain links information. Such interactions when implemented in voluminous data clouds can assist in finding overarching relations in complex and highly distributed data sets. In this chapter, a different perspective of data recognition is considered. Rather than looking at conventional approaches, such as statistical computations and deterministic learning schemes, this chapter focuses on distributed processing approach for scalable data recognition and processing.


Sign in / Sign up

Export Citation Format

Share Document