Geoscience Cyberinfrastructure in the Cloud: Data-Proximate Computing to Address Big Data and Open Science Challenges

Author(s):  
Mohan Ramamurthy
Keyword(s):  
Big Data ◽  
Author(s):  
J. Boehm ◽  
K. Liu ◽  
C. Alis

In the geospatial domain we have now reached the point where data volumes we handle have clearly grown beyond the capacity of most desktop computers. This is particularly true in the area of point cloud processing. It is therefore naturally lucrative to explore established big data frameworks for big geospatial data. The very first hurdle is the import of geospatial data into big data frameworks, commonly referred to as data ingestion. Geospatial data is typically encoded in specialised binary file formats, which are not naturally supported by the existing big data frameworks. Instead such file formats are supported by software libraries that are restricted to single CPU execution. We present an approach that allows the use of existing point cloud file format libraries on the Apache Spark big data framework. We demonstrate the ingestion of large volumes of point cloud data into a compute cluster. The approach uses a map function to distribute the data ingestion across the nodes of a cluster. We test the capabilities of the proposed method to load billions of points into a commodity hardware compute cluster and we discuss the implications on scalability and performance. The performance is benchmarked against an existing native Apache Spark data import implementation.


Author(s):  
Robert Vrbić

Cloud computing provides a powerful, scalable and flexible infrastructure into which one can integrate, previously known, techniques and methods of Data Mining. The result of such integration should be strong and capacitive platform that will be able to deal with the increasing production of data, or that will create the conditions for the efficient mining of massive amounts of data from various data warehouses with the aim of creating (useful) information or the production of new knowledge. This paper discusses such technology - the technology of big data mining, known as Cloud Data Mining (CDM).


2021 ◽  
Vol 2021 ◽  
pp. 1-11
Author(s):  
Lin Yang

In recent years, people have paid more and more attention to cloud data. However, because users do not have absolute control over the data stored on the cloud server, it is necessary for the cloud storage server to provide evidence that the data are completely saved to maintain their control over the data. Give users all management rights, users can independently install operating systems and applications and can choose self-service platforms and various remote management tools to manage and control the host according to personal habits. This paper mainly introduces the cloud data integrity verification algorithm of sustainable computing accounting informatization and studies the advantages and disadvantages of the existing data integrity proof mechanism and the new requirements under the cloud storage environment. In this paper, an LBT-based big data integrity proof mechanism is proposed, which introduces a multibranch path tree as the data structure used in the data integrity proof mechanism and proposes a multibranch path structure with rank and data integrity detection algorithm. In this paper, the proposed data integrity verification algorithm and two other integrity verification algorithms are used for simulation experiments. The experimental results show that the proposed scheme is about 10% better than scheme 1 and about 5% better than scheme 2 in computing time of 500 data blocks; in the change of operation data block time, the execution time of scheme 1 and scheme 2 increases with the increase of data blocks. The execution time of the proposed scheme remains unchanged, and the computational cost of the proposed scheme is also better than that of scheme 1 and scheme 2. The scheme in this paper not only can verify the integrity of cloud storage data but also has certain verification advantages, which has a certain significance in the application of big data integrity verification.


2020 ◽  
Vol 39 (4) ◽  
pp. 5027-5036
Author(s):  
You Lu ◽  
Qiming Fu ◽  
Xuefeng Xi ◽  
Zhenping Chen

Data outsourcing has gradually become a mainstream solution, but once data is outsourced, data owners will without the control of the data hardware, there is a possibility that the integrity of the data will be destroyed objectively. Many current studies have achieved low network overhead cloud data set verification by designing algorithmic structures (e.g., hashing, Merkel verification trees); however, cloud service providers may not recognize the incompleteness of cloud data to avoid liability or business factors fact. There is a need to build a secure, reliable, non-tamperable, and non-forgeable verification system for accountability. Blockchain is a chain-like data structure constructed by using data signatures, timestamps, hash functions, and proof-of-work mechanisms. Using blockchain technology to build an integrity verification system can achieve fault accountability. Blockchain is a chain-like data structure constructed by using data signatures, timestamps, hash functions, and proof-of-work mechanisms. Using blockchain technology to build an integrity verification system can achieve fault accountability. This paper uses the Hadoop framework to implement data collection and storage of the HBase system based on big data architecture. In summary, based on the research of blockchain cloud data collection and storage technology, based on the existing big data storage middleware, a large flow, high concurrency and high availability data collection and processing system has been realized.


2019 ◽  
Vol 11 (3) ◽  
pp. 255-273 ◽  
Author(s):  
Vicki Xafis ◽  
Markus K. Labude

Abstract There is a growing expectation, or even requirement, for researchers to deposit a variety of research data in data repositories as a condition of funding or publication. This expectation recognizes the enormous benefits of data collected and created for research purposes being made available for secondary uses, as open science gains increasing support. This is particularly so in the context of big data, especially where health data is involved. There are, however, also challenges relating to the collection, storage, and re-use of research data. This paper gives a brief overview of the landscape of data sharing via data repositories and discusses some of the key ethical issues raised by the sharing of health-related research data, including expectations of privacy and confidentiality, the transparency of repository governance structures, access restrictions, as well as data ownership and the fair attribution of credit. To consider these issues and the values that are pertinent, the paper applies the deliberative balancing approach articulated in the Ethics Framework for Big Data in Health and Research (Xafis et al. 2019) to the domain of Openness in Big Data and Data Repositories. Please refer to that article for more information on how this framework is to be used, including a full explanation of the key values involved and the balancing approach used in the case study at the end.


2020 ◽  
Vol 54 (4) ◽  
pp. 409-435
Author(s):  
Paolo Manghi ◽  
Claudio Atzori ◽  
Michele De Bonis ◽  
Alessia Bardi

PurposeSeveral online services offer functionalities to access information from “big research graphs” (e.g. Google Scholar, OpenAIRE, Microsoft Academic Graph), which correlate scholarly/scientific communication entities such as publications, authors, datasets, organizations, projects, funders, etc. Depending on the target users, access can vary from search and browse content to the consumption of statistics for monitoring and provision of feedback. Such graphs are populated over time as aggregations of multiple sources and therefore suffer from major entity-duplication problems. Although deduplication of graphs is a known and actual problem, existing solutions are dedicated to specific scenarios, operate on flat collections, local topology-drive challenges and cannot therefore be re-used in other contexts.Design/methodology/approachThis work presents GDup, an integrated, scalable, general-purpose system that can be customized to address deduplication over arbitrary large information graphs. The paper presents its high-level architecture, its implementation as a service used within the OpenAIRE infrastructure system and reports numbers of real-case experiments.FindingsGDup provides the functionalities required to deliver a fully-fledged entity deduplication workflow over a generic input graph. The system offers out-of-the-box Ground Truth management, acquisition of feedback from data curators and algorithms for identifying and merging duplicates, to obtain an output disambiguated graph.Originality/valueTo our knowledge GDup is the only system in the literature that offers an integrated and general-purpose solution for the deduplication graphs, while targeting big data scalability issues. GDup is today one of the key modules of the OpenAIRE infrastructure production system, which monitors Open Science trends on behalf of the European Commission, National funders and institutions.


Author(s):  
Kiritkumar J. Modi ◽  
Prachi Devangbhai Shah ◽  
Zalak Prajapati

The rapid growth of digitization in the present era leads to an exponential increase of information which demands the need of a Big Data paradigm. Big Data denotes complex, unstructured, massive, heterogeneous type data. The Big Data is essential to the success in many applications; however, it has a major setback regarding security and privacy issues. These issues arise because the Big Data is scattered over a distributed system by various users. The security of Big Data relates to all the solutions and measures to prevent the data from threats and malicious activities. Privacy prevails when it comes to processing personal data, while security means protecting information assets from unauthorized access. The existence of cloud computing and cloud data storage have been predecessor and conciliator of emergence of Big Data computing. This article highlights open issues related to traditional techniques of Big Data privacy and security. Moreover, it also illustrates a comprehensive overview of possible security techniques and future directions addressing Big Data privacy and security issues.


2012 ◽  
Vol 12 (3) ◽  
pp. 173-181 ◽  
Author(s):  
Harry Enke ◽  
Adrian Partl ◽  
Alexander Reinefeld ◽  
Florian Schintke

Sign in / Sign up

Export Citation Format

Share Document