Knowledge discovery in large model datasets in the marine environment: the THREDDS Data Server example

A. Bergamasco; A. Benetazzo; S. Carniel; F.M. Falcieri; T. Minuzzo; R.P. Signell; M. Sclavo

doi:10.4081/aiol.2012.5325

Knowledge discovery in large model datasets in the marine environment: the THREDDS Data Server example

Advances in Oceanography and Limnology ◽

10.4081/aiol.2012.5325 ◽

2012 ◽

Vol 3 (1) ◽

pp. 41 ◽

Cited By ~ 6

Author(s):

A. Bergamasco ◽

A. Benetazzo ◽

S. Carniel ◽

F.M. Falcieri ◽

T. Minuzzo ◽

...

Keyword(s):

Web Services ◽

Marine Environment ◽

Standard Form ◽

Climate Science ◽

Common Data Model ◽

Data Sets ◽

Distributed Data ◽

Distributed Search ◽

Standard Data ◽

Core Idea

In order to monitor, describe and understand the marine environment, many research institutions are involved in the acquisition and distribution of ocean data, both from observations and models. Scientists from these institutions are spending too much time looking for, accessing, and reformatting data: they need better tools and procedures to make the science they do more efficient. The U.S. Integrated Ocean Observing System (US-IOOS) is working on making large amounts of distributed data usable in an easy and efficient way. It is essentially a network of scientists, technicians and technologies designed to acquire, collect and disseminate observational and modelled data resulting from coastal and oceanic marine regions investigations to researchers, stakeholders and policy makers. In order to be successful, this effort requires standard data protocols, web services and standards-based tools. Starting from the US-IOOS approach, which is being adopted throughout much of the oceanographic and meteorological sectors, we describe here the CNR-ISMAR Venice experience in the direction of setting up a national Italian IOOS framework using the THREDDS (THematic Real-time Environmental Distributed Data Services) Data Server (TDS), a middleware designed to fill the gap between data providers and data users. The TDS provides services that allow data users to find the data sets pertaining to their scientific needs, to access, to visualize and to use them in an easy way, without downloading files to the local workspace. In order to achieve this, it is necessary that the data providers make their data available in a standard form that the TDS understands, and with sufficient metadata to allow the data to be read and searched in a standard way. The core idea is then to utilize a Common Data Model (CDM), a unified conceptual model that describes different datatypes within each dataset. More specifically, Unidata (<a href="http://www.unidata.ucar.edu" target="_blank">www.unidata.ucar.edu</a>) has developed CDM specifications for many of the different kinds of data used by the scientific community, such as grids, profiles, time series, swath data. These datatypes are aligned the NetCDF Climate and Forecast (CF) Metadata Conventions and with Climate Science Modelling Language (CSML); CF-compliant NetCDF files and GRIB files can be read directly with no modification, while non compliant files can be modified to meet appropriate metadata requirements. Once standardized in the CDM, the TDS makes datasets available through a series of web services such as OPeNDAP or Open Geospatial Consortium Web Coverage Service (WCS), allowing the data users to easily obtain small subsets from large datasets, and to quickly visualize their content by using tools such as GODIVA2 or Integrated Data Viewer (IDV). In addition, an ISO metadata service is available through the TDS that can be harvested by catalogue broker services (e.g. GI-cat) to enable distributed search across federated data servers. Example of TDS datasets can be accessed at the CNR-ISMAR Venice site <a href="http://tds.ve.ismar.cnr.it:8080/thredds/catalog.html" target="_blank">http://tds.ve.ismar.cnr.it:8080/thredds/catalog.html</a>.

Download Full-text

MUREN: a robust and multi-reference approach of RNA-seq transcript normalization

BMC Bioinformatics ◽

10.1186/s12859-021-04288-0 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Yance Feng ◽

Lei M. Li

Keyword(s):

Biological Significance ◽

Housekeeping Genes ◽

R Package ◽

Data Sets ◽

Statistical Regression ◽

Rna Seq ◽

Least Trimmed Squares ◽

Standard Data ◽

Wide Range ◽

Multiple References

Abstract Background Normalization of RNA-seq data aims at identifying biological expression differentiation between samples by removing the effects of unwanted confounding factors. Explicitly or implicitly, the justification of normalization requires a set of housekeeping genes. However, the existence of housekeeping genes common for a very large collection of samples, especially under a wide range of conditions, is questionable. Results We propose to carry out pairwise normalization with respect to multiple references, selected from representative samples. Then the pairwise intermediates are integrated based on a linear model that adjusts the reference effects. Motivated by the notion of housekeeping genes and their statistical counterparts, we adopt the robust least trimmed squares regression in pairwise normalization. The proposed method (MUREN) is compared with other existing tools on some standard data sets. The goodness of normalization emphasizes on preserving possible asymmetric differentiation, whose biological significance is exemplified by a single cell data of cell cycle. MUREN is implemented as an R package. The code under license GPL-3 is available on the github platform: github.com/hippo-yf/MUREN and on the conda platform: anaconda.org/hippo-yf/r-muren. Conclusions MUREN performs the RNA-seq normalization using a two-step statistical regression induced from a general principle. We propose that the densities of pairwise differentiations are used to evaluate the goodness of normalization. MUREN adjusts the mode of differentiation toward zero while preserving the skewness due to biological asymmetric differentiation. Moreover, by robustly integrating pre-normalized counts with respect to multiple references, MUREN is immune to individual outlier samples.

Download Full-text

Managing Large Distributed Data Sets for Testing in a Joint Environment

2008 U.S. Air Force T&E Days ◽

10.2514/6.2008-1665 ◽

2008 ◽

Author(s):

Thomas Bock ◽

Tonya Easley ◽

John Gibson

Keyword(s):

Data Sets ◽

Distributed Data

Download Full-text

PCA for heterogeneous data sets in a distributed data mining

Proceedings of the Fourth Annual ACM Bangalore Conference on - COMPUTE '11 ◽

10.1145/1980422.1980451 ◽

2011 ◽

Author(s):

E. Chandra ◽

P. Ajitha

Keyword(s):

Data Mining ◽

Heterogeneous Data ◽

Distributed Data Mining ◽

Data Sets ◽

Distributed Data

Download Full-text

Explora: Interactive Querying of Multidimensional Data in the Context of Smart Cities

Sensors ◽

10.3390/s20092737 ◽

2020 ◽

Vol 20 (9) ◽

pp. 2737

Author(s):

Leandro Ordonez-Ante ◽

Gregory Van Seghbroeck ◽

Tim Wauters ◽

Bruno Volckaert ◽

Filip De Turck

Keyword(s):

Smart City ◽

Smart Cities ◽

Temporal Trends ◽

Multidimensional Data ◽

Data Sets ◽

Distributed Data ◽

Key Factors ◽

Distributed Data Processing ◽

Spatial And Temporal Trends ◽

Query Latency

Citizen engagement is one of the key factors for smart city initiatives to remain sustainable over time. This in turn entails providing citizens and other relevant stakeholders with the latest data and tools that enable them to derive insights that add value to their day-to-day life. The massive volume of data being constantly produced in these smart city environments makes satisfying this requirement particularly challenging. This paper introduces Explora, a generic framework for serving interactive low-latency requests, typical of visual exploratory applications on spatiotemporal data, which leverages the stream processing for deriving—on ingestion time—synopsis data structures that concisely capture the spatial and temporal trends and dynamics of the sensed variables and serve as compacted data sets to provide fast (approximate) answers to visual queries on smart city data. The experimental evaluation conducted on proof-of-concept implementations of Explora, based on traditional database and distributed data processing setups, accounts for a decrease of up to 2 orders of magnitude in query latency compared to queries running on the base raw data at the expense of less than 10% query accuracy and 30% data footprint. The implementation of the framework on real smart city data along with the obtained experimental results prove the feasibility of the proposed approach.

Download Full-text

Performance Evaluation of Request and Response Time for Audio and Video Data Sets

International Journal of Advanced Research in Science, Communication and Technology ◽

10.48175/ijarsct-135 ◽

2019 ◽

pp. 133-136

Author(s):

Dr. Manish L Jivtode

Keyword(s):

Web Services ◽

Web Service ◽

Information Exchange ◽

Video Data ◽

Data Sets ◽

Representational State Transfer ◽

State Transfer ◽

Extensible Markup ◽

Communications Protocol ◽

Audio Video

Web services are applications that allow for communication between devices over the internet and are independent of the technology. The devices are built and use standardized eXtensible Markup Language (XML) for information exchange. A client or user is able to invoke a web service by sending an XML message and then gets back and XML response message. There are a number of communication protocols for web services that use the XML format such as Web Services Flow Language (WSFL), Blocks Extensible Exchange Protocol(BEEP) etc. Simple Object Access Protocol (SOAP) and Representational State Transfer (REST) are used options for accessing web services. It is not directly comparable that SOAP is a communications protocol while REST is a set of architectural principles for data transmission. In this paper, the data size of 1KB, 2KB, 4KB, 8KB and 16KB were tested each for Audio, Video and result obtained for CRUD methods. The encryption and decryption timings in milliseconds/seconds were recorded by programming extensibility points of a WCF REST web service in the Azure cloud..

Download Full-text

EPOS-Norway Portal

10.5194/egusphere-egu21-15394 ◽

2021 ◽

Author(s):

Jan Michalek ◽

Kuvvet Atakan ◽

Christian Rønnevik ◽

Helga Indrøy ◽

Lars Ottemøller ◽

...

Keyword(s):

Web Services ◽

Visual Analysis ◽

Earth Science ◽

National Research Council ◽

Geophysical Data ◽

Multidimensional Data ◽

Data Sets ◽

Complex Data ◽

Submarine Landslides ◽

Data Repositories

The European Plate Observing System (EPOS) is a European project about building a pan-European infrastructure for accessing solid Earth science data, governed now by EPOS ERIC (European Research Infrastructure Consortium). The EPOS-Norway project (EPOS-N; RCN-Infrastructure Programme - Project no. 245763) is a Norwegian project funded by National Research Council. The aim of the Norwegian EPOS e&#8209;infrastructure is to integrate data from the seismological and geodetic networks, as well as the data from the geological and geophysical data repositories. Among the six EPOS-N project partners, four institutions are providing data &#8211; University of Bergen (UIB), - Norwegian Mapping Authority (NMA), Geological Survey of Norway (NGU) and NORSAR.In this contribution, we present the EPOS-Norway Portal as an online, open access, interactive tool, allowing visual analysis of multidimensional data. It supports maps and 2D plots with linked visualizations. Currently access is provided to more than 300 datasets (18 web services, 288 map layers and 14 static datasets) from four subdomains of Earth science in Norway. New datasets are planned to be integrated in the future. EPOS-N Portal can access remote datasets via web services like FDSNWS for seismological data and OGC services for geological and geophysical data (e.g. WMS). Standalone datasets are available through preloaded data files. Users can also simply add another WMS server or upload their own dataset for visualization and comparison with other datasets. This portal provides unique way (first of its kind in Norway) for exploration of various geoscientific datasets in one common interface. One of the key aspects is quick simultaneous visual inspection of data from various disciplines and test of scientific or geohazard related hypothesis. One of such examples can be spatio-temporal correlation of earthquakes (1980 until now) with existing critical infrastructures (e.g. pipelines), geological structures, submarine landslides or unstable slopes. &#160;The EPOS-N Portal is implemented by adapting Enlighten-web, a server-client program developed by NORCE. Enlighten-web facilitates interactive visual analysis of large multidimensional data sets, and supports interactive mapping of millions of points. The Enlighten-web client runs inside a web browser. An important element in the Enlighten-web functionality is brushing and linking, which is useful for exploring complex data sets to discover correlations and interesting properties hidden in the data. The views are linked to each other, so that highlighting a subset in one view automatically leads to the corresponding subsets being highlighted in all other linked views.

Download Full-text

An Ontology-based approach to enable data-driven research in the field of NDT in Civil Engineering

10.5194/egusphere-egu21-12125 ◽

2021 ◽

Author(s):

Benjamin Moreno-Torres ◽

Christoph Völker ◽

Sabine Kruschwitz

Keyword(s):

Knowledge Representation ◽

Civil Engineering ◽

Common Knowledge ◽

Measurement Data ◽

Data Access ◽

Semantic Knowledge ◽

Data Sources ◽

Data Sets ◽

Extensive Literature ◽

Distributed Data

<div> Non-destructive testing (NDT) data in civil engineering is regularly used for scientific analysis. However, there is no uniform representation of the data yet. An analysis of distributed data sets across different test objects is therefore too difficult in most cases. To overcome this, we present an approach for an integrated data management of distributed data sets based on Semantic Web technologies. The cornerstone of this approach is an ontology, a semantic knowledge representation of our domain. This NDT-CE ontology is later populated with the data sources. Using the properties and the relationships between concepts that the ontology contains, we make these data sets meaningful also for machines. Furthermore, the ontology can be used as a central interface for database access. Non-domain data sources can be integrated by linking them with the NDT ontology, making them directly available for generic use in terms of digitization. Based on an extensive literature research, we outline the possibilities that result for NDT in civil engineering, such as computer-aided sorting and analysis of measurement data, and the recognition and explanation of correlations. A common knowledge representation and data access allows the scientific exploitation of existing data sources with data-based methods (such as image recognition, measurement uncertainty calculations, factor analysis or material characterization) and simplifies bidirectional knowledge and data transfer between engineers and NDT specialists. </div>

Download Full-text

Architecture and Implementation of Distributed Data Storage Using Web Services, CORBA and PVM

Parallel Processing and Applied Mathematics - Lecture Notes in Computer Science ◽

10.1007/978-3-540-24669-5_47 ◽

2004 ◽

pp. 360-367 ◽

Cited By ~ 1

Author(s):

Pawel Czarnul

Keyword(s):

Web Services ◽

Data Storage ◽

Distributed Data ◽

Distributed Data Storage

Download Full-text

Mining Environmental Data in the ADMIRE Project Using New Advanced Methods and Tools

Technology Integration Advancements in Distributed Systems and Computing ◽

10.4018/978-1-4666-0906-8.ch018 ◽

2012 ◽

pp. 296-308

Author(s):

Ondrej Habala ◽

Martin Šeleng ◽

Viet Tran ◽

Branislav Šimo ◽

Ladislav Hluchý

Keyword(s):

Data Mining ◽

Environmental Data ◽

Environmental Applications ◽

Data Sets ◽

Distributed Data ◽

New Methods ◽

Prospective Application ◽

Using Data ◽

Computer Power

The project Advanced Data Mining and Integration Research for Europe (ADMIRE) is designing new methods and tools for comfortable mining and integration of large, distributed data sets. One of the prospective application domains for such methods and tools is the environmental applications domain, which often uses various data sets from different vendors where data mining is becoming increasingly popular and more computer power becomes available. The authors present a set of experimental environmental scenarios, and the application of ADMIRE technology in these scenarios. The scenarios try to predict meteorological and hydrological phenomena which currently cannot or are not predicted by using data mining of distributed data sets from several providers in Slovakia. The scenarios have been designed by environmental experts and apart from being used as the testing grounds for the ADMIRE technology; results are of particular interest to experts who have designed them.

Download Full-text

Pattern Recognition for Large-Scale Data Processing

Strategic Data-Based Wisdom in the Big Data Era - Advances in Knowledge Acquisition, Transfer, and Management ◽

10.4018/978-1-4666-8122-4.ch011 ◽

2015 ◽

pp. 198-208 ◽

Cited By ~ 2

Author(s):

Amir Basirat ◽

Asad I. Khan ◽

Heinz W. Schmidt

Keyword(s):

Large Scale ◽

Distributed Processing ◽

Data Sets ◽

Distributed Data ◽

Time Data ◽

Deterministic Learning ◽

Large Scale Data ◽

Future Data ◽

Large Scale Data Processing ◽

Learning Schemes

One of the main challenges for large-scale computer clouds dealing with massive real-time data is in coping with the rate at which unprocessed data is being accumulated. Transforming big data into valuable information requires a fundamental re-think of the way in which future data management models will need to be developed on the Internet. Unlike the existing relational schemes, pattern-matching approaches can analyze data in similar ways to which our brain links information. Such interactions when implemented in voluminous data clouds can assist in finding overarching relations in complex and highly distributed data sets. In this chapter, a different perspective of data recognition is considered. Rather than looking at conventional approaches, such as statistical computations and deterministic learning schemes, this chapter focuses on distributed processing approach for scalable data recognition and processing.

Download Full-text