DATS: the data tag suite to enable discoverability of datasets

Mapping Intimacies ◽

10.1101/103143 ◽

2017 ◽

Author(s):

Susanna-Assunta Sansone ◽

Alejandra Gonzalez-Beltran ◽

Philippe Rocca-Serra ◽

George Alter ◽

Jeffrey S Grethe ◽

...

Keyword(s):

Big Data ◽

Scientific Literature ◽

Journal Article ◽

Life Sciences ◽

National Institutes Of Health ◽

Collaborative Effort ◽

Data Types ◽

Data Discovery ◽

Core Set ◽

Independent Model

Today's science increasingly requires effective ways to find and access existing datasets that are distributed across a range of repositories. For researchers in the life sciences, discoverability of datasets may soon become as essential as identifying the latest publications via PubMed. Through an international collaborative effort funded by the National Institutes of Health (NIH)'s Big Data to Knowledge (BD2K) initiative, we have designed and implemented the DAta Tag Suite (DATS) model to support the DataMed data discovery index. DataMed's goal is to be for data what PubMed has been for the scientific literature. Akin to the Journal Article Tag Suite (JATS) used in PubMed, the DATS model enables submission of metadata on datasets to DataMed. DATS has a core set of elements, which are generic and applicable to any type of datasets, and an extended set that can accommodate more specialized data types. DATS is a platform-independent model also available as a Schema.org annotated serialization to be used beyond DataMed, for example, in projects like DataCite.

Download Full-text

Data discovery with DATS: exemplar adoptions and lessons learned

Journal of the American Medical Informatics Association ◽

10.1093/jamia/ocx119 ◽

2017 ◽

Vol 25 (1) ◽

pp. 13-16 ◽

Cited By ~ 1

Author(s):

Alejandra N Gonzalez-Beltran ◽

John Campbell ◽

Patrick Dunn ◽

Diana Guijarro ◽

Sanda Ionescu ◽

...

Keyword(s):

Big Data ◽

Best Practices ◽

Lessons Learned ◽

National Institutes Of Health ◽

Data Sources ◽

Data Discovery ◽

Implementation Guidelines ◽

Diverse Data ◽

Existing Data ◽

The Web

Abstract The DAta Tag Suite (DATS) is a model supporting dataset description, indexing, and discovery. It is available as an annotated serialization with schema.org, a vocabulary used by major search engines, thus making the datasets discoverable on the web. DATS underlies DataMed, the National Institutes of Health Big Data to Knowledge Data Discovery Index prototype, which aims to provide a “PubMed for datasets.” The experience gained while indexing a heterogeneous range of >60 repositories in DataMed helped in evaluating DATS’s entities, attributes, and scope. In this work, 3 additional exemplary and diverse data sources were mapped to DATS by their representatives or experts, offering a deep scan of DATS fitness against a new set of existing data. The procedure, including feedback from users and implementers, resulted in DATS implementation guidelines and best practices, and identification of a path for evolving and optimizing the model. Finally, the work exposed additional needs when defining datasets for indexing, especially in the context of clinical and observational information.

Download Full-text

Construction of a multi-source heterogeneous hybrid platform for big data

Journal of Computational Methods in Sciences and Engineering ◽

10.3233/jcm-215138 ◽

2021 ◽

pp. 1-10

Author(s):

Ying Wang ◽

Yiding Liu ◽

Minna Xia

Keyword(s):

Big Data ◽

Data Analysis ◽

Forest Fire ◽

Original Data ◽

Big Data Analysis ◽

Multiple Sources ◽

Data Types ◽

Fire Monitoring ◽

Data Platform

Big data is featured by multiple sources and heterogeneity. Based on the big data platform of Hadoop and spark, a hybrid analysis on forest fire is built in this study. This platform combines the big data analysis and processing technology, and learns from the research results of different technical fields, such as forest fire monitoring. In this system, HDFS of Hadoop is used to store all kinds of data, spark module is used to provide various big data analysis methods, and visualization tools are used to realize the visualization of analysis results, such as Echarts, ArcGIS and unity3d. Finally, an experiment for forest fire point detection is designed so as to corroborate the feasibility and effectiveness, and provide some meaningful guidance for the follow-up research and the establishment of forest fire monitoring and visualized early warning big data platform. However, there are two shortcomings in this experiment: more data types should be selected. At the same time, if the original data can be converted to XML format, the compatibility is better. It is expected that the above problems can be solved in the follow-up research.

Download Full-text

The academic viewpoint on Big data and patient data ownership (as seen in the scientific literature)

European Journal of Public Health ◽

10.1093/eurpub/ckaa166.171 ◽

2020 ◽

Vol 30 (Supplement_5) ◽

Author(s):

I Mircheva ◽

M Mirchev

Keyword(s):

Public Health ◽

Big Data ◽

Medical Research ◽

Information And Communication Technologies ◽

Patient Information ◽

Scientific Literature ◽

Medical Data ◽

Patient Data ◽

Academic Community ◽

Ethical Challenges

Abstract Background Ownership of patient information in the context of Big Data is a relatively new problem, apparently not yet fully understood. There are not enough publications on the subject. Since the topic is interdisciplinary, incorporating legal, ethical, medical and aspects of information and communication technologies, a slightly more sophisticated analysis of the issue is needed. Aim To determine how the medical academic community perceives the issue of ownership of patient information in the context of Big Data. Methods Literature search for full text publications, indexed in PubMed, Springer, ScienceDirect and Scopus identified only 27 appropriate articles authored by academicians and corresponding to three focus areas: problem (ownership); area (healthcare); context (Big Data). Three major aspects were studied: scientific area of publications, aspects and academicians' perception of ownership in the context of Big Data. Results Publications are in the period 2014 - 2019, 37% published in health and medical informatics journals, 30% in medicine and public health, 19% in law and ethics; 78% authored by American and British academicians, highly cited. The majority (63%) are in the area of scientific research - clinical studies, access and use of patient data for medical research, secondary use of medical data, ethical challenges to Big data in healthcare. The majority (70%) of the publications discuss ownership in ethical and legal aspects and 67% see ownership as a challenge mostly to medical research, access control, ethics, politics and business. Conclusions Ownership of medical data is seen first and foremost as a challenge. Addressing this challenge requires the combined efforts of politicians, lawyers, ethicists, computer and medical professionals, as well as academicians, sharing these efforts, experiences and suggestions. However, this issue is neglected in the scientific literature. Publishing may help in open debates and adequate policy solutions. Key messages Ownership of patient information in the context of Big Data is a problem that should not be marginalized but needs a comprehensive attitude, consideration and combined efforts from all stakeholders. Overcoming the challenge of ownership may help in improving healthcare services, medical and public health research and the health of the population as a whole.

Download Full-text

Big Data Analytics for Search Engine Optimization

Big Data and Cognitive Computing ◽

10.3390/bdcc4020005 ◽

2020 ◽

Vol 4 (2) ◽

pp. 5 ◽

Cited By ~ 1

Author(s):

Ioannis C. Drivas ◽

Damianos P. Sakas ◽

Georgios A. Giannakopoulos ◽

Daphne Kyriaki-Manessi

Keyword(s):

Big Data ◽

Cultural Heritage ◽

Search Engine ◽

Data Analytics ◽

User Behavior ◽

Big Data Analytics ◽

Data Types ◽

Search Engine Optimization ◽

The Impact ◽

The Web

In the Big Data era, search engine optimization deals with the encapsulation of datasets that are related to website performance in terms of architecture, content curation, and user behavior, with the purpose to convert them into actionable insights and improve visibility and findability on the Web. In this respect, big data analytics expands the opportunities for developing new methodological frameworks that are composed of valid, reliable, and consistent analytics that are practically useful to develop well-informed strategies for organic traffic optimization. In this paper, a novel methodology is implemented in order to increase organic search engine visits based on the impact of multiple SEO factors. In order to achieve this purpose, the authors examined 171 cultural heritage websites and their retrieved data analytics about their performance and user experience inside them. Massive amounts of Web-based collections are included and presented by cultural heritage organizations through their websites. Subsequently, users interact with these collections, producing behavioral analytics in a variety of different data types that come from multiple devices, with high velocity, in large volumes. Nevertheless, prior research efforts indicate that these massive cultural collections are difficult to browse while expressing low visibility and findability in the semantic Web era. Against this backdrop, this paper proposes the computational development of a search engine optimization (SEO) strategy that utilizes the generated big cultural data analytics and improves the visibility of cultural heritage websites. One step further, the statistical results of the study are integrated into a predictive model that is composed of two stages. First, a fuzzy cognitive mapping process is generated as an aggregated macro-level descriptive model. Secondly, a micro-level data-driven agent-based model follows up. The purpose of the model is to predict the most effective combinations of factors that achieve enhanced visibility and organic traffic on cultural heritage organizations’ websites. To this end, the study contributes to the knowledge expansion of researchers and practitioners in the big cultural analytics sector with the purpose to implement potential strategies for greater visibility and findability of cultural collections on the Web.

Download Full-text

Measuring social media activity of scientific literature: an exhaustive comparison of scopus and novel altmetrics big data

Scientometrics ◽

10.1007/s11192-017-2512-x ◽

2017 ◽

Vol 113 (2) ◽

pp. 1037-1057 ◽

Cited By ~ 29

Author(s):

Saeed-Ul Hassan ◽

Mubashir Imran ◽

Uzair Gillani ◽

Naif Radi Aljohani ◽

Timothy D. Bowman ◽

...

Keyword(s):

Social Media ◽

Big Data ◽

Scientific Literature

Download Full-text

A core-set approach for distributed quadratic programming in big-data classification

2015 54th IEEE Conference on Decision and Control (CDC) ◽

10.1109/cdc.2015.7402402 ◽

2015 ◽

Author(s):

Giuseppe Notarstefano

Keyword(s):

Big Data ◽

Quadratic Programming ◽

Data Classification ◽

Core Set ◽

Big Data Classification

Download Full-text

Discussion on Data Features and Construction Models of Translation Corpus in the Era of Big Data

E3S Web of Conferences ◽

10.1051/e3sconf/202125101030 ◽

2021 ◽

Vol 251 ◽

pp. 01030

Author(s):

Qinqi Kang ◽

Zhao Kang

Keyword(s):

Big Data ◽

Rapid Development ◽

Third Party ◽

Crowd Sourcing ◽

Data Types ◽

Corpus Construction ◽

Translation Practice ◽

Open Source Data ◽

Key Factor ◽

Translation Machine

With the rapid development of artificial intelligence in the current era of big data, the construction of translation corpus has become a key factor in effectively achieving a highly intelligent translation. In the era of big data, the data sources and data types of translation corpus are becoming more and more diversified, which will inevitably bring about a new revolution in the construction of translation corpus. The construction of the translation corpus in the era of big data can fully rely on third-party open source data, crowd-sourcing translation, machine closed-loop, human-machine collaboration and other multiple modes to comprehensively improve the quality of translation corpus construction to better serve translation practice.

Download Full-text

KnowEnG: a knowledge engine for genomics

Journal of the American Medical Informatics Association ◽

10.1093/jamia/ocv090 ◽

2015 ◽

Vol 22 (6) ◽

pp. 1115-1119 ◽

Cited By ~ 4

Author(s):

Saurabh Sinha ◽

Jun Song ◽

Richard Weinshilboum ◽

Victor Jongeneel ◽

Jiawei Han

Keyword(s):

Big Data ◽

National Institutes Of Health ◽

Knowledge Network ◽

Data Sets ◽

Network Mining ◽

Construction Of Knowledge ◽

Community Data ◽

Science Framework ◽

The University ◽

Big Data Computing

Abstract We describe here the vision, motivations, and research plans of the National Institutes of Health Center for Excellence in Big Data Computing at the University of Illinois, Urbana-Champaign. The Center is organized around the construction of “Knowledge Engine for Genomics” (KnowEnG), an E-science framework for genomics where biomedical scientists will have access to powerful methods of data mining, network mining, and machine learning to extract knowledge out of genomics data. The scientist will come to KnowEnG with their own data sets in the form of spreadsheets and ask KnowEnG to analyze those data sets in the light of a massive knowledge base of community data sets called the “Knowledge Network” that will be at the heart of the system. The Center is undertaking discovery projects aimed at testing the utility of KnowEnG for transforming big data to knowledge. These projects span a broad range of biological enquiry, from pharmacogenomics (in collaboration with Mayo Clinic) to transcriptomics of human behavior.

Download Full-text

Analysis of Big Data in Healthcare and Life Sciences Using Hive and Spark

Advances in Intelligent Systems and Computing - Data Engineering and Communication Technology ◽

10.1007/978-981-15-1097-7_69 ◽

2020 ◽

pp. 825-840

Author(s):

A. Sai Hanuman ◽

R. Soujanya ◽

P. M. Madhuri

Keyword(s):

Big Data ◽

Life Sciences

Download Full-text

BigGIS With Hadoop in MapReduce Environment

Handbook of Research on Digital Research Methods and Architectural Tools in Urban Planning and Design - Advances in Civil and Industrial Engineering ◽

10.4018/978-1-5225-9238-9.ch002 ◽

2019 ◽

pp. 25-32

Author(s):

Nada M. Alhakkak

Keyword(s):

Big Data ◽

Scheduling Algorithm ◽

Real Data ◽

Map Reduce ◽

Data Types ◽

Simulated Environment ◽

Merge Sort ◽

Data Source ◽

Sort Algorithm ◽

And Storage

BigGIS is a new product that resulted from developing GIS in the “Big Data” area, which is used in storing and processing big geographical data and helps in solving its issues. This chapter describes an optimized Big GIS framework in Map Reduce Environment M2BG. The suggested framework has been integrated into Map Reduce Environment in order to solve the storage issues and get the benefit of the Hadoop environment. M2BG include two steps: Big GIS warehouse and Big GIS Map Reduce. The first step contains three main layers: Data Source and Storage Layer (DSSL), Data Processing Layer (DPL), and Data Analysis Layer (DAL). The second layer is responsible for clustering using swarms as inputs for the Hadoop phase. Then it is scheduled in the mapping part with the use of a preempted priority scheduling algorithm; some data types are classified as critical and some others are ordinary data type; the reduce part used, merge sort algorithm M2BG, should solve security and be implemented with real data in the simulated environment and later in the real world.

Download Full-text