scholarly journals GBIF Integration of Open Data 

Author(s):  
Tim Robertson ◽  
Federico Mendez ◽  
Matthew Blissett ◽  
Morten Høfft ◽  
Thomas Stjernegaard Jeppese ◽  
...  

The Global Biodiversity Information Facility (GBIF) runs a global data infrastructure that integrates data from more than 1700 institutions. Combining data at this scale has been achieved by deploying open Application Programming Interfaces (API) that adhere to the open data standards provided by Biodiversity Information Standards (TDWG). In this presentation, we will provide an overview of the GBIF infrastructure and APIs and provide insight into lessons learned while operating and evolving the systems, such as long-term API stability, ease of use, and efficiency. This will include the following topics: The registry component provides RESTful APIs for managing the organizations, repositories and datasets that comprise the network and control access permissions. Stability and ease of use have been critical to this being embedded in many systems. Changes within the registry trigger data crawling processes, which connect to external systems through their APIs and deposit datasets into GBIF's central data warehouse. One challenge here relates to the consistency of data across a distributed network. Once a dataset is crawled, the data processing infrastructure organizes and enriches data using reference catalogues accessed through open APIs, such as the vocabulary server and the taxonomic backbone. Being able to process data quickly as source data and reference catalogues change is a challenge for this component. The data access APIs provide search and download services. Asynchronous APIs are required for some of these aspects, and long-term stability is a requirement for widespread adoption. Here we will talk about policies for schema evolution to avoid incompatible changes, which would cause failures in client systems. The APIs that drive the user interface have specific needs such as efficient use of the network bandwidth. We will present how we approached this, and how we are currently adopting GraphQL as the next generation of these APIs. There are several APIs that we believe are of use for the data publishing community. These include APIs that will help in data quality aspects, and new data of interest thanks to the data clustering algorithms GBIF deploys. The registry component provides RESTful APIs for managing the organizations, repositories and datasets that comprise the network and control access permissions. Stability and ease of use have been critical to this being embedded in many systems. Changes within the registry trigger data crawling processes, which connect to external systems through their APIs and deposit datasets into GBIF's central data warehouse. One challenge here relates to the consistency of data across a distributed network. Once a dataset is crawled, the data processing infrastructure organizes and enriches data using reference catalogues accessed through open APIs, such as the vocabulary server and the taxonomic backbone. Being able to process data quickly as source data and reference catalogues change is a challenge for this component. The data access APIs provide search and download services. Asynchronous APIs are required for some of these aspects, and long-term stability is a requirement for widespread adoption. Here we will talk about policies for schema evolution to avoid incompatible changes, which would cause failures in client systems. The APIs that drive the user interface have specific needs such as efficient use of the network bandwidth. We will present how we approached this, and how we are currently adopting GraphQL as the next generation of these APIs. There are several APIs that we believe are of use for the data publishing community. These include APIs that will help in data quality aspects, and new data of interest thanks to the data clustering algorithms GBIF deploys.

2019 ◽  
Vol 6 (1) ◽  
Author(s):  
Mahdi Torabzadehkashi ◽  
Siavash Rezaei ◽  
Ali HeydariGorji ◽  
Hosein Bobarshad ◽  
Vladimir Alves ◽  
...  

AbstractIn the era of big data applications, the demand for more sophisticated data centers and high-performance data processing mechanisms is increasing drastically. Data are originally stored in storage systems. To process data, application servers need to fetch them from storage devices, which imposes the cost of moving data to the system. This cost has a direct relation with the distance of processing engines from the data. This is the key motivation for the emergence of distributed processing platforms such as Hadoop, which move process closer to data. Computational storage devices (CSDs) push the “move process to data” paradigm to its ultimate boundaries by deploying embedded processing engines inside storage devices to process data. In this paper, we introduce Catalina, an efficient and flexible computational storage platform, that provides a seamless environment to process data in-place. Catalina is the first CSD equipped with a dedicated application processor running a full-fledged operating system that provides filesystem-level data access for the applications. Thus, a vast spectrum of applications can be ported for running on Catalina CSDs. Due to these unique features, to the best of our knowledge, Catalina CSD is the only in-storage processing platform that can be seamlessly deployed in clusters to run distributed applications such as Hadoop MapReduce and HPC applications in-place without any modifications on the underlying distributed processing framework. For the proof of concept, we build a fully functional Catalina prototype and a CSD-equipped platform using 16 Catalina CSDs to run Intel HiBench Hadoop and HPC benchmarks to investigate the benefits of deploying Catalina CSDs in the distributed processing environments. The experimental results show up to 2.2× improvement in performance and 4.3× reduction in energy consumption, respectively, for running Hadoop MapReduce benchmarks. Additionally, thanks to the Neon SIMD engines, the performance and energy efficiency of DFT algorithms are improved up to 5.4× and 8.9×, respectively.


2021 ◽  
Vol 1 (2) ◽  
pp. 89
Author(s):  
Lutfia Rizkyatul Akbar ◽  
Gunadi Gunadi

This study aims to assess the implementation of the openness of banking data access policies to improving tax compliance in Indonesia. It cause by the implementation of tax collection using a self-assessment system, thus requiring taxpayer data and information through financial institutions, include banking. Researchers used qualitative descriptive methods. The results of this study are, first, there is support for the implementation of the policy on openness to access to banking data in increasing tax compliance in Indonesia in the form of the issuance of Law Number 9 of 2017 concerning Access to Financial Information. Second, the implementation of banking data disclosure policies to increase tax compliance in Indonesia, including the willingness of target groups to comply with policy outputs, in this case the reporting of customer data by banks to the DGT. Third, the policy of open banking data access does not impede or reduce the number of bank accounts and deposits. Fourth, there are technical obstacles both by the DGT and the banking sector, especially in the first year. Furthermore, there are several inhibiting factors in the implementation of this policy, namely IT factors, and resistance from some circles at the beginning of the emergence of regulations, limited financial resources to process data quickly, so it must be done gradually, in addition to lack of quantity and quality of human resources 


Author(s):  
Lyubomir Penev ◽  
Teodor Georgiev ◽  
Viktor Senderov ◽  
Mariya Dimitrova ◽  
Pavel Stoev

As one of the first advocates of open access and open data in the field of biodiversity publishiing, Pensoft has adopted a multiple data publishing model, resulting in the ARPHA-BioDiv toolbox (Penev et al. 2017). ARPHA-BioDiv consists of several data publishing workflows and tools described in the Strategies and Guidelines for Publishing of Biodiversity Data and elsewhere: Data underlying research results are deposited in an external repository and/or published as supplementary file(s) to the article and then linked/cited in the article text; supplementary files are published under their own DOIs and bear their own citation details. Data deposited in trusted repositories and/or supplementary files and described in data papers; data papers may be submitted in text format or converted into manuscripts from Ecological Metadata Language (EML) metadata. Integrated narrative and data publishing realised by the Biodiversity Data Journal, where structured data are imported into the article text from tables or via web services and downloaded/distributed from the published article. Data published in structured, semanticaly enriched, full-text XMLs, so that several data elements can thereafter easily be harvested by machines. Linked Open Data (LOD) extracted from literature, converted into interoperable RDF triples in accordance with the OpenBiodiv-O ontology (Senderov et al. 2018) and stored in the OpenBiodiv Biodiversity Knowledge Graph. Data underlying research results are deposited in an external repository and/or published as supplementary file(s) to the article and then linked/cited in the article text; supplementary files are published under their own DOIs and bear their own citation details. Data deposited in trusted repositories and/or supplementary files and described in data papers; data papers may be submitted in text format or converted into manuscripts from Ecological Metadata Language (EML) metadata. Integrated narrative and data publishing realised by the Biodiversity Data Journal, where structured data are imported into the article text from tables or via web services and downloaded/distributed from the published article. Data published in structured, semanticaly enriched, full-text XMLs, so that several data elements can thereafter easily be harvested by machines. Linked Open Data (LOD) extracted from literature, converted into interoperable RDF triples in accordance with the OpenBiodiv-O ontology (Senderov et al. 2018) and stored in the OpenBiodiv Biodiversity Knowledge Graph. The above mentioned approaches are supported by a whole ecosystem of additional workflows and tools, for example: (1) pre-publication data auditing, involving both human and machine data quality checks (workflow 2); (2) web-service integration with data repositories and data centres, such as Global Biodiversity Information Facility (GBIF), Barcode of Life Data Systems (BOLD), Integrated Digitized Biocollections (iDigBio), Data Observation Network for Earth (DataONE), Long Term Ecological Research (LTER), PlutoF, Dryad, and others (workflows 1,2); (3) semantic markup of the article texts in the TaxPub format facilitating further extraction, distribution and re-use of sub-article elements and data (workflows 3,4); (4) server-to-server import of specimen data from GBIF, BOLD, iDigBio and PlutoR into manuscript text (workflow 3); (5) automated conversion of EML metadata into data paper manuscripts (workflow 2); (6) export of Darwin Core Archive and automated deposition in GBIF (workflow 3); (7) submission of individual images and supplementary data under own DOIs to the Biodiversity Literature Repository, BLR (workflows 1-3); (8) conversion of key data elements from TaxPub articles and taxonomic treatments extracted by Plazi into RDF handled by OpenBiodiv (workflow 5). These approaches represent different aspects of the prospective scholarly publishing of biodiversity data, which in a combination with text and data mining (TDM) technologies for legacy literature (PDF) developed by Plazi, lay the ground of an entire data publishing ecosystem for biodiversity, supplying FAIR (Findable, Accessible, Interoperable and Reusable data to several interoperable overarching infrastructures, such as GBIF, BLR, Plazi TreatmentBank, OpenBiodiv and various end users.


Author(s):  
Julián Rojas ◽  
Bert Marcelis ◽  
Eveline Vlassenroot ◽  
Mathias Van Compernolle ◽  
Pieter Colpaert ◽  
...  

Chapter 8 in the edited volume Situating Open Data.


2018 ◽  
Vol 13 (1) ◽  
pp. 160-168
Author(s):  
Nandalal Rana ◽  
Krishna P Bhandari ◽  
Surendra Shrestha

 Bandwidth requirement prediction is an important part of network design and service planning. The natural way of predicting bandwidth requirement for existing network is to analyze the past trends and apply appropriate mathematical model to predict for the future. For this research, the historical usage data of FWDR network nodes of Nepal Telecom is subject to univariate linear time series ARIMA model after logit transformation to predict future bandwidth requirement. The predicted data is compared to the real data obtained from the same network and the predicted data has been found to be within 10% MAPE. This model reduces the MAPE by 11.71% and 15.42% respectively as compared to the non-logit transformed ARIMA model at 99% CI. The results imply that the logit transformed ARIMA model has better performance compared to non-logit-transformed ARIMA model. For more accurate and longer term predictions, larger dataset can be taken along with season adjustments and consideration of long term variations.Journal of the Institute of Engineering, 2017, 13(1): 160-168


2021 ◽  
Vol 1 ◽  
pp. 80
Author(s):  
Thijs Devriendt ◽  
Clemens Ammann ◽  
Folkert W. Asselbergs ◽  
Alexander Bernier ◽  
Rodrigo Costas ◽  
...  

Various data sharing platforms are being developed to enhance the sharing of cohort data by addressing the fragmented state of data storage and access systems. However, policy challenges in several domains remain unresolved. The euCanSHare workshop was organized to identify and discuss these challenges and to set the future research agenda. Concerns over the multiplicity and long-term sustainability of platforms, lack of resources, access of commercial parties to medical data, credit and recognition mechanisms in academia and the organization of data access committees are outlined. Within these areas, solutions need to be devised to ensure an optimal functioning of platforms.


Author(s):  
Денис Валерьевич Сикулер

В статье выполнен обзор 10 ресурсов сети Интернет, позволяющих подобрать данные для разнообразных задач, связанных с машинным обучением и искусственным интеллектом. Рассмотрены как широко известные сайты (например, Kaggle, Registry of Open Data on AWS), так и менее популярные или узкоспециализированные ресурсы (к примеру, The Big Bad NLP Database, Common Crawl). Все ресурсы предоставляют бесплатный доступ к данным, в большинстве случаев для этого даже не требуется регистрация. Для каждого ресурса указаны характеристики и особенности, касающиеся поиска и получения наборов данных. В работе представлены следующие сайты: Kaggle, Google Research, Microsoft Research Open Data, Registry of Open Data on AWS, Harvard Dataverse Repository, Zenodo, Портал открытых данных Российской Федерации, World Bank, The Big Bad NLP Database, Common Crawl. The work presents review of 10 Internet resources that can be used to find data for different tasks related to machine learning and artificial intelligence. There were examined some popular sites (like Kaggle, Registry of Open Data on AWS) and some less known and specific ones (like The Big Bad NLP Database, Common Crawl). All included resources provide free access to data. Moreover in most cases registration is not needed for data access. Main features are specified for every examined resource, including regarding data search and access. The following sites are included in the review: Kaggle, Google Research, Microsoft Research Open Data, Registry of Open Data on AWS, Harvard Dataverse Repository, Zenodo, Open Data portal of the Russian Federation, World Bank, The Big Bad NLP Database, Common Crawl.


2019 ◽  
Vol 2 ◽  
Author(s):  
Lyubomir Penev

"Data ownership" is actually an oxymoron, because there could not be a copyright (ownership) on facts or ideas, hence no data onwership rights and law exist. The term refers to various kinds of data protection instruments: Intellectual Property Rights (IPR) (mostly copyright) asserted to indicate some kind of data ownership, confidentiality clauses/rules, database right protection (in the European Union only), or personal data protection (GDPR) (Scassa 2018). Data protection is often realised via different mechanisms of "data hoarding", that is witholding access to data for various reasons (Sieber 1989). Data hoarding, however, does not put the data into someone's ownership. Nonetheless, the access to and the re-use of data, and biodiversuty data in particular, is hampered by technical, economic, sociological, legal and other factors, although there should be no formal legal provisions related to copyright that may prevent anyone who needs to use them (Egloff et al. 2014, Egloff et al. 2017, see also the Bouchout Declaration). One of the best ways to provide access to data is to publish these so that the data creators and holders are credited for their efforts. As one of the pioneers in biodiversity data publishing, Pensoft has adopted a multiple-approach data publishing model, resulting in the ARPHA-BioDiv toolbox and in extensive Strategies and Guidelines for Publishing of Biodiversity Data (Penev et al. 2017a, Penev et al. 2017b). ARPHA-BioDiv consists of several data publishing workflows: Deposition of underlying data in an external repository and/or its publication as supplementary file(s) to the related article which are then linked and/or cited in-tex. Supplementary files are published under their own DOIs to increase citability). Description of data in data papers after they have been deposited in trusted repositories and/or as supplementary files; the systme allows for data papers to be submitted both as plain text or converted into manuscripts from Ecological Metadata Language (EML) metadata. Import of structured data into the article text from tables or via web services and their susequent download/distribution from the published article as part of the integrated narrative and data publishing workflow realised by the Biodiversity Data Journal. Publication of data in structured, semanticaly enriched, full-text XMLs where data elements are machine-readable and easy-to-harvest. Extraction of Linked Open Data (LOD) from literature, which is then converted into interoperable RDF triples (in accordance with the OpenBiodiv-O ontology) (Senderov et al. 2018) and stored in the OpenBiodiv Biodiversity Knowledge Graph Deposition of underlying data in an external repository and/or its publication as supplementary file(s) to the related article which are then linked and/or cited in-tex. Supplementary files are published under their own DOIs to increase citability). Description of data in data papers after they have been deposited in trusted repositories and/or as supplementary files; the systme allows for data papers to be submitted both as plain text or converted into manuscripts from Ecological Metadata Language (EML) metadata. Import of structured data into the article text from tables or via web services and their susequent download/distribution from the published article as part of the integrated narrative and data publishing workflow realised by the Biodiversity Data Journal. Publication of data in structured, semanticaly enriched, full-text XMLs where data elements are machine-readable and easy-to-harvest. Extraction of Linked Open Data (LOD) from literature, which is then converted into interoperable RDF triples (in accordance with the OpenBiodiv-O ontology) (Senderov et al. 2018) and stored in the OpenBiodiv Biodiversity Knowledge Graph In combination with text and data mining (TDM) technologies for legacy literature (PDF) developed by Plazi, these approaches show different angles to the future of biodiversity data publishing and, lay the foundations of an entire data publishing ecosystem in the field, while also supplying FAIR (Findable, Accessible, Interoperable and Reusable) data to several interoperable overarching infrastructures, such as Global Biodiversity Information Facility (GBIF), Biodiversity Literature Repository (BLR), Plazi TreatmentBank, OpenBiodiv, as well as to various end users.


Author(s):  
Roel During ◽  
Marcel Pleijte ◽  
Rosalie I. van Dam ◽  
Irini E. Salverda

Open data and citizen-led initiatives can be both friends and foes. Where it is available and ‘open', official data not only encourages increased public participation but can also generate the production and scrutiny of new material, potentially of benefit to the original provider and others, official or otherwise. In this way, official open data can be seen to improve democracy or, more accurately, the so-called ‘participative democracy'. On the other hand, the public is not always eager to share their personal information in the most open ways. Private and sometimes sensitive information however is required to initiate projects of societal benefit in difficult times. Many citizens appear content to channel personal information exchange via social media instead of putting it on public web sites. The perceived benefits from sharing and complete openness do not outweigh any disadvantages or fear of regulation. This is caused by various sources of contingency, such as the different appeals on citizens, construed in discourses on the participation society and the representative democracy, calling for social openness in the first and privacy protection in the latter. Moreover, the discourse on open data is an economic argument fighting the rules of privacy instead of the promotion of open data as one of the prerequisites for social action. Civil servants acknowledge that access to open data via all sorts of apps could contribute to the mushrooming of public initiatives, but are reluctant to release person-related sensitive information. The authors will describe and discuss this dilemma in the context of some recent case studies from the Netherlands concerning governmental programmes on open data and citizens' initiatives, to highlight both the governance constraints and uncertainties as well as citizens' concerns on data access and data sharing. It will be shown that openness has a different meaning and understanding in the participation society and representative democracy: i.e. the tension surrounding the sharing of private social information versus transparency. Looking from both sides at openness reveals double contingency: understanding and intentions on this openness invokes mutual enforcing uncertainties. This double contingency hampers citizens' eagerness to participate. The paper will conclude with a practical recommendation for improving data governance.


BMJ Open ◽  
2019 ◽  
Vol 9 (2) ◽  
pp. e024152 ◽  
Author(s):  
Aileen Grant ◽  
Sarah Dean ◽  
Jean Hay-Smith ◽  
Suzanne Hagen ◽  
Doreen McClurg ◽  
...  

IntroductionFemale urinary incontinence (UI) is common affecting up to 45% of women. Pelvic floor muscle training (PFMT) is the first-line treatment but there is uncertainty whether intensive PFMT is better than basic PFMT for long-term symptomatic improvement. It is also unclear which factors influence women’s ability to perform PFMT long term and whether this has impacts on long-term outcomes. OPAL (optimising PFMT to achieve long-term benefits) trial examines the effectiveness and cost-effectiveness of basic PFMT versus biofeedback-mediated PFMT and this evaluation explores women’s experiences of treatment and the factors which influence effectiveness. This will provide data aiding interpretation of the trial findings; make recommendations for optimising the treatment protocol; support implementation in practice; and address gaps in the literature around long-term adherence to PFMT for women with stress or mixed UI.Methods and analysisThis evaluation comprises a longitudinal qualitative case study and process evaluation (PE). The case study aims to explore women’s experiences of treatment and adherence and the PE will explore factors influencing intervention effectiveness. The case study has a two-tailed design and will recruit 40 women, 20 from each trial group; they will be interviewed four times over 2 years. Process data will be collected from women through questionnaires at four time-points, from health professionals through checklists and interviews and by sampling 100 audio recordings of appointments. Qualitative analysis will use case study methodology (qualitative study) and the framework technique (PE) and will interrogate for similarities and differences between the trial groups regarding barriers and facilitators to adherence. Process data analyses will examine fidelity, engagement and mediating factors using descriptive and interpretative statistics.Ethics and disseminationApproval from West of Scotland Research Ethics Committee 4 (16/LO/0990). Findings will be published in journals, disseminated at conferences and through the final report.Trial registration numberISRCTN57746448.


Sign in / Sign up

Export Citation Format

Share Document