scholarly journals Epi Archive: Automated Synthesis of Global Notifiable Disease Data

2018 ◽  
Vol 10 (1) ◽  
Author(s):  
Hari S. Khalsa ◽  
Sergio Cordova ◽  
Nicholas Generous ◽  
Prabhu S. Khalsa ◽  
Byron Tasseff ◽  
...  

ObjectiveLANL has built software that automatically collects global notifiable disease data, synthesizes the data, and makes it available to humans and computers within the Biosurveillance Ecosystem (BSVE) as a novel data stream. These data have many applications including improving the prediction and early warning of disease events.IntroductionMost countries do not report national notifiable disease data in a machine-readable format. Data are often in the form of a file that contains text, tables and graphs summarizing weekly or monthly disease counts. This presents a problem when information is needed for more data intensive approaches to epidemiology, biosurveillance and public health.While most nations likely store incident data in a machine-readable format, governments are often hesitant to share data openly for a variety of reasons that include technical, political, economic, and motivational issues1.A survey conducted by LANL of notifiable disease data reporting in over fifty countries identified only a few websites that report data in a machine-readable format. The majority (>70%) produce reports as PDF files on a regular basis. The bulk of the PDF reports present data in a structured tabular format, while some report in natural language.The structure and format of PDF reports change often; this adds to the complexity of identifying and parsing the desired data. Not all websites publish in English, and it is common to find typos and clerical errors.LANL has developed a tool, Epi Archive, to collect global notifiable disease data automatically and continuously and make it uniform and readily accessible.MethodsWe conducted a survey of the national notifiable disease reporting systems notating how the data are reported and in what formats. We determined the minimal metadata that is required to contextualize incident counts properly, as well as optional metadata that is commonly found.The development of software to regularly ingest notifiable disease data and make it available involves three or four main steps: scraping, detecting, parsing and persisting.Scraping: we examine website design and determine reporting mechanisms for each country/website, as well as what varies across the reporting mechanisms. We then designed and wrote code to automate the downloading of the data for each country. We store all artifacts presented as files (PDF, XLSX, etc.) in their original form, along with appropriate metadata for parsing and data provenance.Detecting: This step is required when parsing structured non-machine-readable data such as tabular data in PDF files. We combined the Nurminen methodology of PDF table detection with in-house heuristics to find the desired data within PDF reports2.Parsing: We determined what to extract from each dataset and parsed these data into uniform data structures, correctly accommodating the variations in metadata (e.g., time interval definitions) and the various human languages.Persisting: We store the data in the Epi Archive database and make it available on the internet and through the BSVE. The data is persisted into a structured and normalized SQL database.ResultsThe Epi Archive tool currently contains national and/or subnational notifiable disease data from twenty nations. When a user accesses the Epi Archive site, they are prompted with four fields: country, subregion, disease of interest, and date duration. Upon form submission, a time series is generated from the users’ specifications. The generated graph can then be downloaded into a CSV file if a user is interested in performing personal analysis. Additionally, the data from Epi Archive can be reached through a REST API (Representational State Transfer Application Programming Interface).ConclusionsLANL, as part of a currently funded DTRA effort, is automatically and continually collecting global notifiable disease data. While 20 nations are in production, more are being brought online in the near future. These data are already being utilized and will have many applications including improving the prediction and early warning of disease events.References[1] van Panhuis WG, Paul P, Emerson C, et al. A systematic review of barriers to data sharing in public health. BMC Public Health. 2014. 14:1144. doi:10.1186/1471-2458-14-1144[2] Nurminen, Anssi. "Algorithmic extraction of data in tables in PDF documents." (2013).

2019 ◽  
Vol 11 (1) ◽  
Author(s):  
Hari S. Kkalsa ◽  
Sergio Rene Cordova ◽  
Nicholas Generous

ObjectiveAutomatically collect and synthesize global notifiable disease data and make it available to humans and computers. Provide the data on the web and within the Biosurveillance Ecosystem (BSVE) as a novel data stream. These data have many applications including improving the prediction and early warning of disease events.IntroductionGovernment reporting of notifiable disease data is common and widespread, though most countries do not report in a machine-readable format. This is despite the WHO International Health Regulations stating that “[e]ach State Party shall notify WHO, by the most efficient means of communication available.” 1Data are often in the form of a file that contains text, tables and graphs summarizing weekly or monthly disease counts. This presents a problem when information is needed for more data intensive approaches to epidemiology, biosurveillance and public health. While most nations likely store incident data in a machine-readable format, governments can be hesitant to share data openly for a variety of reasons that include technical, political, economic, and motivational2.A survey conducted by LANL of notifiable disease data reporting in over fifty countries identified only a few websites that report data in a machine-readable format. The majority (>70%) produce reports as PDF files on a regular basis. The bulk of the PDF reports present data in a structured tabular format, while some report in natural language or graphical charts.The structure and format of PDF reports change often; this adds to the complexity of identifying and parsing the desired data. Not all websites publish in English, and it is common to find typos and clerical errors.LANL has developed a tool, Epi Archive, to collect global notifiable disease data automatically and continuously and make it uniform and readily accessible.MethodsA survey of the national notifiable disease reporting systems is periodically conducted notating how the data are reported and in what formats. We determined the minimal metadata that is required to contextualize incident counts properly, as well as optional metadata that is commonly found.The development of software to regularly ingest notifiable disease data and make it available involves three to four main steps: scraping, detecting, parsing and persisting.Scraping: we examine website design and determine reporting mechanisms for each country/website, as well as what varies across the reporting mechanisms. We then design and write code to automate the downloading of data for each country. We store all artifacts presented as files (PDF, XLSX, etc.) in their original form, along with appropriate metadata for parsing and data provenance.Detecting: This step is required when parsing structured non-machine-readable data, such as tabular data in PDF files. We combine the Nurminen methodology of PDF table detection with in-house heuristics to find the desired data within PDF reports3.Parsing: We determine what to extract from each dataset and parse these data into uniform data structures, correctly accommodating the variations in metadata (e.g., time interval definitions) and the various human languages.Persisting: We store the data in the Epi Archive database and make it available on the internet and through the BSVE. The data is persisted into a structured and normalized SQL database.ResultsEpi Archive currently contains national and/or subnational notifiable disease data from thirty-nine nations. When a user accesses the Epi Archive site, they are able to peruse, chart and download data by country, subregion, disease and time interval. Access to a cached version of the original artifacts (e.g. PDF files), a link to the source and additional metadata is also available through the user interface. Finally, to ensure machine-readability, the data from Epi Archive can be reached through a REST API. http://epiarchive.bsvgateway.org/ConclusionsLANL, as part of a currently funded DTRA effort, is automatically and continually collecting global notifiable disease data. While thirty-nine nations are in production, more are being brought online in the near future. These data are already being utilized and have many applications, including improving the prediction and early warning of disease events.References[1] WHO International Health Regulations, edition 3. http://apps.who.int/iris/bitstream/10665/246107/1/9789241580496-eng.pdf[2] van Panhuis WG, Paul P, Emerson C, et al. A systematic review of barriers to data sharing in public health. BMC Public Health. 2014. 14:1144. doi:10.1186/1471-2458-14-1144[3] Nurminen, Anssi. "Algorithmic extraction of data in tables in PDF documents." (2013). 


2021 ◽  
Vol 22 (14) ◽  
pp. 7590
Author(s):  
Liza Vinhoven ◽  
Frauke Stanke ◽  
Sylvia Hafkemeyer ◽  
Manuel Manfred Nietert

Different causative therapeutics for CF patients have been developed. There are still no mutation-specific therapeutics for some patients, especially those with rare CFTR mutations. For this purpose, high-throughput screens have been performed which result in various candidate compounds, with mostly unclear modes of action. In order to elucidate the mechanism of action for promising candidate substances and to be able to predict possible synergistic effects of substance combinations, we used a systems biology approach to create a model of the CFTR maturation pathway in cells in a standardized, human- and machine-readable format. It is composed of a core map, manually curated from small-scale experiments in human cells, and a coarse map including interactors identified in large-scale efforts. The manually curated core map includes 170 different molecular entities and 156 reactions from 221 publications. The coarse map encompasses 1384 unique proteins from four publications. The overlap between the two data sources amounts to 46 proteins. The CFTR Lifecycle Map can be used to support the identification of potential targets inside the cell and elucidate the mode of action for candidate substances. It thereby provides a backbone to structure available data as well as a tool to develop hypotheses regarding novel therapeutics.


2021 ◽  
pp. 79-90
Author(s):  
Christian Zinke-Wehlmann ◽  
Amit Kirschenbaum ◽  
Raul Palma ◽  
Soumya Brahma ◽  
Karel Charvát ◽  
...  

AbstractData is the basis for creating information and knowledge. Having data in a structured and machine-readable format facilitates the processing and analysis of the data. Moreover, metadata—data about the data, can help discovering data based on features as, e.g., by whom they were created, when, or for which purpose. These associated features make the data more interpretable and assist in turning it into useful information. This chapter briefly introduces the concepts of metadata and Linked Data—highly structured and interlinked data, their standards and their usages, with some elaboration on the role of Linked Data in bioeconomy.


2021 ◽  
Author(s):  
Theo Araujo ◽  
Jef Ausloos ◽  
Wouter van Atteveldt ◽  
Felicia Loecherbach ◽  
Judith Moeller ◽  
...  

The digital traces that people leave through their use of various online platforms provide tremendous opportunities for studying human behavior. However, the collection of these data is hampered by legal, ethical and technical challenges. We present a framework and tool for collecting these data through a data donation platform where consenting participants can securely submit their digital traces. This approach leverages recent developments in data rights that have given people more control over their own data, such as legislation that now mandates companies to make digital trace data available on request in a machine-readable format. By transparently requesting access to specific parts of this data for clearly communicated academic purposes, the data ownership and privacy of participants is respected and researchers are less dependent on commercial organizations that store this data in proprietary archives. In this paper we outline the general design principles, the current state of the tool, and future development goals.


Author(s):  
M. Thangamani ◽  
P. Thangaraj

The increase in the number of documents has aggravated the difficulty of classifying those documents according to specific needs. Clustering analysis in a distributed environment is a thrust area in artificial intelligence and data mining. Its fundamental task is to utilize characters to compute the degree of related corresponding relationship between objects and to accomplish automatic classification without earlier knowledge. Document clustering utilizes clustering technique to gather the documents of high resemblance collectively by computing the documents resemblance. Recent studies have shown that ontologies are useful in improving the performance of document clustering. Ontology is concerned with the conceptualization of a domain into an individual identifiable format and machine-readable format containing entities, attributes, relationships, and axioms. By analyzing types of techniques for document clustering, a better clustering technique depending on Genetic Algorithm (GA) is determined. Non-Dominated Ranked Genetic Algorithm (NRGA) is used in this paper for clustering, which has the capability of providing a better classification result. The experiment is conducted in 20 newsgroups data set for evaluating the proposed technique. The result shows that the proposed approach is very effective in clustering the documents in the distributed environment.


1977 ◽  
Vol 35 ◽  
pp. 104-119
Author(s):  
Anne B. Underhill ◽  
Jaylee M. Mead

AbstractMany catalogues of astronomical data appear in book form as well as in a machine-readable format. The latter form is popular because of the convenience of handling large bodies of data by machine and because it is an efficient way in which to transmit and make accessible data in books which are now out of print or very difficult to obtain. Some new catalogues are prepared entirely in a machine-readable form and the book form, if it exists at all, is of secondary importance for the preservation of the data.In this paper comments are given about the importance of prefaces for transmitting the results of a critical evaluation of a body of data and it is noted that it is essential that this type of documentation be transferred with any machine-readable catalogue. The types of error sometimes encountered in handling machine-readable catalogues are noted. The procedures followed in developing the Goddard Cross Index of eleven star catalogues are outlined as one example of how star catalogues can be compared using computers. The classical approach to evaluating data critically is reviewed and the types of question one should ask and answer for particular types of data are listed. Finally, a specific application of these precepts to the problem of line identifications is given.


2020 ◽  
Author(s):  
Tim Cernak ◽  
Babak Mahjour

<p>High throughput experimentation (HTE) is an increasingly important tool in the study of chemical synthesis. While the hardware for running HTE in the synthesis lab has evolved significantly in recent years, there remains a need for software solutions to navigate data rich experiments. We have developed the software, phactor™, to facilitate the performance and analysis of HTE in a chemical laboratory. phactor™ allows experimentalists to rapidly design arrays of chemical reactions in 24, 96, 384, or 1,536 wellplates. Users can access online reagent data, such as a lab inventory, to populate wells with experiments and produce instructions to perform the screen manually, or with the assistance of a liquid handling robot. After completion of the screen, analytical results can be uploaded for facile evaluation, and to guide the next series of experiments. All chemical data, metadata, and results are stored in a machine-readable format.</p>


Author(s):  
Nicholas Generous ◽  
Geoffrey Fairchild ◽  
Hari Khalsa ◽  
Byron Tasseff ◽  
James Arnold

ObjectiveLANL has built a software program that automatically collectsglobal notifiable disease data—particularly data stored in files—andmakes it available and shareable within the Biosurveillance Ecosystem(BSVE) as a new data source. This will improve the prediction andearly warning of disease events and other applications.IntroductionMost countries do not report national notifiable disease data in amachine-readable format. Data are often in the form of a file thatcontains text, tables and graphs summarizing weekly or monthlydisease counts. This presents a problem when information is neededfor more data intensive approaches to epidemiology, biosurveillanceand public health as exemplified by the Biosurveillance Ecosystem(BSVE).While most nations do likely store their data in a machine-readableformat, the governments are often hesitant to share data openly fora variety of reasons that include technical, political, economic, andmotivational issues [1]. For example, an attempt by LANL to obtaina weekly version of openly available monthly data, reported by theAustralian government, resulted in an onerous bureaucratic reply. Theobstacles to obtaining data included: paperwork to request data fromeach of the Australian states and territories, a long delay to obtaindata (up to 3 months) and extensive limitations on the data’s use thatprohibit collaboration and sharing. This type of experience whenattempting to contact public health departments or ministries of healthfor data is not uncommon.A survey conducted by LANL of notifiable disease data reportingin 52 countries identified only 10 as being machine-readable and42 being reported in pdf files on a regular basis. Within the 42 nationsthat report in pdf files, 32 report in a structured, tabular format and10 in a non-structured way.As a result, LANL has developed a tool-Epi Archive (formerlyknown as EPIC)-to automatically and continuously collect globalnotifiable disease data and make it readily accesible.MethodsWe conducted a survey of the national notifiable disease reportingsystems notating how the data is reported in two important dimensions:date standards and case definitions.The development of software to regularly ingests notifiabledisease data frand makes this data available involved four main stepsscraping, extracting, parsing and persisting.For scraping: we would examine website designs and determinereporting mechanisms for each country/website as well as what variesacross the reporting mechanisms. We then designed and wrote codeto automate the downloading of report pdf files, for each country.We stored report pdfs along with appropriate metadata for extractingand parsing.For extracting: we developed software that can extract notifiabledisease data presented in tabular form from a pdf file. We combinedthe methodology of figure placement detection with the in-housedeveloped table extraction and annotation heuristics.For parsing: we determined what to extract from each pdf dataset from the survey conducted. We then parsed the extracted datainto uniform data structures correctly accommodating the dimensionssurveyed and the various human languages. This task involvedingesting notifiable disease data in many disparate formats extractedfrom pdf files and coalescing the data into a standardized format.For persisting: We then store the data in the Epi ArchivePostgreSQL database and make it available through the BSVE.ResultsThe EpiArchive tool currently contains subnational notifiabledisease data from 10 nations. When a user accesses the EpiArchivesite, they are prompted with four fields: country, region, disease,and date duration. These fields allow the user to specify the location(down to the state level), the disease of interest, and the durationof interest. Upon form submission, a time series is generated fromthe users’ specifications. The generated time series can then bedownloaded into a csv file if a user is interested in performingpersonal analysis. Additionally, the data from EpiArchive can bereached through an API.ConclusionsLANL as part of a currently funded DTRA effort so that it willautomatically and continuously collect global notifiable diseasedata—particularly data stored in pdf files—and make it available andshareable within the Biosurveillance Ecosystem (BSVE) as a newdata source. This will provide data to analytics and users that willimprove the prediction and early warning of disease events and otherapplications.


Sign in / Sign up

Export Citation Format

Share Document