scholarly journals Epi Archive: automated data collection of notifiable disease data

Author(s):  
Nicholas Generous ◽  
Geoffrey Fairchild ◽  
Hari Khalsa ◽  
Byron Tasseff ◽  
James Arnold

ObjectiveLANL has built a software program that automatically collectsglobal notifiable disease data—particularly data stored in files—andmakes it available and shareable within the Biosurveillance Ecosystem(BSVE) as a new data source. This will improve the prediction andearly warning of disease events and other applications.IntroductionMost countries do not report national notifiable disease data in amachine-readable format. Data are often in the form of a file thatcontains text, tables and graphs summarizing weekly or monthlydisease counts. This presents a problem when information is neededfor more data intensive approaches to epidemiology, biosurveillanceand public health as exemplified by the Biosurveillance Ecosystem(BSVE).While most nations do likely store their data in a machine-readableformat, the governments are often hesitant to share data openly fora variety of reasons that include technical, political, economic, andmotivational issues [1]. For example, an attempt by LANL to obtaina weekly version of openly available monthly data, reported by theAustralian government, resulted in an onerous bureaucratic reply. Theobstacles to obtaining data included: paperwork to request data fromeach of the Australian states and territories, a long delay to obtaindata (up to 3 months) and extensive limitations on the data’s use thatprohibit collaboration and sharing. This type of experience whenattempting to contact public health departments or ministries of healthfor data is not uncommon.A survey conducted by LANL of notifiable disease data reportingin 52 countries identified only 10 as being machine-readable and42 being reported in pdf files on a regular basis. Within the 42 nationsthat report in pdf files, 32 report in a structured, tabular format and10 in a non-structured way.As a result, LANL has developed a tool-Epi Archive (formerlyknown as EPIC)-to automatically and continuously collect globalnotifiable disease data and make it readily accesible.MethodsWe conducted a survey of the national notifiable disease reportingsystems notating how the data is reported in two important dimensions:date standards and case definitions.The development of software to regularly ingests notifiabledisease data frand makes this data available involved four main stepsscraping, extracting, parsing and persisting.For scraping: we would examine website designs and determinereporting mechanisms for each country/website as well as what variesacross the reporting mechanisms. We then designed and wrote codeto automate the downloading of report pdf files, for each country.We stored report pdfs along with appropriate metadata for extractingand parsing.For extracting: we developed software that can extract notifiabledisease data presented in tabular form from a pdf file. We combinedthe methodology of figure placement detection with the in-housedeveloped table extraction and annotation heuristics.For parsing: we determined what to extract from each pdf dataset from the survey conducted. We then parsed the extracted datainto uniform data structures correctly accommodating the dimensionssurveyed and the various human languages. This task involvedingesting notifiable disease data in many disparate formats extractedfrom pdf files and coalescing the data into a standardized format.For persisting: We then store the data in the Epi ArchivePostgreSQL database and make it available through the BSVE.ResultsThe EpiArchive tool currently contains subnational notifiabledisease data from 10 nations. When a user accesses the EpiArchivesite, they are prompted with four fields: country, region, disease,and date duration. These fields allow the user to specify the location(down to the state level), the disease of interest, and the durationof interest. Upon form submission, a time series is generated fromthe users’ specifications. The generated time series can then bedownloaded into a csv file if a user is interested in performingpersonal analysis. Additionally, the data from EpiArchive can bereached through an API.ConclusionsLANL as part of a currently funded DTRA effort so that it willautomatically and continuously collect global notifiable diseasedata—particularly data stored in pdf files—and make it available andshareable within the Biosurveillance Ecosystem (BSVE) as a newdata source. This will provide data to analytics and users that willimprove the prediction and early warning of disease events and otherapplications.

2021 ◽  
Vol 12 ◽  
pp. 215013272199545
Author(s):  
Areej Khokhar ◽  
Aaron Spaulding ◽  
Zuhair Niazi ◽  
Sikander Ailawadhi ◽  
Rami Manochakian ◽  
...  

Importance: Social media is widely used by various segments of society. Its role as a tool of communication by the Public Health Departments in the U.S. remains unknown. Objective: To determine the impact of the COVID-19 pandemic on social media following of the Public Health Departments of the 50 States of the U.S. Design, Setting, and Participants: Data were collected by visiting the Public Health Department web page for each social media platform. State-level demographics were collected from the U.S. Census Bureau. The Center for Disease Control and Prevention was utilized to collect information regarding the Governance of each State’s Public Health Department. Health rankings were collected from “America’s Health Rankings” 2019 Annual report from the United Health Foundation. The U.S. News and World Report Education Rankings were utilized to provide information regarding the public education of each State. Exposure: Data were pulled on 3 separate dates: first on March 5th (baseline and pre-national emergency declaration (NED) for COVID-19), March 18th (week following NED), and March 25th (2 weeks after NED). In addition, a variable identifying the total change across platforms was also created. All data were collected at the State level. Main Outcome: Overall, the social media following of the state Public Health Departments was very low. There was a significant increase in the public interest in following the Public Health Departments during the early phase of the COVID-19 pandemic. Results: With the declaration of National Emergency, there was a 150% increase in overall public following of the State Public Health Departments in the U.S. The increase was most noted in the Midwest and South regions of the U.S. The overall following in the pandemic “hotspots,” such as New York, California, and Florida, was significantly lower. Interesting correlations were noted between various demographic variables, health, and education ranking of the States and the social media following of their Health Departments. Conclusion and Relevance: Social media following of Public Health Departments across all States of the U.S. was very low. Though, the social media following significantly increased during the early course of the COVID-19 pandemic, but it still remains low. Significant opportunity exists for Public Health Departments to improve social media use to engage the public better.


Author(s):  
Jeremiah Rounds ◽  
Lauren Charles-Smith ◽  
Courtney D. Corley

ObjectiveTo introduce Soda Pop, an R/Shiny application designed to be adisease agnostic time-series clustering, alarming, and forecastingtool to assist in disease surveillance “triage, analysis and reporting”workflows within the Biosurveillance Ecosystem (BSVE) [1]. In thisposter, we highlight the new capabilities that are brought to the BSVEby Soda Pop with an emphasis on the impact of metholodogicaldecisions.IntroductionThe Biosurveillance Ecosystem (BSVE) is a biological andchemical threat surveillance system sponsored by the Defense ThreatReduction Agency (DTRA). BSVE is intended to be user-friendly,multi-agency, cooperative, modular and threat agnostic platformfor biosurveillance [2]. In BSVE, a web-based workbench presentsthe analyst with applications (apps) developed by various DTRAfundedresearchers, which are deployed on-demand in the cloud(e.g., Amazon Web Services). These apps aim to address emergingneeds and refine capabilities to enable early warning of chemical andbiological threats for multiple users across local, state, and federalagencies.Soda Pop is an app developed by Pacific Northwest NationalLaboratory (PNNL) to meet the current needs of the BSVE forearly warning and detection of disease outbreaks. Aimed for use bya diverse set of analysts, the application is agnostic to data sourceand spatial scale enabling it to be generalizable across many diseasesand locations. To achieve this, we placed a particular emphasis onclustering and alerting of disease signals within Soda Pop withoutstrong prior assumptions on the nature of observed diseased counts.MethodsAlthough designed to be agnostic to the data source, Soda Pop wasinitially developed and tested on data summarizing Influenza-LikeIllness in military hospitals from collaboration with the Armed ForcesHealth Surveillance Branch. Currently, the data incorporated alsoincludes the CDC’s National Notifiable Diseases Surveillance System(NNDSS) tables [3] and the WHO’s Influenza A/B Influenza Data(Flunet) [4]. These data sources are now present in BSVE’s Postgresdata storage for direct access.Soda Pop is designed to automate time-series tasks of datasummarization, exploration, clustering, alarming and forecasting.Built as an R/Shiny application, Soda Pop is founded on the powerfulstatistical tool R [5]. Where applicable, Soda Pop facilitates nonparametricseasonal decomposition of time-series; hierarchicalagglomerative clustering across reporting areas and between diseaseswithin reporting areas; and a variety of alarming techniques includingExponential Weighted Moving Average alarms and Early AberrationDetection [6].Soda Pop embeds these techniques within a user-interface designedto enhance an analyst’s understanding of emerging trends in their dataand enables the inclusion of its graphical elements into their dossierfor further tracking and reporting. The ultimate goal of this softwareis to facilitate the discovery of unknown disease signals along withincreasing the speed of detection of unusual patterns within thesesignals.ConclusionsSoda Pop organizes common statistical disease surveillance tasksin a manner integrated with BSVE data source inputs and outputs.The app analyzes time-series disease data and supports a robust set ofclustering and alarming routines that avoid strong assumptions on thenature of observed disease counts. This attribute allows for flexibilityin the data source, spatial scale, and disease types making it useful toa wide range of analystsSoda Pop within the BSVE.KeywordsBSVE; Biosurveillance; R/Shiny; Clustering; AlarmingAcknowledgmentsThis work was supported by the Defense Threat Reduction Agency undercontract CB10082 with Pacific Northwest National LaboratoryReferences1. Dasey, Timothy, et al. “Biosurveillance Ecosystem (BSVE) WorkflowAnalysis.” Online journal of public health informatics 5.1 (2013).2. http://www.defense.gov/News/Article/Article/681832/dtra-scientistsdevelop-cloud-based-biosurveillance-ecosystem. Accessed 9/6/2016.3. Centers for Disease Control and Prevention. “National NotifiableDiseases Surveillance System (NNDSS).”4. World Health Organization. “FluNet.” Global Influenza Surveillanceand Response System (GISRS).5. R Core Team (2016). R: A language and environment for statisticalcomputing. R Foundation for Statistical Computing, Vienna, Austria.6. Salmon, Maëlle, et al. “Monitoring Count Time Series in R: AberrationDetection in Public Health Surveillance.” Journal of StatisticalSoftware [Online], 70.10 (2016): 1 - 35.


2012 ◽  
Vol 9 (3) ◽  
pp. 57-68 ◽  
Author(s):  
Ana Margarida Sousa ◽  
Andreia Ferreira ◽  
Nuno F. Azevedo ◽  
Maria Olivia Pereira ◽  
Anália Lourenço

Summary The study of microorganism consortia, also known as biofilms, is associated to a number of applications in biotechnology, ecotechnology and clinical domains. Nowadays, biofilm studies are heterogeneous and data-intensive, encompassing different levels of analysis. Computational modelling of biofilm studies has become thus a requirement to make sense of these vast and ever-expanding biofilm data volumes.The rationale of the present work is a machine-readable format for representing biofilm studies and supporting biofilm data interchange and data integration. This format is supported by the Biofilm Science Ontology (BSO), the first ontology on biofilms information. The ontology is decomposed into a number of areas of interest, namely: the Experimental Procedure Ontology (EPO) which describes biofilm experimental procedures; the Colony Morphology Ontology (CMO) which characterises morphologically microorganism colonies; and other modules concerning biofilm phenotype, antimicrobial susceptibility and virulence traits. The overall objective behind BSO is to develop semantic resources to capture, represent and share data on biofilms and related experiments in a regularized fashion manner. Furthermore, the present work also introduces a framework in assistance of biofilm data interchange and analysis - BiofOmics (http://biofomics.org) - and a public repository on colony morphology signatures - MorphoCol (http://stardust.deb.uminho.pt/morphocol).


2020 ◽  
Vol 7 (1) ◽  
pp. 81-86
Author(s):  
Amit Mishra ◽  
Sundeep Sahay

Adoption of information technology in healthcare has in recent years improved the process of information collection, analysis and use in the Indian public health system. However, it has also led to multiplicity of information systems. Currently, a good amount of data is being generated by various health management information systems (HMISs); however, usability of these data sets is limited owing to lack of technical and institutional ability to share data with other systems. The lack of an effective standard list of health facilities is one of the major impediments to building interoperability among these multiple systems. To overcome this challenge, the Indian Ministry of Health and Family Welfare has initiated a programme to build a master facility list (MFL) known as National Identification Number to Health Facilities. Facility data from two leading national public health information systems, which were routinely reporting health data since 2008, were selected for this purpose. Common facilities were placed on an online portal for verification by state-level and district-level officers. Currently, this portal holds more than 200 000 verified public health facilities. Use of facility data from existing systems has helped to quickly populate the MFL in India. However, design limitations of the existing systems were also translated to the facility portal. Some lay challenges to sustain and evolve this portal in the future include (1) integration of other HMISs holding facility data with the MFL, (2) public notification of standards for MFL, (3) comprehensive data quality audit of existing MFL facility data and (4) establishment of robust governance mechanisms. We discuss how the benefits from this exercise in technical innovation can be materialised more effectively in practice.


2018 ◽  
Vol 10 (1) ◽  
Author(s):  
Hari S. Khalsa ◽  
Sergio Cordova ◽  
Nicholas Generous ◽  
Prabhu S. Khalsa ◽  
Byron Tasseff ◽  
...  

ObjectiveLANL has built software that automatically collects global notifiable disease data, synthesizes the data, and makes it available to humans and computers within the Biosurveillance Ecosystem (BSVE) as a novel data stream. These data have many applications including improving the prediction and early warning of disease events.IntroductionMost countries do not report national notifiable disease data in a machine-readable format. Data are often in the form of a file that contains text, tables and graphs summarizing weekly or monthly disease counts. This presents a problem when information is needed for more data intensive approaches to epidemiology, biosurveillance and public health.While most nations likely store incident data in a machine-readable format, governments are often hesitant to share data openly for a variety of reasons that include technical, political, economic, and motivational issues1.A survey conducted by LANL of notifiable disease data reporting in over fifty countries identified only a few websites that report data in a machine-readable format. The majority (>70%) produce reports as PDF files on a regular basis. The bulk of the PDF reports present data in a structured tabular format, while some report in natural language.The structure and format of PDF reports change often; this adds to the complexity of identifying and parsing the desired data. Not all websites publish in English, and it is common to find typos and clerical errors.LANL has developed a tool, Epi Archive, to collect global notifiable disease data automatically and continuously and make it uniform and readily accessible.MethodsWe conducted a survey of the national notifiable disease reporting systems notating how the data are reported and in what formats. We determined the minimal metadata that is required to contextualize incident counts properly, as well as optional metadata that is commonly found.The development of software to regularly ingest notifiable disease data and make it available involves three or four main steps: scraping, detecting, parsing and persisting.Scraping: we examine website design and determine reporting mechanisms for each country/website, as well as what varies across the reporting mechanisms. We then designed and wrote code to automate the downloading of the data for each country. We store all artifacts presented as files (PDF, XLSX, etc.) in their original form, along with appropriate metadata for parsing and data provenance.Detecting: This step is required when parsing structured non-machine-readable data such as tabular data in PDF files. We combined the Nurminen methodology of PDF table detection with in-house heuristics to find the desired data within PDF reports2.Parsing: We determined what to extract from each dataset and parsed these data into uniform data structures, correctly accommodating the variations in metadata (e.g., time interval definitions) and the various human languages.Persisting: We store the data in the Epi Archive database and make it available on the internet and through the BSVE. The data is persisted into a structured and normalized SQL database.ResultsThe Epi Archive tool currently contains national and/or subnational notifiable disease data from twenty nations. When a user accesses the Epi Archive site, they are prompted with four fields: country, subregion, disease of interest, and date duration. Upon form submission, a time series is generated from the users’ specifications. The generated graph can then be downloaded into a CSV file if a user is interested in performing personal analysis. Additionally, the data from Epi Archive can be reached through a REST API (Representational State Transfer Application Programming Interface).ConclusionsLANL, as part of a currently funded DTRA effort, is automatically and continually collecting global notifiable disease data. While 20 nations are in production, more are being brought online in the near future. These data are already being utilized and will have many applications including improving the prediction and early warning of disease events.References[1] van Panhuis WG, Paul P, Emerson C, et al. A systematic review of barriers to data sharing in public health. BMC Public Health. 2014. 14:1144. doi:10.1186/1471-2458-14-1144[2] Nurminen, Anssi. "Algorithmic extraction of data in tables in PDF documents." (2013).


2019 ◽  
Vol 11 (1) ◽  
Author(s):  
Hari S. Kkalsa ◽  
Sergio Rene Cordova ◽  
Nicholas Generous

ObjectiveAutomatically collect and synthesize global notifiable disease data and make it available to humans and computers. Provide the data on the web and within the Biosurveillance Ecosystem (BSVE) as a novel data stream. These data have many applications including improving the prediction and early warning of disease events.IntroductionGovernment reporting of notifiable disease data is common and widespread, though most countries do not report in a machine-readable format. This is despite the WHO International Health Regulations stating that “[e]ach State Party shall notify WHO, by the most efficient means of communication available.” 1Data are often in the form of a file that contains text, tables and graphs summarizing weekly or monthly disease counts. This presents a problem when information is needed for more data intensive approaches to epidemiology, biosurveillance and public health. While most nations likely store incident data in a machine-readable format, governments can be hesitant to share data openly for a variety of reasons that include technical, political, economic, and motivational2.A survey conducted by LANL of notifiable disease data reporting in over fifty countries identified only a few websites that report data in a machine-readable format. The majority (>70%) produce reports as PDF files on a regular basis. The bulk of the PDF reports present data in a structured tabular format, while some report in natural language or graphical charts.The structure and format of PDF reports change often; this adds to the complexity of identifying and parsing the desired data. Not all websites publish in English, and it is common to find typos and clerical errors.LANL has developed a tool, Epi Archive, to collect global notifiable disease data automatically and continuously and make it uniform and readily accessible.MethodsA survey of the national notifiable disease reporting systems is periodically conducted notating how the data are reported and in what formats. We determined the minimal metadata that is required to contextualize incident counts properly, as well as optional metadata that is commonly found.The development of software to regularly ingest notifiable disease data and make it available involves three to four main steps: scraping, detecting, parsing and persisting.Scraping: we examine website design and determine reporting mechanisms for each country/website, as well as what varies across the reporting mechanisms. We then design and write code to automate the downloading of data for each country. We store all artifacts presented as files (PDF, XLSX, etc.) in their original form, along with appropriate metadata for parsing and data provenance.Detecting: This step is required when parsing structured non-machine-readable data, such as tabular data in PDF files. We combine the Nurminen methodology of PDF table detection with in-house heuristics to find the desired data within PDF reports3.Parsing: We determine what to extract from each dataset and parse these data into uniform data structures, correctly accommodating the variations in metadata (e.g., time interval definitions) and the various human languages.Persisting: We store the data in the Epi Archive database and make it available on the internet and through the BSVE. The data is persisted into a structured and normalized SQL database.ResultsEpi Archive currently contains national and/or subnational notifiable disease data from thirty-nine nations. When a user accesses the Epi Archive site, they are able to peruse, chart and download data by country, subregion, disease and time interval. Access to a cached version of the original artifacts (e.g. PDF files), a link to the source and additional metadata is also available through the user interface. Finally, to ensure machine-readability, the data from Epi Archive can be reached through a REST API. http://epiarchive.bsvgateway.org/ConclusionsLANL, as part of a currently funded DTRA effort, is automatically and continually collecting global notifiable disease data. While thirty-nine nations are in production, more are being brought online in the near future. These data are already being utilized and have many applications, including improving the prediction and early warning of disease events.References[1] WHO International Health Regulations, edition 3. http://apps.who.int/iris/bitstream/10665/246107/1/9789241580496-eng.pdf[2] van Panhuis WG, Paul P, Emerson C, et al. A systematic review of barriers to data sharing in public health. BMC Public Health. 2014. 14:1144. doi:10.1186/1471-2458-14-1144[3] Nurminen, Anssi. "Algorithmic extraction of data in tables in PDF documents." (2013). 


2020 ◽  
Author(s):  
Ruoyan Sun ◽  
Henna Budhwani

BACKGROUND Though public health systems are responding rapidly to the COVID-19 pandemic, outcomes from publicly available, crowd-sourced big data may assist in helping to identify hot spots, prioritize equipment allocation and staffing, while also informing health policy related to “shelter in place” and social distancing recommendations. OBJECTIVE To assess if the rising state-level prevalence of COVID-19 related posts on Twitter (tweets) is predictive of state-level cumulative COVID-19 incidence after controlling for socio-economic characteristics. METHODS We identified extracted COVID-19 related tweets from January 21st to March 7th (2020) across all 50 states (N = 7,427,057). Tweets were combined with state-level characteristics and confirmed COVID-19 cases to determine the association between public commentary and cumulative incidence. RESULTS The cumulative incidence of COVID-19 cases varied significantly across states. Ratio of tweet increase (p=0.03), number of physicians per 1,000 population (p=0.01), education attainment (p=0.006), income per capita (p = 0.002), and percentage of adult population (p=0.003) were positively associated with cumulative incidence. Ratio of tweet increase was significantly associated with the logarithmic of cumulative incidence (p=0.06) with a coefficient of 0.26. CONCLUSIONS An increase in the prevalence of state-level tweets was predictive of an increase in COVID-19 diagnoses, providing evidence that Twitter can be a valuable surveillance tool for public health.


Energies ◽  
2020 ◽  
Vol 14 (1) ◽  
pp. 141
Author(s):  
Jacob Hale ◽  
Suzanna Long

Energy portfolios are overwhelmingly dependent on fossil fuel resources that perpetuate the consequences associated with climate change. Therefore, it is imperative to transition to more renewable alternatives to limit further harm to the environment. This study presents a univariate time series prediction model that evaluates sustainability outcomes of partial energy transitions. Future electricity generation at the state-level is predicted using exponential smoothing and autoregressive integrated moving average (ARIMA). The best prediction results are then used as an input for a sustainability assessment of a proposed transition by calculating carbon, water, land, and cost footprints. Missouri, USA was selected as a model testbed due to its dependence on coal. Of the time series methods, ARIMA exhibited the best performance and was used to predict annual electricity generation over a 10-year period. The proposed transition consisted of a one-percent annual decrease of coal’s portfolio share to be replaced with an equal share of solar and wind supply. The sustainability outcomes of the transition demonstrate decreases in carbon and water footprints but increases in land and cost footprints. Decision makers can use the results presented here to better inform strategic provisioning of critical resources in the context of proposed energy transitions.


2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Margaret M. Padek ◽  
Stephanie Mazzucca ◽  
Peg Allen ◽  
Emily Rodriguez Weno ◽  
Edward Tsai ◽  
...  

Abstract Background Much of the disease burden in the United States is preventable through application of existing knowledge. State-level public health practitioners are in ideal positions to affect programs and policies related to chronic disease, but the extent to which mis-implementation occurring with these programs is largely unknown. Mis-implementation refers to ending effective programs and policies prematurely or continuing ineffective ones. Methods A 2018 comprehensive survey assessing the extent of mis-implementation and multi-level influences on mis-implementation was reported by state health departments (SHDs). Questions were developed from previous literature. Surveys were emailed to randomly selected SHD employees across the Unites States. Spearman’s correlation and multinomial logistic regression were used to assess factors in mis-implementation. Results Half (50.7%) of respondents were chronic disease program managers or unit directors. Forty nine percent reported that programs their SHD oversees sometimes, often or always continued ineffective programs. Over 50% also reported that their SHD sometimes or often ended effective programs. The data suggest the strongest correlates and predictors of mis-implementation were at the organizational level. For example, the number of organizational layers impeded decision-making was significant for both continuing ineffective programs (OR=4.70; 95% CI=2.20, 10.04) and ending effective programs (OR=3.23; 95% CI=1.61, 7.40). Conclusion The data suggest that changing certain agency practices may help in minimizing the occurrence of mis-implementation. Further research should focus on adding context to these issues and helping agencies engage in appropriate decision-making. Greater attention to mis-implementation should lead to greater use of effective interventions and more efficient expenditure of resources, ultimately to improve health outcomes.


Sign in / Sign up

Export Citation Format

Share Document