Extracting Data from Legacy Taxonomic Literature: Applications for planning field work

Biodiversity Information Science and Standards ◽

10.3897/biss.3.37082 ◽

2019 ◽

Vol 3 ◽

Author(s):

Francisco Andres Rivera-Quiroz ◽

Jeremy Miller

Keyword(s):

Data Extraction ◽

Field Work ◽

Abundant Species ◽

Extraction Process ◽

Biological Data ◽

Knowledge Generation ◽

Data Repository ◽

Full Potential ◽

Biodiversity Data ◽

Open Access Journals

Traditional taxonomic publications have served as a biological data repository accumulating vast amounts of data on species diversity, geographical and temporal distributions, ecological interactions, taxonomic relations, among many other types of information. However, the fragmented nature of taxonomic literature has made this data difficult to access and use to its full potential. Current anthropogenic impact on biodiversity demands faster knowledge generation, but also making better use of what we already have. This could help us make better-informed decisions about conservation and resources management. In past years, several efforts have been made to make taxonomic literature more mobilized and accessible. These include online publications, open access journals, the digitization of old paper literature and improved availability through online specialized repositories such as the Biodiversity Heritage Library (BHL) and the World Spider Catalog (WSC), among others. Although easy to share, PDF publications still have most of their biodiversity data embedded in strings of text making them less dynamic and more difficult or impossible to read and analyze without a human interpreter. Recently developed tools as GoldenGATE-Imagine (GGI) allow transforming PDFs in XML files that extract and categorize taxonomically relevant data. These data can then be aggregated in databases such as Plazi TreatmentBank, where it can be re-explored, queried and analyzed. Here we combined several of these cybertaxonomic tools to test the data extraction process for one potential application: the design and planning of an expedition to collect fresh material in the field. We targeted the ground spider Teutamus politus and other related species from the Teutamus group (TG) (Araneae; Liocranidae). These spiders are known from South East Asia and have been cataloged in the family Liocranidae; however, their relations, biology and evolution are still poorly understood. We marked-up 56 publications that contained taxonomic treatments with specimen records for the Liocranidae. Of these publications, 20 contained information on members of the TG. Geographical distributions and occurrences of 90 TG species were analyzed based on 1,309 specimen records. These data were used to design our field collection in a way that allowed us to optimize the collection of adult specimens of our target taxa. The TG genera were most common in Indonesia, Thailand and Malaysia. From these, Thailand was the second richest but had the most records of T. politus. Seasonal distribution of TG specimens in Thailand suggested June and July as the best time for collecting adults. Based on these analyses, we decided to sample from mid-July to mid-August 2018 in the three Thai provinces that combined most records of TG species and T. politus. Relying on the results of our literature analyses and using standard collection methods for ground spiders, we captured at least one specimen of every TG genus reported for Thailand. Our one-month expedition captured 231 TG spiders; from these, T. politus was the most abundant species with 188 specimens (95 adults). By comparison, a total of 196 specimens of the TG and 66 of T. politus had been reported for the same provinces in the last 40 years. Our sampling greatly increased the number of available specimens, especially for the genera Teutamus and Oedignatha. Also, we extended the known distribution of Oedignatha and Sesieutes within Thailand. These results illustrate the relevance of making biodiversity data contained within taxonomic treatments accessible and reusable. It also exemplifies one potential use of taxonomic legacy data: to more efficiently use existing biodiversity data to fill knowledge gaps. A similar approach can be used to study neglected or interesting taxa and geographic areas, generating a better biodiversity documentation that could aid in decision making, management and conservation.

Download Full-text

3DP Printing of Oral Solid Formulations: A Systematic Review

Pharmaceutics ◽

10.3390/pharmaceutics13030358 ◽

2021 ◽

Vol 13 (3) ◽

pp. 358 ◽

Cited By ~ 1

Author(s):

Chiara R. M. Brambilla ◽

Ogochukwu Lilian Okafor-Muo ◽

Hany Hassanin ◽

Amr ElShaer

Keyword(s):

Systematic Review ◽

Data Extraction ◽

New Technology ◽

Three Dimensional ◽

Extraction Process ◽

Carcinogenic Risk ◽

Drug Dissolution ◽

Dissolution Profile ◽

Disintegration Time ◽

Advantages And Disadvantages

Three-dimensional (3D) printing is a recent technology, which gives the possibility to manufacture personalised dosage forms and it has a broad range of applications. One of the most developed, it is the manufacture of oral solid dosage and the four 3DP techniques which have been more used for their manufacture are FDM, inkjet 3DP, SLA and SLS. This systematic review is carried out to statistically analyze the current 3DP techniques employed in manufacturing oral solid formulations and assess the recent trends of this new technology. The work has been organised into four steps, (1) screening of the articles, definition of the inclusion and exclusion criteria and classification of the articles in the two main groups (included/excluded); (2) quantification and characterisation of the included articles; (3) evaluation of the validity of data and data extraction process; (4) data analysis, discussion, and conclusion to define which technique offers the best properties to be applied in the manufacture of oral solid formulations. It has been observed that with SLS 3DP technique, all the characterisation tests required by the BP (drug content, drug dissolution profile, hardness, friability, disintegration time and uniformity of weight) have been performed in the majority of articles, except for the friability test. However, it is not possible to define which of the four 3DP techniques is the most suitable for the manufacture of oral solid formulations, because the selection is affected by different parameters, such as the type of formulation, the physical-mechanical properties to achieve. Moreover, each technique has its specific advantages and disadvantages, such as for FDM the biggest challenge is the degradation of the drug, due to high printing temperature process or for SLA is the toxicity of the carcinogenic risk of the photopolymerising material.

Download Full-text

Shared Data Science Infrastructure for Genomics Data

10.1101/307777 ◽

2018 ◽

Author(s):

Hamid Bagher ◽

Usha Muppiral ◽

Andrew J Severin ◽

Hridesh Rajan

Keyword(s):

Data Science ◽

Gene Annotation ◽

Large Data ◽

Biological Data ◽

Genomic Research ◽

Data Repository ◽

Small Data ◽

Data Repositories ◽

Shared Data ◽

Genome Assemblies

AbstractBackgroundCreating a computational infrastructure to analyze the wealth of information contained in data repositories that scales well is difficult due to significant barriers in organizing, extracting and analyzing relevant data. Shared Data Science Infrastructures like Boa can be used to more efficiently process and parse data contained in large data repositories. The main features of Boa are inspired from existing languages for data intensive computing and can easily integrate data from biological data repositories.ResultsHere, we present an implementation of Boa for Genomic research (BoaG) on a relatively small data repository: RefSeq’s 97,716 annotation (GFF) and assembly (FASTA) files and metadata. We used BoaG to query the entire RefSeq dataset and gain insight into the RefSeq genome assemblies and gene model annotations and show that assembly quality using the same assembler varies depending on species.ConclusionsIn order to keep pace with our ability to produce biological data, innovative methods are required. The Shared Data Science Infrastructure, BoaG, can provide greater access to researchers to efficiently explore data in ways previously not possible for anyone but the most well funded research groups. We demonstrate the efficiency of BoaG to explore the RefSeq database of genome assemblies and annotations to identify interesting features of gene annotation as a proof of concept for much larger datasets.

Download Full-text

SPDE: A Multi-functional Software for Sequence Processing and Data Extraction

10.1101/2020.11.08.373720 ◽

2020 ◽

Cited By ~ 1

Author(s):

Dong Xu ◽

Zhuchou Lu ◽

Kangming Jin ◽

Wenmin Qiu ◽

Guirong Qiao ◽

...

Keyword(s):

Data Extraction ◽

Single Gene ◽

Biological Data ◽

Sequence Processing ◽

Reverse Complement ◽

Biological Data Analysis ◽

Basic Functions ◽

Functional Software ◽

Genome Information ◽

Ncbi Blast

AbstractEfficiently extracting information from biological big data can be a huge challenge for people (especially those who lack programming skills). We developed Sequence Processing and Data Extraction (SPDE) as an integrated tool for sequence processing and data extraction for gene family and omics analyses. Currently, SPDE has seven modules comprising 100 basic functions that range from single gene processing (e.g., translation, reverse complement, and primer design) to genome information extraction. All SPDE functions can be used without the need for programming or command lines. The SPDE interface has enough prompt information to help users run SPDE without barriers. In addition to its own functions, SPDE also incorporates the publicly available analyses tools (such as, NCBI-blast, HMMER, Primer3 and SAMtools), thereby making SPDE a comprehensive bioinformatics platform for big biological data analysis.AvailabilitySPDE was built using Python and can be run on 32-bit, 64-bit Windows and macOS systems. It is an open-source software that can be downloaded from https://github.com/simon19891216/[email protected]

Download Full-text

Extraction and Critical Appraisal of Data

Handbook of Meta-analysis in Ecology and Evolution ◽

10.23943/princeton/9780691137285.003.0005 ◽

2013 ◽

Author(s):

Peter S. Curtis ◽

Kerrie Mengersen ◽

Marc J. Lajeunesse ◽

Hannah R. Rothstein ◽

Gavin B. Stewart

Keyword(s):

Relational Databases ◽

Critical Appraisal ◽

Data Extraction ◽

Meta Analysis ◽

Research Question ◽

Extraction Process ◽

Research Review ◽

Flat File ◽

Microsoft Access ◽

Definition Of

This chapter discusses the data extraction process, meta-analysis database, and critical appraisal of data. The efficient and accurate extraction of data from primary studies is an important component of successful research reviews. It is one of the most time-consuming parts of a research review and should be approached with the goal of repeatability and transparency of results. Careful definition of the research question and identification of the effect size metric(s) to be used are prerequisites to efficient data extraction. The extraction spreadsheet may simply be appended to a growing database stored in a single spreadsheet (also known as “flat file database”) (e.g., Microsoft Excel, Lotus, Quattro Pro), but it may be advantageous to develop relational databases (e.g., by using Microsoft Access, Paradox or dBase software), particularly for large or complex data. During the process of data extraction the investigator also has an opportunity for critical appraisal of data quality. One approach to quantitative assessment of study quality has been the use of numerical scales in which points are assigned to specific elements of the study and summed to produce an overall quality score.

Download Full-text

Abstract 20: Kawasaki Disease In France: Incomplete Forms Are Frequent And Associated With A High Frequency Of Cardiac Complications

Circulation ◽

10.1161/circ.131.suppl_2.20 ◽

2015 ◽

Vol 131 (suppl_2) ◽

Author(s):

Maryam Piram ◽

Martha Darce Bello ◽

Stéphanie Tellier ◽

Etienne Merlin ◽

Elise Launay ◽

...

Keyword(s):

Clinical Symptoms ◽

Coronary Aneurysm ◽

National Registry ◽

Biological Data ◽

High Rate ◽

Data Repository ◽

Cardiac Complications ◽

Mixed Ancestry ◽

Coronary Abnormalities ◽

Epidemiological Characteristics

KD is the main vasculitis affecting children < 5 years and the leading cause of acquired heart disease in children. Its epidemiology is few reported in France. Even if IVIG is still the standard treatment; the management of patients at risk for cardiac complications may change toward reinforced (and new) therapeutic approaches. Kawanet is a clinical and biological data repository aimed to define the epidemiological characteristics of KD in France. Methods: Institutional physicians received information on a national registry for KD. All patients suspected with KD and seen since 2011 were eligible to enter the study. An eCRF was implemented in a web database.The included patients without the AHA international criteria were reviewed by an experts' committee. Results: 468 cases were entered by 84 physicians from 65 centers. The AHA classification gave: 280 complete KD, and 73 incomplete KD. An expert consensus classified 48 other patients as probable leading to 401 patients considered as KD (M229/F72). 67 were excluded (incomplete data or doubful). The median age at diagnosis was 3.1y (2m-14y). Their ethnical backgrounds were: European Caucasian 67%, Eastern Caucasian/North African 15%, afro-Caribbean 13%, Asian 4% and mixed ancestry 1%. The clinical symptoms were (%): conjunctivitis 84, cheilitis 82, diffuse exanthema 74, modification of the extremities 73, oral erythema 66, cervical adenopathy 52, raspberry tongue 49, seat erythema 26, perineal desquamation 18 and BCG erythema 5. The cardiac complications were: coronary dilatation 30%, pericarditis 15%, coronary aneurysm 4%, and myocarditis 3% (1 death). 392/401 (98%) patients received IVIG, 64 (21%) and 5 required 2 and 3 courses. The mean treatment delay was 6 days. The factors associated with the coronary abnormalities were: male gender (p=0.01), young KD onset age (p=0.03), and resistance to IVIG (p=0.03). conclusion: KD diagnosis remains challenging and overdiagnosis represents at least 10% of cases in this registry. Incomplete forms of KD account for 37 % and are associated with coronary dilatation/aneurysm (34%; p<0.01) and a high rate of IVIG resistance. Unlike previous studies, our population is very mixed with 28 % of children from the Middle East and Africa, in whom KD is still few reported.

Download Full-text

Analyze Physical Design Process Using Big Data Tool

International Journal of Software Science and Computational Intelligence ◽

10.4018/ijssci.2015040102 ◽

2015 ◽

Vol 7 (2) ◽

pp. 31-49 ◽

Cited By ~ 3

Author(s):

Waseem Ahmed ◽

Lisa Fan

Keyword(s):

Design Process ◽

Data Extraction ◽

Daily Basis ◽

Physical Design ◽

Data Repository ◽

Chip Design ◽

Asic Design ◽

Design Engineers ◽

Statistical Representation ◽

And Performance

Physical Design (PD) Data tool is designed mainly to help ASIC design engineers in achieving chip design process quality, optimization and performance measures. The tool uses data mining techniques to handle the existing unstructured data repository. It extracts the relevant data and loads it into a well-structured database. Data archive mechanism is enabled that initially creates and then keeps updating an archive repository on a daily basis. The logs information provide to PD tool is a completely unstructured format which parse by regular expression (regex) based data extraction methodology. It converts the input data into the structured tables. This undergoes the data cleansing process before being fed into the operational DB. PD tool also ensures data integrity and data validity. It helps the design engineers to compare, correlate and inter-relate the results of their existing work with the ones done in the past which gives them a clear picture of the progress made and deviations that occurred. Data analysis can be done using various features offered by the tool such as graphical and statistical representation.

Download Full-text

MP23: A collaborative quality improvement initiative to improve the time to electrocardiogram in patients with chest pain presenting to the emergency department

CJEM ◽

10.1017/cem.2018.177 ◽

2018 ◽

Vol 20 (S1) ◽

pp. S48-S49

Author(s):

H. C. Lindsay ◽

J. Gallaher ◽

C. Wright ◽

L. Korchinski ◽

C. Kim Sing

Keyword(s):

Emergency Department ◽

Chest Pain ◽

Data Extraction ◽

Physical Space ◽

Extraction Process ◽

Vancouver General Hospital ◽

Adverse Outcomes ◽

Quality Improvement Initiative ◽

Target Time ◽

Risk Of Death

Introduction: For patients with chest pain, the target time from first medical contact to obtaining an electrocardiogram (ECG) is 10 minutes, as reperfusion within 120 minutes can reduce the risk of death and adverse outcomes in patients with ST elevation myocardial infarction (STEMI). In 2007, Vancouver Coastal Health (VCH) began tracking key indicators including time to first ECG. The Vancouver General Hospital (VGH) Emergency Department (ED) has been troubled with the longest door to ECG times in the region since 2014. In 2016, the VGH ED Quality Council developed a strategy to address this issue, with an aim of obtaining ECGs on 95% of patients presenting to the VGH ED with active chest pain within 10 minutes of presentation within a 6 month period. Methods: The VGH ED Quality Council brought together frontline clinicians, ECG technicians, and other stakeholders and completed a process map. We obtained baseline data regarding the median time to ECG in both patients with STEMI and all patients presenting with chest pain. Root cause analysis determined two main barriers: access to designated space to obtain ECGs, and the need for patients to be registered in the computer system before an ECG could be ordered. The team identified strategies to eliminate these barriers, identifying a dedicated space and undergoing multiple PDSA cycles to change the workflow to stream patients to this space before registration. Results: Our median times in patients with STEMI have gone from 33 minutes to 8 minutes as of June 2017. In all patients presenting with chest pain, we improved from a median of 36 to 17 minutes. As of April 2017 we are obtaining an ECG within 10 minutes in 27% of our patients, compared to 3% in 2016. Given the limitations in our data extraction process, we were not able to differentiate between patients with active chest pain versus those whose chest pain had resolved. Conclusion: By involving frontline staff, and having frontline champions providing real time support, we were able to make significant changes to the culture at triage. We cultivated sustainability by changing the workflow and physical space, and not relying on education only. While we have improved the times for our walk-in patients, we have not perfected the process when a patient moves immediately to a bed or presents via ambulance. Implementing small changes and incorporating feedback has allowed us to identify these new challenges early.

Download Full-text

Development of an Automated, Real Time Surveillance Tool for Predicting Readmissions at a Community Hospital

Applied Clinical Informatics ◽

10.4338/aci-2012-12-ra-0058 ◽

2013 ◽

Vol 04 (02) ◽

pp. 153-169 ◽

Cited By ~ 6

Author(s):

R. Gildersleeve ◽

P. Cooper

Keyword(s):

Real Time ◽

Community Hospital ◽

Validation Cohort ◽

Data Extraction ◽

Characteristic Curve ◽

Hospital Setting ◽

Data Repository ◽

Derivation Cohort ◽

Effective Interventions ◽

Patients At Risk

SummaryBackground: The Centers for Medicare and Medicaid Services’ Readmissions Reduction Program adjusts payments to hospitals based on 30-day readmission rates for patients with acute myocardial infarction, heart failure, and pneumonia. This holds hospitals accountable for a complex phenomenon about which there is little evidence regarding effective interventions. Further study may benefit from a method for efficiently and inexpensively identifying patients at risk of readmission. Several models have been developed to assess this risk, many of which may not translate to a U.S. community hospital setting.Objective: To develop a real-time, automated tool to stratify risk of 30-day readmission at a semi-rural community hospital.Methods: A derivation cohort was created by extracting demographic and clinical variables from the data repository for adult discharges from calendar year 2010. Multivariate logistic regression identified variables that were significantly associated with 30-day hospital readmission. Those variables were incorporated into a formula to produce a Risk of Readmission Score (RRS). A validation cohort from 2011 assessed the predictive value of the RRS. A SQL stored procedure was created to calculate the RRS for any patient and publish its value, along with an estimate of readmission risk and other factors, to a secure intranet site.Results: Eleven variables were significantly associated with readmission in the multivariate analysis of each cohort. The RRS had an area under the receiver operating characteristic curve (c-statistic) of 0.74 (95% CI 0.73-0.75) in the derivation cohort and 0.70 (95% CI 0.69-0.71) in the validation cohort.Conclusion: Clinical and administrative data available in a typical community hospital database can be used to create a validated, predictive scoring system that automatically assigns a probability of 30-day readmission to hospitalized patients. This does not require manual data extraction or manipulation and uses commonly available systems. Additional study is needed to refine and confirm the findings.Citation: Gildersleeve R, Cooper P. Development of an automated, real time surveillance tool for predicting readmissions at a community hospital. Appl Clin Inf 2013; 4: 153–169http://dx.doi.org/10.4338/ACI-2012-12-RA-0058

Download Full-text

Vehicle Information Influence Degree Screening Method Based on GEP Optimized RBF Neural Network

Complexity ◽

10.1155/2018/1067927 ◽

2018 ◽

Vol 2018 ◽

pp. 1-12 ◽

Cited By ~ 2

Author(s):

Jingfeng Yang ◽

Nanfeng Zhang ◽

Ming Li ◽

Yanwei Zheng ◽

Li Wang ◽

...

Keyword(s):

Neural Network ◽

Data Compression ◽

Rbf Neural Network ◽

Data Extraction ◽

Screening Method ◽

Extraction Process ◽

Monitoring And Control ◽

Position Information ◽

Actual Operation ◽

Vehicle Information

Due to the continuous progress in the field of vehicle hardware, the condition that a vehicle cannot load a complex algorithm no longer exists. At the same time, with the progress in the field of vehicle hardware, a number of studies have reported exponential growth in the actual operation. To solve the problem for a large number of data transmissions in an actual operation, wireless transmission is proposed for text information (including position information) on the basis of the principles of the maximum entropy probability and the neural network prediction model combined with the optimization of the Huffman encoding algorithm, from the exchange of data to the entire data extraction process. The test results showed that the text-type vehicle information based on a compressed algorithm to optimize the algorithm of data compression and transmission could effectively realize the data compression, achieve a higher compression rate and data transmission integrity, and after decompression guarantee no distortion. Therefore, it is important to improve the efficiency of vehicle information transmission, to ensure the integrity of information, to realize the vehicle monitoring and control, and to grasp the traffic situation in real time.

Download Full-text

Sand fly fauna of Crete and the description of Phlebotomus (Adlerius) creticus n. sp. (Diptera: Psychodidae)

Parasites & Vectors ◽

10.1186/s13071-020-04358-x ◽

2020 ◽

Vol 13 (1) ◽

Cited By ~ 1

Author(s):

Vít Dvořák ◽

Nikolaos Tsirigotakis ◽

Christoforos Pavlou ◽

Emmanouil Dokianakis ◽

Mohammad Akhoundi ◽

...

Keyword(s):

Field Work ◽

Abundant Species ◽

Protein Profiling ◽

Sand Fly ◽

Published Data ◽

Specific Sequence ◽

Maldi Tof ◽

Morphological Criteria ◽

Species Specific ◽

Protein Spectra

Abstract Background The Greek island of Crete is endemic for both visceral leishmaniasis (VL) and recently increasing cutaneous leishmaniasis (CL). This study summarizes published data on the sand fly fauna of Crete, the results of new sand fly samplings and the description of a new sand fly species. Methods All published and recent samplings were carried out using CDC light traps, sticky traps or mouth aspirators. The specific status of Phlebotomus (Adlerius) creticus n. sp., was assessed by morphological analysis, cytochrome b (cytb) sequencing and MALDI-TOF protein profiling. Results Published data revealed the presence of 10 Phlebotomus spp. and 2 Sergentomyia spp. During presented field work, 608 specimens of 8 species of Phlebotomus and one species of Sergentomyia were collected. Both published data and present samplings revealed that the two most common and abundant species were Phlebotomus neglectus, a proven vector of Leishmania infantum causing VL, and Ph. similis, a suspected vector of L. tropica causing CL. In addition, the field surveys revealed the presence of a new species, Ph. (Adlerius) creticus n. sp. Conclusions The identification of the newly described species is based on both molecular and morphological criteria, showing distinct characters of the male genitalia that differentiate it from related species of the subgenus Adlerius as well as species-specific sequence of cytb and protein spectra generated by MALDI-TOF mass spectrometry.

Download Full-text