An Evaluation of In-house versus Out-sourced Data Capture at the Meise Botanic Garden (BR)

Biodiversity Information Science and Standards ◽

10.3897/biss.2.26514 ◽

2018 ◽

Vol 2 ◽

pp. e26514 ◽

Cited By ~ 2

Author(s):

Henry Engledow ◽

Sofie De Smedt ◽

Ann Bogaerts ◽

Quentin Groom

Keyword(s):

Data Quality ◽

Democratic Republic Of Congo ◽

Data Cleaning ◽

Data Entry ◽

Cost Effective ◽

Data Capture ◽

Herbarium Specimens ◽

Botanic Garden ◽

Label Information ◽

Minimal Data

There are many ways to capture data from herbarium specimen labels. Here we compare the results of in-house verses out-sourced data transcription with the aim of evaluating the pros and cons of each approach and guiding future projects that want to do the same. In 2014 Meise Botanic Garden (BR) embarked on a mass digitization project. We digitally imaged of some 1.2 million herbarium specimens from our African and Belgian Herbaria. The minimal data for a third of these images was transcribed in-house, while the remainder was out-sourced to a commercial company. The minimal data comprised the fields: specimen’s herbarium location, barcode, filing name, family, collector, collector number, country code and phytoregion (for the Democratic Republic of Congo, Rwanda & Burundi). The out-sourced data capture consisted of three types: additional label information for central African specimens having minimal data; complete data for the remaining African specimens; and, species filing name information for African and Belgian specimens without minimal data. As part of the preparation for out-sourcing, a strict protocol had to be established as to the criteria for acceptable data quality levels. Also, the creation of several lookup tables for data entry was necessary to improve data quality. During the start-up phase all the data were checked, feedback given, compromises made and the protocol amended. After this phase, an agreed upon subsample was quality controlled. If the error score exceeded the agreed level, the batch was returned for retyping. The data had three quality control checks during the process, by the data capturers, the contractor’s project managers and ourselves. Data quality was analysed and compared in-house versus out-sourced modes of data capture. The error rate by our staff versus the external company was comparable. The types of error that occurred were often linked to the specific field in question. These errors include problems of interpretation, legibility, foreign languages, typographic errors, etc. A significant amount of data cleaning and post-capture processing was required prior to import into our database, despite the data being of good quality according to protocol (error < 1%). By improving the workflow and field definitions a notable improvement could be made in the “data cleaning” phase. The initial motivation for capturing some data in-house was financial. However, after analysis, this may not have been the most cost effective approach. Many lessons have been learned from this first mass digitisation project that will implemented in similar projects in the future.

Download Full-text

Managing a Mass Digitization Project at Meise Botanic Garden: From Start to Finish

Biodiversity Information Science and Standards ◽

10.3897/biss.2.25912 ◽

2018 ◽

Vol 2 ◽

pp. e25912

Author(s):

Henry Engledow ◽

Sofie De Smedt ◽

Quentin Groom ◽

Ann Bogaerts ◽

Piet Stoffelen ◽

...

Keyword(s):

Conveyor Belt ◽

Data Capture ◽

Control Points ◽

Jpeg 2000 ◽

Botanic Garden ◽

Capture Process ◽

Minimal Data ◽

Mass Digitization ◽

One Year ◽

New Infrastructure

Mass digitization is a large undertaking for a collection. It is disruptive of routine and can challenge long-held practises. Having been through the procedure and survived, we feel we have a lot of experience to share with other institutions who are considering taking on this challenge. The changes that digitization has made to our institution are positive and the digitization a success, but that is not to say that we would not have done some things differently, were we to repeat the exercise. In 2015 Meise Botanic Garden received a grant from the Flemish Government to upgrade its digitization infrastructure and mass digitize 1.2 million specimens from its African and Belgian Herbaria. The new infrastructure improved our workflow significantly, enabling us to digitize specimens five to ten times faster while also improving their quality. The mass digitization part of the project was split into two parts, imaging and transcription. The contract was awarded and out-sourced to Picturae, who started imaging in May 2016 using a conveyor belt installation. Prior to starting, a significant amount of preparation was required at the herbarium. Within one year, 1.2 million specimens were imaged. The images were captured as TIFF files and stored in triplicate at The Flemish Institute for Archiving (VIAA), while smaller derived JPEG 2000 and JPEG files were generated for day-to-day use. The second part of the project was label transcription. A third of the specimens were transcribed in-house for capturing minimal data (barcode, filing name, collector, collector number & country of origin). This was partly done to reduce costs, but also allowed us to compare in-house to out-sourced transcription. Some 500,000 specimens were transcribed, either completely or partially, by Alembo (subcontracted by Picturae).The remaining 200.000 specimens from our Belgian Herbarium are being transcribed using crowdsourcing. The latter is being realized through the citizen science platform DoeDat (www.doedat.be) that was launched in November 2017. Many lessons have been learnt with respect to implementing mass digitization, both practically and sociologically. Many of the problems encountered during the project could have been avoided by changing the workflow. The addition of extra control points during the process could have reduced problems encountered later in the data capture process. Solving these problems at a later stage was time consuming. Trying to “save money” can result in a disruptive workflow, which may lead to a number of costly errors. Mass digitization has fundamentally changed the workflow in our collections and the way in which our herbarium is managed. All images for the African and Belgian collections may be now found on our new virtual herbarium www.botanicalcollections.be.

Download Full-text

Different Approaches between First and Second Mass Digitisation Project for the Herbarium (BR) at Meise Botanic Garden

Biodiversity Information Science and Standards ◽

10.3897/biss.3.37259 ◽

2019 ◽

Vol 3 ◽

Author(s):

Sofie De Smedt ◽

Ann Bogaerts ◽

Henry Engledow ◽

Quentin Groom

Keyword(s):

Vascular Plants ◽

Scientific Research ◽

Lessons Learned ◽

Specimen Preparation ◽

Herbarium Specimens ◽

Botanic Garden ◽

The World ◽

Label Information ◽

Additional Funding ◽

Macro Algae

The Herbarium of Meise Botanic Garden is in the top 15 herbaria worldwide. The collection comprises some four million specimens, which are important for scientific research. Digitisation of specimens includes imaging, transcription of label information, linking data and making the results publicly accessible online. In addition to facilitating researchers’ access to specimens, digitisation also brings new possibilities for analysis and discovery of new data, such as the vast amount of information on handwritten labels. In the DOE! project (Digitale Ontsluiting Erfgoedcollecties), funded by the Flemish Government, 1.2 million herbarium sheets from the African and Belgian collections were digitised. We have received additional funding to digitise a further 1.4 million specimens for the remaining vascular plants and macro-algae collections, by October 2021. These include the historic collections of Von Martius and Van Heurck. Carl Friedrich Philipp von Martius (1794–1868) was a pioneering explorer whose expeditions led to the discovery of many species. He amassed over 300,000 specimens, some of which were used to compile the first Flora of Brazil. Henri Van Heurck (1838–1909) also gathered herbarium specimens from all over the world, including a specimen originally from the collection of Linnaeus. Despite this being our second mass digitisation project, there are significant differences in our approach. This is partly due to lessons learned from the first project and partly to the nature of the collections themselves. The differences in the tendering process, specimen preparation, workflow and data capture will be explained. Making these specimens openly available online through www.botanicalcollections.be is valuable to scientific research as well as valorising of our collections. Currently, the site attracts 7000 users a year, which adds up to 15,000 sessions a year and the average session is more than 8 minutes. This means that people are actively using our website and these numbers can be expected to grow as we add more specimens and functionality.

Download Full-text

Utilising the Crowd to Unlock the Data on Herbarium Specimens at the Royal Botanic Garden Edinburgh

Biodiversity Information Science and Standards ◽

10.3897/biss.3.37093 ◽

2019 ◽

Vol 3 ◽

Author(s):

Sally King ◽

Juliette Pinon ◽

Robyn Drinkwater

Keyword(s):

Citizen Science ◽

Character Recognition ◽

Optical Character Recognition ◽

Data Entry ◽

Royal Botanic Garden ◽

Flowering Plants ◽

Herbarium Specimens ◽

Botanic Garden ◽

Label Data ◽

Rate Limiting

Digitisation of specimens at the Royal Botanic Garden Edinburgh (RBGE) has created nearly half a million imaged specimens. With data entry from the specimen labels on herbarium sheets identified as the rate-limiting step in the digitisation workflow, the majority of specimens are databased with minimal data (filing name and geographical region), leaving a need to add further label data (collector, collecting locality, collection date etc.) to make the specimens research ready. We are exploring a number of different ways to complete data entry for specimens that have been imaged. These have included Optical Character Recognition (OCR), to identify meaningful specimen groupings to increase the speed of data entry and more recently citizen science platforms to provide accurate crowd-sourced transcriptions of specimen label data. We sent specimen images of the Australian flowering plants held at RBGE herbarium to DigiVol (https://volunteer.ala.org.au/institution/index/21309224), the citizen science platform developed alongside The Atlas of Living Australia. In 29 expeditions, 156 citizen scientists completed collection label data entry for RBGE’s 41,000 specimens of Australian flowering plants. We found that 95% of the transcriptions were completed by less than a third (27%) of the volunteers. Of the four volunteer experience levels in DigiVol we found that the middle two, Collection Managers and Scientists, transcribed fewer specimens, but also made fewer mistakes. We found that by removing the filing name from the information provided with the expedition the number of errors in the Museum Details section of the transcription decreased, as the filing name was often added as the label name, regardless of whether this is the case. The feedback we provided for each expedition was used to highlight common errors to try and reduce their occurrence as well as to inform the volunteers of what their transcriptions had revealed about this part of the collection. We explore the citizen science transcription workflow, its rate-limiting steps and how we have worked to include the citizen science and OCR data on our online herbarium catalogue.

Download Full-text

Data Capture Project at the Royal Botanic Garden Edinburgh

Sibbaldia: the International Journal of Botanic Garden Horticulture ◽

10.24823/sibbaldia.2009.152 ◽

2009 ◽

pp. 77-82

Author(s):

Natacha Frachon ◽

Martin Gardner ◽

David Rae

Keyword(s):

Genetic Resources ◽

Research Community ◽

Royal Botanic Garden ◽

Data Capture ◽

Plant Biodiversity ◽

Botanic Gardens ◽

Botanic Garden ◽

The World ◽

Plant Collections ◽

Field Notes

Botanic gardens, with their large holdings of living plants collected from around the world, are important guardians of plant biodiversity, but acquiring and curating these genetic resources is enormously expensive. For these reasons it is crucial that botanic gardens document and curate their collections in order to gain the greatest benefit from the plants in their care. Great priority is given to making detailed field notes and the process of documentation is often continued during the plants formative years when being propagated. However, for the large majority of plants this process often stops once the material is planted in its final garden location. The Data Capture Project at the Royal Botanic Garden Edinburgh is an attempt to document specific aspects of the plant collections so that the information captured can be of use to the research community even after the plants have died.

Download Full-text

Implementing an automated monitoring process in a digital, longitudinal observational cohort study

Arthritis Research & Therapy ◽

10.1186/s13075-021-02563-2 ◽

2021 ◽

Vol 23 (1) ◽

Author(s):

Lisa Lindner ◽

Anja Weiß ◽

Andreas Reich ◽

Siegfried Kindler ◽

Frank Behrens ◽

...

Keyword(s):

Cohort Studies ◽

Data Quality ◽

Missing Values ◽

Randomized Clinical Trials ◽

Cost Effective ◽

Observational Research ◽

Research Centre ◽

Automated Monitoring ◽

Monitoring Process ◽

Observational Cohort

Abstract Background Clinical data collection requires correct and complete data sets in order to perform correct statistical analysis and draw valid conclusions. While in randomized clinical trials much effort concentrates on data monitoring, this is rarely the case in observational studies- due to high numbers of cases and often-restricted resources. We have developed a valid and cost-effective monitoring tool, which can substantially contribute to an increased data quality in observational research. Methods An automated digital monitoring system for cohort studies developed by the German Rheumatism Research Centre (DRFZ) was tested within the disease register RABBIT-SpA, a longitudinal observational study including patients with axial spondyloarthritis and psoriatic arthritis. Physicians and patients complete electronic case report forms (eCRF) twice a year for up to 10 years. Automatic plausibility checks were implemented to verify all data after entry into the eCRF. To identify conflicts that cannot be found by this approach, all possible conflicts were compiled into a catalog. This “conflict catalog” was used to create queries, which are displayed as part of the eCRF. The proportion of queried eCRFs and responses were analyzed by descriptive methods. For the analysis of responses, the type of conflict was assigned to either a single conflict only (affecting individual items) or a conflict that required the entire eCRF to be queried. Results Data from 1883 patients was analyzed. A total of n = 3145 eCRFs submitted between baseline (T0) and T3 (12 months) had conflicts (40–64%). Fifty-six to 100% of the queries regarding eCRFs that were completely missing were answered. A mean of 1.4 to 2.4 single conflicts occurred per eCRF, of which 59–69% were answered. The most common missing values were CRP, ESR, Schober’s test, data on systemic glucocorticoid therapy, and presence of enthesitis. Conclusion Providing high data quality in large observational cohort studies is a major challenge, which requires careful monitoring. An automated monitoring process was successfully implemented and well accepted by the study centers. Two thirds of the queries were answered with new data. While conventional manual monitoring is resource-intensive and may itself create new sources of errors, automated processes are a convenient way to augment data quality.

Download Full-text

Smart data capture to reduce reporting burden, increase data quality in national truck surveys, and increase analysis capability

2020 Forum on Integrated and Sustainable Transportation Systems (FISTS) ◽

10.1109/fists46898.2020.9264888 ◽

2020 ◽

Author(s):

Inger Beate Hovi ◽

Christian S. Mjosund ◽

Daniel R. Pinchasik ◽

Stein Erik Gronland

Keyword(s):

Data Quality ◽

Data Capture ◽

Smart Data

Download Full-text

Data Quality Associated with Handwritten Laboratory Test Requests: Classification and Frequency of Data-Entry Errors for Outpatient Serology Tests

Health Information Management Journal ◽

10.1177/183335831504400302 ◽

2015 ◽

Vol 44 (3) ◽

pp. 7-12 ◽

Cited By ~ 1

Author(s):

Elia Vecellio ◽

Michael W. Maley ◽

George Toouli ◽

Andrew Georgiou ◽

Johanna Westbrook

Keyword(s):

Laboratory Test ◽

Data Quality ◽

Data Entry

Download Full-text

How Did a Multi-Institutional Trial Show Feasibility of Electronic Data Capture in Older Patients With Cancer? Results From a Multi-Institutional Qualitative Study (Alliance A171902)

JCO Clinical Cancer Informatics ◽

10.1200/cci.20.00164 ◽

2021 ◽

pp. 442-449

Author(s):

Nichole A. Martin ◽

Elizabeth S. Harlos ◽

Kathryn D. Cook ◽

Jennifer M. O'Connor ◽

Andrew Dodge ◽

...

Keyword(s):

Older Patients ◽

Electronic Devices ◽

Data Entry ◽

Data Capture ◽

Data Availability ◽

Electronic Data Capture ◽

Wireless Internet ◽

Electronic Data ◽

Patients With Cancer ◽

Patient Reported

PURPOSE New technology might pose problems for older patients with cancer. This study sought to understand how a trial in older patients with cancer (Alliance A171603) was successful in capturing electronic patient-reported data. METHODS Study personnel were invited via e-mail to participate in semistructured phone interviews, which were audio-recorded and qualitatively analyzed. RESULTS Twenty-four study personnel from the 10 sites were interviewed; three themes emerged. The first was that successful patient-reported electronic data capture shifted work toward patients and toward study personnel at the beginning of the study. One interviewee explained, “I mean it kind of lost all advantages…by being extremely laborious.” Study personnel described how they ensured electronic devices were charged, wireless internet access was up and running, and login codes were available. The second theme was related to the first and dealt with data filtering. Study personnel described high involvement in data gathering; for example, one interviewee described, “I answered on the iPad, whatever they said. They didn't even want to use it at all.” A third theme dealt with advantages of electronic data entry, such as prompt data availability at study completion. Surprisingly, some remarks described how electronic devices brought people together, “Some of the patients, you know, it just gave them a chance to kinda talk about, you know, what was going on.” CONCLUSION High rates of capture of patient-reported electronic data were viewed favorably but occurred in exchange for increased effort from patients and study personnel and in exchange for data that were not always patient-reported in the strictest sense.

Download Full-text

Data Cleaning in Knowledge Discovery Database-Data Mining (KDD-DM)

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.f1100.0986s319 ◽

2019 ◽

Vol 8 (6S3) ◽

pp. 2196-2199

Keyword(s):

Data Mining ◽

Information Systems ◽

Data Quality ◽

Knowledge Discovery ◽

Information Management ◽

Data Cleaning ◽

Main Issue ◽

Quality Information ◽

Data Preparation ◽

Paper Review

Data quality is a main issue in quality information management. Data quality problems occur anywhere in information systems. These problems are solved by Data Cleaning (DC). DC is a process used to determine inaccurate, incomplete or unreasonable data and then improve the quality through correcting of detected errors and omissions. Various process of DC have been discussed in the previous studies, but there is no standard or formalized the DC process. The Domain Driven Data Mining (DDDM) is one of the KDD methodology often used for this purpose. This paper review and emphasize the important of DC in data preparation. The future works was also being highlight.

Download Full-text

Propagation of Dry Habitat Fern Species Using Spore Collections from Historic Herbarium Specimens

Sibbaldia: the International Journal of Botanic Garden Horticulture ◽

10.24823/sibbaldia.2011.121 ◽

2011 ◽

pp. 43-54

Author(s):

Gunnar Ovstebo

Keyword(s):

Royal Botanic Garden ◽

Herbarium Specimens ◽

Botanic Garden ◽

Spore Viability ◽

Living Plant ◽

Herbarium Material ◽

Plant Collection ◽

Fern Species ◽

Herbarium Collections ◽

Horticultural Research

Spores sourced from historic herbarium specimens have been used to introduce wild-collected material to the Royal Botanic Garden Edinburgh (RBGE) living plant collection. The ability of dry habitat ferns to maintain spore viability for prolonged periods makes it possible to grow plants from the historically important RBGE herbarium collections. The factors that affect the ability of spores to germinate from herbarium collections are described. Three fern species from the Pteridaceae – Actiniopteris semiflabellata, Anogramma leptophylla and Aleuritopteris scioana – which were not previously in cultivation at RBGE were germinated from herbarium material of different ages. Germination was observed from all three species. Plants produced in this experiment were accessed into the RBGE living plant collection for future horticultural research and germination trials.

Download Full-text