scholarly journals An Evaluation of In-house versus Out-sourced Data Capture at the Meise Botanic Garden (BR)

2018 ◽  
Vol 2 ◽  
pp. e26514 ◽  
Author(s):  
Henry Engledow ◽  
Sofie De Smedt ◽  
Ann Bogaerts ◽  
Quentin Groom

There are many ways to capture data from herbarium specimen labels. Here we compare the results of in-house verses out-sourced data transcription with the aim of evaluating the pros and cons of each approach and guiding future projects that want to do the same. In 2014 Meise Botanic Garden (BR) embarked on a mass digitization project. We digitally imaged of some 1.2 million herbarium specimens from our African and Belgian Herbaria. The minimal data for a third of these images was transcribed in-house, while the remainder was out-sourced to a commercial company. The minimal data comprised the fields: specimen’s herbarium location, barcode, filing name, family, collector, collector number, country code and phytoregion (for the Democratic Republic of Congo, Rwanda & Burundi). The out-sourced data capture consisted of three types: additional label information for central African specimens having minimal data; complete data for the remaining African specimens; and, species filing name information for African and Belgian specimens without minimal data. As part of the preparation for out-sourcing, a strict protocol had to be established as to the criteria for acceptable data quality levels. Also, the creation of several lookup tables for data entry was necessary to improve data quality. During the start-up phase all the data were checked, feedback given, compromises made and the protocol amended. After this phase, an agreed upon subsample was quality controlled. If the error score exceeded the agreed level, the batch was returned for retyping. The data had three quality control checks during the process, by the data capturers, the contractor’s project managers and ourselves. Data quality was analysed and compared in-house versus out-sourced modes of data capture. The error rate by our staff versus the external company was comparable. The types of error that occurred were often linked to the specific field in question. These errors include problems of interpretation, legibility, foreign languages, typographic errors, etc. A significant amount of data cleaning and post-capture processing was required prior to import into our database, despite the data being of good quality according to protocol (error < 1%). By improving the workflow and field definitions a notable improvement could be made in the “data cleaning” phase. The initial motivation for capturing some data in-house was financial. However, after analysis, this may not have been the most cost effective approach. Many lessons have been learned from this first mass digitisation project that will implemented in similar projects in the future.

2018 ◽  
Vol 2 ◽  
pp. e25912
Author(s):  
Henry Engledow ◽  
Sofie De Smedt ◽  
Quentin Groom ◽  
Ann Bogaerts ◽  
Piet Stoffelen ◽  
...  

Mass digitization is a large undertaking for a collection. It is disruptive of routine and can challenge long-held practises. Having been through the procedure and survived, we feel we have a lot of experience to share with other institutions who are considering taking on this challenge. The changes that digitization has made to our institution are positive and the digitization a success, but that is not to say that we would not have done some things differently, were we to repeat the exercise. In 2015 Meise Botanic Garden received a grant from the Flemish Government to upgrade its digitization infrastructure and mass digitize 1.2 million specimens from its African and Belgian Herbaria. The new infrastructure improved our workflow significantly, enabling us to digitize specimens five to ten times faster while also improving their quality. The mass digitization part of the project was split into two parts, imaging and transcription. The contract was awarded and out-sourced to Picturae, who started imaging in May 2016 using a conveyor belt installation. Prior to starting, a significant amount of preparation was required at the herbarium. Within one year, 1.2 million specimens were imaged. The images were captured as TIFF files and stored in triplicate at The Flemish Institute for Archiving (VIAA), while smaller derived JPEG 2000 and JPEG files were generated for day-to-day use. The second part of the project was label transcription. A third of the specimens were transcribed in-house for capturing minimal data (barcode, filing name, collector, collector number & country of origin). This was partly done to reduce costs, but also allowed us to compare in-house to out-sourced transcription. Some 500,000 specimens were transcribed, either completely or partially, by Alembo (subcontracted by Picturae).The remaining 200.000 specimens from our Belgian Herbarium are being transcribed using crowdsourcing. The latter is being realized through the citizen science platform DoeDat (www.doedat.be) that was launched in November 2017. Many lessons have been learnt with respect to implementing mass digitization, both practically and sociologically. Many of the problems encountered during the project could have been avoided by changing the workflow. The addition of extra control points during the process could have reduced problems encountered later in the data capture process. Solving these problems at a later stage was time consuming. Trying to “save money” can result in a disruptive workflow, which may lead to a number of costly errors. Mass digitization has fundamentally changed the workflow in our collections and the way in which our herbarium is managed. All images for the African and Belgian collections may be now found on our new virtual herbarium www.botanicalcollections.be.


Author(s):  
Sofie De Smedt ◽  
Ann Bogaerts ◽  
Henry Engledow ◽  
Quentin Groom

The Herbarium of Meise Botanic Garden is in the top 15 herbaria worldwide. The collection comprises some four million specimens, which are important for scientific research. Digitisation of specimens includes imaging, transcription of label information, linking data and making the results publicly accessible online. In addition to facilitating researchers’ access to specimens, digitisation also brings new possibilities for analysis and discovery of new data, such as the vast amount of information on handwritten labels. In the DOE! project (Digitale Ontsluiting Erfgoedcollecties), funded by the Flemish Government, 1.2 million herbarium sheets from the African and Belgian collections were digitised. We have received additional funding to digitise a further 1.4 million specimens for the remaining vascular plants and macro-algae collections, by October 2021. These include the historic collections of Von Martius and Van Heurck. Carl Friedrich Philipp von Martius (1794–1868) was a pioneering explorer whose expeditions led to the discovery of many species. He amassed over 300,000 specimens, some of which were used to compile the first Flora of Brazil. Henri Van Heurck (1838–1909) also gathered herbarium specimens from all over the world, including a specimen originally from the collection of Linnaeus. Despite this being our second mass digitisation project, there are significant differences in our approach. This is partly due to lessons learned from the first project and partly to the nature of the collections themselves. The differences in the tendering process, specimen preparation, workflow and data capture will be explained. Making these specimens openly available online through www.botanicalcollections.be is valuable to scientific research as well as valorising of our collections. Currently, the site attracts 7000 users a year, which adds up to 15,000 sessions a year and the average session is more than 8 minutes. This means that people are actively using our website and these numbers can be expected to grow as we add more specimens and functionality.


Author(s):  
Sally King ◽  
Juliette Pinon ◽  
Robyn Drinkwater

Digitisation of specimens at the Royal Botanic Garden Edinburgh (RBGE) has created nearly half a million imaged specimens. With data entry from the specimen labels on herbarium sheets identified as the rate-limiting step in the digitisation workflow, the majority of specimens are databased with minimal data (filing name and geographical region), leaving a need to add further label data (collector, collecting locality, collection date etc.) to make the specimens research ready. We are exploring a number of different ways to complete data entry for specimens that have been imaged. These have included Optical Character Recognition (OCR), to identify meaningful specimen groupings to increase the speed of data entry and more recently citizen science platforms to provide accurate crowd-sourced transcriptions of specimen label data. We sent specimen images of the Australian flowering plants held at RBGE herbarium to DigiVol (https://volunteer.ala.org.au/institution/index/21309224), the citizen science platform developed alongside The Atlas of Living Australia. In 29 expeditions, 156 citizen scientists completed collection label data entry for RBGE’s 41,000 specimens of Australian flowering plants. We found that 95% of the transcriptions were completed by less than a third (27%) of the volunteers. Of the four volunteer experience levels in DigiVol we found that the middle two, Collection Managers and Scientists, transcribed fewer specimens, but also made fewer mistakes. We found that by removing the filing name from the information provided with the expedition the number of errors in the Museum Details section of the transcription decreased, as the filing name was often added as the label name, regardless of whether this is the case. The feedback we provided for each expedition was used to highlight common errors to try and reduce their occurrence as well as to inform the volunteers of what their transcriptions had revealed about this part of the collection. We explore the citizen science transcription workflow, its rate-limiting steps and how we have worked to include the citizen science and OCR data on our online herbarium catalogue.


Author(s):  
Natacha Frachon ◽  
Martin Gardner ◽  
David Rae

Botanic gardens, with their large holdings of living plants collected from around the world, are important guardians of plant biodiversity, but acquiring and curating these genetic resources is enormously expensive. For these reasons it is crucial that botanic gardens document and curate their collections in order to gain the greatest benefit from the plants in their care. Great priority is given to making detailed field notes and the process of documentation is often continued during the plants formative years when being propagated. However, for the large majority of plants this process often stops once the material is planted in its final garden location. The Data Capture Project at the Royal Botanic Garden Edinburgh is an attempt to document specific aspects of the plant collections so that the information captured can be of use to the research community even after the plants have died.


2021 ◽  
Vol 23 (1) ◽  
Author(s):  
Lisa Lindner ◽  
Anja Weiß ◽  
Andreas Reich ◽  
Siegfried Kindler ◽  
Frank Behrens ◽  
...  

Abstract Background Clinical data collection requires correct and complete data sets in order to perform correct statistical analysis and draw valid conclusions. While in randomized clinical trials much effort concentrates on data monitoring, this is rarely the case in observational studies- due to high numbers of cases and often-restricted resources. We have developed a valid and cost-effective monitoring tool, which can substantially contribute to an increased data quality in observational research. Methods An automated digital monitoring system for cohort studies developed by the German Rheumatism Research Centre (DRFZ) was tested within the disease register RABBIT-SpA, a longitudinal observational study including patients with axial spondyloarthritis and psoriatic arthritis. Physicians and patients complete electronic case report forms (eCRF) twice a year for up to 10 years. Automatic plausibility checks were implemented to verify all data after entry into the eCRF. To identify conflicts that cannot be found by this approach, all possible conflicts were compiled into a catalog. This “conflict catalog” was used to create queries, which are displayed as part of the eCRF. The proportion of queried eCRFs and responses were analyzed by descriptive methods. For the analysis of responses, the type of conflict was assigned to either a single conflict only (affecting individual items) or a conflict that required the entire eCRF to be queried. Results Data from 1883 patients was analyzed. A total of n = 3145 eCRFs submitted between baseline (T0) and T3 (12 months) had conflicts (40–64%). Fifty-six to 100% of the queries regarding eCRFs that were completely missing were answered. A mean of 1.4 to 2.4 single conflicts occurred per eCRF, of which 59–69% were answered. The most common missing values were CRP, ESR, Schober’s test, data on systemic glucocorticoid therapy, and presence of enthesitis. Conclusion Providing high data quality in large observational cohort studies is a major challenge, which requires careful monitoring. An automated monitoring process was successfully implemented and well accepted by the study centers. Two thirds of the queries were answered with new data. While conventional manual monitoring is resource-intensive and may itself create new sources of errors, automated processes are a convenient way to augment data quality.


2015 ◽  
Vol 44 (3) ◽  
pp. 7-12 ◽  
Author(s):  
Elia Vecellio ◽  
Michael W. Maley ◽  
George Toouli ◽  
Andrew Georgiou ◽  
Johanna Westbrook

2021 ◽  
pp. 442-449
Author(s):  
Nichole A. Martin ◽  
Elizabeth S. Harlos ◽  
Kathryn D. Cook ◽  
Jennifer M. O'Connor ◽  
Andrew Dodge ◽  
...  

PURPOSE New technology might pose problems for older patients with cancer. This study sought to understand how a trial in older patients with cancer (Alliance A171603) was successful in capturing electronic patient-reported data. METHODS Study personnel were invited via e-mail to participate in semistructured phone interviews, which were audio-recorded and qualitatively analyzed. RESULTS Twenty-four study personnel from the 10 sites were interviewed; three themes emerged. The first was that successful patient-reported electronic data capture shifted work toward patients and toward study personnel at the beginning of the study. One interviewee explained, “I mean it kind of lost all advantages…by being extremely laborious.” Study personnel described how they ensured electronic devices were charged, wireless internet access was up and running, and login codes were available. The second theme was related to the first and dealt with data filtering. Study personnel described high involvement in data gathering; for example, one interviewee described, “I answered on the iPad, whatever they said. They didn't even want to use it at all.” A third theme dealt with advantages of electronic data entry, such as prompt data availability at study completion. Surprisingly, some remarks described how electronic devices brought people together, “Some of the patients, you know, it just gave them a chance to kinda talk about, you know, what was going on.” CONCLUSION High rates of capture of patient-reported electronic data were viewed favorably but occurred in exchange for increased effort from patients and study personnel and in exchange for data that were not always patient-reported in the strictest sense.


Data quality is a main issue in quality information management. Data quality problems occur anywhere in information systems. These problems are solved by Data Cleaning (DC). DC is a process used to determine inaccurate, incomplete or unreasonable data and then improve the quality through correcting of detected errors and omissions. Various process of DC have been discussed in the previous studies, but there is no standard or formalized the DC process. The Domain Driven Data Mining (DDDM) is one of the KDD methodology often used for this purpose. This paper review and emphasize the important of DC in data preparation. The future works was also being highlight.


Author(s):  
Gunnar Ovstebo

Spores sourced from historic herbarium specimens have been used to introduce wild-collected material to the Royal Botanic Garden Edinburgh (RBGE) living plant collection. The ability of dry habitat ferns to maintain spore viability for prolonged periods makes it possible to grow plants from the historically important RBGE herbarium collections. The factors that affect the ability of spores to germinate from herbarium collections are described. Three fern species from the Pteridaceae – Actiniopteris semiflabellata, Anogramma leptophylla and Aleuritopteris scioana – which were not previously in cultivation at RBGE were germinated from herbarium material of different ages. Germination was observed from all three species. Plants produced in this experiment were accessed into the RBGE living plant collection for future horticultural research and germination trials.


Sign in / Sign up

Export Citation Format

Share Document