Introducing ‘The bdverse’: a family of R packages for biodiversity data

Biodiversity Information Science and Standards ◽

10.3897/biss.3.37643 ◽

2019 ◽

Vol 3 ◽

Author(s):

Tomer Gueta ◽

Vijay Barve ◽

Thiloshon Nagarajah ◽

Povilas Gibas ◽

Yohay Carmel

Keyword(s):

Data Cleaning ◽

Biodiversity Data ◽

User Input ◽

Biodiversity Science ◽

Darwin Core ◽

Shiny App ◽

Cleaning Procedures ◽

Core Requirements ◽

User Friendly ◽

Hierarchal Structure

The bdverse is a collection of packages that form a general framework for facilitating biodiversity science in R. We build it to serve as a sustainable and agile infrastructure that enhances the value of biodiversity data by allowing users to conveniently employ R, for data exploration, quality assessment, data cleaning, and standardization. The bdverse supports users with and without programming capabilities. It includes six unique packages in a hierarchal structure — representing different functionality levels (Fig. 1). Major features of three core packages will be highlighted and demonstrated: (i) bdDwC provides an interactive Shiny app and a set of functions for standardizing field names in compliance with Darwin Core (DwC) format; (ii) bdchecks is an infrastructure for performing, filtering and managing various biodiversity data checks; (iii) bdclean is a user-friendly data cleaning Shiny app for the inexperienced R user. It provides features to manage complete workflow for biodiversity data cleaning, including data upload; user input - in order to adjust cleaning procedures; data cleaning; and finally, generation of various reports and versions of the data. We are now working on submitting the bdverse packages to rOpenSci software review, and as soon as the packages meet core requirements, we will officially release the bdverse. The bdverse project won the 2nd prize in the 2018 Ebbe Nielsen Challenge.

Download Full-text

Introducing bdclean: a user friendly biodiversity data cleaning pipeline

Biodiversity Information Science and Standards ◽

10.3897/biss.2.25564 ◽

2018 ◽

Vol 2 ◽

pp. e25564

Author(s):

Tomer Gueta ◽

Vijay Barve ◽

Thiloshon Nagarajah ◽

Ashwin Agrawal ◽

Yohay Carmel

Keyword(s):

Data Cleaning ◽

Control Process ◽

R Package ◽

Modular Approach ◽

Data Validation ◽

Biodiversity Data ◽

Quality Control Process ◽

Cleaning Procedures ◽

R Packages ◽

User Friendly

A new R package for biodiversity data cleaning, 'bdclean', was initiated in the Google Summer of Code (GSoC) 2017 and is available on github. Several R packages have great data validation and cleaning functions, but 'bdclean' provides features to manage a complete pipeline for biodiversity data cleaning; from data quality explorations, to cleaning procedures and reporting. Users are able go through the quality control process in a very structured, intuitive, and effective way. A modular approach to data cleaning functionality should make this package extensible for many biodiversity data cleaning needs. Under GSoC 2018, 'bdclean' will go through a comprehensive upgrade. New features will be highlighted in the demonstration.

Download Full-text

Biologer: an open platform for collecting biodiversity data

Biodiversity Data Journal ◽

10.3897/bdj.8.e53014 ◽

2020 ◽

Vol 8 ◽

Author(s):

Miloš Popović ◽

Nikola Vasić ◽

Toni Koren ◽

Ivona Burić ◽

Nenad Živanović ◽

...

Keyword(s):

Open Access ◽

Android Application ◽

Detailed Data ◽

Biodiversity Data ◽

Team Members ◽

Open Platform ◽

National Initiatives ◽

Darwin Core ◽

Data Portal ◽

User Friendly

We have developed a new platform named "Biologer" intended for recording species observations in the field (but also from literature resources and collections). The platform is created as user-friendly, open source, multilingual software that is compatible with Darwin Core standard and accompanied by a simple Android application. It is made from the user’s perspective, allowing everyone to choose how they share the data. Project team members are delegated by involved organisations. The team is responsible for development of the platform, while local Biologer communities are engaged in data collection and verification. Biologer has been online and available for use in Serbia since 2018 and was soon adopted in Croatia and Bosnia and Herzegovina. In total, we have assembled 536 users, who have collected 163,843 species observation records data from the field and digitalised 33,458 literature records. The number of active users and their records is growing daily. Out of the total number of gathered data, 89% has been made open access by the users, 10% is accessible on the scale of 10×10 km and only 1% is closed. In the future, we plan to provide a taxonomic data portal that could be used by local and national initiatives in Eastern Europe, aggregate all data into a single web location, create detailed data overview and enable fluent communication between users.

Download Full-text

BREC: an R package/Shiny app for automatically identifying heterochromatin boundaries and estimating local recombination rates along chromosomes

BMC Bioinformatics ◽

10.1186/s12859-021-04233-1 ◽

2021 ◽

Vol 22 (S6) ◽

Author(s):

Yasmine Mansour ◽

Annie Chateau ◽

Anna-Sophie Fiston-Lavier

Keyword(s):

Data Quality ◽

Data Science ◽

Fruit Fly ◽

R Package ◽

Model Organisms ◽

Data Quality Control ◽

Recombination Rates ◽

Functional Dynamics ◽

Shiny App ◽

User Friendly

Abstract Background Meiotic recombination is a vital biological process playing an essential role in genome's structural and functional dynamics. Genomes exhibit highly various recombination profiles along chromosomes associated with several chromatin states. However, eu-heterochromatin boundaries are not available nor easily provided for non-model organisms, especially for newly sequenced ones. Hence, we miss accurate local recombination rates necessary to address evolutionary questions. Results Here, we propose an automated computational tool, based on the Marey maps method, allowing to identify heterochromatin boundaries along chromosomes and estimating local recombination rates. Our method, called BREC (heterochromatin Boundaries and RECombination rate estimates) is non-genome-specific, running even on non-model genomes as long as genetic and physical maps are available. BREC is based on pure statistics and is data-driven, implying that good input data quality remains a strong requirement. Therefore, a data pre-processing module (data quality control and cleaning) is provided. Experiments show that BREC handles different markers' density and distribution issues. Conclusions BREC's heterochromatin boundaries have been validated with cytological equivalents experimentally generated on the fruit fly Drosophila melanogaster genome, for which BREC returns congruent corresponding values. Also, BREC's recombination rates have been compared with previously reported estimates. Based on the promising results, we believe our tool has the potential to help bring data science into the service of genome biology and evolution. We introduce BREC within an R-package and a Shiny web-based user-friendly application yielding a fast, easy-to-use, and broadly accessible resource. The BREC R-package is available at the GitHub repository https://github.com/GenomeStructureOrganization.

Download Full-text

Unified exact design with early stopping rules for single arm clinical trials with multiple endpoints

Statistical Methods in Medical Research ◽

10.1177/09622802211013062 ◽

2021 ◽

pp. 096228022110130

Author(s):

Wei Wei ◽

Denise Esserman ◽

Michael Kane ◽

Daniel Zelterman

Keyword(s):

Clinical Trials ◽

Early Phase ◽

Decision Rules ◽

Stopping Rules ◽

Multiple Endpoints ◽

Shiny App ◽

R Shiny ◽

Early Phase Clinical Trials ◽

User Friendly ◽

Exact Design

Adaptive designs are gaining popularity in early phase clinical trials because they enable investigators to change the course of a study in response to accumulating data. We propose a novel design to simultaneously monitor several endpoints. These include efficacy, futility, toxicity and other outcomes in early phase, single-arm studies. We construct a recursive relationship to compute the exact probabilities of stopping for any combination of endpoints without the need for simulation, given pre-specified decision rules. The proposed design is flexible in the number and timing of interim analyses. A R Shiny app with user-friendly web interface has been created to facilitate the implementation of the proposed design.

Download Full-text

Transforming Biodiversity Data into Knowledge for Decision-making

Biodiversity Information Science and Standards ◽

10.3897/biss.3.37171 ◽

2019 ◽

Vol 3 ◽

Author(s):

Lauren Weatherdon

Keyword(s):

Decision Making ◽

Biodiversity Data ◽

Biodiversity Indicators ◽

Technological Advances ◽

Darwin Core ◽

Metadata Standards ◽

Essential Biodiversity Variables ◽

Development Goals ◽

Multiple Levels ◽

Communication Needs

Ensuring that we have the data and information necessary to make informed decisions is a core requirement in an era of increasing complexity and anthropogenic impact. With cumulative challenges such as the decline in biodiversity and accelerating climate change, the need for spatially-explicit and methodologically-consistent data that can be compiled to produce useful and reliable indicators of biological change and ecosystem health is growing. Technological advances—including satellite imagery—are beginning to make this a reality, yet uptake of biodiversity information standards and scaling of data to ensure its applicability at multiple levels of decision-making are still in progress. The complementary Essential Biodiversity Variables (EBVs) and Essential Ocean Variables (EOVs), combined with Darwin Core and other data and metadata standards, provide the underpinnings necessary to produce data that can inform indicators. However, perhaps the largest challenge in developing global, biological change indicators is achieving consistent and holistic coverage over time, with recognition of biodiversity data as global assets that are critical to tracking progress toward the UN Sustainable Development Goals and Targets set by the international community (see Jensen and Campbell (2019) for discussion). Through this talk, I will describe some of the efforts towards producing and collating effective biodiversity indicators, such as those based on authoritative datasets like the World Database on Protected Areas (https://www.protectedplanet.net/), and work achieved through the Biodiversity Indicators Partnership (https://www.bipindicators.net/). I will also highlight some of the characteristics of effective indicators, and global biodiversity reporting and communication needs as we approach 2020 and beyond.

Download Full-text

Empirical Bayesian approach to testing multiple hypotheses with separate priors for left and right alternatives

Statistical Applications in Genetics and Molecular Biology ◽

10.1515/sagmb-2018-0002 ◽

2018 ◽

Vol 17 (4) ◽

Author(s):

Naveen K. Bansal ◽

Mehdi Maadooliat ◽

Steven J. Schrodi

Keyword(s):

Empirical Bayes ◽

Bayes Rule ◽

Test Statistic ◽

Data Set ◽

Web Based ◽

Multiple Hypotheses ◽

Snp Data ◽

Genome Wide ◽

Shiny App ◽

User Friendly

Abstract We consider a multiple hypotheses problem with directional alternatives in a decision theoretic framework. We obtain an empirical Bayes rule subject to a constraint on mixed directional false discovery rate (mdFDR≤α) under the semiparametric setting where the distribution of the test statistic is parametric, but the prior distribution is nonparametric. We proposed separate priors for the left tail and right tail alternatives as it may be required for many applications. The proposed Bayes rule is compared through simulation against rules proposed by Benjamini and Yekutieli and Efron. We illustrate the proposed methodology for two sets of data from biological experiments: HIV-transfected cell-line mRNA expression data, and a quantitative trait genome-wide SNP data set. We have developed a user-friendly web-based shiny App for the proposed method which is available through URL https://npseb.shinyapps.io/npseb/. The HIV and SNP data can be directly accessed, and the results presented in this paper can be executed.

Download Full-text

BREC: An R package/Shiny app for automatically identifying heterochromatin boundaries and estimating local recombination rates along chromosomes

10.1101/2020.06.29.178095 ◽

2020 ◽

Author(s):

Yasmine Mansour ◽

Annie Chateau ◽

Anna-Sophie Fiston-Lavier

Keyword(s):

Data Quality ◽

Data Science ◽

Fruit Fly ◽

R Package ◽

Model Organisms ◽

Data Quality Control ◽

Recombination Rates ◽

Functional Dynamics ◽

Shiny App ◽

User Friendly

AbstractMotivationMeiotic recombination is a vital biological process playing an essential role in genomes structural and functional dynamics. Genomes exhibit highly various recombination profiles along chromosomes associated with several chromatin states. However, eu-heterochromatin boundaries are not available nor easily provided for non-model organisms, especially for newly sequenced ones. Hence, we miss accurate local recombination rates, necessary to address evolutionary questions.ResultsHere, we propose an automated computational tool, based on the Marey maps method, allowing to identify heterochromatin boundaries along chromosomes and estimating local recombination rates. Our method, called BREC (heterochromatin Boundaries and RECombination rate estimates) is non-genome-specific, running even on non-model genomes as long as genetic and physical maps are available. BREC is based on pure statistics and is data-driven, implying that good input data quality remains a strong requirement. Therefore, a data pre-processing module (data quality control and cleaning) is provided. Experiments show that BREC handles different markers density and distribution issues. BREC’s heterochromatin boundaries have been validated with cytological equivalents experimentally generated on the fruit fly Drosophila melanogaster genome, for which BREC returns congruent corresponding values. Also, BREC’s recombination rates have been compared with previously reported estimates. Based on the promising results, we believe our tool has the potential to help bring data science into the service of genome biology and evolution. We introduce BREC within an R-package and a Shiny web-based user-friendly application yielding a fast, easy-to-use, and broadly accessible resource.AvailabilityBREC R-package is available at the GitHub repository https://github.com/ymansour21/BREC.

Download Full-text

Selection of the Number of Participants in Intensive Longitudinal Studies: A User-friendly Shiny App and Tutorial to Perform Power Analysis in Multilevel Regression Models that Account for Temporal Dependencies

10.31234/osf.io/dq6ky ◽

2020 ◽

Cited By ~ 1

Author(s):

Ginette Lafit ◽

Janne Adolf ◽

Egon Dejonckheere ◽

Inez Myin-Germeys ◽

Wolfgang Viechtbauer ◽

...

Keyword(s):

Longitudinal Studies ◽

Psychological Functioning ◽

Regression Models ◽

Longitudinal Research ◽

Fixed Number ◽

Correlated Errors ◽

Multilevel Regression ◽

Shiny App ◽

User Friendly ◽

Temporal Dependencies

In recent years the popularity of procedures to collect intensive longitudinal data, such as the Experience Sampling Method, has immensely increased. The data collected using such designs allow researchers to study the dynamics of psychological functioning, and how these dynamics differ across individuals. To this end, the data are often modeled with multilevel regression models. An important question that arises when designing intensive longitudinal studies is how to determine the number of participants needed to test specific hypotheses regarding the parameters of these models with sufficient power. Power calculations for intensive longitudinal studies are challenging, because of the hierarchical data structure in which repeated observations are nested within the individuals and because of the serial dependence that is typically present in this data. We, therefore, present a user-friendly application and step-by-step tutorial to perform simulation-based power analyses for a set of models that are popular in intensive longitudinal research. Since many studies use the same sampling protocol (i.e., a fixed number of at least approximately equidistant observations) within individuals, we assume this protocol fixed and focus on the number of participants. All included models explicitly account for the temporal dependencies in the data by assuming serially correlated errors or including autoregressive effects.

Download Full-text

speciesgeocodeR: An R package for linking species occurrences, user-defined regions and phylogenetic trees for biogeography, ecology and evolution

10.1101/032755 ◽

2015 ◽

Cited By ~ 6

Author(s):

Alexander Zizka ◽

Alexandre Antonelli

Keyword(s):

Data Quality ◽

Phylogenetic Trees ◽

Large Scale ◽

Data Cleaning ◽

R Package ◽

Species Occurrence ◽

Occurrence Data ◽

User Friendly ◽

Species Occurrences

1. Large-scale species occurrence data from geo-referenced observations and collected specimens are crucial for analyses in ecology, evolution and biogeography. Despite the rapidly growing availability of such data, their use in evolutionary analyses is often hampered by tedious manual classification of point occurrences into operational areas, leading to a lack of reproducibility and concerns regarding data quality. 2. Here we present speciesgeocodeR, a user-friendly R-package for data cleaning, data exploration and data visualization of species point occurrences using discrete operational areas, and linking them to analyses invoking phylogenetic trees. 3. The three core functions of the package are 1) automated and reproducible data cleaning, 2) rapid and reproducible classification of point occurrences into discrete operational areas in an adequate format for subsequent biogeographic analyses, and 3) a comprehensive summary and visualization of species distributions to explore large datasets and ensure data quality. In addition, speciesgeocodeR facilitates the access and analysis of publicly available species occurrence data, widely used operational areas and elevation ranges. Other functionalities include the implementation of minimum occurrence thresholds and the visualization of coexistence patterns and range sizes. SpeciesgeocodeR accompanies a richly illustrated and easy-to-follow tutorial and help functions.

Download Full-text

Unlocking Inventory Data Capture, Sharing and Reuse: The Humboldt Extension to Darwin Core

Biodiversity Information Science and Standards ◽

10.3897/biss.5.74275 ◽

2021 ◽

Vol 5 ◽

Author(s):

Yanina Sica ◽

Paula Zermoglio

Keyword(s):

Relevant Information ◽

Task Group ◽

Quality Data ◽

Systematic Sampling ◽

Sampling Effort ◽

Inventory Data ◽

Biodiversity Data ◽

Darwin Core ◽

Rich Data ◽

Key Aspects

Biodiversity inventories, i.e., recording multiple species at a specific place and time, are routinely performed and offer high-quality data for characterizing biodiversity and its change. Digitization, sharing and reuse of incidental point records (i.e., records that are not readily associated with systematic sampling or monitoring, typically museum specimens and many observations from citizen science projects) has been the focus for many years in the biodiversity data community. Only more recently, attention has been directed towards mobilizing data from both new and longstanding inventories and monitoring efforts. These kinds of studies provide very rich data that can enable inferences about species absence, but their reliability depends on the methodology implemented, the survey effort and completeness. The information about these elements has often been regarded as metadata and captured in an unstructured manner, thus making their full use very challenging. Unlocking and integrating inventory data requires data standards that can facilitate capture and sharing of data with the appropriate depth. The Darwin Core standard (Wieczorek et al. 2012) currently enables reporting some of the information contained in inventories, particularly using Darwin Core Event terms such as samplingProtocol, sampleSizeValue, sampleSizeUnit, samplingEffort. However, it is limited in its ability to accommodate spatial, temporal, and taxonomic scopes, and other key aspects of the inventory sampling process, such as direct or inferred measures of sampling effort and completeness. The lack of a standardized way to share inventory data has hindered their mobilization, integration, and broad reuse. In an effort to overcome these limitations, a framework was developed to standardize inventory data reporting: Humboldt Core (Guralnick et al. 2018). Humboldt Core identified three types of inventories (single, elementary, and summary inventories) and proposed a series of terms to report their content. These terms were organized in six categories: dataset and identification; geospatial and habitat scope; temporal scope; taxonomic scope; methodology description; and completeness and effort. While originally planned as a new TDWG standard and being currently implemented in Map of Life (https://mol.org/humboldtcore/), ratification was not pursued at the time, thus limiting broader community adoption. In 2021 the TDWG Humboldt Core Task Group was established to review how to best integrate the terms proposed in the original publication with existing standards and implementation schemas. The first goal of the task group was to determine whether a new, separate standard was needed or if an extension to Darwin Core could accommodate the terms necessary to describe the relevant information elements. Since the different types of inventories can be thought of as Events with different nesting levels (events within events, e.g., plots within sites), and after an initial mapping to existing Darwin Core terms, it was deemed appropriate to start from a Darwin Core Event Core and build an extension to include Humboldt Core terms. The task group members are currently revising all original Humboldt Core terms, reformulating definitions, comments, and examples, and discarding or adding new terms where needed. We are also gathering real datasets to test the use of the extension once an initial list of revised terms is ready, before undergoing a public review period as established by the TDWG process. Through the ratification of Humboldt Core as a TDWG extension, we expect to provide the community with a solution to share and use inventory data, which improves biodiversity data discoverability, interoperability and reuse while lowering the reporting burden at different levels (data collection, integration and sharing).

Download Full-text