scholarly journals Distributed, but Global in Reach: Outline of a de-centralized paradigm for biodiversity data intelligence

Author(s):  
Nico Franz ◽  
Edward Gilbert ◽  
Beckett Sterner

We provide an overview and update on initiatives and approaches to add taxonomic data intelligence to distributed biodiversity knowledge networks. "Taxonomic intelligence" for biodiversity data is defined here as the ability to identify and renconcile source-contextualized taxonomic name-to-meaning relationships (Remsen 2016). We review the scientific opportunities, as well as information-technological and socio-economic pathways - both existing and envisioned - to embed de-centralized taxonomic data intelligence into the biodiversity data publication and knowledge intedgration processes. We predict that the success of this project will ultimately rest on our ability to up-value the roles and recognition of systematic expertise and experts in large, aggregated data environments. We will argue that these environments will need to adhere to criteria for responsible data science and interests of coherent communities of practice (Wenger 2000, Stoyanovich et al. 2017). This means allowing for fair, accountable, and transparent representation and propagation of evolving systematic knowledge and enduring or newly apparent conflict in systematic perspective (Sterner and Franz 2017, Franz and Sterner 2018, Sterner et al. 2019). We will demonstrate in principle and through concrete use cases, how to de-centralize systematic knowledge while maintaining alignments between congruent or concflicting taxonomic concept labels (Franz et al. 2016a, Franz et al. 2016b, Franz et al. 2019). The suggested approach uses custom-configured logic representation and reasoning methods, based on the Region Connection Calculus (RCC-5) alignment language. The approach offers syntactic consistency and semantic applicability or scalability across a wide range of biodiversity data products, ranging from occurrence records to phylogenomic trees. We will also illustrate how this kind of taxonomic data intelligence can be captured and propagated through existing or envisioned metadata conventions and standards (e.g., Senderov et al. 2018). Having established an intellectual opportunity, as well as a technical solution pathway, we turn to the issue of developing an implementation and adoption strategy. Which biodiversity data environments are currently the most taxonomically intelligent, and why? How is this level of taxonomic data intelligence created, maintained, and propagated outward? How are taxonomic data intelligence services motivated or incentivized, both at the level of individuals and organizations? Which "concerned entities" within the greater biodiversity data publication enterprise are best positioned to promote such services? Are the most valuable lessons for biodiversity data science "hidden" in successful social media applications? What are good, feasible, incremental steps towards improving taxonomic data intelligence for a diversity of data publishers?

Author(s):  
Gaurav Vaidya ◽  
Hilmar Lapp ◽  
Nico Cellinese

Most biological data and knowledge are directly or indirectly linked to biological taxa via taxon names. Using taxon names is one of the most fundamental and ubiquitous ways in which a wide range of biological data are integrated, aggregated, and indexed, from genomic and microbial diversity to macro-ecological data. To this day, the names used, as well as most methods and resources developed for this purpose, are drawn from Linnaean nomenclature. This leads to numerous problems when applied to data-intensive science that depends on computation to take full advantage of the vast – and rapidly increasing – amount of available digital biodiversity data. The theoretical and practical complexities of reconciling taxon names and concepts has plagued the systematics community for decades and now more than ever before, Linnaean names based in Linnaean taxonomy, by far the most prevalent means of linking data to taxa, are unfit for the age of computation-driven data science, due to fundamental theoretical and practical shortfalls that cannot be cured. We propose an alternate approach based on the use of phylogenetic clade definitions, which is a well-developed method for unambiguously defining the semantics of a clade concept in terms of shared evolutionary ancestry (de Queiroz and Gauthier 1990, de Queiroz and Gauthier 1994). These semantics allow locating the defined clade on any phylogeny, or showing that a clade is inconsistent with the topology of a given phylogeny and hence cannot be present on it at all. We have built a workflow for defining phylogenetic clade definitions in terms of shared ancestor and excluded lineage properties, and locating these definitions on any input phylogeny. Once these definitions have been located, we can use the list of species found within that clade on that phylogeny in order to aggregate occurrence data from the Global Biodiversity Information Facility (GBIF). Thus, our approach uses clade definitions with machine-understandable semantics to programmatically and reproducibly aggregate biodiversity data by higher-level taxonomic concepts. This approach has several advantages over the use of taxonomic hierarchies: Unlike taxa, the semantics of clade definitions can be expressed in unambiguous, machine-understandable and reproducible terms and language. The resolution of a given clade definition will depend on the phylogeny being used. Thus, if the phylogeny of groups of interest is updated in light of new evolutionary knowledge, the clade definition can be applied to the new phylogeny to obtain an updated list of clade members consistent with the updated evolutionary knowledge. Machine reproducibility of analyses is possible simply by archiving the machine-readable representations of the clade definition and the phylogeny being used. Unlike taxa, the semantics of clade definitions can be expressed in unambiguous, machine-understandable and reproducible terms and language. The resolution of a given clade definition will depend on the phylogeny being used. Thus, if the phylogeny of groups of interest is updated in light of new evolutionary knowledge, the clade definition can be applied to the new phylogeny to obtain an updated list of clade members consistent with the updated evolutionary knowledge. Machine reproducibility of analyses is possible simply by archiving the machine-readable representations of the clade definition and the phylogeny being used. Clade definitions can be created by biologists as needed or can be reused from those published in peer-reviewed journals. In addition, nearly 300 peer-reviewed clade definitions were recently published as part of the Phylonym volume of the PhyloCode (de Queiroz et al. 2020) and are now available on the Regnum website. As part of the Phyloreferencing Project, we digitize this collection as a machine-readable ontology, where each clade is represented as a class defined by logical conjunctions for class membership, corresponding to a set of necessary and sufficient conditions of shared or divergent evolutionary ancestry. We call these classes phyloreferences, and have created a fully automated workflow for digitizing the Regnum database content into an OWL ontology (W3C OWL Working Group 2012) that we call the Clade Ontology. This ontology includes reference phylogenies and additional metadata about the verbatim clade definitions. Once complete, the Clade Ontology will include all clade definitions from RegNum, both those included in Phylonym after passing peer-review, and those contributed by the community, whether or not under the PhyloCode nomenclature. As an openly available community resource, this will allow researchers to use them to aggregate biodiversity data for comparative biology with grouping semantics that are transparent, machine-processable, and reproducible. In our presentation, we will demonstrate the use of phyloreferences to locate clades on the Open Tree of Life synthetic tree (Hinchliff et al. 2015), to retrieve lists of species in each clade, and to use them to find and aggregate occurrence records in GBIF. We will also describe the workflow we are currently using to build and test the Clade Ontology, and describe our plans for publishing this resource. Finally, we will discuss the advantages and disadvantages of this approach as compared to taxonomic checklists.


Author(s):  
Nico Franz ◽  
Beckett Sterner ◽  
Nathan Upham ◽  
Kevin Cortés Hernández

Translating information between the domains of systematics and conservation requires novel information management designs. Such designs should improve interactions across the trading zone between the domains, herein understood as the model according to which knowledge and uncertainty are productively translated in both directions (cf. Collins et al. 2019). Two commonly held attitudes stand in the way of designing a well-functioning systematics-to-conservation trading zone. On one side, there are calls to unify the knowledge signal produced by systematics, underpinned by the argument that such unification is a necessary precondition for conservation policy to be reliably expressed and enacted (e.g., Garnett et al. 2020). As a matter of legal scholarship, the argument for systematic unity by legislative necessity is principally false (Weiss 2003, MacNeil 2009, Chromá 2011), but perhaps effective enough as a strategy to win over audiences unsure about robust law-making practices in light of variable and uncertain knowledge. On the other side, there is an attitude that conservation cannot ever restrict the academic freedom of systematics as a scientific discipline (e.g., Raposo et al. 2017). This otherwise sound argument misses the mark in the context of designing a productive trading zone with conservation. The central interactional challenge is not whether the systematic knowledge can vary at a given time and/or evolve over time, but whether these signal dynamics are tractable in ways that actors can translate into robust maxims for conservation. Redesigning the trading zone should rest on the (historically validated) projection that systematics will continue to attract generations of inspired, productive researchers and broad-based societal support, frequently leading to protracted conflicts and dramatic shifts in how practioners in the field organize and identify organismal lineages subject to conservation. This confident outlook for systematics' future, in turn, should refocus the challenge of designing the trading zone as one of building better information services to model the concurrent conflicts and longer-term evolution of systematic knowledge. It would seem unreasonable to expect the International Union for Conservation of Nature (IUCN) Red List Index to develop better data science models for the dynamics of systematic knowledge (cf. Hoffmann et al. 2011) than are operational in the most reputable information systems designed and used by domain experts (Burgin et al. 2018). The reasonable challenge from conservation to systematics is not to stop being a science but to be a better data science. In this paper, we will review advances in biodiversity data science in relation to representing and reasoning over changes in systematic knowledge with computational logic, i.e., modeling systematic intelligence (Franz et al. 2016). We stress-test this approach with a use case where rapid systematic signal change and high stakes for conservation action intersect, i.e., the Malagasy mouse lemurs (Microcebus É. Geoffroy, 1834 sec. Schüßler et al. 2020), where the number of recognized species-level concepts has risen from 2 to 25 in the span of 38 years (1982–2020). As much as scientifically defensible, we extend our modeling approach to the level of individual published occurrence records, where the inability to do so sometimes reflects substandard practice but more importantly reveals systemic inadequacies in biodiversity data science or informational modeling. In the absence of shared, sound theoretical foundations to assess taxonomic congruence or incongruence across treatments, and in the absence of biodiversity data platforms capable of propagating logic-enabled, scalable occurrence-to-concept identification events to produce alternative and succeeding distribution maps, there is no robust way to provide a knowledge signal from systematics to conservation that is both consistent in its syntax and acccurate in its semantics, in the sense of accurately reflecting the variation and uncertainty that exists across multiple systematic perspectives. Translating this diagnosis into new designs for the trading zone is only one "half" of the solution, i.e., a technical advancement that then would need to be socially endorsed and incentivized by systematic and conservation communities motivated to elevate their collaborative interactions and trade robustly in inherently variable and uncertain information.


2020 ◽  
Vol 8 ◽  
Author(s):  
Devasis Bassu ◽  
Peter W. Jones ◽  
Linda Ness ◽  
David Shallcross

Abstract In this paper, we present a theoretical foundation for a representation of a data set as a measure in a very large hierarchically parametrized family of positive measures, whose parameters can be computed explicitly (rather than estimated by optimization), and illustrate its applicability to a wide range of data types. The preprocessing step then consists of representing data sets as simple measures. The theoretical foundation consists of a dyadic product formula representation lemma, and a visualization theorem. We also define an additive multiscale noise model that can be used to sample from dyadic measures and a more general multiplicative multiscale noise model that can be used to perturb continuous functions, Borel measures, and dyadic measures. The first two results are based on theorems in [15, 3, 1]. The representation uses the very simple concept of a dyadic tree and hence is widely applicable, easily understood, and easily computed. Since the data sample is represented as a measure, subsequent analysis can exploit statistical and measure theoretic concepts and theories. Because the representation uses the very simple concept of a dyadic tree defined on the universe of a data set, and the parameters are simply and explicitly computable and easily interpretable and visualizable, we hope that this approach will be broadly useful to mathematicians, statisticians, and computer scientists who are intrigued by or involved in data science, including its mathematical foundations.


2020 ◽  
Vol 10 (1) ◽  
Author(s):  
Ali Rohani ◽  
Jennifer A. Kashatus ◽  
Dane T. Sessions ◽  
Salma Sharmin ◽  
David F. Kashatus

Abstract Mitochondria are highly dynamic organelles that can exhibit a wide range of morphologies. Mitochondrial morphology can differ significantly across cell types, reflecting different physiological needs, but can also change rapidly in response to stress or the activation of signaling pathways. Understanding both the cause and consequences of these morphological changes is critical to fully understanding how mitochondrial function contributes to both normal and pathological physiology. However, while robust and quantitative analysis of mitochondrial morphology has become increasingly accessible, there is a need for new tools to generate and analyze large data sets of mitochondrial images in high throughput. The generation of such datasets is critical to fully benefit from rapidly evolving methods in data science, such as neural networks, that have shown tremendous value in extracting novel biological insights and generating new hypotheses. Here we describe a set of three computational tools, Cell Catcher, Mito Catcher and MiA, that we have developed to extract extensive mitochondrial network data on a single-cell level from multi-cell fluorescence images. Cell Catcher automatically separates and isolates individual cells from multi-cell images; Mito Catcher uses the statistical distribution of pixel intensities across the mitochondrial network to detect and remove background noise from the cell and segment the mitochondrial network; MiA uses the binarized mitochondrial network to perform more than 100 mitochondria-level and cell-level morphometric measurements. To validate the utility of this set of tools, we generated a database of morphological features for 630 individual cells that encode 0, 1 or 2 alleles of the mitochondrial fission GTPase Drp1 and demonstrate that these mitochondrial data could be used to predict Drp1 genotype with 87% accuracy. Together, this suite of tools enables the high-throughput and automated collection of detailed and quantitative mitochondrial structural information at a single-cell level. Furthermore, the data generated with these tools, when combined with advanced data science approaches, can be used to generate novel biological insights.


2021 ◽  
Vol 1 (2) ◽  
pp. 27-33
Author(s):  
M.V. Lyashenko ◽  
◽  
V.V. Shekhovtsov ◽  
P.V. Potapov ◽  
A.I. Iskaliyev ◽  
...  

The pneumatic seat suspension is one of the most important, and in some situations, one of the key components of the vibration protection system for the human operator of the vehicle. At the present stage of scientific and technical activities of most developers, great emphasis is placed on controlled seat suspension systems, as the most promising systems. This article analyzes the methods of controlling the elastic damping characteristics of the air suspension of a vehicle seat. Ten dif-ferent and fairly well-known methods of changing the shape and parameters of elastic damping characteristics due to electro-pneumatic valves, throttles, motors, additional cavities, auxiliary mechanisms and other actuators were considered, the advantages, application limits and disad-vantages of each method were analyzed. Based on the results of the performed analytical procedure, as well as the recommendations known in the scientific and technical literature on improving the vibration-protective properties of suspension systems, the authors proposed and developed a new method for controlling the elastic-damping characteristic, which is implemented in the proposed technical solution for the air suspension of a vehicle seat. The method differs in the thing that it im-plements a cyclic controlled exchange of the working fluid between the cavities of the pneumatic elastic element and the additional volume of the receiver on the compression and rebound strokes, forming an almost symmetric elastic damping characteristic, and partial recuperation of vibrational energy by a pneumatic drive, presented in the form of a rotary type pneumatic motor. In addition, the method does not require an unregulated hydraulic shock absorber, while still having the ad-vantage of improved vibration-proof properties of the air suspension of a vehicle seat over a wide range of operating influences.


Author(s):  
Lyubomir Penev ◽  
Teodor Georgiev ◽  
Viktor Senderov ◽  
Mariya Dimitrova ◽  
Pavel Stoev

As one of the first advocates of open access and open data in the field of biodiversity publishiing, Pensoft has adopted a multiple data publishing model, resulting in the ARPHA-BioDiv toolbox (Penev et al. 2017). ARPHA-BioDiv consists of several data publishing workflows and tools described in the Strategies and Guidelines for Publishing of Biodiversity Data and elsewhere: Data underlying research results are deposited in an external repository and/or published as supplementary file(s) to the article and then linked/cited in the article text; supplementary files are published under their own DOIs and bear their own citation details. Data deposited in trusted repositories and/or supplementary files and described in data papers; data papers may be submitted in text format or converted into manuscripts from Ecological Metadata Language (EML) metadata. Integrated narrative and data publishing realised by the Biodiversity Data Journal, where structured data are imported into the article text from tables or via web services and downloaded/distributed from the published article. Data published in structured, semanticaly enriched, full-text XMLs, so that several data elements can thereafter easily be harvested by machines. Linked Open Data (LOD) extracted from literature, converted into interoperable RDF triples in accordance with the OpenBiodiv-O ontology (Senderov et al. 2018) and stored in the OpenBiodiv Biodiversity Knowledge Graph. Data underlying research results are deposited in an external repository and/or published as supplementary file(s) to the article and then linked/cited in the article text; supplementary files are published under their own DOIs and bear their own citation details. Data deposited in trusted repositories and/or supplementary files and described in data papers; data papers may be submitted in text format or converted into manuscripts from Ecological Metadata Language (EML) metadata. Integrated narrative and data publishing realised by the Biodiversity Data Journal, where structured data are imported into the article text from tables or via web services and downloaded/distributed from the published article. Data published in structured, semanticaly enriched, full-text XMLs, so that several data elements can thereafter easily be harvested by machines. Linked Open Data (LOD) extracted from literature, converted into interoperable RDF triples in accordance with the OpenBiodiv-O ontology (Senderov et al. 2018) and stored in the OpenBiodiv Biodiversity Knowledge Graph. The above mentioned approaches are supported by a whole ecosystem of additional workflows and tools, for example: (1) pre-publication data auditing, involving both human and machine data quality checks (workflow 2); (2) web-service integration with data repositories and data centres, such as Global Biodiversity Information Facility (GBIF), Barcode of Life Data Systems (BOLD), Integrated Digitized Biocollections (iDigBio), Data Observation Network for Earth (DataONE), Long Term Ecological Research (LTER), PlutoF, Dryad, and others (workflows 1,2); (3) semantic markup of the article texts in the TaxPub format facilitating further extraction, distribution and re-use of sub-article elements and data (workflows 3,4); (4) server-to-server import of specimen data from GBIF, BOLD, iDigBio and PlutoR into manuscript text (workflow 3); (5) automated conversion of EML metadata into data paper manuscripts (workflow 2); (6) export of Darwin Core Archive and automated deposition in GBIF (workflow 3); (7) submission of individual images and supplementary data under own DOIs to the Biodiversity Literature Repository, BLR (workflows 1-3); (8) conversion of key data elements from TaxPub articles and taxonomic treatments extracted by Plazi into RDF handled by OpenBiodiv (workflow 5). These approaches represent different aspects of the prospective scholarly publishing of biodiversity data, which in a combination with text and data mining (TDM) technologies for legacy literature (PDF) developed by Plazi, lay the ground of an entire data publishing ecosystem for biodiversity, supplying FAIR (Findable, Accessible, Interoperable and Reusable data to several interoperable overarching infrastructures, such as GBIF, BLR, Plazi TreatmentBank, OpenBiodiv and various end users.


2013 ◽  
Vol 8 (1) ◽  
pp. 193-203 ◽  
Author(s):  
Sarah Callaghan ◽  
Fiona Murphy ◽  
Jonathan Tedds ◽  
Rob Allan ◽  
John Kunze ◽  
...  

The Peer REview for Publication and Accreditation of Research Data in the Earth sciences (PREPARDE) project is a JISC and NERC funded project which aims to investigate the policies and procedures required for the formal publication of research data, ranging from ingestion into a data repository, through to formal publication in a data journal. It also addresses key issues arising in the data publication paradigm, including, but not limited to, issues related to how one peer reviews a dataset, what criteria are needed for a repository to be considered objectively trustworthy, and how datasets and journal publications can be effectively cross-linked for the benefit of the wider research community. PREPARDE brings together a wide range of experts in the research, academic publishing and data management fields both within the Earth Sciences and in the broader life sciences with the aim of producing general guidelines applicable to a wide range of scientific disciplines and data publication types. This paper provides details of the work done in the first half of the project; the project itself will be completed in June 2013.


2015 ◽  
Vol 5 (2) ◽  
pp. 279
Author(s):  
MA. Fisnik Sadiku ◽  
MA. Besnik Lokaj

Intelligence services are an important factor of national security. Their main role is to collect, process, analyze, and disseminate information on threats to the state and its population.Because of their “dark” activity, intelligence services for many ordinary citizens are synonymous of violence, fear and intimidation. This mostly comes out in theRepublicofKosovo, due to the murderous activities of the Serbian secret service in the past. Therefore, we will treat the work of intelligence services in democratic conditions, so that the reader can understand what is legitimate and legal of these services.In different countries of the world, security challenges continue to evolve and progress every day, and to fulfil these challenges, the state needs new ways of coordinating and developing the capability to shape the national security environment. However, the increase of intelligence in many countries has raised debates about legal and ethical issues regarding intelligence activities.Therefore, this paper will include a clear explanation of the term, meaning, process, transparency and secrecy, and the role that intelligence services have in analyzing potential threats to national security.The study is based on a wide range of print and electronic literature, including academic and scientific literature, and other documents of various intelligence agencies of developed countries.


Author(s):  
Belén Rubio Ballester ◽  
Fabrizio Antenucci ◽  
Martina Maier ◽  
Anthony C. C. Coolen ◽  
Paul F. M. J. Verschure

Abstract Introduction After a stroke, a wide range of deficits can occur with varying onset latencies. As a result, assessing impairment and recovery are enormous challenges in neurorehabilitation. Although several clinical scales are generally accepted, they are time-consuming, show high inter-rater variability, have low ecological validity, and are vulnerable to biases introduced by compensatory movements and action modifications. Alternative methods need to be developed for efficient and objective assessment. In this study, we explore the potential of computer-based body tracking systems and classification tools to estimate the motor impairment of the more affected arm in stroke patients. Methods We present a method for estimating clinical scores from movement parameters that are extracted from kinematic data recorded during unsupervised computer-based rehabilitation sessions. We identify a number of kinematic descriptors that characterise the patients’ hemiparesis (e.g., movement smoothness, work area), we implement a double-noise model and perform a multivariate regression using clinical data from 98 stroke patients who completed a total of 191 sessions with RGS. Results Our results reveal a new digital biomarker of arm function, the Total Goal-Directed Movement (TGDM), which relates to the patients work area during the execution of goal-oriented reaching movements. The model’s performance to estimate FM-UE scores reaches an accuracy of $$R^2$$ R 2 : 0.38 with an error ($$\sigma$$ σ : 12.8). Next, we evaluate its reliability ($$r=0.89$$ r = 0.89 for test-retest), longitudinal external validity ($$95\%$$ 95 % true positive rate), sensitivity, and generalisation to other tasks that involve planar reaching movements ($$R^2$$ R 2 : 0.39). The model achieves comparable accuracy also for the Chedoke Arm and Hand Activity Inventory ($$R^2$$ R 2 : 0.40) and Barthel Index ($$R^2$$ R 2 : 0.35). Conclusions Our results highlight the clinical value of kinematic data collected during unsupervised goal-oriented motor training with the RGS combined with data science techniques, and provide new insight into factors underlying recovery and its biomarkers.


2018 ◽  
Vol 2 ◽  
pp. e26539 ◽  
Author(s):  
Paul J. Morris ◽  
James Hanken ◽  
David Lowery ◽  
Bertram Ludäscher ◽  
James Macklin ◽  
...  

As curators of biodiversity data in natural science collections, we are deeply concerned with data quality, but quality is an elusive concept. An effective way to think about data quality is in terms of fitness for use (Veiga 2016). To use data to manage physical collections, the data must be able to accurately answer questions such as what objects are in the collections, where are they and where are they from. Some research uses aggregate data across collections, which involves exchange of data using standard vocabularies. Some research uses require accurate georeferences, collecting dates, and current identifications. It is well understood that the costs of data capture and data quality improvement increase with increasing time from the original observation. These factors point towards two engineering principles for software that is intended to maintain or enhance data quality: build small modular data quality tests that can be easily assembled in suites to assess the fitness of use of data for some particular need; and produce tools that can be applied by users with a wide range of technical skill levels at different points in the data life cycle. In the Kurator project, we have produced code (e.g. Wieczorek et al. 2017, Morris 2016) which consists of small modules that can be incorporated into data management processes as small libraries that address particular data quality tests. These modules can be combined into customizable data quality scripts, which can be run on single computers or scalable architecture and can be incorporated into other software, run as command line programs, or run as suites of canned workflows through a web interface. Kurator modules can be integrated into early stage data capture applications, run to help prepare data for aggregation by matching it to standard vocabularies, be run for quality control or quality assurance on data sets, and can report on data quality in terms of a fitness-for-use framework (Veiga et al. 2017). One of our goals is simple tests usable by anyone anywhere.


Sign in / Sign up

Export Citation Format

Share Document