Biodiversity Observations Miner: A web application to unlock primary biodiversity data from published literature

Biodiversity Data Journal ◽

10.3897/bdj.7.e28737 ◽

2019 ◽

Vol 7 ◽

Author(s):

Gabriel Muñoz ◽

W. Daniel Kissling ◽

E. Emiel van Loon

Keyword(s):

Text Mining ◽

Knowledge Discovery ◽

Web Application ◽

Large Scale ◽

Biotic Interactions ◽

Rapid Screening ◽

Biodiversity Data ◽

Biodiversity Science ◽

Automated Discovery ◽

Machine Readable

A considerable portion of primary biodiversity data is digitally locked inside published literature which is often stored as pdf files. Large-scale approaches to biodiversity science could benefit from retrieving this information and making it digitally accessible and machine-readable. Nonetheless, the amount and diversity of digitally published literature pose many challenges for knowledge discovery and retrieval. Text mining has been extensively used for data discovery tasks in large quantities of documents. However, text mining approaches for knowledge discovery and retrieval have been limited in biodiversity science compared to other disciplines. Here, we present a novel, open source text mining tool, the Biodiversity Observations Miner (BOM). This web application, written in R, allows the semi-automated discovery of punctual biodiversity observations (e.g. biotic interactions, functional or behavioural traits and natural history descriptions) associated with the scientific names present inside a corpus of scientific literature. Furthermore, BOM enable users the rapid screening of large quantities of literature based on word co-occurrences that match custom biodiversity dictionaries. This tool aims to increase the digital mobilisation of primary biodiversity data and is freely accessible via GitHub or through a web server.

Download Full-text

Inconsistent XML as a barrier to reuse of Open Access Content

Proceedings of the Impromptu JATS User Group Meeting ◽

10.4242/balisagevol12.mietchen01 ◽

2014 ◽

Author(s):

Daniel Mietchen ◽

Chris Maloney ◽

Nils Dagsson Moskopp

Keyword(s):

Open Access ◽

Text Mining ◽

Large Scale ◽

Specific Content ◽

Current State ◽

The Media ◽

Media Types ◽

Machine Readable

In this paper, we will describe the current state of some of the tagging of articles within the PMC Open Access subset. As a case study, we will use our experiences developing the Open Access Media Importer, a tool to harvest content from the OA subset for automated upload to Wikimedia Commons. Tagging inconsistencies stretch across several aspects of the articles, ranging from licensing to keywords to the media types of supplementary materials. While all of these complicate large-scale reuse, the unclear licensing statements had the greatest impact, requiring us to implement text mining-like algorithms in order to accurately determine whether or not specific content was compatible with reuse on Wikimedia Commons. Besides presenting examples of incorrectly tagged XML from a range of publishers, we will also explore past and current efforts towards standardization of license tagging, and we will describe a set of recommendations related to tagging practices of certain data, to ensure that it is both compatible with existing standards, and consistent and machine-readable.

Download Full-text

Implicit Knowledge Discovery in Design Semantic Network by Applying Pythagorean Means on Shortest Path Searching

Volume 1: 37th Computers and Information in Engineering Conference ◽

10.1115/detc2017-67230 ◽

2017 ◽

Cited By ~ 1

Author(s):

Feng Shi ◽

Liuqing Chen ◽

Ji Han ◽

Peter Childs

Keyword(s):

Text Mining ◽

Knowledge Discovery ◽

Language Processing ◽

Shortest Path ◽

Large Scale ◽

Semantic Network ◽

Semantic Networks ◽

Implicit Knowledge ◽

Implicit Associations ◽

Correlation Degree

With the advent of the big-data era, massive textual information stored in electronic and digital documents have become valuable resources for knowledge discovery in the fields of design and engineering. Ontology technologies and semantic networks have been widely applied with text mining techniques including Natural Language Processing (NLP) to extract structured knowledge associations from the large-scale unstructured textual data. However, most existing works mainly focus on how to construct the semantic networks by developing various text mining methods such as statistical approaches and semantic approaches, while few studies are found to focus on how to subsequently analyze and fully utilize the already well-established semantic networks. In this paper, a specific network analysis method is proposed to discover the implicit knowledge associations from the existing semantic network for improving knowledge discovery and design innovation. Pythagorean means are applied with Dijkstra’s shortest path algorithm to discover the implicit knowledge associations either around a single knowledge concept or between two concepts. Six criteria are established to evaluate and rank the correlation degree of the implicit associations. Two engineering case studies were conducted to illustrate the proposed knowledge discovery process, and the results showed the effectiveness of the retrieved implicit knowledge associations on helping providing relevant knowledge from various aspects, and provoking creative ideas for engineering innovation.

Download Full-text

Developing a Data-Literate Workforce through BLUE: Biodiversity Literacy in Undergraduate Education

Biodiversity Information Science and Standards ◽

10.3897/biss.3.37339 ◽

2019 ◽

Vol 3 ◽

Cited By ~ 2

Author(s):

Elizabeth R. Ellwood ◽

Anna Monfils ◽

Lisa White ◽

Debra Linton ◽

Natalie Douglas ◽

...

Keyword(s):

Undergraduate Education ◽

Disease Transmission ◽

Large Scale ◽

Science Curriculum ◽

Ecological Data ◽

Biodiversity Data ◽

Undergraduate Biology ◽

Data Literacy ◽

Biodiversity Science ◽

Science Community

The biodiversity sciences have experienced a rapid mobilization of data that has increased our capacity to investigate large-scale issues of critical importance (e.g., climate change and its impacts, zoonotic disease transmission, sustainable resource management, impacts of invasive species, and biodiversity loss). Several initiatives are underway to aggregate and mobilize these biodiversity, environmental, and ecological data resources (iDigBio, NEON, GBIF, iNaturalist, etc.). This requires a new set of skills for the 21st century biodiversity scientist; who is required to be fluent in integrative fields spanning evolutionary biology, systematics, ecology, geology, and environmental science and possess the quantitative, computational, and data skills to conduct research using large and complex datasets. The biodiversity science community has recognized a need to unite biodiversity and data sciences and improve data literacy in the emerging science workforce. The NSF-funded Biodiversity Literacy in Undergraduate Education (BLUE; biodiversityliteracy.com) is working to bridge a gap between efforts that currently exists to promote data literacy pre-college and professional development for those pursuing careers in biodiversity science. The BLUE network is developing strategies and materials to infuse biodiversity data into the core of the undergraduate science curriculum, facilitating broad-scale adoption of biodiversity data literacy competencies, and improving undergraduate biology training to meet increasing workforce demands in data and biodiversity sciences. The BLUE network has four major goals: 1) Cultivate a diverse and inclusive network of biodiversity researchers, data scientists, and biology educators focused on undergraduate data-centric biodiversity education; 2) build community consensus on core biodiversity data literacy competencies; 3) develop strategies and exemplar materials to guide the integration of biodiversity data literacy competencies into introductory undergraduate biology curricula; and 4) extend the network to engage a broader community of undergraduate educators in biodiversity data literacy efforts. The BLUE community continues to grow and build new partnerships and initiatives across the biodiversity science community. In year two of the BLUE network we have been focusing efforts building the community, developing and disseminating exemplar educational materials, and defining core biodiversity data literacy skills and competencies. We will present our current and ongoing work and ways in which members of the biodiversity_next community can be involved in shaping the biodiversity science of the future, while addressing the needs of a changing planet.

Download Full-text

A compendium of monocyte transcriptome datasets to foster biomedical knowledge discovery

F1000Research ◽

10.12688/f1000research.8182.1 ◽

2016 ◽

Vol 5 ◽

pp. 291 ◽

Cited By ~ 1

Author(s):

Darawan Rinchai ◽

Sabri Boughorbel ◽

Scott Presnell ◽

Charlie Quinn ◽

Damien Chaussabel

Keyword(s):

Knowledge Discovery ◽

Web Application ◽

Large Scale ◽

Contextual Information ◽

Human Monocyte ◽

Biomedical Knowledge ◽

Relevant Group ◽

Public Repositories ◽

Data Browsing ◽

Public Datasets

Systems-scale profiling approaches have become widely used in translational research settings. The resulting accumulation of large-scale datasets in public repositories represents a critical opportunity to promote insight and foster knowledge discovery. However, resources that can serve as an interface between biomedical researchers and such vast and heterogeneous dataset collections are needed in order to fulfill this potential. Recently, we have developed an interactive data browsing and visualization web application, the Gene Expression Browser (GXB). This tool can be used to overlay deep molecular phenotyping data with rich contextual information about analytes, samples and studies along with ancillary clinical or immunological profiling data. In this note, we describe a curated compendium of 93 public datasets generated in the context of human monocyte immunological studies, representing a total of 4,516 transcriptome profiles. Datasets were uploaded to an instance of GXB along with study description and sample annotations. Study samples were arranged in different groups. Ranked gene lists were generated based on relevant group comparisons. This resource is publicly available online at http://monocyte.gxbsidra.org/dm3/landing.gsp.

Download Full-text

Measurement and analysis of interspecific spatial associations as a facet of biodiversity

10.32942/osf.io/7z8f9 ◽

2019 ◽

Author(s):

Petr Keil ◽

Thorsten Wiegand ◽

Anikó B. Tóth ◽

Daniel J. McGlinn ◽

Jonathan Chase

Keyword(s):

Large Scale ◽

Interspecific Interactions ◽

Simulated Data ◽

Biotic Interactions ◽

Anthropogenic Pressures ◽

Spatial Associations ◽

Biodiversity Science ◽

The Face ◽

Beta And Gamma Diversity ◽

Implicit And Explicit

Interspecific spatial associations (ISA), which include co-occurrences, segregations, or attractions among two or more species, can provide important insights into the spatial structuring of communities. However, ISA has primarily been examined in the context of understanding interspecific interactions, while other aspects of ISA, including its relations to other biodiversity facets and how it changes in the face of anthropogenic pressures, have been largely neglected. This is likely because it is unclear what makes ISA useful in a biodiversity context, little is known about the theoretical connections between ISA and other biodiversity facets, and there is a confusing variety of approaches to measuring ISA. Here, we first review the metrics of ISA. These include both spatially implicit and explicit indices of association for both binary and abundance data. We test and compare these approaches on empirical and simulated data, and we provide specific recommendations for how to use and interpret them in biodiversity science. We argue that measurements of ISA are more informative when they are spatially explicit (i.e. distance dependent). We then review links of ISA to other classical biodiversity facets, such as alpha, beta, and gamma diversity, and show that they mostly fail to reflect changes/variation in ISA, with the exception of average pair-wise beta diversity. This underscores the need for a specific focus on ISA in large-scale biodiversity assessments. Finally, we argue that there are important, and underappreciated, reasons to study ISA that are unrelated to its link to biotic interactions. Specifically, ISA can provide strong tests of biodiversity theories that require multiple patterns to benchmark against, and it can be explored for potentially predictive macroecological patterns.

Download Full-text

Construction Disputes and Associated Contractual Knowledge Discovery Using Unstructured Text-Heavy Data: Legal Cases in the United Kingdom

Sustainability ◽

10.3390/su13169403 ◽

2021 ◽

Vol 13 (16) ◽

pp. 9403

Author(s):

JeeHee Lee ◽

Youngjib Ham ◽

June-Seong Yi

Keyword(s):

Text Mining ◽

Knowledge Discovery ◽

Construction Projects ◽

Large Scale ◽

Lessons Learned ◽

Final Decision ◽

Legal Cases ◽

Construction Disputes ◽

The United Kingdom ◽

Pairwise Correlations

Construction disputes are one of the main challenges to successful construction projects. Most construction parties experience claims—and even worse, disputes—which are costly and time-consuming to resolve. Lessons learned from past failure cases can help reduce potential future risk factors that likely lead to disputes. In particular, case law, which has been accumulated from the past, is valuable information, providing useful insights to prepare for future disputes. However, few efforts have been made to discover legal knowledge using a large scale of case laws in the construction field. The aim of this paper is to enhance understanding of the multifaceted legal issues surrounding construction adjudication using large amounts of accumulated construction legal cases. This goal is achieved by exploring dispute-related contract terms and conditions that affect judicial decisions based on their verdicts. This study builds on text mining methods to examine what type of contract conditions are frequently referenced in the final decision of each dispute. Various text mining techniques are leveraged for knowledge discovery (i.e., analyzing frequent terms, discovering pairwise correlations, and identifying potential topics) in text-heavy data. The findings show that (1) similar patterns of disputes have occurred repeatedly in construction-related legal cases and (2) the discovered dispute topics indicate that mutually agreed upon contract terms and conditions are import in dispute resolution.

Download Full-text

The Use of Medical Record Linkage for Population and Genetic Studies

Methods of Information in Medicine ◽

10.1055/s-0038-1635962 ◽

1969 ◽

Vol 08 (01) ◽

pp. 07-11 ◽

Cited By ~ 9

Author(s):

H. B. Newcombe

Keyword(s):

Record Linkage ◽

Large Scale ◽

Medical Record Linkage ◽

Canadian Province ◽

Genetic Studies ◽

Parental Characteristics ◽

Family Histories ◽

The Family ◽

Large Populations ◽

Machine Readable

Methods are described for deriving personal and family histories of birth, marriage, procreation, ill health and death, for large populations, from existing civil registrations of vital events and the routine records of ill health. Computers have been used to group together and »link« the separately derived records pertaining to successive events in the lives of the same individuals and families, rapidly and on a large scale. Most of the records employed are already available as machine readable punchcards and magnetic tapes, for statistical and administrative purposes, and only minor modifications have been made to the manner in which these are produced.As applied to the population of the Canadian province of British Columbia (currently about 2 million people) these methods have already yielded substantial information on the risks of disease: a) in the population, b) in relation to various parental characteristics, and c) as correlated with previous occurrences in the family histories.

Download Full-text

Investigating Diseases and Chemicals in COVID-19 Literature with Text Mining (Preprint)

10.2196/preprints.21503 ◽

2020 ◽

Author(s):

Amir Karami ◽

Brandon Bookstaver ◽

Melissa Nolan

Keyword(s):

Text Mining ◽

Literature Review ◽

Topic Modeling ◽

Large Scale ◽

Clinical Manifestations ◽

International Health ◽

Research Papers ◽

Strategic Plans ◽

Funding Agencies ◽

The Relationship

BACKGROUND The COVID-19 pandemic has impacted nearly all aspects of life and has posed significant threats to international health and the economy. Given the rapidly unfolding nature of the current pandemic, there is an urgent need to streamline literature synthesis of the growing scientific research to elucidate targeted solutions. While traditional systematic literature review studies provide valuable insights, these studies have restrictions, including analyzing a limited number of papers, having various biases, being time-consuming and labor-intensive, focusing on a few topics, incapable of trend analysis, and lack of data-driven tools. OBJECTIVE This study fills the mentioned restrictions in the literature and practice by analyzing two biomedical concepts, clinical manifestations of disease and therapeutic chemical compounds, with text mining methods in a corpus containing COVID-19 research papers and find associations between the two biomedical concepts. METHODS This research has collected papers representing COVID-19 pre-prints and peer-reviewed research published in 2020. We used frequency analysis to find highly frequent manifestations and therapeutic chemicals, representing the importance of the two biomedical concepts. This study also applied topic modeling to find the relationship between the two biomedical concepts. RESULTS We analyzed 9,298 research papers published through May 5, 2020 and found 3,645 disease-related and 2,434 chemical-related articles. The most frequent clinical manifestations of disease terminology included COVID-19, SARS, cancer, pneumonia, fever, and cough. The most frequent chemical-related terminology included Lopinavir, Ritonavir, Oxygen, Chloroquine, Remdesivir, and water. Topic modeling provided 25 categories showing relationships between our two overarching categories. These categories represent statistically significant associations between multiple aspects of each category, some connections of which were novel and not previously identified by the scientific community. CONCLUSIONS Appreciation of this context is vital due to the lack of a systematic large-scale literature review survey and the importance of fast literature review during the current COVID-19 pandemic for developing treatments. This study is beneficial to researchers for obtaining a macro-level picture of literature, to educators for knowing the scope of literature, to journals for exploring most discussed disease symptoms and pharmaceutical targets, and to policymakers and funding agencies for creating scientific strategic plans regarding COVID-19.

Download Full-text

CFTR Lifecycle Map—A Systems Medicine Model of CFTR Maturation to Predict Possible Active Compound Combinations

International Journal of Molecular Sciences ◽

10.3390/ijms22147590 ◽

2021 ◽

Vol 22 (14) ◽

pp. 7590

Author(s):

Liza Vinhoven ◽

Frauke Stanke ◽

Sylvia Hafkemeyer ◽

Manuel Manfred Nietert

Keyword(s):

Large Scale ◽

Synergistic Effects ◽

Small Scale ◽

Systems Medicine ◽

Promising Candidate ◽

Cftr Mutations ◽

Machine Readable ◽

High Throughput Screens ◽

Readable Format ◽

Machine Readable Format

Different causative therapeutics for CF patients have been developed. There are still no mutation-specific therapeutics for some patients, especially those with rare CFTR mutations. For this purpose, high-throughput screens have been performed which result in various candidate compounds, with mostly unclear modes of action. In order to elucidate the mechanism of action for promising candidate substances and to be able to predict possible synergistic effects of substance combinations, we used a systems biology approach to create a model of the CFTR maturation pathway in cells in a standardized, human- and machine-readable format. It is composed of a core map, manually curated from small-scale experiments in human cells, and a coarse map including interactors identified in large-scale efforts. The manually curated core map includes 170 different molecular entities and 156 reactions from 221 publications. The coarse map encompasses 1384 unique proteins from four publications. The overlap between the two data sources amounts to 46 proteins. The CFTR Lifecycle Map can be used to support the identification of potential targets inside the cell and elucidate the mode of action for candidate substances. It thereby provides a backbone to structure available data as well as a tool to develop hypotheses regarding novel therapeutics.

Download Full-text

Modelling Tree Growth in Monospecific Forests from Forest Inventory Data

Forests ◽

10.3390/f12060753 ◽

2021 ◽

Vol 12 (6) ◽

pp. 753

Author(s):

Guadalupe Sáez-Cano ◽

Marcos Marvá ◽

Paloma Ruiz-Benito ◽

Miguel A. Zavala

Keyword(s):

Tree Growth ◽

Large Scale ◽

Temporal Dynamics ◽

Carbon Sink ◽

Size Variation ◽

Biotic Interactions ◽

Richards Equation ◽

Tree Level ◽

Forest Inventories ◽

Large Scale Data

The prediction of tree growth is key to further understand the carbon sink role of forests and the short-term forest capacity on climate change mitigation. In this work, we used large-scale data available from three consecutive forest inventories in a Euro-Mediterranean region and the Bertalanffy–Chapman–Richards equation to model up to a decade’s tree size variation in monospecific forests in the growing stages. We showed that a tree-level fitting with ordinary differential equations can be used to forecast tree diameter growth across time and space as function of environmental characteristics and initial size. This modelling approximation was applied at different aggregation levels to monospecific regions with forest inventories to predict trends in aboveground tree biomass stocks. Furthermore, we showed that this model accurately forecasts tree growth temporal dynamics as a function of size and environmental conditions. Further research to provide longer term prediction forest stock dynamics in a wide variety of forests should model regeneration and mortality processes and biotic interactions.

Download Full-text