Estimating the scale of biomedical data generation using text mining

Mapping Intimacies ◽

10.1101/182857 ◽

2017 ◽

Author(s):

Gabriel Rosenfeld ◽

Dawei Lin

Keyword(s):

Text Mining ◽

Biomedical Research ◽

Word Embedding ◽

Research Articles ◽

Free Text ◽

Similar Amount ◽

Biomedical Data ◽

Data Types ◽

Data Repositories ◽

The Impact

AbstractWhile the impact of biomedical research has traditionally been measured using bibliographic metrics such as citation or journal impact factor, the data itself is an output which can be directly measured to provide additional context about a publication’s impact. Data are a resource that can be repurposed and reused providing dividends on the original investment used to support the primary work. Moreover, it is the cornerstone upon which a tested hypothesis is rejected or accepted and specific scientific conclusions are reached. Understanding how and where it is being produced enhances the transparency and reproducibility of the biomedical research enterprise. Most biomedical data are not directly deposited in data repositories and are instead found in the publication within figures or attachments making it hard to measure. We attempted to address this challenge by using recent advances in word embedding to identify the technical and methodological features of terms used in the free text of articles’ methods sections. We created term usage signatures for five types of biomedical research data, which were used in univariate clustering to correctly identify a large fraction of positive control articles and a set of manually annotated articles where generation of data types could be validated. The approach was then used to estimate the fraction of PLOS articles generating each biomedical data type over time. Out of all PLOS articles analyzed (n = 129,918), ~7%, 19%, 12%, 18%, and 6% generated flow cytometry, immunoassay, genomic microarray, microscopy, and high-throughput sequencing data. The estimate portends a vast amount of biomedical data being produced: in 2016, if other publishers generated a similar amount of data then roughly 40,000 NIH-funded research articles would produce ~56,000 datasets consisting of the five data types we analyzed.One Sentence SummaryApplication of a word-embedding model trained on the methods sections of research articles allows for estimation of the production of diverse biomedical data types using text mining.

Download Full-text

Text mining-based word representations for biomedical data analysis and machine learning tasks

10.1101/2020.12.09.417733 ◽

2020 ◽

Author(s):

Halima Alachram ◽

Hryhorii Chereda ◽

Tim Beißbarth ◽

Edgar Wingender ◽

Philip Stegmaier

Keyword(s):

Text Mining ◽

Gene Networks ◽

Free Text ◽

Biomedical Data ◽

Science Literature ◽

Biological Databases ◽

Biomedical Analysis ◽

Protein Protein Interaction ◽

Ppi Networks ◽

Corpus Size

AbstractBiomedical and life science literature is an essential way to publish experimental results. With the rapid growth of the number of new publications, the amount of scientific knowledge represented in free text is increasing remarkably. There has been much interest in developing techniques that can extract this knowledge and make it accessible to aid scientists in discovering new relationships between biological entities and answering biological questions. Making use of the word2vec approach, we generated word vector representations based on a corpus consisting of over 16 million PubMed abstracts. We developed a text mining pipeline to produce word2vec embeddings with different properties and performed validation experiments to assess their utility for biomedical analysis. An important pre-processing step consisted in the substitution of synonymous terms by their preferred terms in biomedical databases. Furthermore, we extracted gene-gene networks from two embedding versions and used them as prior knowledge to train Graph-Convolutional Neural Networks (CNNs) on breast cancer gene expression data to predict the occurrence of metastatic events. Performances of resulting models were compared to Graph-CNNs trained with protein-protein interaction (PPI) networks or with networks derived using other word embedding algorithms. We also assessed the effect of corpus size on the variability of word representations. Finally, we created a web service with a graphical and a RESTful interface to extract and explore relations between biomedical terms using annotated embeddings. Comparisons to biological databases showed that relations between entities such as known PPIs, signaling pathways and cellular functions, or narrower disease ontology groups correlated with higher cosine similarity. Graph-CNNs trained with word2vec-embedding-derived networks performed best for the metastatic event prediction task compared to other networks. Word representations as produced by text mining algorithms like word2vec, therefore capture biologically meaningful relations between entities.

Download Full-text

Approaches to Text Mining for Analyzing Treatment Plan of Quit Smoking with Free-text Medical Records (Preprint)

10.2196/preprints.15844 ◽

2019 ◽

Author(s):

Hsien-Liang Huang ◽

Yun-Cheng Tsai ◽

Shi-Hao Hong ◽

Ya-Mei Hsueh

Keyword(s):

Smoking Cessation ◽

Text Mining ◽

Medical Records ◽

Treatment Plan ◽

Free Text ◽

Similar Data ◽

Text Data ◽

Smoking Cessation Treatment ◽

Quit Smoking ◽

The Impact

BACKGROUND Smoking is a complex behavior associated with multiple factors such as personality, environment, genetics, and emotions. Text data is a rich source of information. However, pure text data requires substantial human resources and time to extract and apply the information, resulting in many details not being discovered and used. OBJECTIVE This study proposes a novel approach that explores a text mining flow to capture the behavior of smokers quitting tobacco from their free-text medical records. More importantly, the paper explores the impact of these changes on smokers. The goal is to help smokers quit smoking. Therefore, the paper develops an algorithm for analyzing smoking cessation treatment plans documented in free-text medical records. METHODS The approach involves the development of an information extraction flow that uses a combination of data mining techniques, including text mining. It can be used not only to help others quit smoking but also for other medical records with similar data elements. RESULTS In the paper, the most visible areas for the medical application of text mining are the integration and transfer of advances made in basic sciences, as well as a better understanding of the processes involved in smoking cessation. CONCLUSIONS Text mining may also be useful for supporting decision-making processes associated with smoking cessation.

Download Full-text

ScholarLens: extracting competences from research publications for the automatic generation of semantic user profiles

PeerJ Computer Science ◽

10.7717/peerj-cs.121 ◽

2017 ◽

Vol 3 ◽

pp. e121 ◽

Cited By ~ 4

Author(s):

Bahar Sateli ◽

Felicitas Löffler ◽

Birgitta König-Ries ◽

René Witte

Keyword(s):

Open Source ◽

Open Data ◽

Linked Open Data ◽

User Profiling ◽

Research Articles ◽

User Profiles ◽

Data Repositories ◽

Link Type ◽

Research Objects ◽

The Impact

Motivation Scientists increasingly rely on intelligent information systems to help them in their daily tasks, in particular for managing research objects, like publications or datasets. The relatively young research field of Semantic Publishing has been addressing the question how scientific applications can be improved through semantically rich representations of research objects, in order to facilitate their discovery and re-use. To complement the efforts in this area, we propose an automatic workflow to construct semantic user profiles of scholars, so that scholarly applications, like digital libraries or data repositories, can better understand their users’ interests, tasks, and competences, by incorporating these user profiles in their design. To make the user profiles sharable across applications, we propose to build them based on standard semantic web technologies, in particular the Resource Description Framework (RDF) for representing user profiles and Linked Open Data (LOD) sources for representing competence topics. To avoid the cold start problem, we suggest to automatically populate these profiles by analyzing the publications (co-)authored by users, which we hypothesize reflect their research competences. Results We developed a novel approach, ScholarLens, which can automatically generate semantic user profiles for authors of scholarly literature. For modeling the competences of scholarly users and groups, we surveyed a number of existing linked open data vocabularies. In accordance with the LOD best practices, we propose an RDF Schema (RDFS) based model for competence records that reuses existing vocabularies where appropriate. To automate the creation of semantic user profiles, we developed a complete, automated workflow that can generate semantic user profiles by analyzing full-text research articles through various natural language processing (NLP) techniques. In our method, we start by processing a set of research articles for a given user. Competences are derived by text mining the articles, including syntactic, semantic, and LOD entity linking steps. We then populate a knowledge base in RDF format with user profiles containing the extracted competences.We implemented our approach as an open source library and evaluated our system through two user studies, resulting in mean average precision (MAP) of up to 95%. As part of the evaluation, we also analyze the impact of semantic zoning of research articles on the accuracy of the resulting profiles. Finally, we demonstrate how these semantic user profiles can be applied in a number of use cases, including article ranking for personalized search and finding scientists competent in a topic —e.g., to find reviewers for a paper. Availability All software and datasets presented in this paper are available under open source licenses in the supplements and documented at http://www.semanticsoftware.info/semantic-user-profiling-peerj-2016-supplements. Additionally, development releases of ScholarLens are available on our GitHub page: https://github.com/SemanticSoftwareLab/ScholarLens.

Download Full-text

The Importance of Context: Risk-based De-identification of Biomedical Data

Methods of Information in Medicine ◽

10.3414/me16-01-0012 ◽

2016 ◽

Vol 55 (04) ◽

pp. 347-355 ◽

Cited By ~ 11

Author(s):

Klaus Kuhn ◽

Fabian Prasser ◽

Florian Kohlmayer

Keyword(s):

Data Quality ◽

Data Sharing ◽

Information Content ◽

Biomedical Research ◽

Biomedical Data ◽

Risk Models ◽

Specific Data ◽

Important Challenge ◽

The Impact

Summary Background: Data sharing is a central aspect of modern biomedical research. It is accompanied by significant privacy concerns and often data needs to be protected from re-identification. With methods of de-identification datasets can be transformed in such a way that it becomes extremely difficult to link their records to identified individuals. The most important challenge in this process is to find an adequate balance between an increase in privacy and a decrease in data quality. Objectives: Accurately measuring the risk of re-identification in a specific data sharing scenario is an important aspect of data de-identification. Overestimation of risks will significantly deteriorate data quality, while underestimation will leave data prone to attacks on privacy. Several models have been proposed for measuring risks, but there is a lack of generic methods for risk-based data de-identification. The aim of the work described in this article was to bridge this gap and to show how the quality of de-identified datasets can be improved by using risk models to tailor the process of de-identification to a concrete context. Methods: We implemented a generic de-identification process and several models for measuring re-identification risks into the ARX de-identification tool for biomedical data. By integrating the methods into an existing framework, we were able to automatically transform datasets in such a way that information loss is minimized while it is ensured that re-identification risks meet a user-defined threshold. We performed an extensive experimental evaluation to analyze the impact of using different risk models and assumptions about the goals and the background knowledge of an attacker on the quality of de-identified data. Results: The results of our experiments show that data quality can be improved significantly by using risk models for data de-identification. On a scale where 100 % represents the original input dataset and 0 % represents a dataset from which all information has been removed, the loss of information content could be reduced by up to 10 % when protecting datasets against strong adversaries and by up to 24 % when protecting datasets against weaker adversaries. Conclusions: The methods studied in this article are well suited for protecting sensitive biomedical data and our implementation is available as open-source software. Our results can be used by data custodians to increase the information content of de-identified data by tailoring the process to a specific data sharing scenario. Improving data quality is important for fostering the adoption of de-identification methods in biomedical research.

Download Full-text

Scalable analysis of multi-modal biomedical data

GigaScience ◽

10.1093/gigascience/giab058 ◽

2021 ◽

Vol 10 (9) ◽

Cited By ~ 1

Author(s):

Jaclyn Smith ◽

Yao Shi ◽

Michael Benedikt ◽

Milos Nikolic

Keyword(s):

Data Integration ◽

Large Scale ◽

Treatment Options ◽

Complex Data ◽

Biomedical Data ◽

Data Types ◽

Large Scale Data Processing ◽

Scalable Analysis ◽

Targeted Medicine ◽

The Impact

Abstract Background Targeted diagnosis and treatment options are dependent on insights drawn from multi-modal analysis of large-scale biomedical datasets. Advances in genomics sequencing, image processing, and medical data management have supported data collection and management within medical institutions. These efforts have produced large-scale datasets and have enabled integrative analyses that provide a more thorough look of the impact of a disease on the underlying system. The integration of large-scale biomedical data commonly involves several complex data transformation steps, such as combining datasets to build feature vectors for learning analysis. Thus, scalable data integration solutions play a key role in the future of targeted medicine. Though large-scale data processing frameworks have shown promising performance for many domains, they fail to support scalable processing of complex datatypes. Solution To address these issues and achieve scalable processing of multi-modal biomedical data, we present TraNCE, a framework that automates the difficulties of designing distributed analyses with complex biomedical data types. Performance We outline research and clinical applications for the platform, including data integration support for building feature sets for classification. We show that the system is capable of outperforming the common alternative, based on “flattening” complex data structures, and runs efficiently when alternative approaches are unable to perform at all.

Download Full-text

DataMed: Finding useful data across multiple biomedical data repositories

10.1101/094888 ◽

2016 ◽

Cited By ~ 1

Author(s):

L Ohno-Machado ◽

SA Sansone ◽

G Alter ◽

I Fore ◽

J Grethe ◽

...

Keyword(s):

Big Data ◽

Biomedical Research ◽

Scientific Literature ◽

Service Providers ◽

Research Community ◽

Biomedical Data ◽

Data Repositories ◽

Data Intensive ◽

Fair Principles ◽

Community Of Researchers

AbstractThe value of broadening searches for data across multiple repositories has been identified by the biomedical research community. As part of the NIH Big Data to Knowledge initiative, we work with an international community of researchers, service providers and knowledge experts to develop and test a data index and search engine, which are based on metadata extracted from various datasets in a range of repositories. DataMed is designed to be, for data, what PubMed has been for the scientific literature. DataMed supports Findability and Accessibility of datasets. These characteristics - along with Interoperability and Reusability - compose the four FAIR principles to facilitate knowledge discovery in today’s big data-intensive science landscape.

Download Full-text

Development of an informatics system for accelerating biomedical research.

F1000Research ◽

10.12688/f1000research.19161.2 ◽

2020 ◽

Vol 8 ◽

pp. 1430 ◽

Cited By ~ 1

Author(s):

Vivek Navale ◽

Michele Ji ◽

Olga Vovk ◽

Leonie Misquitta ◽

Tsega Gebremichael ◽

...

Keyword(s):

Biomedical Research ◽

Science Research ◽

Computing System ◽

Study Data ◽

Research Management ◽

Data Repository ◽

Biomedical Data ◽

Unique Identifier ◽

Data Repositories ◽

Common Data Elements

The Biomedical Research Informatics Computing System (BRICS) was developed to support multiple disease-focused research programs. Seven service modules are integrated together to provide a collaborative and extensible web-based environment. The modules—Data Dictionary, Account Management, Query Tool, Protocol and Form Research Management System, Meta Study, Data Repository and Globally Unique Identifier —facilitate the management of research protocols, to submit, process, curate, access and store clinical, imaging, and derived genomics data within the associated data repositories. Multiple instances of BRICS are deployed to support various biomedical research communities focused on accelerating discoveries for rare diseases, Traumatic Brain Injury, Parkinson’s Disease, inherited eye diseases and symptom science research. No Personally Identifiable Information is stored within the data repositories. Digital Object Identifiers are associated with the research studies. Reusability of biomedical data is enhanced by Common Data Elements (CDEs) which enable systematic collection, analysis and sharing of data. The use of CDEs with a service-oriented informatics architecture enabled the development of disease-specific repositories that support hypothesis-based biomedical research.

Download Full-text

Extraction of relations between genes and diseases from text and large-scale data analysis: implications for translational research

10.1101/007443 ◽

2014 ◽

Author(s):

Àlex Bravo ◽

Janet Piñero ◽

Núria Queralt ◽

Michael Rautschka ◽

Laura I. Furlong

Keyword(s):

Text Mining ◽

Translational Research ◽

Biomedical Research ◽

Small Proportion ◽

Drug Target ◽

Large Scale ◽

Joint Analysis ◽

Special Focus ◽

Free Text ◽

Disease Associations

Background Current biomedical research needs to leverage and exploit the large amount of information reported in publications. Automated text mining approaches, in particular those aimed at finding relationships between entities, are key for identification of actionable knowledge from free text repositories. We present the BeFree system aimed at identifying relationships between biomedical entities with a special focus on genes and their associated diseases. Results By exploiting morpho-syntactic information of the text BeFree is able to identify gene-disease, drug-disease and drug-target associations with state-of-the-art performance. The application of BeFree to real-case scenarios shows its effectiveness in extracting information relevant for translational research. We show the value of the gene-disease associations extracted by BeFree through a number of analyses and integration with other data sources. BeFree succeeds in identifying genes associated to a major cause of morbidity worldwide, depression, which are not present in other public resources. Moreover, large-scale extraction and analysis of gene-disease associations, and integration with current biomedical knowledge, provided interesting insights on the kind of information that can be found in the literature, and raised challenges regarding data prioritization and curation. We found that only a small proportion of the gene-disease associations discovered by using BeFree is collected in expert-curated databases. Thus, there is a pressing need to find alternative strategies to manual curation to review, prioritize and curate text-mining data and incorporate it into domain-specific databases. We present our strategy for data prioritization and discuss its implications for supporting biomedical research and applications. Conclusions BeFree is a novel text mining system that performs competitively for the identification of gene-disease, drug-disease and drug-target associations. Our analyses show that mining only a small fraction of MEDLINE results in a large dataset of gene-disease associations, and only a small proportion of this dataset is actually recorded in curated resources, raising several issues on data prioritization and curation. We propose that joint analysis of text mined data with data curated by experts appears as a suitable approach to both assess data quality and highlight novel and interesting information.

Download Full-text

Co-occurrence of Cell Lines, Basal Media and Supplementation in the Biomedical Research Literature

Journal of Data and Information Science ◽

10.2478/jdis-2020-0016 ◽

2020 ◽

Vol 5 (3) ◽

pp. 161-177 ◽

Cited By ~ 1

Author(s):

Jessica Cox ◽

Darin McBeath ◽

Corey Harper ◽

Ron Daniel

Keyword(s):

Cell Culture ◽

Text Mining ◽

Cell Lines ◽

Biomedical Research ◽

Research Literature ◽

Sentence Structure ◽

In Vitro Cell Culture ◽

Basal Media ◽

The Impact

AbstractPurposeThe use of in vitro cell culture and experimentation is a cornerstone of biomedical research, however, more attention has recently been given to the potential consequences of using such artificial basal medias and undefined supplements. As a first step towards better understanding and measuring the impact these systems have on experimental results, we use text mining to capture typical research practices and trends around cell culture.Design/methodology/approachTo measure the scale of in vitro cell culture use, we have analyzed a corpus of 94,695 research articles that appear in biomedical research journals published in ScienceDirect from 2000–2018. Central to our investigation is the observation that studies using cell culture describe conditions using the typical sentence structure of cell line, basal media, and supplemented compounds. Here we tag our corpus with a curated list of basal medias and the Cellosaurus ontology using the Aho-Corasick algorithm. We also processed the corpus with Stanford CoreNLP to find nouns that follow the basal media, in an attempt to identify supplements used.FindingsInterestingly, we find that researchers frequently use DMEM even if a cell line's vendor recommends less concentrated media. We see long-tailed distributions for the usage of media and cell lines, with DMEM and RPMI dominating the media, and HEK293, HEK293T, and HeLa dominating cell lines used.Research limitationsOur analysis was restricted to documents in ScienceDirect, and our text mining method achieved high recall but low precision and mandated manual inspection of many tokens.Practical implicationsOur findings document current cell culture practices in the biomedical research community, which can be used as a resource for future experimental design.Originality/valueNo other work has taken a text mining approach to surveying cell culture practices in biomedical research.

Download Full-text

Obstacles to the reuse of study metadata in ClinicalTrials.gov

10.1101/850578 ◽

2019 ◽

Cited By ~ 1

Author(s):

Laura Miron ◽

Rafael S. Gonçalves ◽

Mark A. Musen

Keyword(s):

Free Text ◽

Biomedical Data ◽

Biomedical Ontologies ◽

Experimental Protocol ◽

Data Types ◽

Eligibility Criteria ◽

Government Regulations ◽

Link Type ◽

Contact Information ◽

Mesh Terms

AbstractMetadata that are structured using principled schemas and that use terms from ontologies are essential to making biomedical data findable and reusable for downstream analyses. The largest source of metadata that describes the experimental protocol, funding, and scientific leadership of clinical studies is ClinicalTrials.gov. We evaluated whether values in 302,091 trial records adhere to expected data types and use terms from biomedical ontologies, whether records contain fields required by government regulations, and whether structured elements could replace free-text elements. Contact information, outcome measures, and study design are frequently missing or underspecified. Important fields for search, such as condition and intervention, are not restricted to ontologies, and almost half of the conditions are not denoted by MeSH terms, as recommended. Eligibility criteria are stored as semi-structured free text. Enforcing the presence of all required elements, requiring values for certain fields to be drawn from ontologies, and creating a structured eligibility criteria element would improve the reusability of data from ClinicalTrials.gov in systematic reviews, metanalyses, and matching of eligible patients to trials.

Download Full-text