GeDex: A consensus Gene-disease Event Extraction System based on frequency patterns and supervised learning

Mapping Intimacies ◽

10.1101/839704 ◽

2019 ◽

Author(s):

Larisa M. Soto ◽

Roberto Olayo-Alarcón ◽

David Alberto Velázquez-Ramírez ◽

Adrián Munguía-Reyes ◽

Yalbi Itzel Balderas-Martínez ◽

...

Keyword(s):

Biomedical Literature ◽

Event Extraction ◽

Supplementary Information ◽

Link Type ◽

Manual Curation ◽

Consensus Prediction ◽

Disease Associations ◽

Chronic Pulmonary Diseases ◽

Genetic Mechanisms ◽

Simple Features

AbstractMotivationThe genetic mechanisms involved in human diseases are fundamental in biomedical research. Several databases with curated associations between genes and diseases have emerged in the last decades. Although, due to the demanding and time consuming nature of manual curation of literature, they still lack large amounts of information. Current automatic approaches extract associations by considering each abstract or sentence independently. This approach could potentially lead to contradictions between individual cases. Therefore, there is a current need for automatic strategies that can provide a literature consensus of gene-disease associations, and are not prone to making contradictory predictions.ResultsHere, we present GeDex, an effective and freely available automatic approach to extract consensus gene-disease associations from biomedical literature based on a predictive model trained with four simple features. As far as we know, it is the only system that reports a single consensus prediction from multiple sentences supporting the same association. We tested our approach on the curated fraction of DisGeNet (f-score 0.77) and validated it on a manually curated dataset, obtaining a competitive performance when compared to pre-existing methods (f-score 0.74). In addition, we effectively recovered associations from an article collection of chronic pulmonary diseases, and discovered that a large proportion is not reported in current databases. Our results demonstrate that GeDex, despite its simplicity, is a competitive tool that can successfully assist the curation of existing databases.AvailabilityGeDex is available at https://bitbucket.org/laigen/gedex/src/master/ and can be used as a docker image https://hub.docker.com/r/laigen/[email protected] informationSupplementary material are available at bioRxiv online.

Download Full-text

Curatr: a web application for creating, curating, and sharing a mass spectral library

10.1101/170571 ◽

2017 ◽

Author(s):

Andrew Palmer ◽

Prasad Phapale ◽

Dominik Fay ◽

Theodore Alexandrov

Keyword(s):

Mass Spectrometry ◽

Web Application ◽

Mass Spectrometry Analysis ◽

Supplementary Information ◽

Spectral Library ◽

Mass Spectral Fragmentation ◽

Mass Spectral ◽

Link Type ◽

Manual Curation

AbstractMotivationIdentification from metabolomics mass spectrometry experiments requires comparison of fragmentation spectra from experimental samples to spectra from analytical standards. As the quality of identification depends directly on the quality of the reference spectra, manual curation is routine during the selection of reference spectra to include in a spectral library. Whilst building our own in-house spectral library we realised that there is currently no vendor neutral open access tool for for facilitating manual curation of spectra from raw LC-MS data into a custom spectral library.ResultsWe developed a web application curatr for the rapid generation of high quality mass spectral fragmentation libraries for liquid-chromatography mass spectrometry analysis. Curatr handles datasets from single or multiplexed standards, automatically extracting chromatographic profiles and potential fragmentation spectra for multiple adducts. These are presented through an intuitive interface for manual curation before being documented in a custom spectral library. Searchable molecular information and the providence of each standard is stored along with metadata on the experimental protocol. Curatr support the export of spectral libraries in several standard formats for easy use with third party software or submission to community databases, maximising the return on investment for these costly measurements. We demonstrate the use of curatr to generate the EMBL Metabolomics Core Facility spectral library which is publicly available at http://curatr.mcf.embl.de.AvailabilityThe source code is freely available at http://github.com/alexandrovteam/curatr/ along with example data.Supplementary informationA step-by step user manual is available in the supplementary information

Download Full-text

Re-curation and Rational Enrichment of Knowledge Graphs in Biological Expression Language

10.1101/536409 ◽

2019 ◽

Author(s):

Charles Tapley Hoyt ◽

Daniel Domingo-Fernández ◽

Rana Aldisi ◽

Lingling Xu ◽

Kristian Kolpeja ◽

...

Keyword(s):

Text Mining ◽

Full Text ◽

Biomedical Literature ◽

Knowledge Graph ◽

Pubmed Central ◽

Link Type ◽

Information Density ◽

Manual Curation ◽

Rapid Accumulation ◽

Knowledge Graphs

AbstractThe rapid accumulation of new biomedical literature not only causes curated knowledge graphs to become outdated and incomplete, but also makes manual curation an impractical and unsustainable solution. Automated or semi-automated workflows are necessary to assist in prioritizing and curating the literature to update and enrich knowledge graphs.We have developed two workflows: one for re-curating a given knowledge graph to assure its syntactic and semantic quality and another for rationally enriching it by manually revising automatically extracted relations for nodes with low information density. We applied these workflows to the knowledge graphs encoded in Biological Expression Language from the NeuroMMSig database using content that was pre-extracted from MEDLINE abstracts and PubMed Central full text articles using text mining output integrated by INDRA. We have made this workflow freely available at https://github.com/bel-enrichment/bel-enrichment.Database URLhttps://github.com/bel-enrichment/results

Download Full-text

Transfer learning for biomedical named entity recognition with neural networks

10.1101/262790 ◽

2018 ◽

Author(s):

John M Giorgi ◽

Gary D Bader

Keyword(s):

Transfer Learning ◽

Short Term Memory ◽

State Of The Art ◽

Conditional Random Field ◽

Named Entity Recognition ◽

Biomedical Literature ◽

Entity Recognition ◽

Supplementary Information ◽

Link Type ◽

Training Examples

AbstractMotivationThe explosive increase of biomedical literature has made information extraction an increasingly important tool for biomedical research. A fundamental task is the recognition of biomedical named entities in text (BNER) such as genes/proteins, diseases, and species. Recently, a domain-independent method based on deep learning and statistical word embeddings, called long short-term memory network-conditional random field (LSTM-CRF), has been shown to outperform state-of-the-art entity-specific BNER tools. However, this method is dependent on gold-standard corpora (GSCs) consisting of hand-labeled entities, which tend to be small but highly reliable. An alternative to GSCs are silver-standard corpora (SSCs), which are generated by harmonizing the annotations made by several automatic annotation systems. SSCs typically contain more noise than GSCs but have the advantage of containing many more training examples. Ideally, these corpora could be combined to achieve the benefits of both, which is an opportunity for transfer learning. In this work, we analyze to what extent transfer learning improves upon state-of-the-art results for BNER.ResultsWe demonstrate that transferring a deep neural network (DNN) trained on a large, noisy SSC to a smaller, but more reliable GSC significantly improves upon state-of-the-art results for BNER. Compared to a state-of-the-art baseline evaluated on 23 GSCs covering four different entity classes, transfer learning results in an average reduction in error of approximately 11%. We found transfer learning to be especially beneficial for target data sets with a small number of labels (approximately 6000 or less).Availability and implementationSource code for the LSTM-CRF is available at https://github.com/Franck-Dernoncourt/NeuroNER/ and links to the corpora are available at https://github.com/BaderLab/Transfer-Learning-BNER-Bioinformatics-2018/[email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

Knowledge extraction for assisted curation of summaries of bacterial transcription factor properties

Database ◽

10.1093/database/baaa109 ◽

2020 ◽

Vol 2020 ◽

Author(s):

Carlos-Francisco Méndez-Cruz ◽

Antonio Blanchet ◽

Alan Godínez ◽

Ignacio Arroyo-Fernández ◽

Socorro Gama-Castro ◽

...

Keyword(s):

Transcriptional Regulation ◽

Knowledge Extraction ◽

Biomedical Literature ◽

Main Role ◽

E Coli ◽

Bacterial Transcription ◽

Manual Curation ◽

New Knowledge ◽

Serovar Typhimurium ◽

K 12

Abstract Transcription factors (TFs) play a main role in transcriptional regulation of bacteria, as they regulate transcription of the genetic information encoded in DNA. Thus, the curation of the properties of these regulatory proteins is essential for a better understanding of transcriptional regulation. However, traditional manual curation of article collections to compile descriptions of TF properties takes significant time and effort due to the overwhelming amount of biomedical literature, which increases every day. The development of automatic approaches for knowledge extraction to assist curation is therefore critical. Here, we show an effective approach for knowledge extraction to assist curation of summaries describing bacterial TF properties based on an automatic text summarization strategy. We were able to recover automatically a median 77% of the knowledge contained in manual summaries describing properties of 177 TFs of Escherichia coli K-12 by processing 5961 scientific articles. For 71% of the TFs, our approach extracted new knowledge that can be used to expand manual descriptions. Furthermore, as we trained our predictive model with manual summaries of E. coli, we also generated summaries for 185 TFs of Salmonella enterica serovar Typhimurium from 3498 articles. According to the manual curation of 10 of these Salmonella typhimurium summaries, 96% of their sentences contained relevant knowledge. Our results demonstrate the feasibility to assist manual curation to expand manual summaries with new knowledge automatically extracted and to create new summaries of bacteria for which these curation efforts do not exist. Database URL: The automatic summaries of the TFs of E. coli and Salmonella and the automatic summarizer are available in GitHub (https://github.com/laigen-unam/tf-properties-summarizer.git).

Download Full-text

Genome-wide inferring gene–phenotype relationship by walking on the heterogeneous network

Bioinformatics ◽

10.1093/bioinformatics/btq108 ◽

2010 ◽

Vol 26 (9) ◽

pp. 1219-1224 ◽

Cited By ~ 238

Author(s):

Yongjin Li ◽

Jagdish C. Patra

Keyword(s):

Heterogeneous Network ◽

Gene Network ◽

Genetic Diseases ◽

Supplementary Information ◽

Disease Genes ◽

Phenotypic Data ◽

Disease Associations ◽

Improved Performance ◽

Leave One Out ◽

Phenotype Network

Abstract Motivation: Clinical diseases are characterized by distinct phenotypes. To identify disease genes is to elucidate the gene–phenotype relationships. Mutations in functionally related genes may result in similar phenotypes. It is reasonable to predict disease-causing genes by integrating phenotypic data and genomic data. Some genetic diseases are genetically or phenotypically similar. They may share the common pathogenetic mechanisms. Identifying the relationship between diseases will facilitate better understanding of the pathogenetic mechanism of diseases. Results: In this article, we constructed a heterogeneous network by connecting the gene network and phenotype network using the phenotype–gene relationship information from the OMIM database. We extended the random walk with restart algorithm to the heterogeneous network. The algorithm prioritizes the genes and phenotypes simultaneously. We use leave-one-out cross-validation to evaluate the ability of finding the gene–phenotype relationship. Results showed improved performance than previous works. We also used the algorithm to disclose hidden disease associations that cannot be found by gene network or phenotype network alone. We identified 18 hidden disease associations, most of which were supported by literature evidence. Availability: The MATLAB code of the program is available at http://www3.ntu.edu.sg/home/aspatra/research/Yongjin_BI2010.zip Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.

Download Full-text

PathScore: a web tool for identifying altered pathways in cancer data

10.1101/067090 ◽

2016 ◽

Cited By ~ 2

Author(s):

Stephen G. Gaffney ◽

Jeffrey P. Townsend

Keyword(s):

Web Application ◽

Somatic Mutations ◽

Supplementary Information ◽

Web Tool ◽

Cancer Data ◽

Link Type ◽

Novel Approach ◽

Supplementary Material ◽

User Friendly ◽

Pathway Effect

ABSTRACTSummaryPathScore quantifies the level of enrichment of somatic mutations within curated pathways, applying a novel approach that identifies pathways enriched across patients. The application provides several user-friendly, interactive graphic interfaces for data exploration, including tools for comparing pathway effect sizes, significance, gene-set overlap and enrichment differences between projects.Availability and ImplementationWeb application available at pathscore.publichealth.yale.edu. Site implemented in Python and MySQL, with all major browsers supported. Source code available at github.com/sggaffney/pathscore with a GPLv3 [email protected] InformationAdditional documentation can be found at http://pathscore.publichealth.yale.edu/faq.

Download Full-text

Quickomics: exploring omics data in an intuitive, interactive and informative manner

10.1101/2021.01.19.427296 ◽

2021 ◽

Author(s):

Benbo Gao ◽

Jing Zhu ◽

Soumya Negi ◽

Xinmin Zhang ◽

Stefka Gyoneva ◽

...

Keyword(s):

Modular Design ◽

Functional Module ◽

Supplementary Information ◽

Data Sets ◽

Omics Data ◽

Proteomics Data ◽

Primary Analysis ◽

Link Type ◽

R Shiny ◽

Advanced Analysis

AbstractSummaryWe developed Quickomics, a feature-rich R Shiny-powered tool to enable biologists to fully explore complex omics data and perform advanced analysis in an easy-to-use interactive interface. It covers a broad range of secondary and tertiary analytical tasks after primary analysis of omics data is completed. Each functional module is equipped with customized configurations and generates both interactive and publication-ready high-resolution plots to uncover biological insights from data. The modular design makes the tool extensible with ease.AvailabilityResearchers can experience the functionalities with their own data or demo RNA-Seq and proteomics data sets by using the app hosted at http://quickomics.bxgenomics.com and following the tutorial, https://bit.ly/3rXIyhL. The source code under GPLv3 license is provided at https://github.com/interactivereport/[email protected], [email protected] informationSupplementary materials are available at https://bit.ly/37HP17g.

Download Full-text

hypeR: An R Package for Geneset Enrichment Workflows

10.1101/656637 ◽

2019 ◽

Cited By ~ 1

Author(s):

Anthony Federico ◽

Stefano Monti

Keyword(s):

High Throughput Sequencing ◽

R Package ◽

Supplementary Information ◽

Sequencing Data ◽

Wide Audience ◽

Popular Method ◽

Link Type ◽

High Throughput Sequencing Data ◽

One Stop ◽

Recent Version

ABSTRACTSummaryGeneset enrichment is a popular method for annotating high-throughput sequencing data. Existing tools fall short in providing the flexibility to tackle the varied challenges researchers face in such analyses, particularly when analyzing many signatures across multiple experiments. We present a comprehensive R package for geneset enrichment workflows that offers multiple enrichment, visualization, and sharing methods in addition to novel features such as hierarchical geneset analysis and built-in markdown reporting. hypeR is a one-stop solution to performing geneset enrichment for a wide audience and range of use cases.Availability and implementationThe most recent version of the package is available at https://github.com/montilab/hypeR.Supplementary informationComprehensive documentation and tutorials, are available at https://montilab.github.io/hypeR-docs.

Download Full-text

COVID-Align: Accurate online alignment of hCoV-19 genomes using a profile HMM

10.1101/2020.05.25.114884 ◽

2020 ◽

Cited By ~ 2

Author(s):

Frédéric Lemoine ◽

Luc Blassel ◽

Jakub Voznica ◽

Olivier Gascuel

Keyword(s):

Daily Basis ◽

Supplementary Information ◽

Summary Statistics ◽

Evolutionary Novelty ◽

Bioinformatics Analyses ◽

Link Type ◽

Sequencing Quality ◽

User Friendly ◽

Profile Hmm ◽

New Mutations

AbstractMotivationThe first cases of the COVID-19 pandemic emerged in December 2019. Until the end of February 2020, the number of available genomes was below 1,000, and their multiple alignment was easily achieved using standard approaches. Subsequently, the availability of genomes has grown dramatically. Moreover, some genomes are of low quality with sequencing/assembly errors, making accurate re-alignment of all genomes nearly impossible on a daily basis. A more efficient, yet accurate approach was clearly required to pursue all subsequent bioinformatics analyses of this crucial data.ResultshCoV-19 genomes are highly conserved, with very few indels and no recombination. This makes the profile HMM approach particularly well suited to align new genomes, add them to an existing alignment and filter problematic ones. Using a core of ∼2,500 high quality genomes, we estimated a profile using HMMER, and implemented this profile in COVID-Align, a user-friendly interface to be used online or as standalone via Docker. The alignment of 1,000 genomes requires less than 20mn on our cluster. Moreover, COVID-Align provides summary statistics, which can be used to determine the sequencing quality and evolutionary novelty of input genomes (e.g. number of new mutations and indels).Availabilityhttps://covalign.pasteur.cloud, hub.docker.com/r/evolbioinfo/[email protected], [email protected] informationSupplementary information is available at Bioinformatics online.

Download Full-text

GalaxyCloudRunner: enhancing scalable computing for Galaxy

10.1101/2020.05.28.121772 ◽

2020 ◽

Author(s):

N Goonasekera ◽

A Mahmoud ◽

J Chilton ◽

E Afgan

Keyword(s):

Source Code ◽

Supplementary Information ◽

Scalable Computing ◽

Link Type ◽

Cloud Providers ◽

Galaxy Server ◽

Cloud Resources

AbstractSummaryThe existence of more than 100 public Galaxy servers with service quotas is indicative of the need for an increased availability of compute resources for Galaxy to use. The GalaxyCloudRunner enables a Galaxy server to easily expand its available compute capacity by sending user jobs to cloud resources. User jobs are routed to the acquired resources based on a set of configurable rules and the resources can be dynamically acquired from any of 4 popular cloud providers (AWS, Azure, GCP, or OpenStack) in an automated fashion.Availability and implementationGalaxyCloudRunner is implemented in Python and leverages Docker containers. The source code is MIT licensed and available at https://github.com/cloudve/galaxycloudrunner. The documentation is available at http://gcr.cloudve.org/.ContactEnis Afgan ([email protected])Supplementary informationNone

Download Full-text