scholarly journals Extracting Evidence Fragments for Distant Supervision of Molecular Interactions

2017 ◽  
Author(s):  
Gully A Burns ◽  
Pradeep Dasigi ◽  
Eduard H. Hovy

Abstract.We describe a methodology for automatically extracting ‘evidence fragments’ from a set of biomedical experimental research articles. These fragments provide the primary description of evidence that is presented in the papers’ figures. They elucidate the goals, methods, results and interpretations of experiments that support the original scientific contributions the study being reported. Within this paper, we describe our methodology and showcase an example data set based on the European Bioinformatics Institute’s INTACT database (http://www.ebi.ac.uk/intact/). Using figure codes as anchors, we linked evidence fragments to INTACT data records as an example ofdistant supervisionso that we could use INTACT’s preexisting, manually-curated structured interaction data to act as a gold standard for machine reading experiments. We report preliminary baseline event extraction measures from this collection based on a publicly available, machine reading system (REACH). We use semantic web standards for our data and provide open access to all source code.

2019 ◽  
Author(s):  
Charlotte A. Darby ◽  
Ravi Gaddipati ◽  
Michael C. Schatz ◽  
Ben Langmead

AbstractRead alignment is central to many aspects of modern genomics. Most aligners use heuristics to accelerate processing, but these heuristics can fail to find the optimal alignments of reads. Alignment accuracy is typically measured through simulated reads; however, the simulated location may not be the (only) location with the optimal alignment score. Vargas implements a heuristic-free algorithm guaranteed to find the highest-scoring alignment for real sequencing reads to a linear or graph genome. With semiglobal and local alignment modes and affine gap and quality-scaled mismatch penalties, it can implement the scoring functions of commonly used aligners to calculate optimal alignments. While this is computationally intensive, Vargas uses multi-core parallelization and vectorized (SIMD) instructions to make it practical to optimally align large numbers of reads, achieving a maximum speed of 456 billion cell updates per second. We demonstrate how these “gold standard” Vargas alignments can be used to improve heuristic alignment accuracy by optimizing command-line parameters in Bowtie 2, BWA-MEM, and vg to align more reads correctly. Source code implemented in C++ and compiled binary releases are available at https://github.com/langmead-lab/vargas under the MIT license.


2019 ◽  
Vol 29 (11n12) ◽  
pp. 1801-1818
Author(s):  
Yixiao Yang ◽  
Xiang Chen ◽  
Jiaguang Sun

In last few years, applying language model to source code is the state-of-the-art method for solving the problem of code completion. However, compared with natural language, code has more obvious repetition characteristics. For example, a variable can be used many times in the following code. Variables in source code have a high chance to be repetitive. Cloned code and templates, also have the property of token repetition. Capturing the token repetition of source code is important. In different projects, variables or types are usually named differently. This means that a model trained in a finite data set will encounter a lot of unseen variables or types in another data set. How to model the semantics of the unseen data and how to predict the unseen data based on the patterns of token repetition are two challenges in code completion. Hence, in this paper, token repetition is modelled as a graph, we propose a novel REP model which is based on deep neural graph network to learn the code toke repetition. The REP model is to identify the edge connections of a graph to recognize the token repetition. For predicting the token repetition of token [Formula: see text], the information of all the previous tokens needs to be considered. We use memory neural network (MNN) to model the semantics of each distinct token to make the framework of REP model more targeted. The experiments indicate that the REP model performs better than LSTM model. Compared with Attention-Pointer network, we also discover that the attention mechanism does not work in all situations. The proposed REP model could achieve similar or slightly better prediction accuracy compared to Attention-Pointer network and consume less training time. We also find other attention mechanism which could further improve the prediction accuracy.


2019 ◽  
Vol 10 (04) ◽  
pp. 655-669
Author(s):  
Gaurav Trivedi ◽  
Esmaeel R. Dadashzadeh ◽  
Robert M. Handzel ◽  
Wendy W. Chapman ◽  
Shyam Visweswaran ◽  
...  

Abstract Background Despite advances in natural language processing (NLP), extracting information from clinical text is expensive. Interactive tools that are capable of easing the construction, review, and revision of NLP models can reduce this cost and improve the utility of clinical reports for clinical and secondary use. Objectives We present the design and implementation of an interactive NLP tool for identifying incidental findings in radiology reports, along with a user study evaluating the performance and usability of the tool. Methods Expert reviewers provided gold standard annotations for 130 patient encounters (694 reports) at sentence, section, and report levels. We performed a user study with 15 physicians to evaluate the accuracy and usability of our tool. Participants reviewed encounters split into intervention (with predictions) and control conditions (no predictions). We measured changes in model performance, the time spent, and the number of user actions needed. The System Usability Scale (SUS) and an open-ended questionnaire were used to assess usability. Results Starting from bootstrapped models trained on 6 patient encounters, we observed an average increase in F1 score from 0.31 to 0.75 for reports, from 0.32 to 0.68 for sections, and from 0.22 to 0.60 for sentences on a held-out test data set, over an hour-long study session. We found that tool helped significantly reduce the time spent in reviewing encounters (134.30 vs. 148.44 seconds in intervention and control, respectively), while maintaining overall quality of labels as measured against the gold standard. The tool was well received by the study participants with a very good overall SUS score of 78.67. Conclusion The user study demonstrated successful use of the tool by physicians for identifying incidental findings. These results support the viability of adopting interactive NLP tools in clinical care settings for a wider range of clinical applications.


2020 ◽  
Vol 53 (4) ◽  
pp. 1060-1072 ◽  
Author(s):  
Edward L. Pang ◽  
Peter M. Larsen ◽  
Christopher A. Schuh

Resolving pseudosymmetry has long presented a challenge for electron backscatter diffraction and has been notoriously challenging in the case of tetragonal ZrO2 in particular. In this work, a method is proposed to resolve pseudosymmetry by building upon the dictionary indexing method and augmenting it with the application of global optimization to fit accurate pattern centers, clustering of the Hough-indexed orientations to focus the dictionary in orientation space and interpolation to improve the accuracy of the indexed solution. The proposed method is demonstrated to resolve pseudosymmetry with 100% accuracy in simulated patterns of tetragonal ZrO2, even with high degrees of binning and noise. The method is then used to index an experimental data set, which confirms its ability to efficiently and accurately resolve pseudosymmetry in these materials. The present method can be applied to resolve pseudosymmetry in a wide range of materials, possibly even some more challenging than tetragonal ZrO2. Source code for this implementation is available online.


Ecology ◽  
2017 ◽  
Vol 98 (6) ◽  
pp. 1729-1729 ◽  
Author(s):  
Carolina Bello ◽  
Mauro Galetti ◽  
Denise Montan ◽  
Marco A. Pizo ◽  
Tatiane C. Mariguela ◽  
...  

2015 ◽  
Vol 32 (6) ◽  
pp. 801-807 ◽  
Author(s):  
Mookyung Cheon ◽  
Choongrak Kim ◽  
Iksoo Chang

AbstractMotivation: The loci-ordering, based on two-point recombination fractions for a pair of loci, is the most important step in constructing a reliable and fine genetic map.Results: Using the concept from complex graph theory, here we propose a Laplacian ordering approach which uncovers the loci-ordering of multiloci simultaneously. The algebraic property for a Fiedler vector of a Laplacian matrix, constructed from the recombination fraction of the loci-ordering for 26 loci of barley chromosome IV, 846 loci of Arabidopsisthaliana and 1903 loci of Malus domestica, together with the variable threshold uncovers their loci-orders. It offers an alternative yet robust approach for ordering multiloci.Availability and implementation : Source code program with data set is available as supplementary data and also in a software category of the website (http://biophysics.dgist.ac.kr)Contact: [email protected] or [email protected] information: Supplementary data are available at Bioinformatics online.


2009 ◽  
Vol 07 (04) ◽  
pp. 701-716 ◽  
Author(s):  
HENRY CHI-MING LEUNG ◽  
MAN-HUNG SIU ◽  
SIU-MING YIU ◽  
FRANCIS YUK-LUN CHIN ◽  
KEN WING-KIN SUNG

Predicting motif pairs from a set of protein sequences based on the protein–protein interaction data is an important, but difficult computational problem. Tan et al. proposed a solution to this problem. However, the scoring function (using χ2 testing) used in their approach is not adequate and their approach is also not scalable. It may take days to process a set of 5000 protein sequences with about 20,000 interactions. Later, Leung et al. proposed an improved scoring function and faster algorithms for solving the same problem. But, the model used in Leung et al. is complicated. The exact value of the scoring function is not easy to compute and an estimated value is used in practice. In this paper, we derive a better model to capture the significance of a given motif pair based on a clustering notion. We develop a fast heuristic algorithm to solve the problem. The algorithm is able to locate the correct motif pair in the yeast data set in about 45 minutes for 5000 protein sequences and 20,000 interactions. Moreover, we derive a lower bound result for the p-value of a motif pair in order for it to be distinguishable from random motif pairs. The lower bound result has been verified using simulated data sets. Availability:


2011 ◽  
Vol 38 (3) ◽  
pp. 1491-1502 ◽  
Author(s):  
Christelle Gendrin ◽  
Primož Markelj ◽  
Supriyanto Ardjo Pawiro ◽  
Jakob Spoerk ◽  
Christoph Bloch ◽  
...  

2011 ◽  
Vol 11 (2) ◽  
pp. 151-167 ◽  
Author(s):  
Mikkel Rønne Jakobsen ◽  
Kasper Hornbæk

Transient use of information visualization may support specific tasks without permanently changing the user interface. Transient visualizations provide immediate and transient use of information visualization close to and in the context of the user’s focus of attention. Little is known, however, about the benefits and limitations of transient visualizations. We describe an experiment that compares the usability of a fisheye view that participants could call up temporarily, a permanent fisheye view, and a linear view: all interfaces gave access to source code in the editor of a widespread programming environment. Fourteen participants performed varied tasks involving navigation and understanding of source code. Participants used the three interfaces for between four and six hours in all. Time and accuracy measures were inconclusive, but subjective data showed a preference for the permanent fisheye view. We analyse interaction data to compare how participants used the interfaces and to understand why the transient interface was not preferred. We conclude by discussing seamless integration of fisheye views in existing user interfaces and future work on transient visualizations.


2021 ◽  
Author(s):  
Qi Jia ◽  
Dezheng Zhang ◽  
Haifeng Xu ◽  
Yonghong Xie

BACKGROUND Traditional Chinese medicine (TCM) clinical records contain the symptoms of patients, diagnoses, and subsequent treatment of doctors. These records are important resources for research and analysis of TCM diagnosis knowledge. However, most of TCM clinical records are unstructured text. Therefore, a method to automatically extract medical entities from TCM clinical records is indispensable. OBJECTIVE Training a medical entity extracting model needs a large number of annotated corpus. The cost of annotated corpus is very high and there is a lack of gold-standard data sets for supervised learning methods. Therefore, we utilized distantly supervised named entity recognition (NER) to respond to the challenge. METHODS We propose a span-level distantly supervised NER approach to extract TCM medical entity. It utilizes the pretrained language model and a simple multilayer neural network as classifier to detect and classify entity. We also designed a negative sampling strategy for the span-level model. The strategy randomly selects negative samples in every epoch and filters the possible false-negative samples periodically. It reduces the bad influence from the false-negative samples. RESULTS We compare our methods with other baseline methods to illustrate the effectiveness of our method on a gold-standard data set. The F1 score of our method is 77.34 and it remarkably outperforms the other baselines. CONCLUSIONS We developed a distantly supervised NER approach to extract medical entity from TCM clinical records. We estimated our approach on a TCM clinical record data set. Our experimental results indicate that the proposed approach achieves a better performance than other baselines.


Sign in / Sign up

Export Citation Format

Share Document