Extracting Evidence Fragments for Distant Supervision of Molecular Interactions

Mapping Intimacies ◽

10.1101/192856 ◽

2017 ◽

Author(s):

Gully A Burns ◽

Pradeep Dasigi ◽

Eduard H. Hovy

Keyword(s):

Gold Standard ◽

Source Code ◽

Event Extraction ◽

Interaction Data ◽

Data Set ◽

Distant Supervision ◽

Web Standards ◽

Reading System ◽

Linked Evidence ◽

Machine Reading

Abstract.We describe a methodology for automatically extracting ‘evidence fragments’ from a set of biomedical experimental research articles. These fragments provide the primary description of evidence that is presented in the papers’ figures. They elucidate the goals, methods, results and interpretations of experiments that support the original scientific contributions the study being reported. Within this paper, we describe our methodology and showcase an example data set based on the European Bioinformatics Institute’s INTACT database (http://www.ebi.ac.uk/intact/). Using figure codes as anchors, we linked evidence fragments to INTACT data records as an example ofdistant supervisionso that we could use INTACT’s preexisting, manually-curated structured interaction data to act as a gold standard for machine reading experiments. We report preliminary baseline event extraction measures from this collection based on a publicly available, machine reading system (REACH). We use semantic web standards for our data and provide open access to all source code.

Download Full-text

Vargas: heuristic-free alignment for assessing linear and graph read aligners

10.1101/2019.12.20.884676 ◽

2019 ◽

Author(s):

Charlotte A. Darby ◽

Ravi Gaddipati ◽

Michael C. Schatz ◽

Ben Langmead

Keyword(s):

Gold Standard ◽

Source Code ◽

Alignment Accuracy ◽

Local Alignment ◽

Maximum Speed ◽

Command Line ◽

Scoring Functions ◽

Large Numbers ◽

Computationally Intensive ◽

Optimal Alignments

AbstractRead alignment is central to many aspects of modern genomics. Most aligners use heuristics to accelerate processing, but these heuristics can fail to find the optimal alignments of reads. Alignment accuracy is typically measured through simulated reads; however, the simulated location may not be the (only) location with the optimal alignment score. Vargas implements a heuristic-free algorithm guaranteed to find the highest-scoring alignment for real sequencing reads to a linear or graph genome. With semiglobal and local alignment modes and affine gap and quality-scaled mismatch penalties, it can implement the scoring functions of commonly used aligners to calculate optimal alignments. While this is computationally intensive, Vargas uses multi-core parallelization and vectorized (SIMD) instructions to make it practical to optimally align large numbers of reads, achieving a maximum speed of 456 billion cell updates per second. We demonstrate how these “gold standard” Vargas alignments can be used to improve heuristic alignment accuracy by optimizing command-line parameters in Bowtie 2, BWA-MEM, and vg to align more reads correctly. Source code implemented in C++ and compiled binary releases are available at https://github.com/langmead-lab/vargas under the MIT license.

Download Full-text

Improve Language Modeling for Code Completion Through Learning General Token Repetition of Source Code with Optimized Memory

International Journal of Software Engineering and Knowledge Engineering ◽

10.1142/s0218194019400229 ◽

2019 ◽

Vol 29 (11n12) ◽

pp. 1801-1818

Author(s):

Yixiao Yang ◽

Xiang Chen ◽

Jiaguang Sun

Keyword(s):

Prediction Accuracy ◽

Language Model ◽

Source Code ◽

Attention Mechanism ◽

Data Set ◽

Training Time ◽

Unseen Data ◽

Code Completion ◽

High Chance ◽

Better Than

In last few years, applying language model to source code is the state-of-the-art method for solving the problem of code completion. However, compared with natural language, code has more obvious repetition characteristics. For example, a variable can be used many times in the following code. Variables in source code have a high chance to be repetitive. Cloned code and templates, also have the property of token repetition. Capturing the token repetition of source code is important. In different projects, variables or types are usually named differently. This means that a model trained in a finite data set will encounter a lot of unseen variables or types in another data set. How to model the semantics of the unseen data and how to predict the unseen data based on the patterns of token repetition are two challenges in code completion. Hence, in this paper, token repetition is modelled as a graph, we propose a novel REP model which is based on deep neural graph network to learn the code toke repetition. The REP model is to identify the edge connections of a graph to recognize the token repetition. For predicting the token repetition of token [Formula: see text], the information of all the previous tokens needs to be considered. We use memory neural network (MNN) to model the semantics of each distinct token to make the framework of REP model more targeted. The experiments indicate that the REP model performs better than LSTM model. Compared with Attention-Pointer network, we also discover that the attention mechanism does not work in all situations. The proposed REP model could achieve similar or slightly better prediction accuracy compared to Attention-Pointer network and consume less training time. We also find other attention mechanism which could further improve the prediction accuracy.

Download Full-text

Interactive NLP in Clinical Care: Identifying Incidental Findings in Radiology Reports

Applied Clinical Informatics ◽

10.1055/s-0039-1695791 ◽

2019 ◽

Vol 10 (04) ◽

pp. 655-669

Author(s):

Gaurav Trivedi ◽

Esmaeel R. Dadashzadeh ◽

Robert M. Handzel ◽

Wendy W. Chapman ◽

Shyam Visweswaran ◽

...

Keyword(s):

Language Processing ◽

Gold Standard ◽

Incidental Findings ◽

User Study ◽

Clinical Care ◽

Model Performance ◽

Data Set ◽

Radiology Reports ◽

System Usability Scale ◽

And Control

Abstract Background Despite advances in natural language processing (NLP), extracting information from clinical text is expensive. Interactive tools that are capable of easing the construction, review, and revision of NLP models can reduce this cost and improve the utility of clinical reports for clinical and secondary use. Objectives We present the design and implementation of an interactive NLP tool for identifying incidental findings in radiology reports, along with a user study evaluating the performance and usability of the tool. Methods Expert reviewers provided gold standard annotations for 130 patient encounters (694 reports) at sentence, section, and report levels. We performed a user study with 15 physicians to evaluate the accuracy and usability of our tool. Participants reviewed encounters split into intervention (with predictions) and control conditions (no predictions). We measured changes in model performance, the time spent, and the number of user actions needed. The System Usability Scale (SUS) and an open-ended questionnaire were used to assess usability. Results Starting from bootstrapped models trained on 6 patient encounters, we observed an average increase in F1 score from 0.31 to 0.75 for reports, from 0.32 to 0.68 for sections, and from 0.22 to 0.60 for sentences on a held-out test data set, over an hour-long study session. We found that tool helped significantly reduce the time spent in reviewing encounters (134.30 vs. 148.44 seconds in intervention and control, respectively), while maintaining overall quality of labels as measured against the gold standard. The tool was well received by the study participants with a very good overall SUS score of 78.67. Conclusion The user study demonstrated successful use of the tool by physicians for identifying incidental findings. These results support the viability of adopting interactive NLP tools in clinical care settings for a wider range of clinical applications.

Download Full-text

Resolving pseudosymmetry in tetragonal ZrO2 using electron backscatter diffraction with a modified dictionary indexing approach

Journal of Applied Crystallography ◽

10.1107/s160057672000864x ◽

2020 ◽

Vol 53 (4) ◽

pp. 1060-1072 ◽

Cited By ~ 1

Author(s):

Edward L. Pang ◽

Peter M. Larsen ◽

Christopher A. Schuh

Keyword(s):

Present Method ◽

Electron Backscatter Diffraction ◽

Source Code ◽

Tetragonal Zro2 ◽

Data Set ◽

Orientation Space ◽

Indexing Method ◽

Wide Range ◽

Electron Backscatter ◽

Backscatter Diffraction

Resolving pseudosymmetry has long presented a challenge for electron backscatter diffraction and has been notoriously challenging in the case of tetragonal ZrO2 in particular. In this work, a method is proposed to resolve pseudosymmetry by building upon the dictionary indexing method and augmenting it with the application of global optimization to fit accurate pattern centers, clustering of the Hough-indexed orientations to focus the dictionary in orientation space and interpolation to improve the accuracy of the indexed solution. The proposed method is demonstrated to resolve pseudosymmetry with 100% accuracy in simulated patterns of tetragonal ZrO2, even with high degrees of binning and noise. The method is then used to index an experimental data set, which confirms its ability to efficiently and accurately resolve pseudosymmetry in these materials. The present method can be applied to resolve pseudosymmetry in a wide range of materials, possibly even some more challenging than tetragonal ZrO2. Source code for this implementation is available online.

Download Full-text

Atlantic frugivory: a plant-frugivore interaction data set for the Atlantic Forest

Ecology ◽

10.1002/ecy.1818 ◽

2017 ◽

Vol 98 (6) ◽

pp. 1729-1729 ◽

Cited By ~ 41

Author(s):

Carolina Bello ◽

Mauro Galetti ◽

Denise Montan ◽

Marco A. Pizo ◽

Tatiane C. Mariguela ◽

...

Keyword(s):

Atlantic Forest ◽

Interaction Data ◽

Data Set

Download Full-text

Uncovering multiloci-ordering by algebraic property of Laplacian matrix and its Fiedler vector

Bioinformatics ◽

10.1093/bioinformatics/btv669 ◽

2015 ◽

Vol 32 (6) ◽

pp. 801-807 ◽

Cited By ~ 2

Author(s):

Mookyung Cheon ◽

Choongrak Kim ◽

Iksoo Chang

Keyword(s):

Source Code ◽

Algebraic Property ◽

Laplacian Matrix ◽

Recombination Fraction ◽

Supplementary Information ◽

Supplementary Data ◽

Data Set ◽

Variable Threshold ◽

Fiedler Vector ◽

Complex Graph

AbstractMotivation: The loci-ordering, based on two-point recombination fractions for a pair of loci, is the most important step in constructing a reliable and fine genetic map.Results: Using the concept from complex graph theory, here we propose a Laplacian ordering approach which uncovers the loci-ordering of multiloci simultaneously. The algebraic property for a Fiedler vector of a Laplacian matrix, constructed from the recombination fraction of the loci-ordering for 26 loci of barley chromosome IV, 846 loci of Arabidopsisthaliana and 1903 loci of Malus domestica, together with the variable threshold uncovers their loci-orders. It offers an alternative yet robust approach for ordering multiloci.Availability and implementation : Source code program with data set is available as supplementary data and also in a software category of the website (http://biophysics.dgist.ac.kr)Contact: [email protected] or [email protected] information: Supplementary data are available at Bioinformatics online.

Download Full-text

CLUSTERING-BASED APPROACH FOR PREDICTING MOTIF PAIRS FROM PROTEIN INTERACTION DATA

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720009004266 ◽

2009 ◽

Vol 07 (04) ◽

pp. 701-716 ◽

Cited By ~ 3

Author(s):

HENRY CHI-MING LEUNG ◽

MAN-HUNG SIU ◽

SIU-MING YIU ◽

FRANCIS YUK-LUN CHIN ◽

KEN WING-KIN SUNG

Keyword(s):

Lower Bound ◽

Protein Interaction ◽

Scoring Function ◽

Simulated Data ◽

Protein Sequences ◽

Protein Interaction Data ◽

P Value ◽

Interaction Data ◽

Motif Pair ◽

Data Set

Predicting motif pairs from a set of protein sequences based on the protein–protein interaction data is an important, but difficult computational problem. Tan et al. proposed a solution to this problem. However, the scoring function (using χ2 testing) used in their approach is not adequate and their approach is also not scalable. It may take days to process a set of 5000 protein sequences with about 20,000 interactions. Later, Leung et al. proposed an improved scoring function and faster algorithms for solving the same problem. But, the model used in Leung et al. is complicated. The exact value of the scoring function is not easy to compute and an estimated value is used in practice. In this paper, we derive a better model to capture the significance of a given motif pair based on a clustering notion. We develop a fast heuristic algorithm to solve the problem. The algorithm is able to locate the correct motif pair in the yeast data set in about 45 minutes for 5000 protein sequences and 20,000 interactions. Moreover, we derive a lower bound result for the p-value of a motif pair in order for it to be distinguishable from random motif pairs. The lower bound result has been verified using simulated data sets. Availability:

Download Full-text

Validation for 2D/3D registration II: The comparison of intensity- and gradient-based merit functions using a new gold standard data set

Medical Physics ◽

10.1118/1.3553403 ◽

2011 ◽

Vol 38 (3) ◽

pp. 1491-1502 ◽

Cited By ~ 27

Author(s):

Christelle Gendrin ◽

Primož Markelj ◽

Supriyanto Ardjo Pawiro ◽

Jakob Spoerk ◽

Christoph Bloch ◽

...

Keyword(s):

Gold Standard ◽

Merit Functions ◽

3D Registration ◽

Data Set ◽

Standard Data ◽

Gradient Based

Download Full-text

Transient or permanent fisheye views: A comparative evaluation of source code interfaces

Information Visualization ◽

10.1177/1473871611405643 ◽

2011 ◽

Vol 11 (2) ◽

pp. 151-167 ◽

Cited By ~ 2

Author(s):

Mikkel Rønne Jakobsen ◽

Kasper Hornbæk

Keyword(s):

Information Visualization ◽

User Interfaces ◽

Comparative Evaluation ◽

Source Code ◽

Programming Environment ◽

Interaction Data ◽

Seamless Integration ◽

Subjective Data ◽

Accuracy Measures ◽

Future Work

Transient use of information visualization may support specific tasks without permanently changing the user interface. Transient visualizations provide immediate and transient use of information visualization close to and in the context of the user’s focus of attention. Little is known, however, about the benefits and limitations of transient visualizations. We describe an experiment that compares the usability of a fisheye view that participants could call up temporarily, a permanent fisheye view, and a linear view: all interfaces gave access to source code in the editor of a widespread programming environment. Fourteen participants performed varied tasks involving navigation and understanding of source code. Participants used the three interfaces for between four and six hours in all. Time and accuracy measures were inconclusive, but subjective data showed a preference for the permanent fisheye view. We analyse interaction data to compare how participants used the interfaces and to understand why the transient interface was not preferred. We conclude by discussing seamless integration of fisheye views in existing user interfaces and future work on transient visualizations.

Download Full-text

Extraction of Traditional Chinese Medicine Entity: Design of a Novel Span-Level Named Entity Recognition Method With Distant Supervision (Preprint)

10.2196/preprints.28219 ◽

2021 ◽

Author(s):

Qi Jia ◽

Dezheng Zhang ◽

Haifeng Xu ◽

Yonghong Xie

Keyword(s):

Chinese Medicine ◽

Gold Standard ◽

False Negative ◽

Named Entity Recognition ◽

Entity Recognition ◽

Data Set ◽

Standard Data ◽

Named Entity ◽

Medical Entity ◽

Clinical Records

BACKGROUND Traditional Chinese medicine (TCM) clinical records contain the symptoms of patients, diagnoses, and subsequent treatment of doctors. These records are important resources for research and analysis of TCM diagnosis knowledge. However, most of TCM clinical records are unstructured text. Therefore, a method to automatically extract medical entities from TCM clinical records is indispensable. OBJECTIVE Training a medical entity extracting model needs a large number of annotated corpus. The cost of annotated corpus is very high and there is a lack of gold-standard data sets for supervised learning methods. Therefore, we utilized distantly supervised named entity recognition (NER) to respond to the challenge. METHODS We propose a span-level distantly supervised NER approach to extract TCM medical entity. It utilizes the pretrained language model and a simple multilayer neural network as classifier to detect and classify entity. We also designed a negative sampling strategy for the span-level model. The strategy randomly selects negative samples in every epoch and filters the possible false-negative samples periodically. It reduces the bad influence from the false-negative samples. RESULTS We compare our methods with other baseline methods to illustrate the effectiveness of our method on a gold-standard data set. The F1 score of our method is 77.34 and it remarkably outperforms the other baselines. CONCLUSIONS We developed a distantly supervised NER approach to extract medical entity from TCM clinical records. We estimated our approach on a TCM clinical record data set. Our experimental results indicate that the proposed approach achieves a better performance than other baselines.

Download Full-text