Language workbench user interfaces for data analysis

10.7287/peerj.preprints.511v2 ◽

2015 ◽

Author(s):

Victoria M Benson ◽

Fabien Campagne

Keyword(s):

Data Analysis ◽

User Interfaces ◽

Feature Selection Method ◽

Development Project ◽

Biological Data ◽

Programming System ◽

Command Line ◽

Analysis Software ◽

Workflow Systems ◽

High Level

Biological data analysis is frequently performed with command line software. While this practice provides considerable flexibility for computationally savy individuals, such as investigators trained in bioinformatics, this also creates a barrier to the widespread use of data analysis software by investigators trained as biologists and/or clinicians. Workflow systems such as Galaxy and Taverna have been developed to try and provide generic user interfaces that can wrap command line analysis software. These solutions are useful for problems that can be solved with workflows, and that do not require specialized user interfaces. However, some types of analyses can benefit from custom user interfaces. For instance, developing biomarker models from high-throughput data is a type of analysis that can be expressed more succinctly with specialized user interfaces. Here, we show how Language Workbench (LW) technology can be used to model the biomarker development and validation process. We developed a language that models the concepts of Dataset, Endpoint, Feature Selection Method and Classifier. These high-level language concepts map directly to abstractions that analysts who develop biomarker models are familiar with. We found that user interfaces developed in the Meta-Programming System (MPS) LW provide convenient means to configure a biomarker development project, to train models and view the validation statistics. We discuss several advantages of developing user interfaces for data analysis with a LW, including increased interface consistency, portability and extension by language composition. The language developed during this experiment is distributed as an MPS plugin (available at http://campagnelab.org/software/bdval-for-mps/

Download Full-text

Language workbench user interfaces for data analysis

10.7287/peerj.preprints.511v1 ◽

2014 ◽

Author(s):

Victoria M Benson ◽

Fabien Campagne

Keyword(s):

Data Analysis ◽

User Interfaces ◽

Feature Selection Method ◽

Development Project ◽

Biological Data ◽

Programming System ◽

Command Line ◽

Analysis Software ◽

High Level ◽

Data Analysis Software

Biological data analysis software is frequently performed with command line software. While this practice provides considerable flexibility for computationally savy individuals, such as investigators trained in bioinformatics, this also creates a barrier to the widespread use of data analysis software by investigators trained as biologists and/or clinicians. Dataflow systems such as Galaxy and Taverna have been developed to try and provide generic user interfaces that can wrap command line analysis software. These solutions are useful for problems that can be solved with the dataflow abstraction, and that do not require specialized user interfaces. For instance, developing biomarker models from high-throughput data is a type of analysis that cannot be directly expressed with the dataflow model. In contrast, we show here that Language Workbench (LW) technology can be used to model the biomarker development and validation process. We developed a language that models the concepts of Dataset, Endpoint, Feature Selection Method and Classifier. These high-level language concepts map directly to abstractions that analysts who develop biomarker models are familiar with. We found that user interfaces developed in the Meta-Programming System (MPS) LW provide convenient means to configure a biomarker development project, to train models and view the validation statistics. We discuss several advantages of developing user interfaces for data analysis with a LW, including increased interface consistency, portability and extension by language composition. The language developed during this experiment is distributed as an MPS plugin (available at http://campagnelab.org/software/bdval-for-mps/).

Download Full-text

MetaR: simple, high-level languages for data analysis with the R ecosystem

10.1101/030254 ◽

2015 ◽

Cited By ~ 2

Author(s):

Fabien Campagne ◽

William ER Digan ◽

Manuele Simi

Keyword(s):

Data Analysis ◽

User Interfaces ◽

Teaching Experience ◽

R Package ◽

Diverse Range ◽

R Language ◽

Analysis Tools ◽

Analysis Task ◽

Simple Language ◽

High Level

AbstractData analysis tools have become essential to the study of biology. Here, we applied language workbench technology (LWT) to create data analysis languages tailored for biologists with a diverse range of experience: from beginners with no programming experience to expert bioinformaticians and statisticians. A key novelty of our approach is its ability to blend user interface with scripting in a single platform. This feature helps beginners and experts alike analyze data more productively. This new approach has several advantages over state of the art approaches currently popular for data analysis: experts can design simplified data analysis languages that require no programming experience, and behave like graphical user interfaces, yet have the advantages of scripting. We report on such a simple language, called MetaR, which we have used to teach complete beginners how to call differentially expressed genes and build heatmaps. We found that beginners can complete this task in less than 2 hours with MetaR, when more traditional teaching with R and its packages would require several training sessions (6-24hrs). Furthermore, MetaR seamlessly integrates with docker to enable reproducibility of analyses and simplified R package installations during training sessions. We used the same approach to develop the first composable R language. A composable language is a language that can be extended with micro-languages. We illustrate this capability with a Biomart micro-language designed to compose with R and help R programmers query Biomart interactively to assemble specific queries to retrieve data, (The same micro-language also composes with MetaR to help beginners query Biomart.) Our teaching experience suggests that language design with LWT can be a compelling approach for developing intelligent data analysis tools and can accelerate training for common data analysis task. LWT offers an interactive environment with the potential to promote exchanges between beginner and expert data analysts.

Download Full-text

Streamlining Data-Intensive Biology With Workflow Systems

10.1101/2020.06.30.178673 ◽

2020 ◽

Cited By ~ 1

Author(s):

Taylor Reiter ◽

Phillip T. Brooks ◽

Luiz Irber ◽

Shannon E.K. Joslin ◽

Charles M. Reid ◽

...

Keyword(s):

Data Analysis ◽

Large Scale ◽

High Throughput Sequencing ◽

Sequence Data ◽

Open Science ◽

Biological Data ◽

Data Generation ◽

Biological Sequence ◽

Sequencing Data ◽

Workflow Systems

AbstractAs the scale of biological data generation has increased, the bottleneck of research has shifted from data generation to analysis. Researchers commonly need to build computational workflows that include multiple analytic tools and require incremental development as experimental insights demand tool and parameter modifications. These workflows can produce hundreds to thousands of intermediate files and results that must be integrated for biological insight. Data-centric workflow systems that internally manage computational resources, software, and conditional execution of analysis steps are reshaping the landscape of biological data analysis, and empowering researchers to conduct reproducible analyses at scale. Adoption of these tools can facilitate and expedite robust data analysis, but knowledge of these techniques is still lacking. Here, we provide a series of practices and strategies for leveraging workflow systems with structured project, data, and resource management to streamline large-scale biological analysis. We present these strategies in the context of high-throughput sequencing data analysis, but the principles are broadly applicable to biologists working beyond this field.Author SummaryWe present a guide for workflow-enabled biological sequence data analysis, developed through our own teaching, training and analysis projects. We recognize that this is based on our own use cases and experiences, but we hope that our guide will contribute to a larger discussion within the open source and open science communities and lead to more comprehensive resources. Our main goal is to accelerate the research of scientists conducting sequence analyses by introducing them to organized workflow practices that not only benefit their own research but also facilitate open and reproducible science.

Download Full-text

Streamlining data-intensive biology with workflow systems

GigaScience ◽

10.1093/gigascience/giaa140 ◽

2021 ◽

Vol 10 (1) ◽

Author(s):

Taylor Reiter ◽

Phillip T Brooks† ◽

Luiz Irber† ◽

Shannon E K Joslin† ◽

Charles M Reid† ◽

...

Keyword(s):

Data Analysis ◽

Large Scale ◽

High Throughput Sequencing ◽

Biological Data ◽

Data Generation ◽

Sequencing Data ◽

Workflow Systems ◽

Data Intensive ◽

High Throughput Sequencing Data ◽

Project Data

Abstract As the scale of biological data generation has increased, the bottleneck of research has shifted from data generation to analysis. Researchers commonly need to build computational workflows that include multiple analytic tools and require incremental development as experimental insights demand tool and parameter modifications. These workflows can produce hundreds to thousands of intermediate files and results that must be integrated for biological insight. Data-centric workflow systems that internally manage computational resources, software, and conditional execution of analysis steps are reshaping the landscape of biological data analysis and empowering researchers to conduct reproducible analyses at scale. Adoption of these tools can facilitate and expedite robust data analysis, but knowledge of these techniques is still lacking. Here, we provide a series of strategies for leveraging workflow systems with structured project, data, and resource management to streamline large-scale biological analysis. We present these practices in the context of high-throughput sequencing data analysis, but the principles are broadly applicable to biologists working beyond this field.

Download Full-text

phenix.model_vs_data: a high-level tool for the calculation of crystallographic model and data statistics

Journal of Applied Crystallography ◽

10.1107/s0021889810015608 ◽

2010 ◽

Vol 43 (4) ◽

pp. 669-676 ◽

Cited By ~ 86

Author(s):

Pavel V. Afonine ◽

Ralf W. Grosse-Kunstleve ◽

Vincent B. Chen ◽

Jeffrey J. Headd ◽

Nigel W. Moriarty ◽

...

Keyword(s):

Experimental Data ◽

Data Analysis ◽

Protein Data Bank ◽

Data Bank ◽

Command Line ◽

Percentage Points ◽

Command Line Tool ◽

Data Statistics ◽

High Level

phenix.model_vs_datais a high-level command-line tool for the computation of crystallographic model and data statistics, and the evaluation of the fit of the model to data. Analysis of all Protein Data Bank structures that have experimental data available shows that in most cases the reported statistics, in particularRfactors, can be reproduced within a few percentage points. However, there are a number of outliers where the recomputedRvalues are significantly different from those originally reported. The reasons for these discrepancies are discussed.

Download Full-text

Enalos Suite of Tools: Enhancing Cheminformatics and Nanoinfor - matics through KNIME

Current Medicinal Chemistry ◽

10.2174/0929867327666200727114410 ◽

2020 ◽

Vol 27 (38) ◽

pp. 6523-6535 ◽

Cited By ~ 3

Author(s):

Antreas Afantitis ◽

Andreas Tsoumanis ◽

Georgia Melagraki

Keyword(s):

Data Analysis ◽

Virtual Screening ◽

In Silico ◽

Model Development ◽

In Silico Analysis ◽

Material Design ◽

Efficient Solutions ◽

Biological Data ◽

Cost Efficient ◽

Silico Analysis

Drug discovery as well as (nano)material design projects demand the in silico analysis of large datasets of compounds with their corresponding properties/activities, as well as the retrieval and virtual screening of more structures in an effort to identify new potent hits. This is a demanding procedure for which various tools must be combined with different input and output formats. To automate the data analysis required we have developed the necessary tools to facilitate a variety of important tasks to construct workflows that will simplify the handling, processing and modeling of cheminformatics data and will provide time and cost efficient solutions, reproducible and easier to maintain. We therefore develop and present a toolbox of >25 processing modules, Enalos+ nodes, that provide very useful operations within KNIME platform for users interested in the nanoinformatics and cheminformatics analysis of chemical and biological data. With a user-friendly interface, Enalos+ Nodes provide a broad range of important functionalities including data mining and retrieval from large available databases and tools for robust and predictive model development and validation. Enalos+ Nodes are available through KNIME as add-ins and offer valuable tools for extracting useful information and analyzing experimental and virtual screening results in a chem- or nano- informatics framework. On top of that, in an effort to: (i) allow big data analysis through Enalos+ KNIME nodes, (ii) accelerate time demanding computations performed within Enalos+ KNIME nodes and (iii) propose new time and cost efficient nodes integrated within Enalos+ toolbox we have investigated and verified the advantage of GPU calculations within the Enalos+ nodes. Demonstration data sets, tutorial and educational videos allow the user to easily apprehend the functions of the nodes that can be applied for in silico analysis of data.

Download Full-text

Android based XRD data analysis software design for cube crystal structure with analytic and Cohen methods

10.1063/5.0037472 ◽

2021 ◽

Author(s):

Ade Kurniawan ◽

Irzaman

Keyword(s):

Crystal Structure ◽

Data Analysis ◽

Software Design ◽

Analysis Software ◽

Data Analysis Software

Download Full-text

Integrating Biological Data Sources and Data Analysis Tools through Mediators (available online only)

Proceedings of the 2004 ACM symposium on Applied computing - SAC '04 ◽

10.1145/967900.980091 ◽

2004 ◽

Cited By ~ 4

Author(s):

J. F. Aldana ◽

M. Roldán ◽

I. Navas ◽

A. J. Pérez ◽

O. Trelles

Keyword(s):

Data Analysis ◽

Biological Data ◽

Data Sources ◽

Analysis Tools

Download Full-text

Establishing a consensus for the hallmarks of cancer based on gene ontology and pathway annotations

BMC Bioinformatics ◽

10.1186/s12859-021-04105-8 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Yi Chen ◽

Fons. J. Verbeek ◽

Katherine Wolstencroft

Keyword(s):

Gene Ontology ◽

Enrichment Analysis ◽

Biological Data ◽

Hallmarks Of Cancer ◽

High Throughput Analysis ◽

Knowledge Resources ◽

Gene Set ◽

Cancer Hallmarks ◽

Starting Point ◽

High Level

Abstract Background The hallmarks of cancer provide a highly cited and well-used conceptual framework for describing the processes involved in cancer cell development and tumourigenesis. However, methods for translating these high-level concepts into data-level associations between hallmarks and genes (for high throughput analysis), vary widely between studies. The examination of different strategies to associate and map cancer hallmarks reveals significant differences, but also consensus. Results Here we present the results of a comparative analysis of cancer hallmark mapping strategies, based on Gene Ontology and biological pathway annotation, from different studies. By analysing the semantic similarity between annotations, and the resulting gene set overlap, we identify emerging consensus knowledge. In addition, we analyse the differences between hallmark and gene set associations using Weighted Gene Co-expression Network Analysis and enrichment analysis. Conclusions Reaching a community-wide consensus on how to identify cancer hallmark activity from research data would enable more systematic data integration and comparison between studies. These results highlight the current state of the consensus and offer a starting point for further convergence. In addition, we show how a lack of consensus can lead to large differences in the biological interpretation of downstream analyses and discuss the challenges of annotating changing and accumulating biological data, using intermediate knowledge resources that are also changing over time.

Download Full-text