SciPipe - A workflow library for agile development of complex and dynamic bioinformatics pipelines

Mapping Intimacies ◽

10.1101/380808 ◽

2018 ◽

Cited By ~ 2

Author(s):

Samuel Lampa ◽

Martin Dahlö ◽

Jonathan Alvarsson ◽

Ola Spjuth

Keyword(s):

Machine Learning ◽

Dynamic Scheduling ◽

Workflow Management ◽

Scientific Workflow ◽

Biological Data ◽

Agile Development ◽

Complex Nature ◽

Reusable Components ◽

Programming Library ◽

Machine Leaning

AbstractBackgroundThe complex nature of biological data has driven the development of specialized software tools. Scientific workflow management systems simplify the assembly of such tools into pipelines, assist with job automation and aid reproducibility of analyses. Many contemporary workflow tools are specialized and not designed for highly complex workflows, such as with nested loops, dynamic scheduling and parametriza-tion, which is common in e.g. machine learning.FindingsSciPipe is a workflow programming library implemented in the programming language Go, for managing complex and dynamic pipelines in bioinformatics, cheminformatics and other fields. SciPipe helps in particular with workflow constructs common in machine learning, such as extensive branching, parameter sweeps and dynamic scheduling and parametrization of downstream tasks. SciPipe builds on Flow-based programming principles to support agile development of workflows based on a library of self-contained, reusable components. It supports running subsets of workflows for improved iterative development, and provides a data-centric audit logging feature that saves a full audit trace for every output file of a workflow, which can be converted to other formats such as HTML, TeX and PDF on-demand. The utility of SciPipe is demonstrated with a machine learning pipeline, a genomics, and a transcriptomics pipeline.ConclusionsSciPipe provides a solution for agile development of complex and dynamic pipelines, espe-cially in machine leaning, through a flexible programming API suitable for scientists used to programming or scripting.

Download Full-text

The role of machine learning in scientific workflows

The International Journal of High Performance Computing Applications ◽

10.1177/1094342019852127 ◽

2019 ◽

Vol 33 (6) ◽

pp. 1128-1139 ◽

Cited By ~ 3

Author(s):

Ewa Deelman ◽

Anirban Mandal ◽

Ming Jiang ◽

Rizos Sakellariou

Keyword(s):

Machine Learning ◽

Language Processing ◽

Autonomous Vehicles ◽

Workflow Management ◽

Scientific Productivity ◽

Scientific Workflow ◽

Scientific Workflows ◽

Materials Development ◽

Smart Agriculture

Machine learning (ML) is being applied in a number of everyday contexts from image recognition, to natural language processing, to autonomous vehicles, to product recommendation. In the science realm, ML is being used for medical diagnosis, new materials development, smart agriculture, DNA classification, and many others. In this article, we describe the opportunities of using ML in the area of scientific workflow management. Scientific workflows are key to today’s computational science, enabling the definition and execution of complex applications in heterogeneous and often distributed environments. We describe the challenges of composing and executing scientific workflows and identify opportunities for applying ML techniques to meet these challenges by enhancing the current workflow management system capabilities. We foresee that as the ML field progresses, the automation provided by workflow management systems will greatly increase and result in significant improvements in scientific productivity.

Download Full-text

GeNNet: An Integrated Platform for Unifying Scientific Workflow Management and Graph Databases for Transcriptome Data Analysis

10.1101/095257 ◽

2016 ◽

Cited By ~ 1

Author(s):

Raquel L. Costa ◽

Luiz M. R. Gadelha ◽

Marcelo Ribeiro-Alves ◽

Fabio Porto

Keyword(s):

Regulatory Networks ◽

Workflow Management ◽

Scientific Workflow ◽

Biological Data ◽

Scientific Workflows ◽

Graph Database ◽

Graph Databases ◽

Biological Databases ◽

Transcriptome Data ◽

Daunting Task

AbstractBackgroundThere are many steps in analyzing transcriptome data, from the acquisition of raw data to the selection of a subset of representative genes that explain a scientific hypothesis. The data produced may additionally be integrated with other biological databases, such as Protein-Protein Interactions and annotations. However, the results of these analyses remain fragmented, imposing difficulties, either for posterior inspection of results, or for meta-analysis by the incorporation of new related data. Integrating databases and tools into scientific workflows, orchestrating their execution, and managingthe resulting data and its respective metadata are challenging tasks. Running in-silico experiments to structure and compose the information as needed for analysis is a daunting task. Different programsmay need to be applied and different files are produced during the experiment cycle. In this context,the availability of a platform supporting experiment execution is paramount.ResultsWe present GeNNet, an integrated transcriptome analysis platform that unifies scientific workflows with graph databases for selecting relevant genes according to the evaluated biological systems. GeNNet includes pre-loaded biological data, pre-processes raw microarray data and conducts a series of analyses including normalization, differential expression inference, clusterization and geneset enrichment analysis. To demonstrate the features of GeNNet, we performed case studies with data retrieved from GEO, particularly using a single-factor experiment. As a result, we obtained differentially expressed genes for which biological functions were analyzed. The results are integrated into GeNNet-DB, a database about genes, clusters, experiments and their properties and relationships.The resulting graph database is explored with queries that demonstrate the expressiveness of this data model for reasoning about gene regulatory networks.ConclusionsGeNNet is the first platform to integrate the analytical process of transcriptome data with graph database. It provides a comprehensive set of tools that would otherwise be challenging for non-expert users to install and use. Developers as well can add new functionality to each component of GeNNet. The resulting data allows for testing previous hypotheses about an experiment as well as exploring new ones through the interactive graph database environment. It enables the analysis of different data on humans, rhesus, mice and rat coming from Affymetrix platforms.

Download Full-text

COVID-19 Outbreak Prediction with Machine Learning

10.34055/osf.io/xr4js ◽

2020 ◽

Author(s):

Sina Faizollahzadeh Ardabili ◽

Amir Mosavi ◽

Pedram Ghamisi ◽

Filip Ferdinand ◽

Annamaria R. Varkonyi-Koczy ◽

...

Keyword(s):

Machine Learning ◽

Prediction Models ◽

Fuzzy Inference ◽

Control Measures ◽

Future Research ◽

Complex Nature ◽

Inference System ◽

Wide Range ◽

Standard Models ◽

High Level

Several outbreak prediction models for COVID-19 are being used by officials around the world to make informed-decisions and enforce relevant control measures. Among the standard models for COVID-19 global pandemic prediction, simple epidemiological and statistical models have received more attention by authorities, and they are popular in the media. Due to a high level of uncertainty and lack of essential data, standard models have shown low accuracy for long-term prediction. Although the literature includes several attempts to address this issue, the essential generalization and robustness abilities of existing models needs to be improved. This paper presents a comparative analysis of machine learning and soft computing models to predict the COVID-19 outbreak as an alternative to SIR and SEIR models. Among a wide range of machine learning models investigated, two models showed promising results (i.e., multi-layered perceptron, MLP, and adaptive network-based fuzzy inference system, ANFIS). Based on the results reported here, and due to the highly complex nature of the COVID-19 outbreak and variation in its behavior from nation-to-nation, this study suggests machine learning as an effective tool to model the outbreak. This paper provides an initial benchmarking to demonstrate the potential of machine learning for future research. Paper further suggests that real novelty in outbreak prediction can be realized through integrating machine learning and SEIR models.

Download Full-text

Insights into therapeutic targets and biomarkers using integrated multi-‘omics’ approaches for dilated and ischemic cardiomyopathies

Integrative Biology ◽

10.1093/intbio/zyab007 ◽

2021 ◽

Author(s):

Austė Kanapeckaitė ◽

Neringa Burokienė

Keyword(s):

Machine Learning ◽

Single Cell ◽

Learning Algorithm ◽

Expression Profiles ◽

Therapeutic Targets ◽

Development Stage ◽

Biological Data ◽

Specific Gene ◽

Tissue Remodelling ◽

Pharmacological Management

Abstract At present, heart failure (HF) treatment only targets the symptoms based on the left ventricle dysfunction severity; however, the lack of systemic ‘omics’ studies and available biological data to uncover the heterogeneous underlying mechanisms signifies the need to shift the analytical paradigm towards network-centric and data mining approaches. This study, for the first time, aimed to investigate how bulk and single cell RNA-sequencing as well as the proteomics analysis of the human heart tissue can be integrated to uncover HF-specific networks and potential therapeutic targets or biomarkers. We also aimed to address the issue of dealing with a limited number of samples and to show how appropriate statistical models, enrichment with other datasets as well as machine learning-guided analysis can aid in such cases. Furthermore, we elucidated specific gene expression profiles using transcriptomic and mined data from public databases. This was achieved using the two-step machine learning algorithm to predict the likelihood of the therapeutic target or biomarker tractability based on a novel scoring system, which has also been introduced in this study. The described methodology could be very useful for the target or biomarker selection and evaluation during the pre-clinical therapeutics development stage as well as disease progression monitoring. In addition, the present study sheds new light into the complex aetiology of HF, differentiating between subtle changes in dilated cardiomyopathies (DCs) and ischemic cardiomyopathies (ICs) on the single cell, proteome and whole transcriptome level, demonstrating that HF might be dependent on the involvement of not only the cardiomyocytes but also on other cell populations. Identified tissue remodelling and inflammatory processes can be beneficial when selecting targeted pharmacological management for DCs or ICs, respectively.

Download Full-text

“Guilt by association” is not competitive with genetic association for identifying autism risk genes

Scientific Reports ◽

10.1038/s41598-021-95321-y ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Margot Gunning ◽

Paul Pavlidis

Keyword(s):

Machine Learning ◽

Genetic Association ◽

Gene Networks ◽

Rare Variants ◽

Association Studies ◽

Genetic Disorders ◽

Autism Spectrum ◽

Biological Data ◽

Disease Genes ◽

Risk Genes

AbstractDiscovering genes involved in complex human genetic disorders is a major challenge. Many have suggested that machine learning (ML) algorithms using gene networks can be used to supplement traditional genetic association-based approaches to predict or prioritize disease genes. However, questions have been raised about the utility of ML methods for this type of task due to biases within the data, and poor real-world performance. Using autism spectrum disorder (ASD) as a test case, we sought to investigate the question: can machine learning aid in the discovery of disease genes? We collected 13 published ASD gene prioritization studies and evaluated their performance using known and novel high-confidence ASD genes. We also investigated their biases towards generic gene annotations, like number of association publications. We found that ML methods which do not incorporate genetics information have limited utility for prioritization of ASD risk genes. These studies perform at a comparable level to generic measures of likelihood for the involvement of genes in any condition, and do not out-perform genetic association studies. Future efforts to discover disease genes should be focused on developing and validating statistical models for genetic association, specifically for association between rare variants and disease, rather than developing complex machine learning methods using complex heterogeneous biological data with unknown reliability.

Download Full-text

Scientific Workflow Management -- For Whom?

2014 IEEE 10th International Conference on e-Science ◽

10.1109/escience.2014.8 ◽

2014 ◽

Cited By ~ 3

Author(s):

Silvia Olabarriaga ◽

Gabrielle Pierantoni ◽

Giuliano Taffoni ◽

Eva Sciacca ◽

Mahdi Jaghoori ◽

...

Keyword(s):

Workflow Management ◽

Scientific Workflow

Download Full-text

A scientific workflow management system architecture and its scheduling based on cloud service platform for manufacturing big data analytics

The International Journal of Advanced Manufacturing Technology ◽

10.1007/s00170-015-7804-9 ◽

2015 ◽

Vol 84 (1-4) ◽

pp. 119-131 ◽

Cited By ~ 39

Author(s):

Xiu Li ◽

Jingdong Song ◽

Biqing Huang

Keyword(s):

Big Data ◽

System Architecture ◽

Management System ◽

Data Analytics ◽

Workflow Management ◽

Big Data Analytics ◽

Cloud Service ◽

Scientific Workflow ◽

Workflow Management System ◽

Cloud Service Platform

Download Full-text

A new optimization phase for scientific workflow management systems

Future Generation Computer Systems ◽

10.1016/j.future.2013.09.005 ◽

2014 ◽

Vol 36 ◽

pp. 352-362 ◽

Cited By ~ 15

Author(s):

Sonja Holl ◽

Olav Zimmermann ◽

Magnus Palmblad ◽

Yassene Mohammed ◽

Martin Hofmann-Apitius

Keyword(s):

Workflow Management ◽

Scientific Workflow ◽

Management Systems ◽

Workflow Management Systems

Download Full-text

A User-Defined Exception Handling Framework in the VIEW Scientific Workflow Management System

2012 IEEE Ninth International Conference on Services Computing ◽

10.1109/scc.2012.71 ◽

2012 ◽

Cited By ~ 2

Author(s):

Dong Ruan ◽

Shiyong Lu ◽

Aravind Mohan ◽

Xubo Fei ◽

Jia Zhang

Keyword(s):

Management System ◽

Workflow Management ◽

Exception Handling ◽

Scientific Workflow ◽

Workflow Management System

Download Full-text

Swarm Intelligence Optimization: An Exploration and Application of Machine Learning Technology

Journal of Intelligent Systems ◽

10.1515/jisys-2020-0084 ◽

2021 ◽

Vol 30 (1) ◽

pp. 460-469

Author(s):

Yinying Cai ◽

Amit Sharma

Keyword(s):

Machine Learning ◽

Swarm Intelligence ◽

Research Result ◽

Machine Learning Algorithms ◽

Learning Technology ◽

Data Set ◽

Rice Pests ◽

Machine Leaning ◽

Smart Agriculture ◽

Swarm Intelligence Optimization

Abstract In the agriculture development and growth, the efficient machinery and equipment plays an important role. Various research studies are involved in the implementation of the research and patents to aid the smart agriculture and authors and reviewers that machine leaning technologies are providing the best support for this growth. To explore machine learning technology and machine learning algorithms, the most of the applications are studied based on the swarm intelligence optimization. An optimized V3CFOA-RF model is built through V3CFOA. The algorithm is tested in the data set collected concerning rice pests, later analyzed and compared in detail with other existing algorithms. The research result shows that the model and algorithm proposed are not only more accurate in recognition and prediction, but also solve the time lagging problem to a degree. The model and algorithm helped realize a higher accuracy in crop pest prediction, which ensures a more stable and higher output of rice. Thus they can be employed as an important decision-making instrument in the agricultural production sector.

Download Full-text