KibioR & Kibio: a new architecture for next-generation data querying and sharing in big biology

Bioinformatics ◽

10.1093/bioinformatics/btab157 ◽

2021 ◽

Author(s):

Régis Ongaro-Carcy ◽

Marie-Pier Scott-Boyer ◽

Adrien Dessemond ◽

François Belleau ◽

Mickael Leclercq ◽

...

Keyword(s):

Data Management ◽

Data Storage ◽

Data Exchange ◽

Simple Structure ◽

R Package ◽

Biological Data ◽

Ease Of Use ◽

Supplementary Information ◽

Data Querying ◽

Provider Organization

Abstract Motivation The growing production of massive heterogeneous biological data offers opportunities for new discoveries. However, performing multi-omics data analysis is challenging, and researchers are forced to handle the ever-increasing complexity of both data management and evolution of our biological understanding. Substantial efforts have been made to unify biological datasets into integrated systems. Unfortunately, they are not easily scalable, deployable and searchable, locally or globally. Results This publication presents two tools with a simple structure that can help any data provider, organization or researcher, requiring a reliable data search and analysis base. The first tool is Kibio, a scalable and adaptable data storage based on Elasticsearch search engine. The second tool is KibioR, a R package to pull, push and search Kibio datasets or any accessible Elasticsearch-based databases. These tools apply a uniform data exchange model and minimize the burden of data management by organizing data into a decentralized, versatile, searchable and shareable structure. Several case studies are presented using multiple databases, from drug characterization to miRNAs and pathways identification, emphasizing the ease of use and versatility of the Kibio/KibioR framework. Availability Both KibioR and Elasticsearch are open source. KibioR package source is available at https://github.com/regisoc/kibior and the library on CRAN at https://cran.r-project.org/package=kibior. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

DiscoRhythm: an easy-to-use web application and R package for discovering rhythmicity

Bioinformatics ◽

10.1093/bioinformatics/btz834 ◽

2019 ◽

Cited By ~ 2

Author(s):

Matthew Carlucci ◽

Algimantas Kriščiūnas ◽

Haohan Li ◽

Povilas Gibas ◽

Karolis Koncevičius ◽

...

Keyword(s):

Web Application ◽

Statistical Significance ◽

R Package ◽

Biological Data ◽

Supplementary Information ◽

Statistical Knowledge ◽

Health And Disease ◽

Phase Amplitude ◽

Almost All ◽

User Friendly

Abstract Motivation Biological rhythmicity is fundamental to almost all organisms on Earth and plays a key role in health and disease. Identification of oscillating signals could lead to novel biological insights, yet its investigation is impeded by the extensive computational and statistical knowledge required to perform such analysis. Results To address this issue, we present DiscoRhythm (Discovering Rhythmicity), a user-friendly application for characterizing rhythmicity in temporal biological data. DiscoRhythm is available as a web application or an R/Bioconductor package for estimating phase, amplitude, and statistical significance using four popular approaches to rhythm detection (Cosinor, JTK Cycle, ARSER, and Lomb-Scargle). We optimized these algorithms for speed, improving their execution times up to 30-fold to enable rapid analysis of -omic-scale datasets in real-time. Informative visualizations, interactive modules for quality control, dimensionality reduction, periodicity profiling, and incorporation of experimental replicates make DiscoRhythm a thorough toolkit for analyzing rhythmicity. Availability and Implementation The DiscoRhythm R package is available on Bioconductor (https://bioconductor.org/packages/DiscoRhythm), with source code available on GitHub (https://github.com/matthewcarlucci/DiscoRhythm) under a GPL-3 license. The web application is securely deployed over HTTPS (https://disco.camh.ca) and is freely available for use worldwide. Local instances of the DiscoRhythm web application can be created using the R package or by deploying the publicly available Docker container (https://hub.docker.com/r/mcarlucci/discorhythm). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Inferring cellular heterogeneity of associations from single cell genomics

Bioinformatics ◽

10.1093/bioinformatics/btaa151 ◽

2020 ◽

Vol 36 (11) ◽

pp. 3466-3473

Author(s):

Maya Levy ◽

Amit Frishberg ◽

Irit Gat-Viks

Keyword(s):

Simulated Data ◽

R Package ◽

Biological Data ◽

Cellular Heterogeneity ◽

Supplementary Information ◽

Dynamic Changes ◽

Entire Cell ◽

Complete Set ◽

Cellular Phenotypes ◽

Cell Variation

Abstract Motivation Cell-to-cell variation has uncovered associations between cellular phenotypes. However, it remains challenging to address the cellular diversity of such associations. Results Here, we do not rely on the conventional assumption that the same association holds throughout the entire cell population. Instead, we assume that associations may exist in a certain subset of the cells. We developed CEllular Niche Association (CENA) to reliably predict pairwise associations together with the cell subsets in which the associations are detected. CENA does not rely on predefined subsets but only requires that the cells of each predicted subset would share a certain characteristic state. CENA may therefore reveal dynamic modulation of dependencies along cellular trajectories of temporally evolving states. Using simulated data, we show the advantage of CENA over existing methods and its scalability to a large number of cells. Application of CENA to real biological data demonstrates dynamic changes in associations that would be otherwise masked. Availability and implementation CENA is available as an R package at Github: https://github.com/mayalevy/CENA and is accompanied by a complete set of documentations and instructions. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

dittoSeq: universal user-friendly single-cell and bulk RNA sequencing visualization toolkit

Bioinformatics ◽

10.1093/bioinformatics/btaa1011 ◽

2020 ◽

Author(s):

Daniel G Bunis ◽

Jared Andrews ◽

Gabriela K Fragiadakis ◽

Trevor D Burt ◽

Marina Sirota

Keyword(s):

Single Cell ◽

R Package ◽

Color Blindness ◽

Ease Of Use ◽

Supplementary Information ◽

Supplementary Data ◽

Rnaseq Data ◽

Visualization Toolkit ◽

User Friendly ◽

Publication Quality

Abstract Summary A visualization suite for major forms of bulk and single-cell RNAseq data in R. dittoSeq is color blindness-friendly by default, robustly documented to power ease-of-use and allows highly customizable generation of both daily-use and publication-quality figures. Availability and implementation dittoSeq is an R package available through Bioconductor via an open source MIT license. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Identification of disease-associated loci using machine learning for genotype and network data integration

Bioinformatics ◽

10.1093/bioinformatics/btz310 ◽

2019 ◽

Vol 35 (24) ◽

pp. 5182-5190 ◽

Cited By ~ 4

Author(s):

Luis G Leal ◽

Alessia David ◽

Marjo-Riita Jarvelin ◽

Sylvain Sebert ◽

Minna Männikkö ◽

...

Keyword(s):

Machine Learning ◽

Gene Networks ◽

Association Studies ◽

R Package ◽

Biological Data ◽

Machine Learning Algorithms ◽

Supplementary Information ◽

Genome Wide Association Studies ◽

Omics Data ◽

Missing Heritability

Abstract Motivation Integration of different omics data could markedly help to identify biological signatures, understand the missing heritability of complex diseases and ultimately achieve personalized medicine. Standard regression models used in Genome-Wide Association Studies (GWAS) identify loci with a strong effect size, whereas GWAS meta-analyses are often needed to capture weak loci contributing to the missing heritability. Development of novel machine learning algorithms for merging genotype data with other omics data is highly needed as it could enhance the prioritization of weak loci. Results We developed cNMTF (corrected non-negative matrix tri-factorization), an integrative algorithm based on clustering techniques of biological data. This method assesses the inter-relatedness between genotypes, phenotypes, the damaging effect of the variants and gene networks in order to identify loci-trait associations. cNMTF was used to prioritize genes associated with lipid traits in two population cohorts. We replicated 129 genes reported in GWAS world-wide and provided evidence that supports 85% of our findings (226 out of 265 genes), including recent associations in literature (NLGN1), regulators of lipid metabolism (DAB1) and pleiotropic genes for lipid traits (CARM1). Moreover, cNMTF performed efficiently against strong population structures by accounting for the individuals’ ancestry. As the method is flexible in the incorporation of diverse omics data sources, it can be easily adapted to the user’s research needs. Availability and implementation An R package (cnmtf) is available at https://lgl15.github.io/cnmtf_web/index.html. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Presto scales Wilcoxon and auROC analyses to millions of observations

10.1101/653253 ◽

2019 ◽

Cited By ~ 6

Author(s):

Ilya Korsunsky ◽

Aparna Nathan ◽

Nghia Millard ◽

Soumya Raychaudhuri

Keyword(s):

Single Cell ◽

Single Cell Analysis ◽

Sparse Matrices ◽

R Package ◽

Biological Data ◽

Supplementary Information ◽

Wilcoxon Rank Sum Test ◽

Biological Data Analysis ◽

Simple Interface ◽

Operator Curve

AbstractSummaryThe related Wilcoxon rank sum test and area under the receiver operator curve are ubiquitous in high dimensional biological data analysis. Current implementations do not scale readily to the increasingly large datasets generated by novel high-throughput technologies, such as single cell RNAseq. We introduce a simple and scalable implementation of both analyses, available through the R package Presto. Presto scales to big datasets, with functions optimized for both dense and sparse matrices. On a sparse dataset of 1 million observations, 10 groups, and 1,000 features, Presto performed both rank-sum and auROC analyses in only 17 seconds, compared to 6.4 hours with base R functions. Presto also includes functions to seamlessly integrate with the Seurat single cell analysis pipeline and the Bioconductor SingleCellExperiment class. Presto enables the use of robust classical analyses on big data with a simple interface and optimized implementation.Availability and ImplementationPresto is available as an R package at https://github.com/immunogenomics/[email protected] InformationVignettes are available with the Presto package.

Download Full-text

ClusterMine: a Knowledge-integrated Clustering Approach based on Expression Profiles of Gene Sets

10.1101/255711 ◽

2018 ◽

Author(s):

Hong-Dong Li ◽

Yunpei Xu ◽

Xiaoshu Zhu ◽

Quan Liu ◽

Gilbert S. Omenn ◽

...

Keyword(s):

Expression Profiles ◽

R Package ◽

Biological Data ◽

Supplementary Information ◽

Consensus Clustering ◽

Cluster Membership ◽

Link Type ◽

Novel Approach ◽

Gene Sets ◽

Biological Interpretation

ABSTRACTMotivationClustering analysis is essential for understanding complex biological data. In widely used methods such as hierarchical clustering (HC) and consensus clustering (CC), expression profiles of all genes are often used to assess similarity between samples for clustering. These methods output sample clusters, but are not able to provide information about which gene sets (functions) contribute most to the clustering. So interpretability of their results is limited. We hypothesized that integrating prior knowledge of annotated biological processes would not only achieve satisfying clustering performance but also, more importantly, enable potential biological interpretation of clusters.ResultsHere we report ClusterMine, a novel approach that identifies clusters by assessing functional similarity between samples through integrating known annotated gene sets, e.g., in Gene Ontology. In addition to outputting cluster membership of each sample as conventional approaches do, it outputs gene sets that are most likely to contribute to the clustering, a feature facilitating biological interpretation. Using three cancer datasets, two single cell RNA-sequencing based cell differentiation datasets, one cell cycle dataset and two datasets of cells of different tissue origins, we found that ClusterMine achieved similar or better clustering performance and that top-scored gene sets prioritized by ClusterMine are biologically relevant.Implementation and availabilityClusterMine is implemented as an R package and is freely available at: www.genemine.org/[email protected] InformationSupplementary data are available at Bioinformatics online.

Download Full-text

Data management pipeline for plant phenotyping in a multisite project

Functional Plant Biology ◽

10.1071/fp12009 ◽

2012 ◽

Vol 39 (11) ◽

pp. 948 ◽

Cited By ~ 16

Author(s):

Kenny Billiau ◽

Heike Sprenger ◽

Christian Schudoma ◽

Dirk Walther ◽

Karin I. Köhl

Keyword(s):

Data Management ◽

Data Storage ◽

Data Exchange ◽

Data Access ◽

Data Management System ◽

Plant Phenotyping ◽

Data Types ◽

File Server ◽

Management Concepts ◽

Term Basis

In plant breeding, plants have to be characterised precisely, consistently and rapidly by different people at several field sites within defined time spans. For a meaningful data evaluation and statistical analysis, standardised data storage is required. Data access must be provided on a long-term basis and be independent of organisational barriers without endangering data integrity or intellectual property rights. We discuss the associated technical challenges and demonstrate adequate solutions exemplified in a data management pipeline for a project to identify markers for drought tolerance in potato. This project involves 11 groups from academia and breeding companies, 11 sites and four analytical platforms. Our data warehouse concept combines central data storage in databases and a file server and integrates existing and specialised database solutions for particular data types with new, project-specific databases. The strict use of controlled vocabularies and the application of web-access technologies proved vital to the successful data exchange between diverse institutes and data management concepts and infrastructures. By presenting our data management system and making the software available, we aim to support related phenotyping projects.

Download Full-text

MetaCycle: an integrated R package to evaluate periodicity in large scale data

10.1101/040345 ◽

2016 ◽

Cited By ~ 6

Author(s):

Gang Wu ◽

Ron C Anafi ◽

Michael E Hughes ◽

Karl Kornacker ◽

John B Hogenesch

Keyword(s):

Statistical Power ◽

Large Scale ◽

Time Series Data ◽

R Package ◽

Ease Of Use ◽

Data Availability ◽

Supplementary Information ◽

Series Data ◽

Large Scale Data ◽

Scale Data

Summary: Detecting periodicity in large scale data remains a challenge. Different algorithms offer strengths and weaknesses in statistical power, sensitivity to outliers, ease of use, and sampling requirements. While efforts have been made to identify best of breed algorithms, relatively little research has gone into integrating these methods in a generalizable method. Here we present MetaCycle, an R package that incorporates ARSER, JTK_CYCLE, and Lomb-Scargle to conveniently evaluate periodicity in time-series data. Availability and implementation: MetaCycle package is available on the CRAN repository (https://cran.r-project.org/web/packages/MetaCycle/index.html) and GitHub (https://github.com/gangwug/MetaCycle). Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.

Download Full-text

GNET2: an R package for constructing gene regulatory networks from transcriptomic data

Bioinformatics ◽

10.1093/bioinformatics/btaa902 ◽

2020 ◽

Author(s):

Chen Chen ◽

Jie Hou ◽

Xiaowen Shi ◽

Hua Yang ◽

James A Birchler ◽

...

Keyword(s):

Gene Regulatory Networks ◽

Regulatory Networks ◽

Data Exchange ◽

Graphical Model ◽

R Package ◽

Supplementary Information ◽

Original Algorithm ◽

Transcriptomic Data ◽

Regulatory Module ◽

Gene Regulatory

Abstract Motivation The Gene Network Estimation Tool (GNET) is designed to build gene regulatory networks (GRNs) from transcriptomic gene expression data with a probabilistic graphical model. The data preprocessing, model construction and visualization modules of the original GNET software were developed on different programming platforms, which were inconvenient for users to deploy and use. Results Here, we present GNET2, an improved implementation of GNET as an integrated R package. GNET2 provides more flexibility for parameter initialization and regulatory module construction based on the core iterative modeling process of the original algorithm. The data exchange interface of GNET2 is handled within an R session automatically. Given the growing demand for regulatory network reconstruction from transcriptomic data, GNET2 offers a convenient option for GRN inference on large datasets. Availability and implementation The source code of GNET2 is available at https://github.com/jianlin-cheng/GNET2. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Automatic identification of relevant genes from low-dimensional embeddings of single-cell RNA-seq data

Bioinformatics ◽

10.1093/bioinformatics/btaa198 ◽

2020 ◽

Vol 36 (15) ◽

pp. 4291-4295

Author(s):

Philipp Angerer ◽

David S Fischer ◽

Fabian J Theis ◽

Antonio Scialdone ◽

Carsten Marr

Keyword(s):

Single Cell ◽

Principal Component ◽

R Package ◽

Ease Of Use ◽

Supplementary Information ◽

Automatic Identification ◽

Biological Processes ◽

Rna Seq ◽

Sequencing Data ◽

Low Dimensional

Abstract Motivation Dimensionality reduction is a key step in the analysis of single-cell RNA-sequencing data. It produces a low-dimensional embedding for visualization and as a calculation base for downstream analysis. Nonlinear techniques are most suitable to handle the intrinsic complexity of large, heterogeneous single-cell data. However, with no linear relation between gene and embedding coordinate, there is no way to extract the identity of genes driving any cell’s position in the low-dimensional embedding, making it difficult to characterize the underlying biological processes. Results In this article, we introduce the concepts of local and global gene relevance to compute an equivalent of principal component analysis loadings for non-linear low-dimensional embeddings. Global gene relevance identifies drivers of the overall embedding, while local gene relevance identifies those of a defined sub-region. We apply our method to single-cell RNA-seq datasets from different experimental protocols and to different low-dimensional embedding techniques. This shows our method’s versatility to identify key genes for a variety of biological processes. Availability and implementation To ensure reproducibility and ease of use, our method is released as part of destiny 3.0, a popular R package for building diffusion maps from single-cell transcriptomic data. It is readily available through Bioconductor. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text