The GCTx format and cmap{Py, R, M} packages: resources for the optimized storage and integrated traversal of dense matrices of data and annotations

Mapping Intimacies ◽

10.1101/227041 ◽

2017 ◽

Cited By ~ 6

Author(s):

Oana M. Enache ◽

David L. Lahr ◽

Ted E. Natoli ◽

Lev Litichevskiy ◽

David Wadden ◽

...

Keyword(s):

Data Storage ◽

Supplementary Information ◽

Data Sets ◽

Connectivity Map ◽

Functional Relationships ◽

Software Packages ◽

Supplementary Material ◽

Dense Matrices ◽

Diverse Data ◽

Technological Improvements

AbstractMotivationComputational analysis of datasets generated by treating cells with pharmacological and genetic perturbagens has proven useful for the discovery of functional relationships. Facilitated by technological improvements, perturbational datasets have grown in recent years to include millions of experiments. While initial studies, such as our work on Connectivity Map, used gene expression readouts, recent studies from the NIH LINCS consortium have expanded to a more diverse set of molecular readouts, including proteomic and cell morphological signatures. Sharing these diverse data creates many opportunities for research and discovery, but the unprecedented size of data generated and the complex metadata associated with experiments have also created fundamental technical challenges regarding data storage and cross-assay integration.ResultsWe present the GCTx file format and a suite of open-source packages for the efficient storage, serialization, and analysis of dense two-dimensional matrices. The utility of this format is not just theoretical; we have extensively used the format in the Connectivity Map to assemble and share massive data sets comprising 1.7 million experiments. We anticipate that the generalizability of the GCTx format, paired with code libraries that we provide, will stimulate wider adoption and lower barriers for integrated cross-assay analysis and algorithm development.AvailabilitySoftware packages (available in Matlab, Python, and R) are freely available at https://github.com/cmapSupplementary informationSupplementary information is available at clue.io/[email protected]

Download Full-text

ACDtool: a web-server extending the original Audic-Claverie statistical test to the comparison of large data sets of counts

10.1101/304568 ◽

2018 ◽

Author(s):

Jean-Michel Claverie ◽

TA Thi Ngan

Keyword(s):

Pairwise Comparison ◽

Large Data ◽

Statistical Test ◽

Large Data Sets ◽

Supplementary Information ◽

Pairwise Comparisons ◽

Data Sets ◽

Multiple Datasets ◽

Software Packages ◽

Supplementary Material

AbstractMotivationMore than 20 years ago, our laboratory published an original statistical test (referred to as the Audic-Claverie (AC) test in the literature) to identify differentially expressed genes from the pairwise comparison of counts of cognate RNA-seq reads (then called “expressed sequence tags”) determined in different conditions. Despite its antiquity and the publications of more sophisticated software packages, this original article continued to gather more than 200 citations per year, indicating the persistent usefulness of the simple AC test for the community. This prompted us to propose a fully revamped version of the AC test with a user interface adapted to the diverse and much larger datasets produced by contemporary omics techniques.ResultsWe implemented ACDtool as an interactive, freely accessible web service proposing 3 types of analyses: 1) the pairwise comparison of individual counts, 2) pairwise comparisons of arbitrary large lists of counts, 3) the all-at-once pairwise comparisons of multiple datasets. Statistical computations are implemented using standard R functions and mathematically reformulated as to accommodate all practical ranges of count values. ACDtool can thus analyze datasets from transcriptomic, proteomic, metagenomics, barcoding, ChlP'seq, population genetics, etc, using the same mathematical approach. ACDtool is particularly well suited for comparisons of large datasets without replicates.AvailabilityACDtool is at URL: www.igs.cnrs-mrs.fr/acdtool/[email protected] informationnone.

Download Full-text

Single cell network analysis with a mixture of Nested Effects Models

10.1101/258202 ◽

2018 ◽

Author(s):

Martin Pirkl ◽

Niko Beerenwinkel

Keyword(s):

Single Cell ◽

New Technologies ◽

Single Cells ◽

R Package ◽

Supplementary Information ◽

Data Sets ◽

Cell Network ◽

A Cell ◽

Supplementary Material ◽

Cell Data

AbstractMotivationNew technologies allow for the elaborate measurement of different traits of single cells. These data promise to elucidate intra-cellular networks in unprecedented detail and further help to improve treatment of diseases like cancer. However, cell populations can be very heterogeneous.ResultsWe developed a mixture of Nested Effects Models (M&NEM) for single-cell data to simultaneously identify different cellular sub-populations and their corresponding causal networks to explain the heterogeneity in a cell population. For inference, we assign each cell to a network with a certain probability and iteratively update the optimal networks and cell probabilities in an Expectation Maximization scheme. We validate our method in the controlled setting of a simulation study and apply it to three data sets of pooled CRISPR screens generated previously by two novel experimental techniques, namely Crop-Seq and Perturb-Seq.AvailabilityThe mixture Nested Effects Model (M&NEM) is available as the R-package mnem at https://github.com/cbgethz/mnem/[email protected], [email protected] informationSupplementary data are available.online.

Download Full-text

emeraLD: Rapid Linkage Disequilibrium Estimation with Massive Data Sets

10.1101/301366 ◽

2018 ◽

Cited By ~ 1

Author(s):

Corbin Quick ◽

Christian Fuchsberger ◽

Daniel Taliun ◽

Gonçalo Abecasis ◽

Michael Boehnke ◽

...

Keyword(s):

Linkage Disequilibrium ◽

Association Studies ◽

Random Access ◽

Supplementary Information ◽

Data Sets ◽

Genome Wide Association Studies ◽

Summary Statistics ◽

Genome Wide ◽

Wide Range ◽

Supplementary Material

AbstractSummaryEstimating linkage disequilibrium (LD) is essential for a wide range of summary statistics-based association methods for genome-wide association studies (GWAS). Large genetic data sets, e.g. the TOPMed WGS project and UK Biobank, enable more accurate and comprehensive LD estimates, but increase the computational burden of LD estimation. Here, we describe emeraLD (Efficient Methods for Estimation and Random Access of LD), a computational tool that leverages sparsity and haplotype structure to estimate LD orders of magnitude faster than existing tools.Availability and ImplementationemeraLD is implemented in C++, and is open source under GPLv3. Source code, documentation, an R interface, and utilities for analysis of summary statistics are freely available at http://github.com/statgen/[email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

Methods for Automatic Reference Trees and Multilevel Phylogenetic Placement

10.1101/299792 ◽

2018 ◽

Cited By ~ 1

Author(s):

Lucas Czech ◽

Alexandros Stamatakis

Keyword(s):

Large Scale ◽

Sequence Data ◽

Sequence Similarity ◽

Computational Effort ◽

Supplementary Information ◽

Data Sets ◽

Metagenomic Sequencing ◽

Sequencing Studies ◽

Manual Selection ◽

Supplementary Material

AbstractMotivationIn most metagenomic sequencing studies, the initial analysis step consists in assessing the evolutionary provenance of the sequences. Phylogenetic (or Evolutionary) Placement methods can be employed to determine the evolutionary position of sequences with respect to a given reference phylogeny. These placement methods do however face certain limitations: The manual selection of reference sequences is labor-intensive; the computational effort to infer reference phylogenies is substantially larger than for methods that rely on sequence similarity; the number of taxa in the reference phylogeny should be small enough to allow for visually inspecting the results.ResultsWe present algorithms to overcome the above limitations. First, we introduce a method to automatically construct representative sequences from databases to infer reference phylogenies. Second, we present an approach for conducting large-scale phylogenetic placements on nested phylogenies. Third, we describe a preprocessing pipeline that allows for handling huge sequence data sets. Our experiments on empirical data show that our methods substantially accelerate the workflow and yield highly accurate placement results.ImplementationFreely available under GPLv3 at http://github.com/lczech/[email protected] InformationSupplementary data are available at Bioinformatics online.

Download Full-text

MTQuant: “Seeing” Beyond the Diffraction Limit in Fluorescence Images to Quantify Neuronal Microtubule Organization

10.1101/074047 ◽

2016 ◽

Cited By ~ 1

Author(s):

Roshni Cooper ◽

Shaul Yogev ◽

Kang Shen ◽

Mark Horowitz

Keyword(s):

Average Length ◽

Cell Structure ◽

Ground Truth ◽

Supplementary Information ◽

Data Sets ◽

Microtubule Organization ◽

Ground Truth Data ◽

C Elegans ◽

Supplementary Material

AbstractMotivation:Microtubules (MTs) are polarized polymers that are critical for cell structure and axonal transport. They form a bundle in neurons, but beyond that, their organization is relatively unstudied.Results:We present MTQuant, a method for quantifying MT organization using light microscopy, which distills three parameters from MT images: the spacing of MT minus-ends, their average length, and the average number of MTs in a cross-section of the bundle. This method allows for robust and rapid in vivo analysis of MTs, rendering it more practical and more widely applicable than commonly-used electron microscopy reconstructions. MTQuant was successfully validated with three ground truth data sets and applied to over 3000 images of MTs in a C. elegans motor neuron.Availability:MATLAB code is available at http://roscoope.github.io/MTQuantContact:[email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

MOSim: Multi-Omics Simulation in R

10.1101/421834 ◽

2018 ◽

Cited By ~ 5

Author(s):

Carlos Martínez-Mira ◽

Ana Conesa ◽

Sonia Tarazona

Keyword(s):

Time Series Data ◽

Simulated Data ◽

R Package ◽

Experimental Designs ◽

Supplementary Information ◽

Series Data ◽

Data Sets ◽

Expression Data ◽

Supplementary Material ◽

Omic Data

AbstractMotivationAs new integrative methodologies are being developed to analyse multi-omic experiments, validation strategies are required for benchmarking. In silico approaches such as simulated data are popular as they are fast and cheap. However, few tools are available for creating synthetic multi-omic data sets.ResultsMOSim is a new R package for easily simulating multi-omic experiments consisting of gene expression data, other regulatory omics and the regulatory relationships between them. MOSim supports different experimental designs including time series data.AvailabilityThe package is freely available under the GPL-3 license from the Bitbucket repository (https://bitbucket.org/ConesaLab/mosim/)[email protected] informationSupplementary material is available at bioRxiv online.

Download Full-text

ascend: R package for analysis of single cell RNA-seq data

10.1101/207704 ◽

2017 ◽

Cited By ~ 11

Author(s):

Anne Senabouth ◽

Samuel W Lukowski ◽

Jose Alquicira Hernandez ◽

Stacey Andersen ◽

Xin Mei ◽

...

Keyword(s):

Single Cell ◽

R Package ◽

Computational Genomics ◽

Supplementary Information ◽

Rna Seq ◽

Software Packages ◽

Wide Range ◽

Flexible Framework ◽

Supplementary Material ◽

Data Objects

AbstractSummaryascend is an R package comprised of fast, streamlined analysis functions optimized to address the statistical challenges of single cell RNA-seq. The package incorporates novel and established methods to provide a flexible framework to perform filtering, quality control, normalization, dimension reduction, clustering, differential expression and a wide-range of plotting. ascend is designed to work with scRNA-seq data generated by any high-throughput platform, and includes functions to convert data objects between software packages.AvailabilityThe R package and associated vignettes are freely available at https://github.com/IMB-Computational-Genomics-Lab/[email protected] informationAn example dataset is available at ArrayExpress, accession number E-MTAB-6108

Download Full-text

CellTool: an open source software combining bio-image analysis and mathematical modeling

10.1101/454256 ◽

2018 ◽

Author(s):

Georgi Danovski ◽

Teodora Dyankova ◽

Stoyno Stoynov

Keyword(s):

Mathematical Modeling ◽

Image Analysis ◽

Open Source ◽

Open Source Software ◽

Time Lapse ◽

Supplementary Information ◽

Data Sets ◽

Link Type ◽

Sample Data ◽

Supplementary Material

AbstractSummaryWe present CellTool, a stand-alone open source software with a Graphical User Interface for image analysis, optimized for measurement of time-lapse microscopy images. It combines data management, image processing, mathematical modeling and graphical presentation of data in a single package. Multiple image filters, segmentation and particle tracking algorithms, combined with direct visualization of the obtained results make CellTool an ideal application for rapid execution of complex tasks. In addition, the software allows for the fitting of the obtained results to predefined or custom mathematical models. Importantly, CellTool provides a platform for easy implementation of custom image analysis packages written on a variety of programing languages.Availability and ImplementationCellTool is a free software available for MS Windows OS under the terms of the GNU General Public License. Executables and source files, supplementary information and sample data sets are freely available for download at URL: https://dnarepair.bas.bg/software/CellTool/[email protected]; [email protected];Supplementary informationSupplementary data are available at URL: https://dnarepair.bas.bg/software/CellTool/Program/CellTool_UserGuide.pdf

Download Full-text

Falco: A quick and flexible single-cell RNA-seq processing framework on the cloud

10.1101/064006 ◽

2016 ◽

Cited By ~ 1

Author(s):

Andrian Yang ◽

Michael Troup ◽

Peijie Lin ◽

Joshua W. K. Ho

Keyword(s):

Single Cell ◽

Large Scale ◽

Low Cost ◽

Supplementary Information ◽

Data Sets ◽

Rna Seq ◽

Single Node ◽

Big Data Technologies ◽

Amazon Web Services ◽

Supplementary Material

AbstractSummarySingle-cell RNA-seq (scRNA-seq) is increasingly used in a range of biomedical studies. Nonetheless, current RNA-seq analysis tools are not specifically designed to efficiently process scRNA-seq data due to their limited scalability. Here we introduce Falco, a cloud-based framework to enable paralellisation of existing RNA-seq processing pipelines using big data technologies of Apache Hadoop and Apache Spark for performing massively parallel analysis of large scale transcriptomic data. Using two public scRNA-seq data sets and two popular RNA-seq alignment/feature quantification pipelines, we show that the same processing pipeline runs 2.6 – 145.4 times faster using Falco than running on a highly optimised single node analysis. Falco also allows user to the utilise low-cost spot instances of Amazon Web Services (AWS), providing a 65% reduction in cost of analysis.AvailabilityFalco is available via a GNU General Public License at https://github.com/VCCRI/Falco/[email protected] informationSupplementary data are available at BioRXiv online.

Download Full-text

UniBioDicts: Unified access to Biological Dictionaries

Bioinformatics ◽

10.1093/bioinformatics/btaa1065 ◽

2020 ◽

Author(s):

John Zobolas ◽

Vasundra Touré ◽

Martin Kuiper ◽

Steven Vercruysse

Keyword(s):

User Interface ◽

Life Science ◽

Biological Data ◽

Supplementary Information ◽

Supplementary Data ◽

Query Interface ◽

Controlled Vocabularies ◽

Search String ◽

Software Packages ◽

The Right

Abstract Summary We present a set of software packages that provide uniform access to diverse biological vocabulary resources that are instrumental for current biocuration efforts and tools. The Unified Biological Dictionaries (UniBioDicts or UBDs) provide a single query-interface for accessing the online API services of leading biological data providers. Given a search string, UBDs return a list of matching term, identifier and metadata units from databases (e.g. UniProt), controlled vocabularies (e.g. PSI-MI) and ontologies (e.g. GO, via BioPortal). This functionality can be connected to input fields (user-interface components) that offer autocomplete lookup for these dictionaries. UBDs create a unified gateway for accessing life science concepts, helping curators find annotation terms across resources (based on descriptive metadata and unambiguous identifiers), and helping data users search and retrieve the right query terms. Availability and implementation The UBDs are available through npm and the code is available in the GitHub organisation UniBioDicts (https://github.com/UniBioDicts) under the Affero GPL license. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text