scholarly journals The GCTx format and cmap{Py, R, M} packages: resources for the optimized storage and integrated traversal of dense matrices of data and annotations

2017 ◽  
Author(s):  
Oana M. Enache ◽  
David L. Lahr ◽  
Ted E. Natoli ◽  
Lev Litichevskiy ◽  
David Wadden ◽  
...  

AbstractMotivationComputational analysis of datasets generated by treating cells with pharmacological and genetic perturbagens has proven useful for the discovery of functional relationships. Facilitated by technological improvements, perturbational datasets have grown in recent years to include millions of experiments. While initial studies, such as our work on Connectivity Map, used gene expression readouts, recent studies from the NIH LINCS consortium have expanded to a more diverse set of molecular readouts, including proteomic and cell morphological signatures. Sharing these diverse data creates many opportunities for research and discovery, but the unprecedented size of data generated and the complex metadata associated with experiments have also created fundamental technical challenges regarding data storage and cross-assay integration.ResultsWe present the GCTx file format and a suite of open-source packages for the efficient storage, serialization, and analysis of dense two-dimensional matrices. The utility of this format is not just theoretical; we have extensively used the format in the Connectivity Map to assemble and share massive data sets comprising 1.7 million experiments. We anticipate that the generalizability of the GCTx format, paired with code libraries that we provide, will stimulate wider adoption and lower barriers for integrated cross-assay analysis and algorithm development.AvailabilitySoftware packages (available in Matlab, Python, and R) are freely available at https://github.com/cmapSupplementary informationSupplementary information is available at clue.io/[email protected]

2018 ◽  
Author(s):  
Jean-Michel Claverie ◽  
TA Thi Ngan

AbstractMotivationMore than 20 years ago, our laboratory published an original statistical test (referred to as the Audic-Claverie (AC) test in the literature) to identify differentially expressed genes from the pairwise comparison of counts of cognate RNA-seq reads (then called “expressed sequence tags”) determined in different conditions. Despite its antiquity and the publications of more sophisticated software packages, this original article continued to gather more than 200 citations per year, indicating the persistent usefulness of the simple AC test for the community. This prompted us to propose a fully revamped version of the AC test with a user interface adapted to the diverse and much larger datasets produced by contemporary omics techniques.ResultsWe implemented ACDtool as an interactive, freely accessible web service proposing 3 types of analyses: 1) the pairwise comparison of individual counts, 2) pairwise comparisons of arbitrary large lists of counts, 3) the all-at-once pairwise comparisons of multiple datasets. Statistical computations are implemented using standard R functions and mathematically reformulated as to accommodate all practical ranges of count values. ACDtool can thus analyze datasets from transcriptomic, proteomic, metagenomics, barcoding, ChlP'seq, population genetics, etc, using the same mathematical approach. ACDtool is particularly well suited for comparisons of large datasets without replicates.AvailabilityACDtool is at URL: www.igs.cnrs-mrs.fr/acdtool/[email protected] informationnone.


2018 ◽  
Author(s):  
Martin Pirkl ◽  
Niko Beerenwinkel

AbstractMotivationNew technologies allow for the elaborate measurement of different traits of single cells. These data promise to elucidate intra-cellular networks in unprecedented detail and further help to improve treatment of diseases like cancer. However, cell populations can be very heterogeneous.ResultsWe developed a mixture of Nested Effects Models (M&NEM) for single-cell data to simultaneously identify different cellular sub-populations and their corresponding causal networks to explain the heterogeneity in a cell population. For inference, we assign each cell to a network with a certain probability and iteratively update the optimal networks and cell probabilities in an Expectation Maximization scheme. We validate our method in the controlled setting of a simulation study and apply it to three data sets of pooled CRISPR screens generated previously by two novel experimental techniques, namely Crop-Seq and Perturb-Seq.AvailabilityThe mixture Nested Effects Model (M&NEM) is available as the R-package mnem at https://github.com/cbgethz/mnem/[email protected], [email protected] informationSupplementary data are available.online.


2018 ◽  
Author(s):  
Corbin Quick ◽  
Christian Fuchsberger ◽  
Daniel Taliun ◽  
Gonçalo Abecasis ◽  
Michael Boehnke ◽  
...  

AbstractSummaryEstimating linkage disequilibrium (LD) is essential for a wide range of summary statistics-based association methods for genome-wide association studies (GWAS). Large genetic data sets, e.g. the TOPMed WGS project and UK Biobank, enable more accurate and comprehensive LD estimates, but increase the computational burden of LD estimation. Here, we describe emeraLD (Efficient Methods for Estimation and Random Access of LD), a computational tool that leverages sparsity and haplotype structure to estimate LD orders of magnitude faster than existing tools.Availability and ImplementationemeraLD is implemented in C++, and is open source under GPLv3. Source code, documentation, an R interface, and utilities for analysis of summary statistics are freely available at http://github.com/statgen/[email protected] informationSupplementary data are available at Bioinformatics online.


2018 ◽  
Author(s):  
Lucas Czech ◽  
Alexandros Stamatakis

AbstractMotivationIn most metagenomic sequencing studies, the initial analysis step consists in assessing the evolutionary provenance of the sequences. Phylogenetic (or Evolutionary) Placement methods can be employed to determine the evolutionary position of sequences with respect to a given reference phylogeny. These placement methods do however face certain limitations: The manual selection of reference sequences is labor-intensive; the computational effort to infer reference phylogenies is substantially larger than for methods that rely on sequence similarity; the number of taxa in the reference phylogeny should be small enough to allow for visually inspecting the results.ResultsWe present algorithms to overcome the above limitations. First, we introduce a method to automatically construct representative sequences from databases to infer reference phylogenies. Second, we present an approach for conducting large-scale phylogenetic placements on nested phylogenies. Third, we describe a preprocessing pipeline that allows for handling huge sequence data sets. Our experiments on empirical data show that our methods substantially accelerate the workflow and yield highly accurate placement results.ImplementationFreely available under GPLv3 at http://github.com/lczech/[email protected] InformationSupplementary data are available at Bioinformatics online.


2016 ◽  
Author(s):  
Roshni Cooper ◽  
Shaul Yogev ◽  
Kang Shen ◽  
Mark Horowitz

AbstractMotivation:Microtubules (MTs) are polarized polymers that are critical for cell structure and axonal transport. They form a bundle in neurons, but beyond that, their organization is relatively unstudied.Results:We present MTQuant, a method for quantifying MT organization using light microscopy, which distills three parameters from MT images: the spacing of MT minus-ends, their average length, and the average number of MTs in a cross-section of the bundle. This method allows for robust and rapid in vivo analysis of MTs, rendering it more practical and more widely applicable than commonly-used electron microscopy reconstructions. MTQuant was successfully validated with three ground truth data sets and applied to over 3000 images of MTs in a C. elegans motor neuron.Availability:MATLAB code is available at http://roscoope.github.io/MTQuantContact:[email protected] informationSupplementary data are available at Bioinformatics online.


2018 ◽  
Author(s):  
Carlos Martínez-Mira ◽  
Ana Conesa ◽  
Sonia Tarazona

AbstractMotivationAs new integrative methodologies are being developed to analyse multi-omic experiments, validation strategies are required for benchmarking. In silico approaches such as simulated data are popular as they are fast and cheap. However, few tools are available for creating synthetic multi-omic data sets.ResultsMOSim is a new R package for easily simulating multi-omic experiments consisting of gene expression data, other regulatory omics and the regulatory relationships between them. MOSim supports different experimental designs including time series data.AvailabilityThe package is freely available under the GPL-3 license from the Bitbucket repository (https://bitbucket.org/ConesaLab/mosim/)[email protected] informationSupplementary material is available at bioRxiv online.


2017 ◽  
Author(s):  
Anne Senabouth ◽  
Samuel W Lukowski ◽  
Jose Alquicira Hernandez ◽  
Stacey Andersen ◽  
Xin Mei ◽  
...  

AbstractSummaryascend is an R package comprised of fast, streamlined analysis functions optimized to address the statistical challenges of single cell RNA-seq. The package incorporates novel and established methods to provide a flexible framework to perform filtering, quality control, normalization, dimension reduction, clustering, differential expression and a wide-range of plotting. ascend is designed to work with scRNA-seq data generated by any high-throughput platform, and includes functions to convert data objects between software packages.AvailabilityThe R package and associated vignettes are freely available at https://github.com/IMB-Computational-Genomics-Lab/[email protected] informationAn example dataset is available at ArrayExpress, accession number E-MTAB-6108


2018 ◽  
Author(s):  
Georgi Danovski ◽  
Teodora Dyankova ◽  
Stoyno Stoynov

AbstractSummaryWe present CellTool, a stand-alone open source software with a Graphical User Interface for image analysis, optimized for measurement of time-lapse microscopy images. It combines data management, image processing, mathematical modeling and graphical presentation of data in a single package. Multiple image filters, segmentation and particle tracking algorithms, combined with direct visualization of the obtained results make CellTool an ideal application for rapid execution of complex tasks. In addition, the software allows for the fitting of the obtained results to predefined or custom mathematical models. Importantly, CellTool provides a platform for easy implementation of custom image analysis packages written on a variety of programing languages.Availability and ImplementationCellTool is a free software available for MS Windows OS under the terms of the GNU General Public License. Executables and source files, supplementary information and sample data sets are freely available for download at URL: https://dnarepair.bas.bg/software/CellTool/[email protected]; [email protected];Supplementary informationSupplementary data are available at URL: https://dnarepair.bas.bg/software/CellTool/Program/CellTool_UserGuide.pdf


2016 ◽  
Author(s):  
Andrian Yang ◽  
Michael Troup ◽  
Peijie Lin ◽  
Joshua W. K. Ho

AbstractSummarySingle-cell RNA-seq (scRNA-seq) is increasingly used in a range of biomedical studies. Nonetheless, current RNA-seq analysis tools are not specifically designed to efficiently process scRNA-seq data due to their limited scalability. Here we introduce Falco, a cloud-based framework to enable paralellisation of existing RNA-seq processing pipelines using big data technologies of Apache Hadoop and Apache Spark for performing massively parallel analysis of large scale transcriptomic data. Using two public scRNA-seq data sets and two popular RNA-seq alignment/feature quantification pipelines, we show that the same processing pipeline runs 2.6 – 145.4 times faster using Falco than running on a highly optimised single node analysis. Falco also allows user to the utilise low-cost spot instances of Amazon Web Services (AWS), providing a 65% reduction in cost of analysis.AvailabilityFalco is available via a GNU General Public License at https://github.com/VCCRI/Falco/[email protected] informationSupplementary data are available at BioRXiv online.


Author(s):  
John Zobolas ◽  
Vasundra Touré ◽  
Martin Kuiper ◽  
Steven Vercruysse

Abstract Summary We present a set of software packages that provide uniform access to diverse biological vocabulary resources that are instrumental for current biocuration efforts and tools. The Unified Biological Dictionaries (UniBioDicts or UBDs) provide a single query-interface for accessing the online API services of leading biological data providers. Given a search string, UBDs return a list of matching term, identifier and metadata units from databases (e.g. UniProt), controlled vocabularies (e.g. PSI-MI) and ontologies (e.g. GO, via BioPortal). This functionality can be connected to input fields (user-interface components) that offer autocomplete lookup for these dictionaries. UBDs create a unified gateway for accessing life science concepts, helping curators find annotation terms across resources (based on descriptive metadata and unambiguous identifiers), and helping data users search and retrieve the right query terms. Availability and implementation The UBDs are available through npm and the code is available in the GitHub organisation UniBioDicts (https://github.com/UniBioDicts) under the Affero GPL license. Supplementary information Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document