plyranges: A grammar of genomic data transformation

Mapping Intimacies ◽

10.1101/327841 ◽

2018 ◽

Author(s):

Stuart Lee ◽

Dianne Cook ◽

Michael Lawrence

Keyword(s):

Data Structure ◽

High Throughput ◽

Genomic Data ◽

Data Transformation ◽

Interval Data ◽

Bioconductor Package ◽

Coherent Interface ◽

Bioconductor Project ◽

Genomic Interval ◽

Integrative Level

The Bioconductor project provides many interoperable data abstractions for analyzing high-throughput genomics experiments; however implementing a typical genomic workflow with Bioconductor requires learning these abstractions and understanding them at an integrative level. This places a large cognitive burden on the user, especially for non-programmers. To reduce this burden we have created a grammar of genomic data transformation that operates on a single, central Bioconductor data structure, GRanges, which naturally represents genomic intervals and their associated measurements. The grammar defines verbs for performing actions on and between genomic interval data through a simplified, coherent interface to existing Bioconductor infrastructure, resulting in fluent analysis workflows. We have implemented this grammar as an R/Bioconductor package called plyranges.

Download Full-text

Augmented Interval List: a novel data structure for efficient genomic interval search

Bioinformatics ◽

10.1093/bioinformatics/btz407 ◽

2019 ◽

Vol 35 (23) ◽

pp. 4907-4911 ◽

Cited By ~ 8

Author(s):

Jianglin Feng ◽

Aakrosh Ratan ◽

Nathan C Sheffield

Keyword(s):

Data Structure ◽

High Performance ◽

Genomic Analysis ◽

Genomic Data ◽

Interval Data ◽

Supplementary Information ◽

Genomic Interval ◽

Interval Trees ◽

Running Maximum ◽

Scalable Methods

Abstract Motivation Genomic data is frequently stored as segments or intervals. Because this data type is so common, interval-based comparisons are fundamental to genomic analysis. As the volume of available genomic data grows, developing efficient and scalable methods for searching interval data is necessary. Results We present a new data structure, the Augmented Interval List (AIList), to enumerate intersections between a query interval q and an interval set R. An AIList is constructed by first sorting R as a list by the interval start coordinate, then decomposing it into a few approximately flattened components (sublists), and then augmenting each sublist with the running maximum interval end. The query time for AIList is O(log2N+n+m), where n is the number of overlaps between R and q, N is the number of intervals in the set R and m is the average number of extra comparisons required to find the n overlaps. Tested on real genomic interval datasets, AIList code runs 5–18 times faster than standard high-performance code based on augmented interval-trees, nested containment lists or R-trees (BEDTools). For large datasets, the memory-usage for AIList is 4–60% of other methods. The AIList data structure, therefore, provides a significantly improved fundamental operation for highly scalable genomic data analysis. Availability and implementation An implementation of the AIList data structure with both construction and search algorithms is available at http://ailist.databio.org. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Augmented Interval List: a novel data structure for efficient genomic interval search

10.1101/593657 ◽

2019 ◽

Cited By ~ 1

Author(s):

Jianglin Feng ◽

Aakrosh Ratan ◽

Nathan C. Sheffield

Keyword(s):

Data Structure ◽

High Performance ◽

Genomic Analysis ◽

Genomic Data ◽

Interval Data ◽

Genomic Interval ◽

Interval Trees ◽

Running Maximum ◽

Genomic Data Analysis ◽

Scalable Methods

AbstractMotivationGenomic data is frequently stored as segments or intervals. Because this data type is so common, interval-based comparisons are fundamental to genomic analysis. As the volume of available genomic data grows, developing efficient and scalable methods for searching interval data is necessary.ResultsWe present a new data structure, the augmented interval list (AIList), to enumerate intersections between a query interval q and an interval set R. An AIList is constructed by first sorting R as a list by the interval start coordinate, then decomposing it into a few approximately flattened components (sublists), and then augmenting each sublist with the running maximum interval end. The query time for AIList is O(log2N + n + m), where n is the number of overlaps between R and q, N is the number of intervals in the set R, and m is the average number of extra comparisons required to find the n overlaps. Tested on real genomic interval datasets, AIList code runs 5 - 18 times faster than standard high-performance code based on augmented interval-trees (AITree), nested containment lists (NCList), or R-trees (BEDTools). For large datasets, the memory-usage for AIList is 4% - 60% of other methods. The AIList data structure, therefore, provides a significantly improved fundamental operation for highly scalable genomic data analysis.AvailabilityAn implementation of the AIList data structure with both construction and search algorithms is available at code.databio.org/AIList.

Download Full-text

DupChecker: a bioconductor package for checking high-throughput genomic data redundancy in meta-analysis

BMC Bioinformatics ◽

10.1186/1471-2105-15-323 ◽

2014 ◽

Vol 15 (1) ◽

pp. 323 ◽

Cited By ~ 1

Author(s):

Quanhu Sheng ◽

Yu Shyr ◽

Xi Chen

Keyword(s):

High Throughput ◽

Meta Analysis ◽

Genomic Data ◽

Bioconductor Package ◽

Data Redundancy

Download Full-text

IGD: high-performance search for large-scale genomic interval datasets

Bioinformatics ◽

10.1093/bioinformatics/btaa1062 ◽

2020 ◽

Author(s):

Jianglin Feng ◽

Nathan C Sheffield

Keyword(s):

High Performance ◽

Large Scale ◽

Interval Data ◽

Scale Analysis ◽

Genome Database ◽

Genomic Interval ◽

Critical Resource ◽

Genomic Regions ◽

Genome Projects

Abstract Summary Databases of large-scale genome projects now contain thousands of genomic interval datasets. These data are a critical resource for understanding the function of DNA. However, our ability to examine and integrate interval data of this scale is limited. Here, we introduce the integrated genome database (IGD), a method and tool for searching genome interval datasets more than three orders of magnitude faster than existing approaches, while using only one hundredth of the memory. IGD uses a novel linear binning method that allows us to scale analysis to billions of genomic regions. Availability https://github.com/databio/IGD

Download Full-text

SSFinder: High Throughput CRISPR-Cas Target Sites Prediction Tool

BioMed Research International ◽

10.1155/2014/742482 ◽

2014 ◽

Vol 2014 ◽

pp. 1-4 ◽

Cited By ~ 22

Author(s):

Santosh Kumar Upadhyay ◽

Shailesh Sharma

Keyword(s):

Operating Systems ◽

High Throughput ◽

Genomic Data ◽

Prediction Tool ◽

High Demand ◽

Specific Target ◽

Target Sites ◽

High Throughput Detection ◽

User Friendly

Clustered regularly interspaced short palindromic repeats (CRISPR) and CRISPR-associated protein (Cas) system facilitates targeted genome editing in organisms. Despite high demand of this system, finding a reliable tool for the determination of specific target sites in large genomic data remained challenging. Here, we report SSFinder, a python script to perform high throughput detection of specific target sites in large nucleotide datasets. The SSFinder is a user-friendly tool, compatible with Windows, Mac OS, and Linux operating systems, and freely available online.

Download Full-text

ShortRead: a bioconductor package for input, quality assessment and exploration of high-throughput sequence data

Bioinformatics ◽

10.1093/bioinformatics/btp450 ◽

2009 ◽

Vol 25 (19) ◽

pp. 2607-2608 ◽

Cited By ~ 278

Author(s):

M. Morgan ◽

S. Anders ◽

M. Lawrence ◽

P. Aboyoun ◽

H. Pages ◽

...

Keyword(s):

Quality Assessment ◽

High Throughput ◽

Sequence Data ◽

Bioconductor Package ◽

Input Quality

Download Full-text

karyoploteR: an R/Bioconductor package to plot customizable linear genomes displaying arbitrary data

10.1101/122838 ◽

2017 ◽

Cited By ~ 1

Author(s):

Bernat Gel ◽

Eduard Serra

Keyword(s):

Experimental Data ◽

Source Code ◽

Genomic Data ◽

Data Exploration ◽

Main Function ◽

Bioconductor Package ◽

End User ◽

Creation Process ◽

Whole Genomes ◽

Linear Genomes

AbstractMotivationData visualization is a crucial tool for data exploration, analysis and interpretation. For the visualization of genomic data there lacks a tool to create customizable non-circular plots of whole genomes from any species.ResultsWe have developed karyoploteR, an R/Bioconductor package to create linear chromosomal representations of any genome with genomic annotations and experimental data plotted along them. Plot creation process is inspired in R base graphics, with a main function creating karyoplots with no data and multiple additional functions, including custom functions written by the end-user, adding data and other graphical elements. This approach allows the creation of highly customizable plots from arbitrary data with complete freedom on data positioning and representation.AvailabilitykaryoploteR is released under Artistic-2.0 License. Source code and documentation are freely available through Bioconductor (http://www.bioconductor.org/packages/karyoploteR)[email protected]

Download Full-text

RTNduals: an R/Bioconductor package for analysis of co-regulation and inference of dual regulons

Bioinformatics ◽

10.1093/bioinformatics/btz534 ◽

2019 ◽

Vol 35 (24) ◽

pp. 5357-5358

Author(s):

Vinicius S Chagas ◽

Clarice S Groeneveld ◽

Kelin G Oliveira ◽

Sheyla Trefflich ◽

Rodrigo C de Almeida ◽

...

Keyword(s):

Regulatory Networks ◽

Target Genes ◽

Supplementary Information ◽

Transcriptional Networks ◽

Bioconductor Package ◽

Multiple Target ◽

R Language ◽

Downstream Effects ◽

Bioconductor Project ◽

General Method

Abstract Motivation Transcription factors (TFs) are key regulators of gene expression, and can activate or repress multiple target genes, forming regulatory units, or regulons. Understanding downstream effects of these regulators includes evaluating how TFs cooperate or compete within regulatory networks. Here we present RTNduals, an R/Bioconductor package that implements a general method for analyzing pairs of regulons. Results RTNduals identifies a dual regulon when the number of targets shared between a pair of regulators is statistically significant. The package extends the RTN (Reconstruction of Transcriptional Networks) package, and uses RTN transcriptional networks to identify significant co-regulatory associations between regulons. The Supplementary Information reports two case studies for TFs using the METABRIC and TCGA breast cancer cohorts. Availability and implementation RTNduals is written in the R language, and is available from the Bioconductor project at http://bioconductor.org/packages/RTNduals/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

34 Gene fusions in 1,015 human cancer cell lines: integrating large-scale genomic data, high-throughput drug and CRISPR/Cas9 screens to assess functional relevance and therapeutic potential

10.1136/esmoopen-2018-eacr25.34 ◽

2018 ◽

Author(s):

E Chen ◽

G Picco ◽

L Garcia-Alonso ◽

G Bignell ◽

F Behan ◽

...

Keyword(s):

Cell Lines ◽

High Throughput ◽

Large Scale ◽

Therapeutic Potential ◽

Human Cancer ◽

Genomic Data ◽

Gene Fusions ◽

Human Cancer Cell ◽

Human Cancer Cell Lines ◽

Functional Relevance

Download Full-text

A comprehensive database of high-throughput sequencing-based RNA secondary structure probing data (Structure Surfer)

BMC Bioinformatics ◽

10.1186/s12859-016-1071-0 ◽

2016 ◽

Vol 17 (1) ◽

Cited By ~ 15

Author(s):

Nathan D. Berkowitz ◽

Ian M. Silverman ◽

Daniel M. Childress ◽

Hilal Kazan ◽

Li-San Wang ◽

...

Keyword(s):

Data Structure ◽

Secondary Structure ◽

High Throughput ◽

Rna Secondary Structure ◽

High Throughput Sequencing ◽

Comprehensive Database ◽

Structure Probing

Download Full-text