Augmented Interval List: a novel data structure for efficient genomic interval search

Mapping Intimacies ◽

10.1101/593657 ◽

2019 ◽

Cited By ~ 1

Author(s):

Jianglin Feng ◽

Aakrosh Ratan ◽

Nathan C. Sheffield

Keyword(s):

Data Structure ◽

High Performance ◽

Genomic Analysis ◽

Genomic Data ◽

Interval Data ◽

Genomic Interval ◽

Interval Trees ◽

Running Maximum ◽

Genomic Data Analysis ◽

Scalable Methods

AbstractMotivationGenomic data is frequently stored as segments or intervals. Because this data type is so common, interval-based comparisons are fundamental to genomic analysis. As the volume of available genomic data grows, developing efficient and scalable methods for searching interval data is necessary.ResultsWe present a new data structure, the augmented interval list (AIList), to enumerate intersections between a query interval q and an interval set R. An AIList is constructed by first sorting R as a list by the interval start coordinate, then decomposing it into a few approximately flattened components (sublists), and then augmenting each sublist with the running maximum interval end. The query time for AIList is O(log2N + n + m), where n is the number of overlaps between R and q, N is the number of intervals in the set R, and m is the average number of extra comparisons required to find the n overlaps. Tested on real genomic interval datasets, AIList code runs 5 - 18 times faster than standard high-performance code based on augmented interval-trees (AITree), nested containment lists (NCList), or R-trees (BEDTools). For large datasets, the memory-usage for AIList is 4% - 60% of other methods. The AIList data structure, therefore, provides a significantly improved fundamental operation for highly scalable genomic data analysis.AvailabilityAn implementation of the AIList data structure with both construction and search algorithms is available at code.databio.org/AIList.

Download Full-text

Augmented Interval List: a novel data structure for efficient genomic interval search

Bioinformatics ◽

10.1093/bioinformatics/btz407 ◽

2019 ◽

Vol 35 (23) ◽

pp. 4907-4911 ◽

Cited By ~ 8

Author(s):

Jianglin Feng ◽

Aakrosh Ratan ◽

Nathan C Sheffield

Keyword(s):

Data Structure ◽

High Performance ◽

Genomic Analysis ◽

Genomic Data ◽

Interval Data ◽

Supplementary Information ◽

Genomic Interval ◽

Interval Trees ◽

Running Maximum ◽

Scalable Methods

Abstract Motivation Genomic data is frequently stored as segments or intervals. Because this data type is so common, interval-based comparisons are fundamental to genomic analysis. As the volume of available genomic data grows, developing efficient and scalable methods for searching interval data is necessary. Results We present a new data structure, the Augmented Interval List (AIList), to enumerate intersections between a query interval q and an interval set R. An AIList is constructed by first sorting R as a list by the interval start coordinate, then decomposing it into a few approximately flattened components (sublists), and then augmenting each sublist with the running maximum interval end. The query time for AIList is O(log2N+n+m), where n is the number of overlaps between R and q, N is the number of intervals in the set R and m is the average number of extra comparisons required to find the n overlaps. Tested on real genomic interval datasets, AIList code runs 5–18 times faster than standard high-performance code based on augmented interval-trees, nested containment lists or R-trees (BEDTools). For large datasets, the memory-usage for AIList is 4–60% of other methods. The AIList data structure, therefore, provides a significantly improved fundamental operation for highly scalable genomic data analysis. Availability and implementation An implementation of the AIList data structure with both construction and search algorithms is available at http://ailist.databio.org. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

plyranges: A grammar of genomic data transformation

10.1101/327841 ◽

2018 ◽

Author(s):

Stuart Lee ◽

Dianne Cook ◽

Michael Lawrence

Keyword(s):

Data Structure ◽

High Throughput ◽

Genomic Data ◽

Data Transformation ◽

Interval Data ◽

Bioconductor Package ◽

Coherent Interface ◽

Bioconductor Project ◽

Genomic Interval ◽

Integrative Level

The Bioconductor project provides many interoperable data abstractions for analyzing high-throughput genomics experiments; however implementing a typical genomic workflow with Bioconductor requires learning these abstractions and understanding them at an integrative level. This places a large cognitive burden on the user, especially for non-programmers. To reduce this burden we have created a grammar of genomic data transformation that operates on a single, central Bioconductor data structure, GRanges, which naturally represents genomic intervals and their associated measurements. The grammar defines verbs for performing actions on and between genomic interval data through a simplified, coherent interface to existing Bioconductor infrastructure, resulting in fluent analysis workflows. We have implemented this grammar as an R/Bioconductor package called plyranges.

Download Full-text

IGD: high-performance search for large-scale genomic interval datasets

Bioinformatics ◽

10.1093/bioinformatics/btaa1062 ◽

2020 ◽

Author(s):

Jianglin Feng ◽

Nathan C Sheffield

Keyword(s):

High Performance ◽

Large Scale ◽

Interval Data ◽

Scale Analysis ◽

Genome Database ◽

Genomic Interval ◽

Critical Resource ◽

Genomic Regions ◽

Genome Projects

Abstract Summary Databases of large-scale genome projects now contain thousands of genomic interval datasets. These data are a critical resource for understanding the function of DNA. However, our ability to examine and integrate interval data of this scale is limited. Here, we introduce the integrated genome database (IGD), a method and tool for searching genome interval datasets more than three orders of magnitude faster than existing approaches, while using only one hundredth of the memory. IGD uses a novel linear binning method that allows us to scale analysis to billions of genomic regions. Availability https://github.com/databio/IGD

Download Full-text

HiGene: A high-performance platform for genomic data analysis

2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) ◽

10.1109/bibm.2016.7822584 ◽

2016 ◽

Cited By ~ 6

Author(s):

Liqun Deng ◽

Guowei Huang ◽

Yuzheng Zhuang ◽

Jiansheng Wei ◽

Youliang Yan

Keyword(s):

Data Analysis ◽

High Performance ◽

Genomic Data ◽

Genomic Data Analysis

Download Full-text

IGD: high-performance search for large-scale genomic interval datasets

10.1101/2020.06.08.139758 ◽

2020 ◽

Cited By ~ 1

Author(s):

Jianglin Feng ◽

Nathan C. Sheffield

Keyword(s):

High Performance ◽

Large Scale ◽

Interval Data ◽

Scale Analysis ◽

Genome Database ◽

Link Type ◽

Genomic Interval ◽

Critical Resource ◽

Genomic Regions ◽

Genome Projects

SummaryDatabases of large-scale genome projects now contain thousands of genomic interval datasets. These data are a critical resource for understanding the function of DNA. However, our ability to examine and integrate interval data of this scale is limited. Here, we introduce the integrated genome database (IGD), a method and tool for searching genome interval datasets more than three orders of magnitude faster than existing approaches, while using only one hundredth of the memory. IGD uses a novel linear binning method that allows us to scale analysis to billions of genomic regions.Availabilityhttps://github.com/databio/IGD

Download Full-text

The parallelism motifs of genomic data analysis

Philosophical Transactions of The Royal Society A Mathematical Physical and Engineering Sciences ◽

10.1098/rsta.2019.0394 ◽

2020 ◽

Vol 378 (2166) ◽

pp. 20190394

Author(s):

Katherine Yelick ◽

Aydın Buluç ◽

Muaaz Awan ◽

Ariful Azad ◽

Benjamin Brock ◽

...

Keyword(s):

Data Analysis ◽

High Performance ◽

Architectural Design ◽

Large Scale ◽

Numerical Algorithms ◽

Genomic Data ◽

Scientific Simulations ◽

Genomic Data Analysis ◽

The Cost ◽

Support Software

Genomic datasets are growing dramatically as the cost of sequencing continues to decline and small sequencing devices become available. Enormous community databases store and share these data with the research community, but some of these genomic data analysis problems require large-scale computational platforms to meet both the memory and computational requirements. These applications differ from scientific simulations that dominate the workload on high-end parallel systems today and place different requirements on programming support, software libraries and parallel architectural design. For example, they involve irregular communication patterns such as asynchronous updates to shared data structures. We consider several problems in high-performance genomics analysis, including alignment, profiling, clustering and assembly for both single genomes and metagenomes. We identify some of the common computational patterns or ‘motifs’ that help inform parallelization strategies and compare our motifs to some of the established lists, arguing that at least two key patterns, sorting and hashing, are missing. This article is part of a discussion meeting issue ‘Numerical algorithms for high-performance computational science’.

Download Full-text

Characterizing Performance Variation of Genomic Data Analysis Workflows on the Public Cloud

2020 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech) ◽

10.1109/dasc-picom-cbdcom-cyberscitech49142.2020.00116 ◽

2020 ◽

Author(s):

David Perez ◽

Ling-Hong Hung ◽

Sonia Xu ◽

Ka Yee Yeung ◽

Wes Lloyd

Keyword(s):

Data Analysis ◽

Genomic Data ◽

Public Cloud ◽

The Public ◽

Performance Variation ◽

Genomic Data Analysis

Download Full-text

Statistical Genetics for Genomic Data Analysis

Springer Handbook of Engineering Statistics ◽

10.1007/978-1-84628-288-1_32 ◽

2006 ◽

pp. 591-605

Author(s):

Jae Lee

Keyword(s):

Data Analysis ◽

Genomic Data ◽

Statistical Genetics ◽

Genomic Data Analysis

Download Full-text

Introduction to R for Genomic Data Analysis

Computational Genomics with R ◽

10.1201/9780429084317-2 ◽

2020 ◽

pp. 23-66

Author(s):

Altuna Akalin

Keyword(s):

Data Analysis ◽

Genomic Data ◽

Genomic Data Analysis

Download Full-text

A Chip for a Routing Table Based on a Novel Modified Trie Algorithm

VLSI Design ◽

10.1155/2000/81057 ◽

2000 ◽

Vol 11 (4) ◽

pp. 405-415

Author(s):

D. Torres ◽

A. Larios ◽

M. Guzmán

Keyword(s):

Data Structure ◽

High Performance ◽

Object Oriented ◽

Pci Bus ◽

Memory Space ◽

General Behavior ◽

Routing Table ◽

Starting Point ◽

Associated Data

The design for a routing table circuit for Ethernet-, IP- and ATM-applications is presented. Starting point for the design was an object-oriented general behavior of the routing table. The selected data structure for the routing table is based on a modification of the structure denominated trie, saving one search level and memory space. The architecture for searching and sorting of data, implemented in hardware, is explained. This modified trie stores 64 K addresses and the associated data, achieving a high performance too. The circuit, which can support a flow of 500000 frames/s, is connected to the PCI Bus. For the implementation a FLEX10K100 from Altera Company was used.

Download Full-text