external sorting
Recently Published Documents


TOTAL DOCUMENTS

55
(FIVE YEARS 2)

H-INDEX

7
(FIVE YEARS 0)

2021 ◽  
Vol 116 ◽  
pp. 333-348
Author(s):  
Chih-Hsuan Chen ◽  
Shuo-Han Chen ◽  
Yu-Pei Liang ◽  
Tseng-Yi Chen ◽  
Tsan-sheng Hsu ◽  
...  

2021 ◽  
Vol 20 (4) ◽  
pp. 1-21
Author(s):  
Riley Jackson ◽  
Jonathan Gresl ◽  
Ramon Lawrence

Embedded devices are ubiquitous in areas of industrial and environmental monitoring, health and safety, and consumer appliances. A common use case is data collection, processing, and performing actions based on data analysis. Although many Internet of Things (IoT) applications use the embedded device simply for data collection, there are benefits to having more data processing done closer to data collection to reduce network transmissions and power usage and provide faster response. This work implements and evaluates algorithms for sorting data on embedded devices with specific focus on the smallest memory devices. In devices with less than 4 KB of available RAM, the standard external merge sort algorithm has limited application as it requires a minimum of three memory buffers and is not flash-aware. The contribution is a memory-optimized external sorting algorithm called no output buffer sort (NOBsort) that reduces the minimum memory required for sorting, has excellent performance for sorted or near-sorted data, and sorts on external memory such as SD cards or raw flash chips. When sorting large datasets, no output buffer sort reduces I/O and execution time by between 20% to 35% compared to standard external merge sort.


Author(s):  
Wenhan Chen ◽  
Yang Liu ◽  
Zhiguang Chen ◽  
Fang Liu ◽  
Nong Xiao

2020 ◽  
Vol 36 (9) ◽  
pp. 2705-2711 ◽  
Author(s):  
Gianvito Urgese ◽  
Emanuele Parisi ◽  
Orazio Scicolone ◽  
Santa Di Cataldo ◽  
Elisa Ficarra

Abstract Motivation High-throughput next-generation sequencing can generate huge sequence files, whose analysis requires alignment algorithms that are typically very demanding in terms of memory and computational resources. This is a significant issue, especially for machines with limited hardware capabilities. As the redundancy of the sequences typically increases with coverage, collapsing such files into compact sets of non-redundant reads has the 2-fold advantage of reducing file size and speeding-up the alignment, avoiding to map the same sequence multiple times. Method BioSeqZip generates compact and sorted lists of alignment-ready non-redundant sequences, keeping track of their occurrences in the raw files as well as of their quality score information. By exploiting a memory-constrained external sorting algorithm, it can be executed on either single- or multi-sample datasets even on computers with medium computational capabilities. On request, it can even re-expand the compacted files to their original state. Results Our extensive experiments on RNA-Seq data show that BioSeqZip considerably brings down the computational costs of a standard sequence analysis pipeline, with particular benefits for the alignment procedures that typically have the highest requirements in terms of memory and execution time. In our tests, BioSeqZip was able to compact 2.7 billion of reads into 963 million of unique tags reducing the size of sequence files up to 70% and speeding-up the alignment by 50% at least. Availability and implementation BioSeqZip is available at https://github.com/bioinformatics-polito/BioSeqZip. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Asaduzzaman Nur Shuvo ◽  
Apurba Adhikary ◽  
Md. Bipul Hossain ◽  
Sultana Jahan Soheli

Data sets in large applications are often too gigantic to fit completely inside the computer’s internal memory. The resulting input/output communication (or I/O) between fast internal memory and slower external memory (such as disks) can be a major performance bottle−neck. While applying sorting on this huge data set, it is essential to do external sorting. This paper is concerned with a new in−place external sorting algorithm. Our proposed algorithm uses the concept of Quick−Sort and Divide−and−Conquer approaches resulting in a faster sorting algorithm avoiding any additional disk space. In addition, we showed that the average time complexity can be reduced compared to the existing external sorting approaches.


2018 ◽  
Author(s):  
Hongzhe Guo ◽  
Yilei Fu ◽  
Yan Gao ◽  
Junyi Li ◽  
Yadong Wang ◽  
...  

AbstractMotivationDe Bruijn graph, a fundamental data structure to represent and organize genome sequence, plays important roles in various kinds of sequence analysis tasks such as de novo assembly, high-throughput sequencing (HTS) read alignment, pan-genome analysis, metagenomics analysis, HTS read correction, etc. With the rapid development of HTS data and ever-increasing number of assembled genomes, there is a high demand to construct de Bruijn graph for sequences up to Tera-base-pair level. It is non-trivial since the size of the graph to be constructed could be very large and each graph consists of hundreds of billions of vertices and edges. Current existing approaches may have unaffordable memory footprints to handle such a large de Bruijn graph. Moreover, it also requires the construction approach to handle very large dataset efficiently, even if in a relatively small RAM space.ResultsWe propose a lightweight parallel de Bruijn graph construction approach, de Bruijn Graph Constructor in Scalable Memory (deGSM). The main idea of deGSM is to efficiently construct the Bur-rows-Wheeler Transformation (BWT) of the unipaths of de Bruijn graph in constant RAM space and transform the BWT into the original unitigs. It is mainly implemented by a fast parallel external sorting of k-mers, which allows only a part of k-mers kept in RAM by a novel organization of the k-mers. The experimental results demonstrate that, just with a commonly used machine, deGSM is able to handle very large genome sequence(s), e.g., the contigs (305 Gbp) and scaffolds (1.1 Tbp) recorded in Gen-Bank database and Picea abies HTS dataset (9.7 Tbp). Moreover, deGSM also has faster or comparable construction speed compared with state-of-the-art approaches. With its high scalability and efficiency, deGSM has enormous potentials in many large scale genomics studies.Availabilityhttps://github.com/hitbc/[email protected] (YW) and [email protected] (BL)Supplementary informationSupplementary data are available online.


2017 ◽  
Vol 66 (10) ◽  
pp. 1689-1702 ◽  
Author(s):  
Arezki Laga ◽  
Jalil Boukhobza ◽  
Frank Singhoff ◽  
Michel Koskas

2016 ◽  
Vol 65 ◽  
pp. 76-89 ◽  
Author(s):  
Young-Sik Lee ◽  
Luis Cavazos Quero ◽  
Sang-Hoon Kim ◽  
Jin-Soo Kim ◽  
Seungryoul Maeng

Sign in / Sign up

Export Citation Format

Share Document