scholarly journals ArrowSAM: In-Memory Genomics Data Processing Using Apache Arrow

2019 ◽  
Author(s):  
Tanveer Ahmad ◽  
Nauman Ahmed ◽  
Johan Peltenburg ◽  
Zaid Al-Ars

AbstractThe rapidly growing size of genomics data bases, driven by advances in sequencing technologies, demands fast and cost-effective processing. However, processing this data creates many challenges, particularly in selecting appropriate algorithms and computing platforms. Computing systems need data closer to the processor for fast processing. Traditionally, due to cost, volatility and other physical constraints of DRAM, it was not feasible to place large amounts of working data sets in memory. However, new emerging storage class memories allow storing and processing big data closer to the processor. In this work, we show how the commonly used genomics data format, Sequence Alignment/Map (SAM), can be presented in the Apache Arrow in-memory data representation to benefit of in-memory processing and to ensure better scalability through shared memory objects, by avoiding large (de)-serialization overheads in cross-language interoperability. To demonstrate the benefits of such a system, we propose ArrowSAM, an in-memory SAM format that uses the Apache Arrow framework, and integrate it into genome pre-processing pipelines including BWA-MEM, Picard and Sambamba. Results show 15x and 2.4x speedups as compared to Picard and Sambamba, respectively. The code and scripts for running all workflows are freely available at https://github.com/abs-tudelft/ArrowSAM.

2019 ◽  
Author(s):  
Charles Blatti ◽  
Amin Emad ◽  
Matthew J. Berry ◽  
Lisa Gatzke ◽  
Milt Epstein ◽  
...  

AbstractWe present KnowEnG, a free-to-use computational system for analysis of genomics data sets, designed to accelerate biomedical discovery. It includes tools for popular bioinformatics tasks such as gene prioritization, sample clustering, gene set analysis and expression signature analysis. The system offers ‘knowledge-guided’ data-mining and machine learning algorithms, where user-provided data are analyzed in light of prior information about genes, aggregated from numerous knowledge-bases and encoded in a massive ‘Knowledge Network’. KnowEnG adheres to ‘FAIR’ principles: its tools are easily portable to diverse computing environments, run on the cloud for scalable and cost-effective execution of compute-intensive and data-intensive algorithms, and are interoperable with other computing platforms. They are made available through multiple access modes including a web-portal, and include specialized visualization modules. We present use cases and re-analysis of published cancer data sets using KnowEnG tools and demonstrate its potential value in democratization of advanced tools for the modern genomics era.


F1000Research ◽  
2015 ◽  
Vol 4 ◽  
pp. 20 ◽  
Author(s):  
Priti Kumari ◽  
Raja Mazumder ◽  
Vahan Simonyan ◽  
Konstantinos Krampis

Background: The transition to Next Generation sequencing (NGS) sequencing technologies has had numerous applications in Plant, Microbial and Human genomics during the past decade. However, NGS sequencing trades high read throughput for shorter read length, increasing the difficulty for genome assembly. This research presents a comparison of traditional versus Cloud computing-based genome assembly software, using as examples the Velvet and Contrail assemblers and reads from the genome sequence of the zebrafish (Danio rerio) model organism.Results: The first phase of the analysis involved a subset of the zebrafish data set (2X coverage) and best results were obtained using K-mer size of 65, while it was observed that Velvet takes less time than Contrail to complete the assembly. In the next phase, genome assembly was attempted using the full dataset of read coverage 192x and while Velvet failed to complete on a 256GB memory compute server, Contrail completed but required 240hours of computation.Conclusion: This research concludes that for deciding on which assembler software to use, the size of the dataset and available computing hardware should be taken into consideration. For a relatively small sequencing dataset, such as microbial or small eukaryotic genome, the Velvet assembler is a good option. However, for larger datasets Velvet requires large-memory  compute servers in the order of 1000GB or more. On the other hand, Contrail is implemented  using Hadoop, which performs the assembly in parallel across nodes of a compute cluster. Furthermore, Hadoop clusters can be rented on-demand from Cloud computing providers, and therefore Contrail can provide a simple and cost effective way for genome assembly of data generated at laboratories that lack the infrastructure or funds to build their own clusters.


2019 ◽  
Vol 14 (2) ◽  
pp. 157-163
Author(s):  
Majid Hajibaba ◽  
Mohsen Sharifi ◽  
Saeid Gorgin

Background: One of the pivotal challenges in nowadays genomic research domain is the fast processing of voluminous data such as the ones engendered by high-throughput Next-Generation Sequencing technologies. On the other hand, BLAST (Basic Local Alignment Search Tool), a longestablished and renowned tool in Bioinformatics, has shown to be incredibly slow in this regard. Objective: To improve the performance of BLAST in the processing of voluminous data, we have applied a novel memory-aware technique to BLAST for faster parallel processing of voluminous data. Method: We have used a master-worker model for the processing of voluminous data alongside a memory-aware technique in which the master partitions the whole data in equal chunks, one chunk for each worker, and consequently each worker further splits and formats its allocated data chunk according to the size of its memory. Each worker searches every split data one-by-one through a list of queries. Results: We have chosen a list of queries with different lengths to run insensitive searches in a huge database called UniProtKB/TrEMBL. Our experiments show 20 percent improvement in performance when workers used our proposed memory-aware technique compared to when they were not memory aware. Comparatively, experiments show even higher performance improvement, approximately 50 percent, when we applied our memory-aware technique to mpiBLAST. Conclusion: We have shown that memory-awareness in formatting bulky database, when running BLAST, can improve performance significantly, while preventing unexpected crashes in low-memory environments. Even though distributed computing attempts to mitigate search time by partitioning and distributing database portions, our memory-aware technique alleviates negative effects of page-faults on performance.


2021 ◽  
pp. 1-13
Author(s):  
Yikai Zhang ◽  
Yong Peng ◽  
Hongyu Bian ◽  
Yuan Ge ◽  
Feiwei Qin ◽  
...  

Concept factorization (CF) is an effective matrix factorization model which has been widely used in many applications. In CF, the linear combination of data points serves as the dictionary based on which CF can be performed in both the original feature space as well as the reproducible kernel Hilbert space (RKHS). The conventional CF treats each dimension of the feature vector equally during the data reconstruction process, which might violate the common sense that different features have different discriminative abilities and therefore contribute differently in pattern recognition. In this paper, we introduce an auto-weighting variable into the conventional CF objective function to adaptively learn the corresponding contributions of different features and propose a new model termed Auto-Weighted Concept Factorization (AWCF). In AWCF, on one hand, the feature importance can be quantitatively measured by the auto-weighting variable in which the features with better discriminative abilities are assigned larger weights; on the other hand, we can obtain more efficient data representation to depict its semantic information. The detailed optimization procedure to AWCF objective function is derived whose complexity and convergence are also analyzed. Experiments are conducted on both synthetic and representative benchmark data sets and the clustering results demonstrate the effectiveness of AWCF in comparison with the related models.


Materials ◽  
2021 ◽  
Vol 14 (2) ◽  
pp. 471
Author(s):  
Constantino Grau Turuelo ◽  
Sebastian Pinnau ◽  
Cornelia Breitkopf

Modeling of thermodynamic properties, like heat capacities for stoichiometric solids, includes the treatment of different sources of data which may be inconsistent and diverse. In this work, an approach based on the covariance matrix adaptation evolution strategy (CMA-ES) is proposed and described as an alternative method for data treatment and fitting with the support of data source dependent weight factors and physical constraints. This is applied to a Gibb’s Free Energy stoichiometric model for different magnesium sulfate hydrates by means of the NASA9 polynomial. Its behavior is proved by: (i) The comparison of the model to other standard methods for different heat capacity data, yielding a more plausible curve at high temperature ranges; (ii) the comparison of the fitted heat capacity values of MgSO4·7H2O against DSC measurements, resulting in a mean relative error of a 0.7% and a normalized root mean square deviation of 1.1%; and (iii) comparing the Van’t Hoff and proposed Stoichiometric model vapor-solid equilibrium curves to different literature data for MgSO4·7H2O, MgSO4·6H2O, and MgSO4·1H2O, resulting in similar equilibrium values, especially for MgSO4·7H2O and MgSO4·6H2O. The results show good agreement with the employed data and confirm this method as a viable alternative for fitting complex physically constrained data sets, while being a potential approach for automatic data fitting of substance data.


2010 ◽  
Vol 28 (16) ◽  
pp. 2777-2783 ◽  
Author(s):  
Ana Maria Gonzalez-Angulo ◽  
Bryan T.J. Hennessy ◽  
Gordon B. Mills

The development of cost-effective technologies able to comprehensively assess DNA, RNA, protein, and metabolites in patient tumors has fueled efforts to tailor medical care. Indeed validated molecular tests assessing tumor tissue or patient germline DNA already drive therapeutic decision making. However, many theoretical and regulatory challenges must still be overcome before fully realizing the promise of personalized molecular medicine. The masses of data generated by high-throughput technologies are challenging to manage, visualize, and convert to the knowledge required to improve patient outcomes. Systems biology integrates engineering, physics, and mathematical approaches with biologic and medical insights in an iterative process to visualize the interconnected events within a cell that determine how inputs from the environment and the network rewiring that occurs due to the genomic aberrations acquired by patient tumors determines cellular behavior and patient outcomes. A cross-disciplinary systems biology effort will be necessary to convert the information contained in multidimensional data sets into useful biomarkers that can classify patient tumors by prognosis and response to therapeutic modalities and to identify the drivers of tumor behavior that are optimal targets for therapy. An understanding of the effects of targeted therapeutics on signaling networks and homeostatic regulatory loops will be necessary to prevent inadvertent effects as well as to develop rational combinatorial therapies. Systems biology approaches identifying molecular drivers and biomarkers will lead to the implementation of smaller, shorter, cheaper, and individualized clinical trials that will increase the success rate and hasten the implementation of effective therapies into the clinical armamentarium.


Author(s):  
L Mohana Tirumala ◽  
S. Srinivasa Rao

Privacy preserving in Data mining & publishing, plays a major role in today networked world. It is important to preserve the privacy of the vital information corresponding to a data set. This process can be achieved by k-anonymization solution for classification. Along with the privacy preserving using anonymization, yielding the optimized data sets is also of equal importance with a cost effective approach. In this paper Top-Down Refinement algorithm has been proposed which yields optimum results in a cost effective manner. Bayesian Classification has been proposed in this paper to predict class membership probabilities for a data tuple for which the associated class label is unknown.


2021 ◽  
Vol 111 (1) ◽  
pp. 8-11
Author(s):  
Remco Stam ◽  
Pierre Gladieux ◽  
Boris A. Vinatzer ◽  
Erica M. Goss ◽  
Neha Potnis ◽  
...  

Population genetics has been a key discipline in phytopathology for many years. The recent rise in cost-effective, high-throughput DNA sequencing technologies, allows sequencing of dozens, if not hundreds of specimens, turning population genetics into population genomics and opening up new, exciting opportunities as described in this Focus Issue . Without the limitations of genetic markers and the availability of whole or near whole-genome data, population genomics can give new insights into the biology, evolution and adaptation, and dissemination patterns of plant-associated microbes.


2018 ◽  
Author(s):  
Adrian Fritz ◽  
Peter Hofmann ◽  
Stephan Majda ◽  
Eik Dahms ◽  
Johannes Dröge ◽  
...  

Shotgun metagenome data sets of microbial communities are highly diverse, not only due to the natural variation of the underlying biological systems, but also due to differences in laboratory protocols, replicate numbers, and sequencing technologies. Accordingly, to effectively assess the performance of metagenomic analysis software, a wide range of benchmark data sets are required. Here, we describe the CAMISIM microbial community and metagenome simulator. The software can model different microbial abundance profiles, multi-sample time series and differential abundance studies, includes real and simulated strain-level diversity, and generates second and third generation sequencing data from taxonomic profiles or de novo. Gold standards are created for sequence assembly, genome binning, taxonomic binning, and taxonomic profiling. CAMSIM generated the benchmark data sets of the first CAMI challenge. For two simulated multi-sample data sets of the human and mouse gut microbiomes we observed high functional congruence to the real data. As further applications, we investigated the effect of varying evolutionary genome divergence, sequencing depth, and read error profiles on two popular metagenome assemblers, MEGAHIT and metaSPAdes, on several thousand small data sets generated with CAMISIM. CAMISIM can simulate a wide variety of microbial communities and metagenome data sets together with truth standards for method evaluation. All data sets and the software are freely available at: https://github.com/CAMI-challenge/CAMISIM


Sign in / Sign up

Export Citation Format

Share Document