Ten simple rules for managing high-throughput nucleotide sequencing data

Mapping Intimacies ◽

10.1101/049338 ◽

2016 ◽

Cited By ~ 1

Author(s):

Rutger A. Vos

Keyword(s):

Data Management ◽

High Throughput ◽

Large Data ◽

Data Access ◽

Nucleotide Sequencing ◽

Data Generation ◽

Sequencing Data ◽

Data Movement ◽

Simple Rules ◽

Bandwidth Consumption

AbstractThe challenges posed by large data volumes produced by high-throughput nucleotide sequencing technologies are well known. This document establishes ten simple rules for coping with these challenges. At the level of master data management, (1) data triage reduces data volumes; (2) some lossless data representations are much more compact than others; (3) careful management of data replication reduces wasted storage space. At the level of data analysis, (4) automated analysis pipelines obviate the need for storing work files; (5) virtualization reduces the need for data movement and bandwidth consumption; (6) tracking of data and analysis provenance will generate a paper trail to better understand how results were produced. At the level of data access and sharing, (7) careful modeling of data movement patterns reduces bandwidth consumption and haphazard copying; (8) persistent, resolvable identifiers for data reduce ambiguity caused by data movement; (9) sufficient metadata enables more effective collaboration. Finally, because of rapid developments in HTS technologies, (10) agile practices that combine loosely coupled modules operating on standards-compliant data are the best approach for avoiding lock-in. A generalized scenario is presented for data management from initial raw data generation to publication of result data.

Download Full-text

A Cognitive Research Tendency in Data Management of Sensor Network

International Journal of Wireless and Ad Hoc Communication ◽

10.54216/ijwac.030103 ◽

2021 ◽

pp. 26-36

Author(s):

Subhra Prosun Paul ◽

◽

Dr. Shruti Aggarwal ◽

Keyword(s):

Sensor Networks ◽

Data Management ◽

Sensor Network ◽

High Speed ◽

Web Of Science ◽

Data Access ◽

Global Perspective ◽

Data Movement ◽

Network Database ◽

Cognitive Research

In today’s World sensor networks offer various opportunities for data management applications because of their low cost, reliability, scalability, high-speed data processing, and other versatile advantageous purposes. It is a great challenge to organize data effectively and to retrieve the appropriate data from the large volume of various data sets in ad-hoc network databases, mobile databases, etc. The sensor network is necessary for routing of data, performance analysis of data management activities, and data incorporation for the right application. Data management involves intranet and extranet query handling, data access mechanism, modeling of data, different data movement algorithm, data warehousing, and data mining of network database. Additionally, connectivity, design, and lifetime are important issues for sensor networks to perform all data management activities smoothly. In this paper, we are trying to give a cognitive research tendency of Sensor network data management in the last two decades considering all the challenges and issues of both sensor network database and data management functions using Scopus and Web of Science database. To analyze data, different assessments are done considering various parameters like the author, time, publication and citation number, place, source, document separately for Web of Science and Scopus database in global perspective. It is noticed that there is a significant growth of research in data management for sensor networks because of the popularity of this topic.

Download Full-text

Kreon

ACM Transactions on Storage ◽

10.1145/3418414 ◽

2021 ◽

Vol 17 (1) ◽

pp. 1-32

Author(s):

Anastasios Papagiannis ◽

Giorgos Saloustros ◽

Giorgos Xanthakis ◽

Giorgos Kalaentzis ◽

Pilar Gonzalez-Ferez ◽

...

Keyword(s):

Data Processing ◽

Large Data ◽

Data Access ◽

Full Data ◽

Large Dataset ◽

Data Movement ◽

Access Path ◽

Data Reorganization ◽

User Space ◽

Main Component

Persistent key-value stores have emerged as a main component in the data access path of modern data processing systems. However, they exhibit high CPU and I/O overhead. Nowadays, due to power limitations, it is important to reduce CPU overheads for data processing. In this article, we propose Kreon , a key-value store that targets servers with flash-based storage, where CPU overhead and I/O amplification are more significant bottlenecks compared to I/O randomness. We first observe that two significant sources of overhead in key-value stores are: (a) The use of compaction in Log-Structured Merge-Trees (LSM-Tree) that constantly perform merging and sorting of large data segments and (b) the use of an I/O cache to access devices, which incurs overhead even for data that reside in memory. To avoid these, Kreon performs data movement from level to level by using partial reorganization instead of full data reorganization via the use of a full index per-level. Kreon uses memory-mapped I/O via a custom kernel path to avoid a user-space cache. For a large dataset, Kreon reduces CPU cycles/op by up to 5.8×, reduces I/O amplification for inserts by up to 4.61×, and increases insert ops/s by up to 5.3×, compared to RocksDB.

Download Full-text

Evaluation of MC1R high-throughput nucleotide sequencing data generated by the 1000 Genomes Project

Genetics and Molecular Biology ◽

10.1590/1678-4685-gmb-2016-0180 ◽

2017 ◽

Vol 40 (2) ◽

pp. 530-539 ◽

Cited By ~ 3

Author(s):

Leonardo Arduino Marano ◽

Letícia Marcorin ◽

Erick da Cruz Castelli ◽

Celso Teixeira Mendes-Junior

Keyword(s):

High Throughput ◽

Nucleotide Sequencing ◽

Sequencing Data ◽

1000 Genomes Project ◽

1000 Genomes

Download Full-text

Broom: Application for non-redundant storage of High Throughput Sequencing data

10.1101/312306 ◽

2018 ◽

Author(s):

Levent Albayrak ◽

Kamil Khanipov ◽

George Golovko ◽

Yuriy Fofanov

Keyword(s):

Data Storage ◽

High Throughput ◽

High Throughput Sequencing ◽

Data Generation ◽

Sequencing Data ◽

High Throughput Sequencing Data ◽

Sequencing Quality ◽

Redundant Storage ◽

Recent Trends ◽

The Cost

AbstractMotivationThe data generation capabilities of High Throughput Sequencing (HTS) instruments have exponentially increased over the last few years, while the cost of sequencing has dramatically decreased allowing this technology to become widely used in biomedical studies. For small labs and individual researchers, however, storage and transfer of large amounts of HTS data present a significant challenge. The recent trends in increased sequencing quality and genome coverage can be used to reconsider HTS data storage strategies.ResultsWe present Broom, a stand-alone application designed to select and store only high-quality sequencing reads at extremely high compression rates. Written in C++, the application accepts single and paired-end reads in FASTQ and FASTA formats and decompresses data in FASTA format.AvailabilityC++ code available at https://scsb.utmb.edu/labgroups/fofanov/[email protected]

Download Full-text

V-pipe: a computational pipeline for assessing viral genetic diversity from high-throughput sequencing data

10.1101/2020.06.09.142919 ◽

2020 ◽

Cited By ~ 4

Author(s):

Susana Posada-Céspedes ◽

David Seifert ◽

Ivan Topolsky ◽

Karin J. Metzner ◽

Niko Beerenwinkel

Keyword(s):

Genetic Diversity ◽

High Throughput ◽

High Throughput Sequencing ◽

Viral Infections ◽

Markov Models ◽

Large Data ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Bioinformatics Pipeline ◽

The Impact

AbstractHigh-throughput sequencing technologies are used increasingly, not only in viral genomics research but also in clinical surveillance and diagnostics. These technologies facilitate the assessment of the genetic diversity in intra-host virus populations, which affects transmission, virulence, and pathogenesis of viral infections. However, there are two major challenges in analysing viral diversity. First, amplification and sequencing errors confound the identification of true biological variants, and second, the large data volumes represent computational limitations. To support viral high-throughput sequencing studies, we developed V-pipe, a bioinformatics pipeline combining various state-of-the-art statistical models and computational tools for automated end-to-end analyses of raw sequencing reads. V-pipe supports quality control, read mapping and alignment, low-frequency mutation calling, and inference of viral haplotypes. For generating high-quality read alignments, we developed a novel method, called ngshmmalign, based on profile hidden Markov models and tailored to small and highly diverse viral genomes. V-pipe also includes benchmarking functionality providing a standardized environment for comparative evaluations of different pipeline configurations. We demonstrate this capability by assessing the impact of three different read aligners (Bowtie 2, BWA MEM, ngshmmalign) and two different variant callers (LoFreq, ShoRAH) on the performance of calling single-nucleotide variants in intra-host virus populations. V-pipe supports various pipeline configurations and is implemented in a modular fashion to facilitate adaptations to the continuously changing technology landscape. V-pipe is freely available at https://github.com/cbg-ethz/V-pipe.

Download Full-text

Handling the data management needs of high-throughput sequencing data: SpeedGene, a compression algorithm for the efficient storage of genetic data

BMC Bioinformatics ◽

10.1186/1471-2105-13-100 ◽

2012 ◽

Vol 13 (1) ◽

Cited By ~ 13

Author(s):

Dandi Qiao ◽

Wai-Ki Yip ◽

Christoph Lange

Keyword(s):

Data Management ◽

High Throughput ◽

High Throughput Sequencing ◽

Genetic Data ◽

Compression Algorithm ◽

Sequencing Data ◽

Efficient Storage ◽

High Throughput Sequencing Data

Download Full-text

Faculty Opinions recommendation of Coalescent Inference Using Serially Sampled, High-Throughput Sequencing Data from Intrahost HIV Infection.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.726132071.793531014 ◽

2017 ◽

Author(s):

Sarah Rowland-Jones ◽

Sophie Andrews

Keyword(s):

Hiv Infection ◽

High Throughput ◽

High Throughput Sequencing ◽

Sequencing Data ◽

High Throughput Sequencing Data

Download Full-text

BlindCall: ultra-fast base-calling of high-throughput sequencing data by blind deconvolution

Bioinformatics ◽

10.1093/bioinformatics/btu010 ◽

2014 ◽

Vol 30 (9) ◽

pp. 1214-1219 ◽

Cited By ~ 6

Author(s):

C. Ye ◽

C. Hsiao ◽

H. Corrada Bravo

Keyword(s):

High Throughput ◽

High Throughput Sequencing ◽

Blind Deconvolution ◽

Sequencing Data ◽

Base Calling ◽

High Throughput Sequencing Data

Download Full-text

Advancing clinical genomics and precision medicine with GVViZ: FAIR bioinformatics platform for variable gene-disease annotation, visualization, and expression analysis

Human Genomics ◽

10.1186/s40246-021-00336-1 ◽

2021 ◽

Vol 15 (1) ◽

Author(s):

Zeeshan Ahmed ◽

Eduard Gibert Renart ◽

Saman Zeeshan ◽

XinQi Dong

Keyword(s):

Data Analysis ◽

Patient Care ◽

Expression Analysis ◽

High Throughput ◽

Gene Annotation ◽

Next Generation Sequencing Data ◽

Rna Seq ◽

Sequencing Data ◽

Complex Disorders ◽

Transcriptomics Data

Abstract Background Genetic disposition is considered critical for identifying subjects at high risk for disease development. Investigating disease-causing and high and low expressed genes can support finding the root causes of uncertainties in patient care. However, independent and timely high-throughput next-generation sequencing data analysis is still a challenge for non-computational biologists and geneticists. Results In this manuscript, we present a findable, accessible, interactive, and reusable (FAIR) bioinformatics platform, i.e., GVViZ (visualizing genes with disease-causing variants). GVViZ is a user-friendly, cross-platform, and database application for RNA-seq-driven variable and complex gene-disease data annotation and expression analysis with a dynamic heat map visualization. GVViZ has the potential to find patterns across millions of features and extract actionable information, which can support the early detection of complex disorders and the development of new therapies for personalized patient care. The execution of GVViZ is based on a set of simple instructions that users without a computational background can follow to design and perform customized data analysis. It can assimilate patients’ transcriptomics data with the public, proprietary, and our in-house developed gene-disease databases to query, easily explore, and access information on gene annotation and classified disease phenotypes with greater visibility and customization. To test its performance and understand the clinical and scientific impact of GVViZ, we present GVViZ analysis for different chronic diseases and conditions, including Alzheimer’s disease, arthritis, asthma, diabetes mellitus, heart failure, hypertension, obesity, osteoporosis, and multiple cancer disorders. The results are visualized using GVViZ and can be exported as image (PNF/TIFF) and text (CSV) files that include gene names, Ensembl (ENSG) IDs, quantified abundances, expressed transcript lengths, and annotated oncology and non-oncology diseases. Conclusions We emphasize that automated and interactive visualization should be an indispensable component of modern RNA-seq analysis, which is currently not the case. However, experts in clinics and researchers in life sciences can use GVViZ to visualize and interpret the transcriptomics data, making it a powerful tool to study the dynamics of gene expression and regulation. Furthermore, with successful deployment in clinical settings, GVViZ has the potential to enable high-throughput correlations between patient diagnoses based on clinical and transcriptomics data.

Download Full-text

Improvement, identification, and target prediction for miRNAs in the porcine genome by using massive, public high-throughput sequencing data

Journal of Animal Science ◽

10.1093/jas/skab018 ◽

2021 ◽

Vol 99 (2) ◽

Author(s):

Yuhua Fu ◽

Pengyu Fan ◽

Lu Wang ◽

Ziqiang Shu ◽

Shilin Zhu ◽

...

Keyword(s):

High Throughput Sequencing ◽

Target Genes ◽

Target Prediction ◽

Large Data ◽

Sequencing Data ◽

Regulate Gene Expression ◽

High Throughput Sequencing Data ◽

Annotation Information ◽

Public Data ◽

Broad Variety

Abstract Despite the broad variety of available microRNA (miRNA) research tools and methods, their application to the identification, annotation, and target prediction of miRNAs in nonmodel organisms is still limited. In this study, we collected nearly all public sRNA-seq data to improve the annotation for known miRNAs and identify novel miRNAs that have not been annotated in pigs (Sus scrofa). We newly annotated 210 mature sequences in known miRNAs and found that 43 of the known miRNA precursors were problematic due to redundant/missing annotations or incorrect sequences. We also predicted 811 novel miRNAs with high confidence, which was twice the current number of known miRNAs for pigs in miRBase. In addition, we proposed a correlation-based strategy to predict target genes for miRNAs by using a large amount of sRNA-seq and RNA-seq data. We found that the correlation-based strategy provided additional evidence of expression compared with traditional target prediction methods. The correlation-based strategy also identified the regulatory pairs that were controlled by nonbinding sites with a particular pattern, which provided abundant complementarity for studying the mechanism of miRNAs that regulate gene expression. In summary, our study improved the annotation of known miRNAs, identified a large number of novel miRNAs, and predicted target genes for all pig miRNAs by using massive public data. This large data-based strategy is also applicable for other nonmodel organisms with incomplete annotation information.

Download Full-text