pyBedGraph: a Python package for fast operations on 1-dimensional genomic signal tracks

Mapping Intimacies ◽

10.1101/709683 ◽

2019 ◽

Author(s):

Henry B. Zhang ◽

Minji Kim ◽

Jeffrey H. Chuang ◽

Yijun Ruan

Keyword(s):

Chromatin Accessibility ◽

Genomic Research ◽

Summary Statistics ◽

Text Format ◽

Binary Format ◽

Modern Genomic ◽

Binding Intensity ◽

Python Package ◽

Generation Sequencing ◽

Genomic Signal

AbstractMotivationModern genomic research relies heavily on next-generation sequencing experiments such as ChIP-seq and ChIA-PET that generate coverage files for transcription factor binding, as well as DHS and ATAC-seq that yield coverage files for chromatin accessibility. Such files are in a bedGraph text format or a bigWig binary format. Obtaining summary statistics in a given region is a fundamental task in analyzing protein binding intensity or chromatin accessibility. However, the existing Python package for operating on coverage files is not optimized for speed.ResultsWe developed pyBedGraph, a Python package to quickly obtain summary statistics for a given interval in a bedGraph file. When tested on 8 ChIP-seq and ATAC-seq datasets, pyBedGraph is on average 245 times faster than the existing program. Notably, pyBedGraph can look up the exact mean signal of 1 million regions in ~0.26 second on a conventional laptop. An approximate mean for 10,000 regions can be computed in ~0.0012 second with an error rate of less than 5 percent.AvailabilitypyBedGraph is publicly available at https://github.com/TheJacksonLaboratory/pyBedGraph under the MIT license.

Download Full-text

pyBedGraph: a python package for fast operations on 1D genomic signal tracks

Bioinformatics ◽

10.1093/bioinformatics/btaa061 ◽

2020 ◽

Vol 36 (10) ◽

pp. 3234-3235

Author(s):

Henry B Zhang ◽

Minji Kim ◽

Jeffrey H Chuang ◽

Yijun Ruan

Keyword(s):

Chromatin Accessibility ◽

Genomic Research ◽

Supplementary Information ◽

Summary Statistics ◽

Rna Seq ◽

Binary Format ◽

Modern Genomic ◽

Binding Intensity ◽

Python Package ◽

Generation Sequencing

Abstract Motivation Modern genomic research is driven by next-generation sequencing experiments such as ChIP-seq and ChIA-PET that generate coverage files for transcription factor binding, as well as DHS and ATAC-seq that yield coverage files for chromatin accessibility. Such files are in a bedGraph text format or a bigWig binary format. Obtaining summary statistics in a given region is a fundamental task in analyzing protein binding intensity or chromatin accessibility. However, the existing Python package for operating on coverage files is not optimized for speed. Results We developed pyBedGraph, a Python package to quickly obtain summary statistics for a given interval in a bedGraph or a bigWig file. When tested on 12 ChIP-seq, ATAC-seq, RNA-seq and ChIA-PET datasets, pyBedGraph is on average 260 times faster than the existing program pyBigWig. On average, pyBedGraph can look up the exact mean signal of 1 million regions in ∼0.26 s and can compute their approximate means in <0.12 s on a conventional laptop. Availability and implementation pyBedGraph is publicly available at https://github.com/TheJacksonLaboratory/pyBedGraph under the MIT license. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Next-Generation Sequencing: An Emerging Tool for Drug Designing

Current Pharmaceutical Design ◽

10.2174/1381612825666190911155508 ◽

2019 ◽

Vol 25 (31) ◽

pp. 3350-3357 ◽

Cited By ~ 1

Author(s):

Pooja Tripathi ◽

Jyotsna Singh ◽

Jonathan A. Lal ◽

Vijay Tripathi

Keyword(s):

Next Generation Sequencing ◽

High Throughput ◽

Large Scale ◽

Massively Parallel Sequencing ◽

Genomic Research ◽

Biological Research ◽

Next Generation ◽

Human Welfare ◽

Drug Designing ◽

Generation Sequencing

Background: With the outbreak of high throughput next-generation sequencing (NGS), the biological research of drug discovery has been directed towards the oncology and infectious disease therapeutic areas, with extensive use in biopharmaceutical development and vaccine production. Method: In this review, an effort was made to address the basic background of NGS technologies, potential applications of NGS in drug designing. Our purpose is also to provide a brief introduction of various Nextgeneration sequencing techniques. Discussions: The high-throughput methods execute Large-scale Unbiased Sequencing (LUS) which comprises of Massively Parallel Sequencing (MPS) or NGS technologies. The Next geneinvolved necessarily executes Largescale Unbiased Sequencing (LUS) which comprises of MPS or NGS technologies. These are related terms that describe a DNA sequencing technology which has revolutionized genomic research. Using NGS, an entire human genome can be sequenced within a single day. Conclusion: Analysis of NGS data unravels important clues in the quest for the treatment of various lifethreatening diseases and other related scientific problems related to human welfare.

Download Full-text

The Influence of Memory-Aware Computation on Distributed BLAST

Current Bioinformatics ◽

10.2174/1574893613666180601080811 ◽

2019 ◽

Vol 14 (2) ◽

pp. 157-163

Author(s):

Majid Hajibaba ◽

Mohsen Sharifi ◽

Saeid Gorgin

Keyword(s):

Search Time ◽

Genomic Research ◽

Local Alignment ◽

Negative Effects ◽

Sequencing Technologies ◽

Percent Improvement ◽

Fast Processing ◽

Search Tool ◽

Memory Awareness ◽

Generation Sequencing

Background: One of the pivotal challenges in nowadays genomic research domain is the fast processing of voluminous data such as the ones engendered by high-throughput Next-Generation Sequencing technologies. On the other hand, BLAST (Basic Local Alignment Search Tool), a longestablished and renowned tool in Bioinformatics, has shown to be incredibly slow in this regard. Objective: To improve the performance of BLAST in the processing of voluminous data, we have applied a novel memory-aware technique to BLAST for faster parallel processing of voluminous data. Method: We have used a master-worker model for the processing of voluminous data alongside a memory-aware technique in which the master partitions the whole data in equal chunks, one chunk for each worker, and consequently each worker further splits and formats its allocated data chunk according to the size of its memory. Each worker searches every split data one-by-one through a list of queries. Results: We have chosen a list of queries with different lengths to run insensitive searches in a huge database called UniProtKB/TrEMBL. Our experiments show 20 percent improvement in performance when workers used our proposed memory-aware technique compared to when they were not memory aware. Comparatively, experiments show even higher performance improvement, approximately 50 percent, when we applied our memory-aware technique to mpiBLAST. Conclusion: We have shown that memory-awareness in formatting bulky database, when running BLAST, can improve performance significantly, while preventing unexpected crashes in low-memory environments. Even though distributed computing attempts to mitigate search time by partitioning and distributing database portions, our memory-aware technique alleviates negative effects of page-faults on performance.

Download Full-text

Pancreatic cancer prognosis is predicted by an ATAC-array technology for assessing chromatin accessibility

Nature Communications ◽

10.1038/s41467-021-23237-2 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

S. Dhara ◽

S. Chhangawala ◽

H. Chintalapudi ◽

G. Askan ◽

V. Aveson ◽

...

Keyword(s):

Low Cost ◽

Disease Free Survival ◽

Chromatin Accessibility ◽

Cancer Prognosis ◽

Ductal Adenocarcinoma ◽

Binding Motifs ◽

Free Survival ◽

Array Technology ◽

Treatment Naïve ◽

Generation Sequencing

AbstractUnlike other malignancies, therapeutic options in pancreatic ductal adenocarcinoma (PDAC) are largely limited to cytotoxic chemotherapy without the benefit of molecular markers predicting response. Here we report tumor-cell-intrinsic chromatin accessibility patterns of treatment-naïve surgically resected PDAC tumors that were subsequently treated with (Gem)/Abraxane adjuvant chemotherapy. By ATAC-seq analyses of EpCAM+ PDAC malignant epithelial cells sorted from 54 freshly resected human tumors, we show here the discovery of a signature of 1092 chromatin loci displaying differential accessibility between patients with disease free survival (DFS) < 1 year and patients with DFS > 1 year. Analyzing transcription factor (TF) binding motifs within these loci, we identify two TFs (ZKSCAN1 and HNF1b) displaying differential nuclear localization between patients with short vs. long DFS. We further develop a chromatin accessibility microarray methodology termed “ATAC-array”, an easy-to-use platform obviating the time and cost of next generation sequencing. Applying this methodology to the original ATAC-seq libraries as well as independent libraries generated from patient-derived organoids, we validate ATAC-array technology in both the original ATAC-seq cohort as well as in an independent validation cohort. We conclude that PDAC prognosis can be predicted by ATAC-array, which represents a low-cost, clinically feasible technology for assessing chromatin accessibility profiles.

Download Full-text

Detecting somatic mutations in genomic sequences by means of Kolmogorov–Arnold analysis

Royal Society Open Science ◽

10.1098/rsos.150143 ◽

2015 ◽

Vol 2 (8) ◽

pp. 150143 ◽

Cited By ~ 3

Author(s):

V. G. Gurzadyan ◽

H. Yan ◽

G. Vlahovic ◽

A. Kashin ◽

P. Killela ◽

...

Keyword(s):

Clinical Diagnostics ◽

Genomic Research ◽

Genomic Sequences ◽

Sequencing Data ◽

Sequencing Technologies ◽

Cancer Genome Sequencing ◽

Frequent Mutations ◽

Using Data ◽

First Time ◽

Generation Sequencing

The Kolmogorov–Arnold stochasticity parameter technique is applied for the first time to the study of cancer genome sequencing, to reveal mutations. Using data generated by next-generation sequencing technologies, we have analysed the exome sequences of brain tumour patients with matched tumour and normal blood. We show that mutations contained in sequencing data can be revealed using this technique, thus providing a new methodology for determining subsequences of given length containing mutations, i.e. its value differs from those of subsequences without mutations. A potential application for this technique involves simplifying the procedure of finding segments with mutations, speeding up genomic research and accelerating its implementation in clinical diagnostics. Moreover, the prediction of a mutation associated with a family of frequent mutations in numerous types of cancers based purely on the value of the Kolmogorov function indicates that this applied marker may recognize genomic sequences that are in extremely low abundance and can be used in revealing new types of mutations.

Download Full-text

pysradb: A Python package to query next-generation sequencing metadata and data from NCBI Sequence Read Archive

F1000Research ◽

10.12688/f1000research.18676.1 ◽

2019 ◽

Vol 8 ◽

pp. 532 ◽

Cited By ~ 2

Author(s):

Saket Choudhary

Keyword(s):

Next Generation Sequencing ◽

Research Community ◽

Command Line ◽

Next Generation ◽

Multiple Use ◽

Sequencing Data ◽

Sequence Read Archive ◽

Python Package ◽

Generation Sequencing ◽

Ncbi Sequence Read Archive

The NCBI Sequence Read Archive (SRA) is the primary archive of next-generation sequencing datasets. SRA makes metadata and raw sequencing data available to the research community to encourage reproducibility and to provide avenues for testing novel hypotheses on publicly available data. However, methods to programmatically access this data are limited. We introduce the Python package, pysradb, which provides a collection of command line methods to query and download metadata and data from SRA, utilizing the curated metadata database available through the SRAdb project. We demonstrate the utility of pysradb on multiple use cases for searching and downloading SRA datasets. It is available freely at https://github.com/saketkc/pysradb.

Download Full-text

The molecular pathophysiology of vascular anomalies: Genomic research

Archives of Plastic Surgery ◽

10.5999/aps.2020.00591 ◽

2020 ◽

Vol 47 (3) ◽

pp. 203-208

Author(s):

Jong Seong Kim ◽

Su-Kyeong Hwang ◽

Ho Yun Chung

Keyword(s):

Genetic Testing ◽

Treatment Options ◽

Current Treatment ◽

Vascular Anomalies ◽

Clinical Severity ◽

Genomic Research ◽

Location Type ◽

Molecular Pathophysiology ◽

Generation Sequencing ◽

Disease Specific

Vascular anomalies are congenital localized abnormalities that result from improper development and maintenance of the vasculature. The lesions of vascular anomalies vary in location, type, and clinical severity of the phenotype, and the current treatment options are often unsatisfactory. Most vascular anomalies are sporadic, but patterns of inheritance have been noted in some cases, making genetic analysis relevant. Developments in the field of genomics, including next-generation sequencing, have provided novel insights into the genetic and molecular pathophysiological mechanisms underlying vascular anomalies. These insights may pave the way for new approaches to molecular diagnosis and potential disease-specific therapies. This article provides an introduction to genetic testing for vascular anomalies and presents a brief summary of the etiology and genetics of vascular anomalies.

Download Full-text

Kronos: a workflow assembler for genome analytics and informatics

10.1101/040352 ◽

2016 ◽

Cited By ~ 3

Author(s):

M Jafar Taghiyar ◽

Jamie Rosner ◽

Diljot Grewal ◽

Bruno Grande ◽

Radhouane Aniba ◽

...

Keyword(s):

Feature Detection ◽

Automated Analysis ◽

Configuration File ◽

Detection Methods ◽

Reproducible Research ◽

Robust Implementation ◽

Free Open Source ◽

Standard Framework ◽

Python Package ◽

Generation Sequencing

The field of next generation sequencing informatics has matured to a point where algorithmic advances in sequence alignment and individual feature detection methods have stabilized. Practical and robust implementation of complex analytical workflows (where such tools are structured into "best practices" for automated analysis of NGS datasets) still requires significant programming investment and expertise. We present Kronos, a software platform for automating the development and execution of reproducible, auditable and distributable bioinformatics workflows. Kronos obviates the need for explicit coding of workflows by compiling a text configuration file into executable Python applications. The framework of each workflow includes a run manager to execute the encoded workflows locally (or on a cluster or cloud), parallelize tasks, and log all runtime events. Resulting workflows are highly modular and configurable by construction, facilitating flexible and extensible meta-applications which can be modified easily through configuration file editing. The workflows are fully encoded for ease of distribution and can be instantiated on external systems, promoting and facilitating reproducible research and comparative analyses. We introduce a framework for building Kronos components which function as shareable, modular nodes in Kronos workflows. The Kronos platform provides a standard framework for developers to implement custom tools, reuse existing tools, and contribute to the community at large. Kronos is shipped with both Docker and Amazon AWS machine images. It is free, open source and available through PyPI (Python Package Index) and https://github.com/jtaghiyar/kronos. Keywords: genomics; workflow; pipeline; reproducibility

Download Full-text

AMAS: a fast tool for alignment manipulation and computing of summary statistics

10.7287/peerj.preprints.1355v1 ◽

2015 ◽

Author(s):

Marek L Borowiec

Keyword(s):

Amino Acid ◽

Source Code ◽

Data Sets ◽

Command Line ◽

Summary Statistics ◽

Computationally Efficient ◽

Python Package ◽

Alignment Length ◽

Amino Acid Alphabet ◽

Gc Contents

The amount of data used in phylogenetics has grown explosively in the recent years and many phylogenies are inferred with hundreds or even thousands of loci and many taxa. These modern phylogenomic studies often entail separate analyses of each of the loci in addition to multiple analyses of subsets of genes or concatenated sequences. Computationally efficient tools for handling and computing properties of thousands of single-locus or large concatenated alignments are needed. Here I present AMAS (Alignment Manipulation And Summary), a tool that can be used either as a stand-alone command-line utility or as a Python package. AMAS works on amino acid and nucleotide alignments and combines capabilities of sequence manipulation with a function that calculates basic statistics. The manipulation functions include conversions among popular formats, concatenation, extracting sites and splitting according to a pre-defined partitioning scheme, and creation of replicate data sets. The statistics calculated include the number of taxa, alignment length, total count of matrix cells, overall number of undetermined characters, percent of missing data, AT and GC contents (for DNA alignments), count and proportion of variable sites, count and proportion of parsimony informative sites, and counts of all characters relevant for a nucleotide or amino acid alphabet. AMAS is particularly suitable for very large alignments with hundreds of taxa and thousands of loci. It performs better at concatenation and summarizing alignments than other popular tools. AMAS is a Python 3 program that relies solely on Python’s core modules. AMAS source code and manual can be downloaded from http://github.com/marekborowiec/AMAS/

Download Full-text

amplimap: a versatile tool to process and analyze targeted NGS data

Bioinformatics ◽

10.1093/bioinformatics/btz582 ◽

2019 ◽

Vol 35 (24) ◽

pp. 5349-5350

Author(s):

Nils Koelling ◽

Marie Bernkopf ◽

Eduardo Calpena ◽

Geoffrey J Maher ◽

Kerry A Miller ◽

...

Keyword(s):

Command Line ◽

User Friendliness ◽

Targeted Next Generation Sequencing ◽

Base Calling ◽

Targeted Ngs ◽

Command Line Tool ◽

Versatile Tool ◽

Ngs Data ◽

Python Package ◽

Generation Sequencing

Abstract Summary amplimap is a command-line tool to automate the processing and analysis of data from targeted next-generation sequencing experiments with PCR-based amplicons or capture-based enrichment systems. From raw sequencing reads, amplimap generates output such as read alignments, annotated variant calls, target coverage statistics and variant allele counts and frequencies for each target base pair. In addition to its focus on user-friendliness and reproducibility, amplimap supports advanced features such as consensus base calling for read families based on unique molecular identifiers and filtering false positive variant calls caused by amplification of off-target loci. Availability and implementation amplimap is available as a free Python package under the open-source Apache 2.0 License. Documentation, source code and installation instructions are available at https://github.com/koelling/amplimap.

Download Full-text