Simphony: simulating large-scale, rhythmic data

Mapping Intimacies ◽

10.1101/497859 ◽

2018 ◽

Author(s):

Jordan M Singer ◽

Darwin Y Fu ◽

Jacob J Hughey

Keyword(s):

Experimental Design ◽

Large Scale ◽

Method Development ◽

Negative Binomial ◽

Simulated Data ◽

General Purpose ◽

Computational Method ◽

Next Generation Sequencing Data ◽

Multiple Time ◽

Sequencing Data

Simulated data are invaluable for assessing a computational method's ability to distinguish signal from noise. Although many biological systems show rhythmicity, there is no general-purpose tool to simulate large-scale, rhythmic data. Here we present Simphony, an R package for simulating data from experiments in which the abundances of rhythmic and non-rhythmic features (e.g., genes) are measured at multiple time points in multiple conditions. Simphony has parameters for specifying experimental design and each feature's rhythmic properties (e.g., shape, amplitude, and phase). In addition, Simphony can sample measurements from Gaussian and negative binomial distributions, the latter of which approximates read counts from next-generation sequencing data. We show an example of using Simphony to benchmark a method for detecting rhythms. Our results suggest that Simphony can aid experimental design and computational method development. Simphony is thoroughly documented and freely available at https://github.com/hugheylab/simphony.

Download Full-text

Simphony: simulating large-scale, rhythmic data

PeerJ ◽

10.7717/peerj.6985 ◽

2019 ◽

Vol 7 ◽

pp. e6985 ◽

Cited By ~ 5

Author(s):

Jordan M. Singer ◽

Darwin Y. Fu ◽

Jacob J. Hughey

Keyword(s):

Experimental Design ◽

Large Scale ◽

Method Development ◽

Negative Binomial ◽

Simulated Data ◽

R Package ◽

General Purpose ◽

Computational Method ◽

Multiple Time ◽

Multiple Time Points

Simulated data are invaluable for assessing a computational method’s ability to distinguish signal from noise. Although many biological systems show rhythmicity, there is no general-purpose tool to simulate large-scale, rhythmic data. Here we present Simphony, an R package for simulating data from experiments in which the abundances of rhythmic and non-rhythmic features (e.g., genes) are measured at multiple time points in multiple conditions. Simphony has parameters for specifying experimental design and each feature’s rhythmic properties (e.g., amplitude and phase). In addition, Simphony can sample measurements from Gaussian and negative binomial distributions, the latter of which approximates read counts from RNA-seq data. We show an example of using Simphony to evaluate the accuracy of rhythm detection. Our results suggest that Simphony will aid experimental design and computational method development. Simphony is thoroughly documented and freely available at https://github.com/hugheylab/simphony.

Download Full-text

Speeding Up Large-Scale Next Generation Sequencing Data Analysis with pBWA

Journal of Applied Bioinformatics & Computational Biology ◽

10.4172/2329-9533.1000101 ◽

2017 ◽

Vol 01 (01) ◽

Cited By ~ 4

Author(s):

Darren Peters ◽

Xuemei Luo ◽

Ke Qiu ◽

Ping Liang

Keyword(s):

Data Analysis ◽

Next Generation Sequencing ◽

Large Scale ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Generation Sequencing ◽

Sequencing Data Analysis

Download Full-text

InteractomeSeq: a web server for the identification and profiling of domains and epitopes from phage display and next generation sequencing data

Nucleic Acids Research ◽

10.1093/nar/gkaa363 ◽

2020 ◽

Vol 48 (W1) ◽

pp. W200-W207

Author(s):

Simone Puccio ◽

Giorgio Grillo ◽

Arianna Consiglio ◽

Maria Felicia Soluri ◽

Daniele Sblattero ◽

...

Keyword(s):

Phage Display ◽

Large Scale ◽

High Throughput Sequencing ◽

Gene Annotation ◽

Web Server ◽

Next Generation Sequencing Data ◽

Sequencing Data ◽

Phage Display Technology ◽

Essential Information ◽

Research Fields

Abstract High-Throughput Sequencing technologies are transforming many research fields, including the analysis of phage display libraries. The phage display technology coupled with deep sequencing was introduced more than a decade ago and holds the potential to circumvent the traditional laborious picking and testing of individual phage rescued clones. However, from a bioinformatics point of view, the analysis of this kind of data was always performed by adapting tools designed for other purposes, thus not considering the noise background typical of the ‘interactome sequencing’ approach and the heterogeneity of the data. InteractomeSeq is a web server allowing data analysis of protein domains (‘domainome’) or epitopes (‘epitome’) from either Eukaryotic or Prokaryotic genomic phage libraries generated and selected by following an Interactome sequencing approach. InteractomeSeq allows users to upload raw sequencing data and to obtain an accurate characterization of domainome/epitome profiles after setting the parameters required to tune the analysis. The release of this tool is relevant for the scientific and clinical community, because InteractomeSeq will fill an existing gap in the field of large-scale biomarkers profiling, reverse vaccinology, and structural/functional studies, thus contributing essential information for gene annotation or antigen identification. InteractomeSeq is freely available at https://InteractomeSeq.ba.itb.cnr.it/

Download Full-text

Abstract 2376: MutationValidator: A computational method for variant cross-validation in next-generation sequencing data

10.1158/1538-7445.am2014-2376 ◽

2014 ◽

Author(s):

Mara Rosenberg ◽

Gad Getz ◽

Adam Kiezun ◽

Andrey Sivachenko

Keyword(s):

Next Generation Sequencing ◽

Cross Validation ◽

Computational Method ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Generation Sequencing

Download Full-text

ASCOT identifies key regulators of neuronal subtype-specific splicing

Nature Communications ◽

10.1038/s41467-019-14020-5 ◽

2020 ◽

Vol 11 (1) ◽

Cited By ~ 2

Author(s):

Jonathan P. Ling ◽

Christopher Wilks ◽

Rone Charles ◽

Patrick J. Leavey ◽

Devlina Ghosh ◽

...

Keyword(s):

Rna Splicing ◽

Large Scale ◽

Splice Variants ◽

Next Generation Sequencing Data ◽

Data Sets ◽

Cell Type ◽

Sequencing Data ◽

Large Scale Analysis ◽

Cell Type Specific ◽

Public Archives

AbstractPublic archives of next-generation sequencing data are growing exponentially, but the difficulty of marshaling this data has led to its underutilization by scientists. Here, we present ASCOT, a resource that uses annotation-free methods to rapidly analyze and visualize splice variants across tens of thousands of bulk and single-cell data sets in the public archive. To demonstrate the utility of ASCOT, we identify novel cell type-specific alternative exons across the nervous system and leverage ENCODE and GTEx data sets to study the unique splicing of photoreceptors. We find that PTBP1 knockdown and MSI1 and PCBP2 overexpression are sufficient to activate many photoreceptor-specific exons in HepG2 liver cancer cells. This work demonstrates how large-scale analysis of public RNA-Seq data sets can yield key insights into cell type-specific control of RNA splicing and underscores the importance of considering both annotated and unannotated splicing events.

Download Full-text

Detection of Pathogenic Microbe Composition Using Next-Generation Sequencing Data

Frontiers in Genetics ◽

10.3389/fgene.2020.603093 ◽

2020 ◽

Vol 11 ◽

Author(s):

Haiyong Zhao ◽

Shuang Wang ◽

Xiguo Yuan

Keyword(s):

Next Generation Sequencing ◽

Computational Method ◽

Superior Performance ◽

Next Generation Sequencing Data ◽

Support Vector ◽

Next Generation ◽

Sequencing Data ◽

Microbial Composition ◽

High Resolution Data ◽

Generation Sequencing

Next-generation sequencing (NGS) technologies have provided great opportunities to analyze pathogenic microbes with high-resolution data. The main goal is to accurately detect microbial composition and abundances in a sample. However, high similarity among sequences from different species and the existence of sequencing errors pose various challenges. Numerous methods have been developed for quantifying microbial composition and abundance, but they are not versatile enough for the analysis of samples with mixtures of noise. In this paper, we propose a new computational method, PGMicroD, for the detection of pathogenic microbial composition in a sample using NGS data. The method first filters the potentially mistakenly mapped reads and extracts multiple species-related features from the sequencing reads of 16S rRNA. Then it trains an Support Vector Machine classifier to predict the microbial composition. Finally, it groups all multiple-mapped sequencing reads into the references of the predicted species to estimate the abundance for each kind of species. The performance of PGMicroD is evaluated based on both simulation and real sequencing data and is compared with several existing methods. The results demonstrate that our proposed method achieves superior performance. The software package of PGMicroD is available at https://github.com/BDanalysis/PGMicroD.

Download Full-text

A bivariate zero-inflated negative binomial model for identifying underlying dependence with application to single cell RNA sequencing data

10.1101/2020.03.06.977728 ◽

2020 ◽

Author(s):

Hunyong Cho ◽

Chuwen Liu ◽

John S. Preisser ◽

Di Wu

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Latent Variable ◽

Large Scale ◽

Negative Binomial ◽

Model Fitting ◽

Sequencing Data ◽

Excess Zeros ◽

Binomial Distributions ◽

Single Cell Rna Sequencing

SummaryMeasuring gene-gene dependence in single cell RNA sequencing (scRNA-seq) count data is often of interest and remains challenging, because an unidentified portion of the zero counts represent non-detected RNA due to technical reasons. Conventional statistical methods that fail to account for technical zeros incorrectly measure the dependence among genes. To address this problem, we propose a bivariate zero-inflated negative binomial (BZINB) model constructed using a bivariate Poisson-gamma mixture with dropout indicators for the technical (excess) zeros. Parameters are estimated based on the EM algorithm and are used to measure the underlying dependence by decomposing the two sources of zeros. Compared to existing models, the proposed BZINB model is specifically designed for estimating dependence and is more flexible, while preserving the marginal zero-inflated negative binomial distributions. Additionally, it has a simple latent variable framework, allowing parameters to have clear and intuitive interpretations, and its computation is feasible with large scale data. Using a recent scRNA-seq dataset, we illustrate model fitting and how the model-based measures can be different from naive measures. The inferential ability of the proposed model is evaluated in a simulation study. An R package ‘bzinb’ is available on CRAN.

Download Full-text

Deconvolute individual genomes from metagenome sequences through short read clustering

PeerJ ◽

10.7717/peerj.8966 ◽

2020 ◽

Vol 8 ◽

pp. e8966 ◽

Cited By ~ 1

Author(s):

Kexue Li ◽

Yakang Lu ◽

Li Deng ◽

Lili Wang ◽

Lizhen Shi ◽

...

Keyword(s):

Large Scale ◽

False Negative ◽

Next Generation Sequencing Data ◽

Clustering Methods ◽

Sequencing Data ◽

Short Reads ◽

Clustering Problem ◽

Metagenome Assembly ◽

Real World Datasets ◽

Almost All

Metagenome assembly from short next-generation sequencing data is a challenging process due to its large scale and computational complexity. Clustering short reads by species before assembly offers a unique opportunity for parallel downstream assembly of genomes with individualized optimization. However, current read clustering methods suffer either false negative (under-clustering) or false positive (over-clustering) problems. Here we extended our previous read clustering software, SpaRC, by exploiting statistics derived from multiple samples in a dataset to reduce the under-clustering problem. Using synthetic and real-world datasets we demonstrated that this method has the potential to cluster almost all of the short reads from genomes with sufficient sequencing coverage. The improved read clustering in turn leads to improved downstream genome assembly quality.

Download Full-text

NGSPERL: a semi-automated framework for large scale next generation sequencing data analysis

International Journal of Computational Biology and Drug Design ◽

10.1504/ijcbdd.2015.072082 ◽

2015 ◽

Vol 8 (3) ◽

pp. 203

Author(s):

Quanhu Sheng ◽

Shilin Zhao ◽

Mingsheng Guo ◽

Yu Shyr

Keyword(s):

Data Analysis ◽

Next Generation Sequencing ◽

Large Scale ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Generation Sequencing ◽

Sequencing Data Analysis

Download Full-text

GenoPheno: cataloging large-scale phenotypic and next-generation sequencing data within human datasets

Briefings in Bioinformatics ◽

10.1093/bib/bbaa033 ◽

2020 ◽

Author(s):

Alba Gutiérrez-Sacristán ◽

Carlos De Niz ◽

Cartik Kothari ◽

Sek Won Kong ◽

Kenneth D Mandl ◽

...

Keyword(s):

Next Generation Sequencing ◽

Web Application ◽

Large Scale ◽

Human Subjects ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Phenotypic Data ◽

Data Repositories ◽

Generation Sequencing

Abstract Precision medicine promises to revolutionize treatment, shifting therapeutic approaches from the classical one-size-fits-all to those more tailored to the patient’s individual genomic profile, lifestyle and environmental exposures. Yet, to advance precision medicine’s main objective—ensuring the optimum diagnosis, treatment and prognosis for each individual—investigators need access to large-scale clinical and genomic data repositories. Despite the vast proliferation of these datasets, locating and obtaining access to many remains a challenge. We sought to provide an overview of available patient-level datasets that contain both genotypic data, obtained by next-generation sequencing, and phenotypic data—and to create a dynamic, online catalog for consultation, contribution and revision by the research community. Datasets included in this review conform to six specific inclusion parameters that are: (i) contain data from more than 500 human subjects; (ii) contain both genotypic and phenotypic data from the same subjects; (iii) include whole genome sequencing or whole exome sequencing data; (iv) include at least 100 recorded phenotypic variables per subject; (v) accessible through a website or collaboration with investigators and (vi) make access information available in English. Using these criteria, we identified 30 datasets, reviewed them and provided results in the release version of a catalog, which is publicly available through a dynamic Web application and on GitHub. Users can review as well as contribute new datasets for inclusion (Web: https://avillachlab.shinyapps.io/genophenocatalog/; GitHub: https://github.com/hms-dbmi/GenoPheno-CatalogShiny).

Download Full-text