scholarly journals Clustering gene expression time series data using an infinite Gaussian process mixture model

2017 ◽  
Author(s):  
Ian C. McDowell ◽  
Dinesh Manandhar ◽  
Christopher M. Vockley ◽  
Amy K. Schmid ◽  
Timothy E. Reddy ◽  
...  

AbstractTranscriptome-wide time series expression profiling is used to characterize the cellular response to environmental perturbations. The first step to analyzing transcriptional response data is often to cluster genes with similar responses. Here, we present a nonparametric model-based method, Dirichlet process Gaussian process mixture model (DPGP), which jointly models cluster number with a Dirichlet process and temporal dependencies with Gaussian processes. We demonstrate the accuracy of DPGP in comparison with state-of-the-art approaches using hundreds of simulated data sets. To further test our method, we apply DPGP to published microarray data from a microbial model organism exposed to stress and to novel RNA-seq data from a human cell line exposed to the glucocorticoid dexamethasone. We validate our clusters by examining local transcription factor binding and histone modifications. Our results demonstrate that jointly modeling cluster number and temporal dependencies can reveal novel regulatory mechanisms. DPGP software is freely available online at https://github.com/PrincetonUniversity/DP_GP_cluster.

2020 ◽  
Author(s):  
Sk Md Mosaddek Hossain ◽  
Aanzil Akram Halsana ◽  
Lutfunnesa Khatun ◽  
Sumanta Ray ◽  
Anirban Mukhopadhyay

ABSTRACTPancreatic Ductal Adenocarcinoma (PDAC) is the most lethal type of pancreatic cancer (PC), late detection of which leads to its therapeutic failure. This study aims to find out key regulatory genes and their impact on the progression of the disease helping the etiology of the disease which is still largely unknown. We leverage the landmark advantages of time-series gene expression data of this disease, and thereby the identified key regulators capture the characteristics of gene activity patterns in the progression of the cancer. We have identified the key modules and predicted gene functions of top genes from the compiled gene association network (GAN). Here, we have used the natural cubic spline regression model (splineTimeR) to identify differentially expressed genes (DEG) from the PDAC microarray time-series data downloaded from gene expression omnibus (GEO). First, we have identified key transcriptomic regulators (TR) and DNA binding transcription factors (DbTF). Subsequently, the Dirichlet process and Gaussian process (DPGP) mixture model is utilized to identify the key gene modules. A variation of the partial correlation method is utilized to analyze GAN, which is followed by a process of gene function prediction from the network. Finally, a panel of key genes related to PDAC is highlighted from each of the analyses performed.Please note: Abbreviations should be introduced at the first mention in the main text – no abbreviations lists. Suggested structure of main text (not enforced) is provided below.


2018 ◽  
Vol 14 (1) ◽  
pp. e1005896 ◽  
Author(s):  
Ian C. McDowell ◽  
Dinesh Manandhar ◽  
Christopher M. Vockley ◽  
Amy K. Schmid ◽  
Timothy E. Reddy ◽  
...  

Author(s):  
Puneet Agarwal ◽  
William Walker ◽  
Kenneth Bhalla

The most probable maximum (MPM) is the extreme value statistic commonly used in the offshore industry. The extreme value of vessel motions, structural response, and environment are often expressed using the MPM. For a Gaussian process, the MPM is a function of the root-mean square and the zero-crossing rate of the process. Accurate estimates of the MPM may be obtained in frequency domain from spectral moments of the known power spectral density. If the MPM is to be estimated from the time-series of a random process, either from measurements or from simulations, the time series data should be of long enough duration, sampled at an adequate rate, and have an ensemble of multiple realizations. This is not the case when measured data is recorded for an insufficient duration, or one wants to make decisions (requiring an estimate of the MPM) in real-time based on observing the data only for a short duration. Sometimes, the instrumentation system may not be properly designed to measure the dynamic vessel motions with a fine sampling rate, or it may be a legacy instrumentation system. The question then becomes whether the short-duration and/or the undersampled data is useful at all, or if some useful information (i.e., an estimate of MPM) can be extracted, and if yes, what is the accuracy and uncertainty of such estimates. In this paper, a procedure for estimation of the MPM from the short-time maxima, i.e., the maximum value from a time series of short duration (say, 10 or 30 minutes), is presented. For this purpose pitch data is simulated from the vessel RAOs (response amplitude operators). Factors to convert the short-time maxima to the MPM are computed for various non-exceedance levels. It is shown that the factors estimated from simulation can also be obtained from the theory of extremes of a Gaussian process. Afterwards, estimation of the MPM from the short-time maxima is explored for an undersampled process; however, undersampled data must not be used and only the adequately sampled data should be utilized. It is found that the undersampled data can be somewhat useful and factors to convert the short-time maxima to the MPM can be derived for an associated non-exceedance level. However, compared to the adequately sampled data, the factors for the undersampled data are less useful since they depend on more variables and have more uncertainty. While the vessel pitch data was the focus of this paper, the results and conclusions are valid for any adequately sampled narrow-banded Gaussian process.


Author(s):  
Nobuhiko Yamaguchi ◽  

Gaussian Process Dynamical Models (GPDMs) constitute a nonlinear dimensionality reduction technique that provides a probabilistic representation of time series data in terms of Gaussian process priors. In this paper, we report a method based on GPDMs to visualize the states of time-series data. Conventional GPDMs are unsupervised, and therefore, even when the labels of data are available, it is not possible to use this information. To overcome the problem, we propose a supervised GPDM (S-GPDM) that utilizes both the data and their corresponding labels. We demonstrate experimentally that the S-GPDM can locate related motion data closer together than conventional GPDMs.


Author(s):  
Rosmanjawati Binti Abdul Rahman ◽  
Seuk Wai Phoong ◽  
Mohd Tahir Ismail ◽  
Seuk Yen Phoong

2012 ◽  
Vol 2012 ◽  
pp. 1-16
Author(s):  
John W. Lau ◽  
Ed Cripps

Traditional GARCH models describe volatility levels that evolve smoothly over time, generated by a single GARCH regime. However, nonstationary time series data may exhibit abrupt changes in volatility, suggesting changes in the underlying GARCH regimes. Further, the number and times of regime changes are not always obvious. This article outlines a nonparametric mixture of GARCH models that is able to estimate the number and time of volatility regime changes by mixing over the Poisson-Kingman process. The process is a generalisation of the Dirichlet process typically used in nonparametric models for time-dependent data provides a richer clustering structure, and its application to time series data is novel. Inference is Bayesian, and a Markov chain Monte Carlo algorithm to explore the posterior distribution is described. The methodology is illustrated on the Standard and Poor's 500 financial index.


Author(s):  
Seuk Yen Phoong ◽  
Mohd Tahir Ismail ◽  
Seuk Wai Phoong ◽  
Rosmanjawati Binti Abdul Rahman

Sign in / Sign up

Export Citation Format

Share Document