MSIGNET: a Metropolis sampling-based method for global optimal significant network identification

Mapping Intimacies ◽

10.1101/260844 ◽

2018 ◽

Cited By ~ 2

Author(s):

Xi Chen ◽

Jianhua Xuan

Keyword(s):

Cancer Recurrence ◽

Simulated Data ◽

Superior Performance ◽

Biological Knowledge ◽

Specific Gene ◽

Data Sets ◽

Data Set ◽

Novel Approach ◽

Network Identification ◽

Global Optimal

AbstractIn this paper, we propose a novel approach namely MSIGNET to identify subnetworks with significantly expressed genes by integrating context specific gene expression and protein-protein interaction (PPI) data. Specifically, we integrate differential expression of each gene and mutual information of gene pairs in a Bayesian framework and use Metropolis sampling to identify functional interactions. During the sampling process, a conditional probability is calculated given a randomly selected gene to control the network state transition. Our method provides global statistics of all genes and their interactions, and finally achieves a global optimal sub-network. We apply MSIGNET to simulated data and have demonstrated its superior performance over comparable network identification tools. Using a validated Parkinson data set we show that the network identified using MSIGNET is consistent to previously reported results but provides more biology meaningful interpretation of Parkinson’s disease. Finally, to study networks related to ovarian cancer recurrence, we investigate two patient data sets. Identified networks from independent data sets show functional consistence. And those common genes and interactions are well supported by current biological knowledge.

Download Full-text

Model-free detection of unique events in time series

Scientific Reports ◽

10.1038/s41598-021-03526-y ◽

2022 ◽

Vol 12 (1) ◽

Author(s):

Zsigmond Benkő ◽

Tamás Bábel ◽

Zoltán Somogyvári

Keyword(s):

Simulated Data ◽

Normal Activity ◽

Detection Algorithm ◽

Superior Performance ◽

Data Series ◽

Data Sets ◽

Data Set ◽

Detection Algorithms ◽

Model Free ◽

Unique Event

AbstractRecognition of anomalous events is a challenging but critical task in many scientific and industrial fields, especially when the properties of anomalies are unknown. In this paper, we introduce a new anomaly concept called “unicorn” or unique event and present a new, model-free, unsupervised detection algorithm to detect unicorns. The key component of the new algorithm is the Temporal Outlier Factor (TOF) to measure the uniqueness of events in continuous data sets from dynamic systems. The concept of unique events differs significantly from traditional outliers in many aspects: while repetitive outliers are no longer unique events, a unique event is not necessarily an outlier; it does not necessarily fall out from the distribution of normal activity. The performance of our algorithm was examined in recognizing unique events on different types of simulated data sets with anomalies and it was compared with the Local Outlier Factor (LOF) and discord discovery algorithms. TOF had superior performance compared to LOF and discord detection algorithms even in recognizing traditional outliers and it also detected unique events that those did not. The benefits of the unicorn concept and the new detection method were illustrated by example data sets from very different scientific fields. Our algorithm successfully retrieved unique events in those cases where they were already known such as the gravitational waves of a binary black hole merger on LIGO detector data and the signs of respiratory failure on ECG data series. Furthermore, unique events were found on the LIBOR data set of the last 30 years.

Download Full-text

Variational inference using approximate likelihood under the coalescent with recombination

Genome Research ◽

10.1101/gr.273631.120 ◽

2021 ◽

pp. gr.273631.120

Author(s):

Xinhao Liu ◽

Huw A Ogilvie ◽

Luay Nakhleh

Keyword(s):

Simulated Data ◽

Variational Inference ◽

Divide And Conquer ◽

Data Sets ◽

Transition Rates ◽

Data Set ◽

Population Sizes ◽

Novel Method ◽

Approximate Likelihood ◽

Promising Avenue

Coalescent methods are proven and powerful tools for population genetics, phylogenetics, epidemiology, and other fields. A promising avenue for the analysis of large genomic alignments, which are increasingly common, are coalescent hidden Markov model (coalHMM) methods, but these methods have lacked general usability and flexibility. We introduce a novel method for automatically learning a coalHMM and inferring the posterior distributions of evolutionary parameters using black-box variational inference, with the transition rates between local genealogies derived empirically by simulation. This derivation enables our method to work directly with three or four taxa and through a divide-and-conquer approach with more taxa. Using a simulated data set resembling a human-chimp-gorilla scenario, we show that our method has comparable or better accuracy to previous coalHMM methods. Both species divergence times and population sizes were accurately inferred. The method also infers local genealogies and we report on their accuracy. Furthermore, we discuss a potential direction for scaling the method to larger data sets through a divide-and-conquer approach. This accuracy means our method is useful now, and by deriving transition rates by simulation it is flexible enough to enable future implementations of all kinds of population models.

Download Full-text

Feature Scaling via Second-Order Cone Programming

Mathematical Problems in Engineering ◽

10.1155/2016/7347986 ◽

2016 ◽

Vol 2016 ◽

pp. 1-7

Author(s):

Zhizheng Liang

Keyword(s):

Simulated Data ◽

Performance Measure ◽

Second Order ◽

Data Sets ◽

Scaling Factors ◽

Data Set ◽

Cone Programming ◽

Second Order Cone Programming ◽

Second Order Cone ◽

Feature Scaling

Feature scaling has attracted considerable attention during the past several decades because of its important role in feature selection. In this paper, a novel algorithm for learning scaling factors of features is proposed. It first assigns a nonnegative scaling factor to each feature of data and then adopts a generalized performance measure to learn the optimal scaling factors. It is of interest to note that the proposed model can be transformed into a convex optimization problem: second-order cone programming (SOCP). Thus the scaling factors of features in our method are globally optimal in some sense. Several experiments on simulated data, UCI data sets, and the gene data set are conducted to demonstrate that the proposed method is more effective than previous methods.

Download Full-text

A Growth Model for Multilevel Ordinal Data

Journal of Educational and Behavioral Statistics ◽

10.3102/10769986030004369 ◽

2005 ◽

Vol 30 (4) ◽

pp. 369-396 ◽

Cited By ~ 8

Author(s):

Eisuke Segawa

Keyword(s):

Latent Variable ◽

Ordinal Data ◽

Linear Models ◽

Growth Models ◽

Simulated Data ◽

Real Data ◽

Analytic Structure ◽

Data Sets ◽

Data Set ◽

Time Points

Multi-indicator growth models were formulated as special three-level hierarchical generalized linear models to analyze growth of a trait latent variable measured by ordinal items. Items are nested within a time-point, and time-points are nested within subject. These models are special because they include factor analytic structure. This model can analyze not only data with item- and time-level missing observations, but also data with time points freely specified over subjects. Furthermore, features useful for longitudinal analyses, “autoregressive error degree one” structure for the trait residuals and estimated time-scores, were included. The approach is Bayesian with Markov Chain and Monte Carlo, and the model is implemented in WinBUGS. They are illustrated with two simulated data sets and one real data set with planned missing items within a scale.

Download Full-text

FSOL - a workflow for the detection of patient subgroups and affected molecular features in high-throughput omics data

10.7287/peerj.preprints.1305v1 ◽

2015 ◽

Author(s):

Maike Ahrens ◽

Michael Turewicz ◽

Katrin Marcus ◽

Helmut E. Meyer ◽

Caroline May ◽

...

Keyword(s):

High Throughput ◽

Drug Targets ◽

Simulated Data ◽

Therapy Response ◽

Specific Gene ◽

Specific Expression ◽

Data Set ◽

Molecular Features ◽

Leukemia Data ◽

Patient Subgroups

In personalized medicine, one major goal is the identification of yet unknown patient subgroups with specific gene or protein expression. Different subgroups can indicate different molecular subtypes of a disease. These subtypes might correlate with disease progression, prognosis or therapy response, and the subgroup-specific genes or proteins are potential drug targets. Using high-throughput molecular data, the aim is to characterize the patient subgroup by identifying both the set of samples that shows a distinct expression pattern as well as the set of features that are affected. We present the new workflow FSOL for the identification of patient subgroups from two sample comparisons (e.g. healthy vs. diseased). First, a pre-filtering based on the univariate score FisherSum (FS) is applied to assess subgroup-specific expression of the features. FS has been shown to outperform competing methods in several settings. Second, the selected features are compared regarding the samples that form the affected subgroup. This step uses the OrderedList (OL) method that was originally developed for the comparison of result lists from gene expression studies. We compare our workflow FSOL to a reference workflow based on biclustering using real world and simulated data. On a leukemia data set, a true biological subgroup can be detected with higher stability by FSOL. On simulated data, FSOL shows higher sensitivity and accuracy compared to biclustering especially for small to moderate differences. The exploratory approach FSOL may help in identifying yet unknown mechanisms in pathologic processes and may assist in the generation of new research hypotheses.

Download Full-text

Comparison of silhouette-based reallocation methods for vegetation classification

10.1101/630384 ◽

2019 ◽

Cited By ~ 1

Author(s):

Attila Lengyel ◽

David W. Roberts ◽

Zoltán Botta-Dukát

Keyword(s):

Simulated Data ◽

Primary Objective ◽

Vegetation Classification ◽

Data Sets ◽

Data Set ◽

Number Of Clusters ◽

Silhouette Width ◽

Diagnostic Species ◽

Order Of Magnitude ◽

Initial Classification

AbstractAimsTo introduce REMOS, a new iterative reallocation method (with two variants) for vegetation classification, and to compare its performance with OPTSIL. We test (1) how effectively REMOS and OPTSIL maximize mean silhouette width and minimize the number of negative silhouette widths when run on classifications with different structure; (2) how these three methods differ in runtime with different sample sizes; and (3) if classifications by the three reallocation methods differ in the number of diagnostic species, a surrogate for interpretability.Study areaSimulation; example data sets from grasslands in Hungary and forests in Wyoming and Utah, USA.MethodsWe classified random subsets of simulated data with the flexible-beta algorithm for different values of beta. These classifications were subsequently optimized by REMOS and OPTSIL and compared for mean silhouette widths and proportion of negative silhouette widths. Then, we classified three vegetation data sets of different sizes from two to ten clusters, optimized them with the reallocation methods, and compared their runtimes, mean silhouette widths, numbers of negative silhouette widths, and the number of diagnostic species.ResultsIn terms of mean silhouette width, OPTSIL performed the best when the initial classifications already had high mean silhouette width. REMOS algorithms had slightly lower mean silhouette width than what was maximally achievable with OPTSIL but their efficiency was consistent across different initial classifications; thus REMOS was significantly superior to OPTSIL when the initial classification had low mean silhouette width. REMOS resulted in zero or a negligible number of negative silhouette widths across all classifications. OPTSIL performed similarly when the initial classification was effective but could not reach as low proportion of misclassified objects when the initial classification was inefficient. REMOS algorithms were typically more than an order of magnitude faster to calculate than OPTSIL. There was no clear difference between REMOS and OPTSIL in the number of diagnostic species.ConclusionsREMOS algorithms may be preferable to OPTSIL when (1) the primary objective is to reduce or eliminate negative silhouette widths in a classification, (2) the initial classification has low mean silhouette width, or (3) when the time efficiency of the algorithm is important because of the size of the data set or the high number of clusters.

Download Full-text

Personalized Dynamic Pricing with Machine Learning: High-Dimensional Features and Heterogeneous Elasticity

Management Science ◽

10.1287/mnsc.2020.3680 ◽

2021 ◽

Author(s):

Gah-Yi Ban ◽

N. Bora Keskin

Keyword(s):

Dynamic Pricing ◽

Simulated Data ◽

The United States ◽

Model Parameters ◽

Data Sets ◽

Demand Model ◽

Pricing Policy ◽

Data Set ◽

Customized Pricing ◽

The Individual

We consider a seller who can dynamically adjust the price of a product at the individual customer level, by utilizing information about customers’ characteristics encoded as a d-dimensional feature vector. We assume a personalized demand model, parameters of which depend on s out of the d features. The seller initially does not know the relationship between the customer features and the product demand but learns this through sales observations over a selling horizon of T periods. We prove that the seller’s expected regret, that is, the revenue loss against a clairvoyant who knows the underlying demand relationship, is at least of order [Formula: see text] under any admissible policy. We then design a near-optimal pricing policy for a semiclairvoyant seller (who knows which s of the d features are in the demand model) who achieves an expected regret of order [Formula: see text]. We extend this policy to a more realistic setting, where the seller does not know the true demand predictors, and show that this policy has an expected regret of order [Formula: see text], which is also near-optimal. Finally, we test our theory on simulated data and on a data set from an online auto loan company in the United States. On both data sets, our experimentation-based pricing policy is superior to intuitive and/or widely-practiced customized pricing methods, such as myopic pricing and segment-then-optimize policies. Furthermore, our policy improves upon the loan company’s historical pricing decisions by 47% in expected revenue over a six-month period. This paper was accepted by Noah Gans, stochastic models and simulation.

Download Full-text

Smoothing Gene Expression Data with Network Information Improves Consistency of Regulated Genes

Statistical Applications in Genetics and Molecular Biology ◽

10.2202/1544-6115.1618 ◽

2011 ◽

Vol 10 (1) ◽

Cited By ~ 6

Author(s):

Guro Dørum ◽

Lars Snipen ◽

Margrete Solheim ◽

Solve Saebo

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Gene Networks ◽

Simulated Data ◽

Real Data ◽

Biological Knowledge ◽

Expression Data ◽

Data Set ◽

Gene Set ◽

Network Information

Gene set analysis methods have become a widely used tool for including prior biological knowledge in the statistical analysis of gene expression data. Advantages of these methods include increased sensitivity, easier interpretation and more conformity in the results. However, gene set methods do not employ all the available information about gene relations. Genes are arranged in complex networks where the network distances contain detailed information about inter-gene dependencies. We propose a method that uses gene networks to smooth gene expression data with the aim of reducing the number of false positives and identify important subnetworks. Gene dependencies are extracted from the network topology and are used to smooth genewise test statistics. To find the optimal degree of smoothing, we propose using a criterion that considers the correlation between the network and the data. The network smoothing is shown to improve the ability to identify important genes in simulated data. Applied to a real data set, the smoothing accentuates parts of the network with a high density of differentially expressed genes.

Download Full-text

The Impact of Site Sample Size on the Reconstruction of Culture Histories

American Antiquity ◽

10.7183/0002-7316.76.3.547 ◽

2011 ◽

Vol 76 (3) ◽

pp. 547-572 ◽

Cited By ~ 3

Author(s):

Charles Perreault

Keyword(s):

Simulated Data ◽

Archaeological Sites ◽

Cultural Tradition ◽

Data Sets ◽

Data Set ◽

Rate Of Spread ◽

Accuracy And Precision ◽

Historical Processes ◽

The Impact ◽

Simulated Data Sets

I examine how our capacity to produce accurate culture-historical reconstructions changes as more archaeological sites are discovered, dated, and added to a data set. More precisely, I describe, using simulated data sets, how increases in the number of known sites impact the accuracy and precision of our estimations of (1) the earliest and (2) latest date of a cultural tradition, (3) the date and (4) magnitude of its peak popularity, as well as (5) its rate of spread and (6) disappearance in a population. I show that the accuracy and precision of inferences about these six historical processes are not affected in the same fashion by changes in the number of known sites. I also consider the impact of two simple taphonomic site destruction scenarios on the results. Overall, the results presented in this paper indicate that unless we are in possession of near-total samples of sites, and can be certain that there are no taphonomic biases in the universe of sites to be sampled, we will make inferences of varying precision and accuracy depending on the aspect of a cultural trait’s history in question.

Download Full-text

Joint detection of germline and somatic copy number events in matched tumor–normal sample pairs

Bioinformatics ◽

10.1093/bioinformatics/btz429 ◽

2019 ◽

Vol 35 (23) ◽

pp. 4955-4961

Author(s):

Yongzhuang Liu ◽

Jian Liu ◽

Yadong Wang

Keyword(s):

Copy Number ◽

Simulated Data ◽

Real Data ◽

Copy Number Variations ◽

Superior Performance ◽

Supplementary Information ◽

Normal Sample ◽

Joint Detection ◽

Novel Approach ◽

Powerful Approach

Abstract Motivation Whole-genome sequencing (WGS) of tumor–normal sample pairs is a powerful approach for comprehensively characterizing germline copy number variations (CNVs) and somatic copy number alterations (SCNAs) in cancer research and clinical practice. Existing computational approaches for detecting copy number events cannot detect germline CNVs and SCNAs simultaneously, and yield low accuracy for SCNAs. Results In this study, we developed TumorCNV, a novel approach for jointly detecting germline CNVs and SCNAs from WGS data of the matched tumor–normal sample pair. We compared TumorCNV with existing copy number event detection approaches using the simulated data and real data for the COLO-829 melanoma cell line. The experimental results showed that TumorCNV achieved superior performance than existing approaches. Availability and implementation The software TumorCNV is implemented using a combination of Java and R, and it is freely available from the website at https://github.com/yongzhuang/TumorCNV. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text