Assessing Significance of Information Flow in High Dimensional Dynamical Systems With Few Data

Volume 2: Dynamic Modeling and Diagnostics in Biomedical Systems; Dynamics and Control of Wind Energy Systems; Vehicle Energy Management Optimization; Energy Storage, Optimization; Transportation and Grid Applications; Estimation and Identification Methods, Tracking, Detection, Alternative Propulsion Systems; Ground and Space Vehicle Dynamics; Intelligent Transportation Systems and Control; Energy Harvesting; Modeling and Control for Thermo-Fluid Applications, IC Engines, Manufacturing ◽

10.1115/dscc2014-5865 ◽

2014 ◽

Cited By ~ 1

Author(s):

Ross P. Anderson ◽

Maurizio Porfiri

Keyword(s):

Dynamical Systems ◽

Information Flow ◽

Statistical Significance ◽

Transfer Entropy ◽

Statistical Hypothesis ◽

Statistical Hypothesis Testing ◽

Stochastic Dynamical System ◽

Model Free ◽

The Difference ◽

Few Data

Information-theoretical notions of causality provide a model-free approach to identification of the magnitude and direction of influence among sub-components of a stochastic dynamical system. In addition to detecting causal influences, any effective test should also report the level of statistical significance of the finding. Here, we focus on transfer entropy, which has recently been considered for causality detection in a variety of fields based on statistical significance tests that are valid only in the asymptotic regime, that is, with enormous amounts of data. In the interest of applications with limited available data, we develop a non-asymptotic theory for the probability distribution of the difference between the empirically-estimated transfer entropy and the true transfer entropy. Based on this result, we additionally demonstrate an approach for statistical hypothesis testing for directed information flow in dynamical systems with a given number of observed time steps.

Download Full-text

Multiple Hypothesis Testing for Data Mining

Encyclopedia of Data Warehousing and Mining, Second Edition ◽

10.4018/978-1-60566-010-3.ch215 ◽

2011 ◽

pp. 1390-1395

Author(s):

Sach Mukherjee

Keyword(s):

Data Mining ◽

Hypothesis Testing ◽

Conventional Treatment ◽

Statistical Significance ◽

Multiple Hypothesis Testing ◽

Statistical Hypothesis ◽

Single Variable ◽

Statistical Hypothesis Testing ◽

Computationally Efficient ◽

Error Probabilities

A number of important problems in data mining can be usefully addressed within the framework of statistical hypothesis testing. However, while the conventional treatment of statistical significance deals with error probabilities at the level of a single variable, practical data mining tasks tend to involve thousands, if not millions, of variables. This Chapter looks at some of the issues that arise in the application of hypothesis tests to multi-variable data mining problems, and describes two computationally efficient procedures by which these issues can be addressed.

Download Full-text

Multiple Hypothesis Testing for Data Mining

Encyclopedia of Data Warehousing and Mining ◽

10.4018/978-1-59140-557-3.ch161 ◽

2011 ◽

pp. 848-853

Author(s):

Sach Mukherjee

Keyword(s):

Data Mining ◽

Hypothesis Testing ◽

Conventional Treatment ◽

Statistical Significance ◽

Multiple Hypothesis Testing ◽

Statistical Hypothesis ◽

Single Variable ◽

Statistical Hypothesis Testing ◽

Computationally Efficient ◽

Error Probabilities

Download Full-text

Making Decisions with Data: Understanding Hypothesis Testing & Statistical Significance

The American Biology Teacher ◽

10.1525/abt.2019.81.8.535 ◽

2019 ◽

Vol 81 (8) ◽

pp. 535-542

Author(s):

Robert A. Cooper

Keyword(s):

Hypothesis Testing ◽

Statistical Methods ◽

Scientific Literacy ◽

Statistical Significance ◽

Statistical Hypothesis ◽

Statistical Hypothesis Testing ◽

P Values ◽

Using Data ◽

Science Classes ◽

Practice Of Science

Statistical methods are indispensable to the practice of science. But statistical hypothesis testing can seem daunting, with P-values, null hypotheses, and the concept of statistical significance. This article explains the concepts associated with statistical hypothesis testing using the story of “the lady tasting tea,” then walks the reader through an application of the independent-samples t-test using data from Peter and Rosemary Grant's investigations of Darwin's finches. Understanding how scientists use statistics is an important component of scientific literacy, and students should have opportunities to use statistical methods like this in their science classes.

Download Full-text

Data Analytics and Statistical Hypothesis Testing: Making the Difference between Passing or Failing a Wet Gas Multiphase Flow Meter in a Field Trial Test

10.2118/177992-ms ◽

2015 ◽

Author(s):

Salvador A. Ruvalcaba Velarde ◽

Ruben Villegas Rodriguez ◽

Mohammed A. Asiri

Keyword(s):

Multiphase Flow ◽

Hypothesis Testing ◽

Field Trial ◽

Data Analytics ◽

Statistical Hypothesis ◽

Statistical Hypothesis Testing ◽

Flow Meter ◽

Wet Gas ◽

Trial Test ◽

The Difference

Download Full-text

A STATISTICAL METHODOLOGY TO SIMPLIFY SOFTWARE METRIC MODELS CONSTRUCTED USING INCOMPLETE DATA SAMPLES

International Journal of Software Engineering and Knowledge Engineering ◽

10.1142/s0218194007003495 ◽

2007 ◽

Vol 17 (06) ◽

pp. 689-707 ◽

Cited By ~ 2

Author(s):

VICTOR K. Y. CHAN ◽

W. ERIC WONG ◽

T. F. XIE

Keyword(s):

Incomplete Data ◽

Goodness Of Fit ◽

Statistical Significance ◽

Statistical Hypothesis ◽

Software Project ◽

Statistical Hypothesis Testing ◽

Work Effort ◽

K Nearest Neighbors ◽

Quality Metric ◽

Software Metric

Software metric models predict the target software metric(s), e.g., the development work effort or defect rates, for any future software project based on the project's predictor software metric(s), e.g., the project team size. Obviously, the construction of such a software metric model makes use of a data sample of such metrics from analogous past projects. However, incomplete data often appear in such data samples. Moreover, the decision on whether a particular predictor metric should be included is most likely based on an intuitive or experience-based assumption that the predictor metric has an impact on the target metric with a statistical significance. However, this assumption is usually not verifiable "retrospectively" after the model is constructed, leading to redundant predictor metric(s) and/or unnecessary predictor metric complexity. To solve all these problems, we derived a methodology consisting of the k-nearest neighbors (k-NN) imputation method, statistical hypothesis testing, and a "goodness-of-fit" criterion. This methodology was tested on software effort metric models and software quality metric models, the latter usually suffers from far more serious incomplete data. This paper documents this methodology and the tests on these two types of software metric models.

Download Full-text

DATA DISCRETIZATION FOR THE TRANSFER ENTROPY IN FINANCIAL MARKET

Fluctuation and Noise Letters ◽

10.1142/s0219477513500193 ◽

2013 ◽

Vol 12 (04) ◽

pp. 1350019 ◽

Cited By ~ 3

Author(s):

XUEJIAO WANG ◽

PENGJIAN SHANG ◽

JINGJING HUANG ◽

GUOCHEN FENG

Keyword(s):

Financial Market ◽

Information Flow ◽

Coarse Graining ◽

Transfer Entropy ◽

Linear Modeling ◽

Information Theoretic ◽

Model Free ◽

Temporal Correlations ◽

Data Discretization ◽

Flow Of Information

Recently, an information theoretic inspired concept of transfer entropy has been introduced by Schreiber. It aims to quantify in a nonparametric and explicitly nonsymmetric way the flow of information between two time series. This model-free based on Shannon entropy approach in principle allows us to detect statistical dependencies of all types, i.e., linear and nonlinear temporal correlations. However, we always analyze the transfer entropy based on the data, which is discretized into three partitions by some coarse graining. Naturally, we are interested in investigating the effect of the data discretization of the two series on the transfer entropy. In our paper, we analyze the results based on the data which are generated by the linear modeling and the ARFIMA modeling, as well as the dataset consists of seven indices during the period 1992–2002. The results show that the higher the degree of data discretization get, the larger the value of the transfer entropy will be, besides, the direction of the information flow is unchanged along with the degree of data discretization.

Download Full-text

Large-scale directed network inference with multivariate transfer entropy and hierarchical statistical testing

Network Neuroscience ◽

10.1162/netn_a_00092 ◽

2019 ◽

Vol 3 (3) ◽

pp. 827-847 ◽

Cited By ~ 16

Author(s):

Leonardo Novelli ◽

Patricia Wollstadt ◽

Pedro Mediano ◽

Michael Wibral ◽

Joseph T. Lizier

Keyword(s):

Time Series ◽

Large Scale ◽

Network Inference ◽

Statistical Tests ◽

Statistical Significance ◽

Greedy Algorithms ◽

Transfer Entropy ◽

Statistical Testing ◽

Directed Network ◽

Model Free

Network inference algorithms are valuable tools for the study of large-scale neuroimaging datasets. Multivariate transfer entropy is well suited for this task, being a model-free measure that captures nonlinear and lagged dependencies between time series to infer a minimal directed network model. Greedy algorithms have been proposed to efficiently deal with high-dimensional datasets while avoiding redundant inferences and capturing synergistic effects. However, multiple statistical comparisons may inflate the false positive rate and are computationally demanding, which limited the size of previous validation studies. The algorithm we present—as implemented in the IDTxl open-source software—addresses these challenges by employing hierarchical statistical tests to control the family-wise error rate and to allow for efficient parallelization. The method was validated on synthetic datasets involving random networks of increasing size (up to 100 nodes), for both linear and nonlinear dynamics. The performance increased with the length of the time series, reaching consistently high precision, recall, and specificity (>98% on average) for 10,000 time samples. Varying the statistical significance threshold showed a more favorable precision-recall trade-off for longer time series. Both the network size and the sample size are one order of magnitude larger than previously demonstrated, showing feasibility for typical EEG and magnetoencephalography experiments.

Download Full-text

Statistical hypothesis testing in wavelet analysis: theoretical developments and applications to Indian rainfall

Nonlinear Processes in Geophysics ◽

10.5194/npg-26-91-2019 ◽

2019 ◽

Vol 26 (2) ◽

pp. 91-108 ◽

Cited By ~ 2

Author(s):

Justin A. Schulte

Keyword(s):

Wavelet Analysis ◽

Statistical Significance ◽

Persistent Homology ◽

Testing Procedure ◽

Significance Test ◽

Wavelet Spectrum ◽

Statistical Hypothesis ◽

Statistical Hypothesis Testing ◽

Testing Procedures ◽

Indian Rainfall

Abstract. Statistical hypothesis tests in wavelet analysis are methods that assess the degree to which a wavelet quantity (e.g., power and coherence) exceeds background noise. Commonly, a point-wise approach is adopted in which a wavelet quantity at every point in a wavelet spectrum is individually compared to the critical level of the point-wise test. However, because adjacent wavelet coefficients are correlated and wavelet spectra often contain many wavelet quantities, the point-wise test can produce many false positive results that occur in clusters or patches. To circumvent the point-wise test drawbacks, it is necessary to implement the recently developed area-wise, geometric, cumulative area-wise, and topological significance tests, which are reviewed and developed in this paper. To improve the computational efficiency of the cumulative area-wise test, a simplified version of the testing procedure is created based on the idea that its output is the mean of individual estimates of statistical significance calculated from the geometric test applied at a set of point-wise significance levels. Ideal examples are used to show that the geometric and cumulative area-wise tests are unable to differentiate wavelet spectral features arising from singularity-like structures from those associated with periodicities. A cumulative arc-wise test is therefore developed to strictly test for periodicities by using normalized arclength, which is defined as the number of points composing a cross section of a patch divided by the wavelet scale in question. A previously proposed topological significance test is formalized using persistent homology profiles (PHPs) measuring the number of patches and holes corresponding to the set of all point-wise significance values. Ideal examples show that the PHPs can be used to distinguish time series containing signal components from those that are purely noise. To demonstrate the practical uses of the existing and newly developed statistical methodologies, a first comprehensive wavelet analysis of Indian rainfall is also provided. An R software package has been written by the author to implement the various testing procedures.

Download Full-text

RECAP reveals the true statistical significance of ChIP-seq peak calls

Bioinformatics ◽

10.1093/bioinformatics/btz150 ◽

2019 ◽

Vol 35 (19) ◽

pp. 3592-3598 ◽

Cited By ~ 1

Author(s):

Justin G Chitpin ◽

Aseel Awdeh ◽

Theodore J Perkins

Keyword(s):

False Discovery Rate ◽

Statistical Significance ◽

Statistical Hypothesis ◽

Supplementary Information ◽

Peak Calling ◽

Statistical Hypothesis Testing ◽

P Values ◽

False Discovery ◽

Genomic Regions ◽

And Control

Abstract Motivation Chromatin Immunopreciptation (ChIP)-seq is used extensively to identify sites of transcription factor binding or regions of epigenetic modifications to the genome. A key step in ChIP-seq analysis is peak calling, where genomic regions enriched for ChIP versus control reads are identified. Many programs have been designed to solve this task, but nearly all fall into the statistical trap of using the data twice—once to determine candidate enriched regions, and again to assess enrichment by classical statistical hypothesis testing. This double use of the data invalidates the statistical significance assigned to enriched regions, thus the true significance or reliability of peak calls remains unknown. Results Using simulated and real ChIP-seq data, we show that three well-known peak callers, MACS, SICER and diffReps, output biased P-values and false discovery rate estimates that can be many orders of magnitude too optimistic. We propose a wrapper algorithm, RECAP, that uses resampling of ChIP-seq and control data to estimate a monotone transform correcting for biases built into peak calling algorithms. When applied to null hypothesis data, where there is no enrichment between ChIP-seq and control, P-values recalibrated by RECAP are approximately uniformly distributed. On data where there is genuine enrichment, RECAP P-values give a better estimate of the true statistical significance of candidate peaks and better false discovery rate estimates, which correlate better with empirical reproducibility. RECAP is a powerful new tool for assessing the true statistical significance of ChIP-seq peak calls. Availability and implementation The RECAP software is available through www.perkinslab.ca or on github at https://github.com/theodorejperkins/RECAP. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Weighted Sampling for Combined Model Selection and Hyperparameter Tuning

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.6012 ◽

2020 ◽

Vol 34 (04) ◽

pp. 5595-5603

Author(s):

Dimitrios Sarigiannis ◽

Thomas Parnell ◽

Haralampos Pozidis

Keyword(s):

Sampling Distribution ◽

Statistical Hypothesis ◽

Statistical Hypothesis Testing ◽

Algorithm Selection ◽

Worst Case ◽

Uniform Sampling ◽

Model Free ◽

Automated Machine Learning ◽

Tuning Methods ◽

Combined Algorithm

The combined algorithm selection and hyperparameter tuning (CASH) problem is characterized by large hierarchical hyperparameter spaces. Model-free hyperparameter tuning methods can explore such large spaces efficiently since they are highly parallelizable across multiple machines. When no prior knowledge or meta-data exists to boost their performance, these methods commonly sample random configurations following a uniform distribution. In this work, we propose a novel sampling distribution as an alternative to uniform sampling and prove theoretically that it has a better chance of finding the best configuration in a worst-case setting. In order to compare competing methods rigorously in an experimental setting, one must perform statistical hypothesis testing. We show that there is little-to-no agreement in the automated machine learning literature regarding which methods should be used. We contrast this disparity with the methods recommended by the broader statistics literature, and identify a suitable approach. We then select three popular model-free solutions to CASH and evaluate their performance, with uniform sampling as well as the proposed sampling scheme, across 67 datasets from the OpenML platform. We investigate the trade-off between exploration and exploitation across the three algorithms, and verify empirically that the proposed sampling distribution improves performance in all cases.

Download Full-text