Assessing Significance of Information Flow in High Dimensional Dynamical Systems With Few Data

Author(s):  
Ross P. Anderson ◽  
Maurizio Porfiri

Information-theoretical notions of causality provide a model-free approach to identification of the magnitude and direction of influence among sub-components of a stochastic dynamical system. In addition to detecting causal influences, any effective test should also report the level of statistical significance of the finding. Here, we focus on transfer entropy, which has recently been considered for causality detection in a variety of fields based on statistical significance tests that are valid only in the asymptotic regime, that is, with enormous amounts of data. In the interest of applications with limited available data, we develop a non-asymptotic theory for the probability distribution of the difference between the empirically-estimated transfer entropy and the true transfer entropy. Based on this result, we additionally demonstrate an approach for statistical hypothesis testing for directed information flow in dynamical systems with a given number of observed time steps.

Author(s):  
Sach Mukherjee

A number of important problems in data mining can be usefully addressed within the framework of statistical hypothesis testing. However, while the conventional treatment of statistical significance deals with error probabilities at the level of a single variable, practical data mining tasks tend to involve thousands, if not millions, of variables. This Chapter looks at some of the issues that arise in the application of hypothesis tests to multi-variable data mining problems, and describes two computationally efficient procedures by which these issues can be addressed.


Author(s):  
Sach Mukherjee

A number of important problems in data mining can be usefully addressed within the framework of statistical hypothesis testing. However, while the conventional treatment of statistical significance deals with error probabilities at the level of a single variable, practical data mining tasks tend to involve thousands, if not millions, of variables. This Chapter looks at some of the issues that arise in the application of hypothesis tests to multi-variable data mining problems, and describes two computationally efficient procedures by which these issues can be addressed.


2019 ◽  
Vol 81 (8) ◽  
pp. 535-542
Author(s):  
Robert A. Cooper

Statistical methods are indispensable to the practice of science. But statistical hypothesis testing can seem daunting, with P-values, null hypotheses, and the concept of statistical significance. This article explains the concepts associated with statistical hypothesis testing using the story of “the lady tasting tea,” then walks the reader through an application of the independent-samples t-test using data from Peter and Rosemary Grant's investigations of Darwin's finches. Understanding how scientists use statistics is an important component of scientific literacy, and students should have opportunities to use statistical methods like this in their science classes.


Author(s):  
VICTOR K. Y. CHAN ◽  
W. ERIC WONG ◽  
T. F. XIE

Software metric models predict the target software metric(s), e.g., the development work effort or defect rates, for any future software project based on the project's predictor software metric(s), e.g., the project team size. Obviously, the construction of such a software metric model makes use of a data sample of such metrics from analogous past projects. However, incomplete data often appear in such data samples. Moreover, the decision on whether a particular predictor metric should be included is most likely based on an intuitive or experience-based assumption that the predictor metric has an impact on the target metric with a statistical significance. However, this assumption is usually not verifiable "retrospectively" after the model is constructed, leading to redundant predictor metric(s) and/or unnecessary predictor metric complexity. To solve all these problems, we derived a methodology consisting of the k-nearest neighbors (k-NN) imputation method, statistical hypothesis testing, and a "goodness-of-fit" criterion. This methodology was tested on software effort metric models and software quality metric models, the latter usually suffers from far more serious incomplete data. This paper documents this methodology and the tests on these two types of software metric models.


2013 ◽  
Vol 12 (04) ◽  
pp. 1350019 ◽  
Author(s):  
XUEJIAO WANG ◽  
PENGJIAN SHANG ◽  
JINGJING HUANG ◽  
GUOCHEN FENG

Recently, an information theoretic inspired concept of transfer entropy has been introduced by Schreiber. It aims to quantify in a nonparametric and explicitly nonsymmetric way the flow of information between two time series. This model-free based on Shannon entropy approach in principle allows us to detect statistical dependencies of all types, i.e., linear and nonlinear temporal correlations. However, we always analyze the transfer entropy based on the data, which is discretized into three partitions by some coarse graining. Naturally, we are interested in investigating the effect of the data discretization of the two series on the transfer entropy. In our paper, we analyze the results based on the data which are generated by the linear modeling and the ARFIMA modeling, as well as the dataset consists of seven indices during the period 1992–2002. The results show that the higher the degree of data discretization get, the larger the value of the transfer entropy will be, besides, the direction of the information flow is unchanged along with the degree of data discretization.


2019 ◽  
Vol 3 (3) ◽  
pp. 827-847 ◽  
Author(s):  
Leonardo Novelli ◽  
Patricia Wollstadt ◽  
Pedro Mediano ◽  
Michael Wibral ◽  
Joseph T. Lizier

Network inference algorithms are valuable tools for the study of large-scale neuroimaging datasets. Multivariate transfer entropy is well suited for this task, being a model-free measure that captures nonlinear and lagged dependencies between time series to infer a minimal directed network model. Greedy algorithms have been proposed to efficiently deal with high-dimensional datasets while avoiding redundant inferences and capturing synergistic effects. However, multiple statistical comparisons may inflate the false positive rate and are computationally demanding, which limited the size of previous validation studies. The algorithm we present—as implemented in the IDTxl open-source software—addresses these challenges by employing hierarchical statistical tests to control the family-wise error rate and to allow for efficient parallelization. The method was validated on synthetic datasets involving random networks of increasing size (up to 100 nodes), for both linear and nonlinear dynamics. The performance increased with the length of the time series, reaching consistently high precision, recall, and specificity (>98% on average) for 10,000 time samples. Varying the statistical significance threshold showed a more favorable precision-recall trade-off for longer time series. Both the network size and the sample size are one order of magnitude larger than previously demonstrated, showing feasibility for typical EEG and magnetoencephalography experiments.


2019 ◽  
Vol 26 (2) ◽  
pp. 91-108 ◽  
Author(s):  
Justin A. Schulte

Abstract. Statistical hypothesis tests in wavelet analysis are methods that assess the degree to which a wavelet quantity (e.g., power and coherence) exceeds background noise. Commonly, a point-wise approach is adopted in which a wavelet quantity at every point in a wavelet spectrum is individually compared to the critical level of the point-wise test. However, because adjacent wavelet coefficients are correlated and wavelet spectra often contain many wavelet quantities, the point-wise test can produce many false positive results that occur in clusters or patches. To circumvent the point-wise test drawbacks, it is necessary to implement the recently developed area-wise, geometric, cumulative area-wise, and topological significance tests, which are reviewed and developed in this paper. To improve the computational efficiency of the cumulative area-wise test, a simplified version of the testing procedure is created based on the idea that its output is the mean of individual estimates of statistical significance calculated from the geometric test applied at a set of point-wise significance levels. Ideal examples are used to show that the geometric and cumulative area-wise tests are unable to differentiate wavelet spectral features arising from singularity-like structures from those associated with periodicities. A cumulative arc-wise test is therefore developed to strictly test for periodicities by using normalized arclength, which is defined as the number of points composing a cross section of a patch divided by the wavelet scale in question. A previously proposed topological significance test is formalized using persistent homology profiles (PHPs) measuring the number of patches and holes corresponding to the set of all point-wise significance values. Ideal examples show that the PHPs can be used to distinguish time series containing signal components from those that are purely noise. To demonstrate the practical uses of the existing and newly developed statistical methodologies, a first comprehensive wavelet analysis of Indian rainfall is also provided. An R software package has been written by the author to implement the various testing procedures.


2019 ◽  
Vol 35 (19) ◽  
pp. 3592-3598 ◽  
Author(s):  
Justin G Chitpin ◽  
Aseel Awdeh ◽  
Theodore J Perkins

Abstract Motivation Chromatin Immunopreciptation (ChIP)-seq is used extensively to identify sites of transcription factor binding or regions of epigenetic modifications to the genome. A key step in ChIP-seq analysis is peak calling, where genomic regions enriched for ChIP versus control reads are identified. Many programs have been designed to solve this task, but nearly all fall into the statistical trap of using the data twice—once to determine candidate enriched regions, and again to assess enrichment by classical statistical hypothesis testing. This double use of the data invalidates the statistical significance assigned to enriched regions, thus the true significance or reliability of peak calls remains unknown. Results Using simulated and real ChIP-seq data, we show that three well-known peak callers, MACS, SICER and diffReps, output biased P-values and false discovery rate estimates that can be many orders of magnitude too optimistic. We propose a wrapper algorithm, RECAP, that uses resampling of ChIP-seq and control data to estimate a monotone transform correcting for biases built into peak calling algorithms. When applied to null hypothesis data, where there is no enrichment between ChIP-seq and control, P-values recalibrated by RECAP are approximately uniformly distributed. On data where there is genuine enrichment, RECAP P-values give a better estimate of the true statistical significance of candidate peaks and better false discovery rate estimates, which correlate better with empirical reproducibility. RECAP is a powerful new tool for assessing the true statistical significance of ChIP-seq peak calls. Availability and implementation The RECAP software is available through www.perkinslab.ca or on github at https://github.com/theodorejperkins/RECAP. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Vol 34 (04) ◽  
pp. 5595-5603
Author(s):  
Dimitrios Sarigiannis ◽  
Thomas Parnell ◽  
Haralampos Pozidis

The combined algorithm selection and hyperparameter tuning (CASH) problem is characterized by large hierarchical hyperparameter spaces. Model-free hyperparameter tuning methods can explore such large spaces efficiently since they are highly parallelizable across multiple machines. When no prior knowledge or meta-data exists to boost their performance, these methods commonly sample random configurations following a uniform distribution. In this work, we propose a novel sampling distribution as an alternative to uniform sampling and prove theoretically that it has a better chance of finding the best configuration in a worst-case setting. In order to compare competing methods rigorously in an experimental setting, one must perform statistical hypothesis testing. We show that there is little-to-no agreement in the automated machine learning literature regarding which methods should be used. We contrast this disparity with the methods recommended by the broader statistics literature, and identify a suitable approach. We then select three popular model-free solutions to CASH and evaluate their performance, with uniform sampling as well as the proposed sampling scheme, across 67 datasets from the OpenML platform. We investigate the trade-off between exploration and exploitation across the three algorithms, and verify empirically that the proposed sampling distribution improves performance in all cases.


Sign in / Sign up

Export Citation Format

Share Document