Large-scale directed network inference with multivariate transfer entropy and hierarchical statistical testing

Leonardo Novelli; Patricia Wollstadt; Pedro Mediano; Michael Wibral; Joseph T. Lizier

doi:10.1162/netn_a_00092

Large-scale directed network inference with multivariate transfer entropy and hierarchical statistical testing

Network Neuroscience ◽

10.1162/netn_a_00092 ◽

2019 ◽

Vol 3 (3) ◽

pp. 827-847 ◽

Cited By ~ 16

Author(s):

Leonardo Novelli ◽

Patricia Wollstadt ◽

Pedro Mediano ◽

Michael Wibral ◽

Joseph T. Lizier

Keyword(s):

Time Series ◽

Large Scale ◽

Network Inference ◽

Statistical Tests ◽

Statistical Significance ◽

Greedy Algorithms ◽

Transfer Entropy ◽

Statistical Testing ◽

Directed Network ◽

Model Free

Network inference algorithms are valuable tools for the study of large-scale neuroimaging datasets. Multivariate transfer entropy is well suited for this task, being a model-free measure that captures nonlinear and lagged dependencies between time series to infer a minimal directed network model. Greedy algorithms have been proposed to efficiently deal with high-dimensional datasets while avoiding redundant inferences and capturing synergistic effects. However, multiple statistical comparisons may inflate the false positive rate and are computationally demanding, which limited the size of previous validation studies. The algorithm we present—as implemented in the IDTxl open-source software—addresses these challenges by employing hierarchical statistical tests to control the family-wise error rate and to allow for efficient parallelization. The method was validated on synthetic datasets involving random networks of increasing size (up to 100 nodes), for both linear and nonlinear dynamics. The performance increased with the length of the time series, reaching consistently high precision, recall, and specificity (>98% on average) for 10,000 time samples. Varying the statistical significance threshold showed a more favorable precision-recall trade-off for longer time series. Both the network size and the sample size are one order of magnitude larger than previously demonstrated, showing feasibility for typical EEG and magnetoencephalography experiments.

Download Full-text

PolySTest: Robust statistical testing of proteomics data with missing values improves detection of biologically relevant features

10.1101/765818 ◽

2019 ◽

Cited By ~ 1

Author(s):

Veit Schwämmle ◽

Christina E Hagensen ◽

Adelina Rogowska-Wrzesinska ◽

Ole N. Jensen

Keyword(s):

Mass Spectrometry ◽

Large Scale ◽

Missing Values ◽

Statistical Tests ◽

Ground Truth ◽

Statistical Testing ◽

Molecular Networks ◽

Proteomics Data ◽

Biologically Relevant ◽

Data Browsing

AbstractStatistical testing remains one of the main challenges for high-confidence detection of differentially regulated proteins or peptides in large-scale quantitative proteomics experiments by mass spectrometry. Statistical tests need to be sufficiently robust to deal with experiment intrinsic data structures and variations and often also reduced feature coverage across different biological samples due to ubiquitous missing values. A robust statistical test provides accurate confidence scores of large-scale proteomics results, regardless of instrument platform, experimental protocol and software tools. However, the multitude of different combinations of experimental strategies, mass spectrometry techniques and informatics methods complicate the decision of choosing appropriate statistical approaches. We address this challenge by introducing PolySTest, a user-friendly web service for statistical testing, data browsing and data visualization. We introduce a new method, Miss Test, that simultaneously tests for missingness and feature abundance, thereby complementing common statistical tests by rescuing otherwise discarded data features. We demonstrate that PolySTest with integrated Miss Test achieves higher confidence and higher sensitivity for artificial and experimental proteomics data sets with known ground truth. Application of PolySTest to mass spectrometry based large-scale proteomics data obtained from differentiating muscle cells resulted in the rescue of 10%-20% additional proteins in the identified molecular networks relevant to muscle differentiation. We conclude that PolySTest is a valuable addition to existing tools and instrument enhancements that improve coverage and depth of large-scale proteomics experiments. A fully functional demo version of PolySTest and Miss Test is available via http://computproteomics.bmb.sdu.dk/Apps/PolySTest.

Download Full-text

Mind your gaps: Overlooking assembly gaps confounds statistical testing in genome analysis

10.1101/252973 ◽

2018 ◽

Cited By ~ 1

Author(s):

Diana Domanska ◽

Chakravarthi Kanduri ◽

Boris Simovski ◽

Geir Kjetil Sandve

Keyword(s):

Statistical Tests ◽

Statistical Significance ◽

Null Model ◽

Statistical Testing ◽

Slight Reduction ◽

Test Statistic ◽

Experimental Conditions ◽

P Values ◽

Localization Analysis ◽

The Right

AbstractBackgroundThe difficulties associated with sequencing and assembling some regions of the DNA sequence result in gaps in the reference genomes that are typically represented as stretches of Ns. Although the presence of assembly gaps causes a slight reduction in the mapping rate in many experimental settings, that does not invalidate the typical statistical testing comparing read count distributions across experimental conditions. However, we hypothesize that not handling assembly gaps in the null model may confound statistical testing of co-localization of genomic features.ResultsFirst, we performed a series of explorative analyses to understand whether and how the public genomic tracks intersect the assembly gaps track (hg19). The findings rightly confirm that the genomic regions in public genomic tracks intersect very little with assembly gaps and the intersection was observed only at the beginning and end regions of the assembly gaps rather than covering the whole gap sizes. Further, we simulated a set of query and reference genomic tracks in a way that nullified any dependence between them to test our hypothesis that not avoiding assembly gaps in the null model would result in spurious inflation of statistical significance. We then contrasted the distributions of test statistics and p-values of Monte Carlo simulation-based permutation tests that either avoided or not avoided assembly gaps in the null model when testing for significant co-localization between a pair of query and reference tracks. We observed that the statistical tests that did not account for the assembly gaps in the null model resulted in a distribution of the test statistic that is shifted to the right and a distribu tion of p-values that is shifted to the left (leading to inflated significance).ConclusionOur results shows that not accounting for assembly gaps in statistical testing of co-localization analysis may lead to false positives and over-optimistic findings.

Download Full-text

PolySTest: Robust Statistical Testing of Proteomics Data with Missing Values Improves Detection of Biologically Relevant Features

Molecular & Cellular Proteomics ◽

10.1074/mcp.ra119.001777 ◽

2020 ◽

Vol 19 (8) ◽

pp. 1396-1408 ◽

Cited By ~ 2

Author(s):

Veit Schwämmle ◽

Christina E. Hagensen ◽

Adelina Rogowska-Wrzesinska ◽

Ole N. Jensen

Keyword(s):

Mass Spectrometry ◽

Large Scale ◽

Missing Values ◽

Statistical Tests ◽

Ground Truth ◽

Statistical Testing ◽

Molecular Networks ◽

Proteomics Data ◽

Biologically Relevant ◽

Data Browsing

Statistical testing remains one of the main challenges for high-confidence detection of differentially regulated proteins or peptides in large-scale quantitative proteomics experiments by mass spectrometry. Statistical tests need to be sufficiently robust to deal with experiment intrinsic data structures and variations and often also reduced feature coverage across different biological samples due to ubiquitous missing values. A robust statistical test provides accurate confidence scores of large-scale proteomics results, regardless of instrument platform, experimental protocol and software tools. However, the multitude of different combinations of experimental strategies, mass spectrometry techniques and informatics methods complicate the decision of choosing appropriate statistical approaches. We address this challenge by introducing PolySTest, a user-friendly web service for statistical testing, data browsing and data visualization. We introduce a new method, Miss test, that simultaneously tests for missingness and feature abundance, thereby complementing common statistical tests by rescuing otherwise discarded data features. We demonstrate that PolySTest with integrated Miss test achieves higher confidence and higher sensitivity for artificial and experimental proteomics data sets with known ground truth. Application of PolySTest to mass spectrometry based large-scale proteomics data obtained from differentiating muscle cells resulted in the rescue of 10–20% additional proteins in the identified molecular networks relevant to muscle differentiation. We conclude that PolySTest is a valuable addition to existing tools and instrument enhancements that improve coverage and depth of large-scale proteomics experiments. A fully functional demo version of PolySTest and Miss test is available via http://computproteomics.bmb.sdu.dk/Apps/PolySTest.

Download Full-text

Windowed Granger causal inference strategy improves discovery of gene regulatory networks

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1710936115 ◽

2018 ◽

Vol 115 (9) ◽

pp. 2252-2257 ◽

Cited By ~ 15

Author(s):

Justin D. Finkle ◽

Jia J. Wu ◽

Neda Bagheri

Keyword(s):

Time Series ◽

Network Structure ◽

Time Delays ◽

Regulatory Networks ◽

Network Inference ◽

Time Series Data ◽

Series Data ◽

Directed Network ◽

Rapid Characterization

Accurate inference of regulatory networks from experimental data facilitates the rapid characterization and understanding of biological systems. High-throughput technologies can provide a wealth of time-series data to better interrogate the complex regulatory dynamics inherent to organisms, but many network inference strategies do not effectively use temporal information. We address this limitation by introducing Sliding Window Inference for Network Generation (SWING), a generalized framework that incorporates multivariate Granger causality to infer network structure from time-series data. SWING moves beyond existing Granger methods by generating windowed models that simultaneously evaluate multiple upstream regulators at several potential time delays. We demonstrate that SWING elucidates network structure with greater accuracy in both in silico and experimentally validated in vitro systems. We estimate the apparent time delays present in each system and demonstrate that SWING infers time-delayed, gene–gene interactions that are distinct from baseline methods. By providing a temporal framework to infer the underlying directed network topology, SWING generates testable hypotheses for gene–gene influences.

Download Full-text

Wavelet analysis in ecology and epidemiology: impact of statistical tests

Journal of The Royal Society Interface ◽

10.1098/rsif.2013.0585 ◽

2014 ◽

Vol 11 (91) ◽

pp. 20130585 ◽

Cited By ~ 53

Author(s):

Bernard Cazelles ◽

Kévin Cazelles ◽

Mario Chavez

Keyword(s):

Time Series ◽

Wavelet Analysis ◽

Statistical Tests ◽

Statistical Testing ◽

Statistical Hypothesis ◽

Periodic Patterns ◽

Original Time Series ◽

Resampling Method ◽

Original Time ◽

The Impact

Wavelet analysis is now frequently used to extract information from ecological and epidemiological time series. Statistical hypothesis tests are conducted on associated wavelet quantities to assess the likelihood that they are due to a random process. Such random processes represent null models and are generally based on synthetic data that share some statistical characteristics with the original time series. This allows the comparison of null statistics with those obtained from original time series. When creating synthetic datasets, different techniques of resampling result in different characteristics shared by the synthetic time series. Therefore, it becomes crucial to consider the impact of the resampling method on the results. We have addressed this point by comparing seven different statistical testing methods applied with different real and simulated data. Our results show that statistical assessment of periodic patterns is strongly affected by the choice of the resampling method, so two different resampling techniques could lead to two different conclusions about the same time series. Moreover, our results clearly show the inadequacy of resampling series generated by white noise and red noise that are nevertheless the methods currently used in the wide majority of wavelets applications. Our results highlight that the characteristics of a time series, namely its Fourier spectrum and autocorrelation, are important to consider when choosing the resampling technique. Results suggest that data-driven resampling methods should be used such as the hidden Markov model algorithm and the ‘beta-surrogate’ method.

Download Full-text

Statistical inference for exploratory data analysis and model diagnostics

Philosophical Transactions of The Royal Society A Mathematical Physical and Engineering Sciences ◽

10.1098/rsta.2009.0120 ◽

2009 ◽

Vol 367 (1906) ◽

pp. 4361-4383 ◽

Cited By ~ 48

Author(s):

Andreas Buja ◽

Dianne Cook ◽

Heike Hofmann ◽

Michael Lawrence ◽

Eun-Kyung Lee ◽

...

Keyword(s):

Data Analysis ◽

Exploratory Data Analysis ◽

Statistical Tests ◽

Statistical Significance ◽

Statistical Testing ◽

Human Cognition ◽

Model Diagnostics ◽

The Real ◽

Exploratory Data

We propose to furnish visual statistical methods with an inferential framework and protocol, modelled on confirmatory statistical testing. In this framework, plots take on the role of test statistics, and human cognition the role of statistical tests. Statistical significance of ‘discoveries’ is measured by having the human viewer compare the plot of the real dataset with collections of plots of simulated datasets. A simple but rigorous protocol that provides inferential validity is modelled after the ‘lineup’ popular from criminal legal procedures. Another protocol modelled after the ‘Rorschach’ inkblot test, well known from (pop-)psychology, will help analysts acclimatize to random variability before being exposed to the plot of the real data. The proposed protocols will be useful for exploratory data analysis, with reference datasets simulated by using a null assumption that structure is absent. The framework is also useful for model diagnostics in which case reference datasets are simulated from the model in question. This latter point follows up on previous proposals. Adopting the protocols will mean an adjustment in working procedures for data analysts, adding more rigour, and teachers might find that incorporating these protocols into the curriculum improves their students’ statistical thinking.

Download Full-text

Structural network inference from time-series data using a generative model and transfer entropy

Pattern Recognition Letters ◽

10.1016/j.patrec.2019.05.019 ◽

2019 ◽

Vol 125 ◽

pp. 357-363 ◽

Cited By ~ 2

Author(s):

Zhihong Zhang ◽

Genzhou Zhang ◽

Zhonghao Zhang ◽

Guo Chen ◽

Yangbin Zeng ◽

...

Keyword(s):

Time Series ◽

Network Inference ◽

Time Series Data ◽

Generative Model ◽

Transfer Entropy ◽

Series Data ◽

Structural Network

Download Full-text

Inferability of transcriptional networks from large scale gene deletion studies

10.1101/082925 ◽

2016 ◽

Author(s):

Christopher Frederik Blum ◽

Nadia Heramvand ◽

Armin S. Khonsari ◽

Markus Kollmann

Keyword(s):

Molecular Interactions ◽

Large Scale ◽

Network Inference ◽

Gene Knockout ◽

Statistical Significance ◽

Transcriptional Networks ◽

Large Network ◽

Fundamental Limits ◽

High Statistical Significance ◽

Scale Perturbation

Generating a comprehensive map of molecular interactions in living cells is difficult and great efforts are undertaken to infer molecular interactions from large scale perturbation experiments. Here, we develop the analytical and numerical tools to quantify the fundamental limits for inferring transcriptional networks from gene knockout screens and introduce a network inference method that is unbiased and scalable to large network sizes. We show that it is possible to infer gene regulatory interactions with high statistical significance, even if prior knowledge about potential regulators is absent.

Download Full-text

Computing the statistical significance of optimized communities in networks

Scientific Reports ◽

10.1038/s41598-019-54708-8 ◽

2019 ◽

Vol 9 (1) ◽

Cited By ~ 2

Author(s):

John Palowitch

Keyword(s):

Large Scale ◽

Statistical Tests ◽

Statistical Significance ◽

Graph Model ◽

Detection Methods ◽

Random Graph Model ◽

Connected Node ◽

Scoring Algorithm ◽

Strongly Connected ◽

Feature Discovery

AbstractIn scientific problems involving systems that can be modeled as a network (or “graph”), it is often of interest to find network communities - strongly connected node subsets - for unsupervised learning, feature discovery, anomaly detection, or scientific study. The vast majority of community detection methods proceed via optimization of a quality function, which is possible even on random networks without communities. Therefore there is usually not an easy way to tell if a community is “significant”, in this context meaning more internally connected than would be expected under a random graph model without communities. This paper generalizes existing null models and statistical tests for this purpose to bipartite graphs, and introduces a new significance scoring algorithm called Fast Optimized Community Significance (FOCS) that is highly scalable and agnostic to the type of graph. Compared with existing methods on unipartite graphs, FOCS is more numerically stable and better balances the trade-off between detection power and false positives. On a large-scale bipartite graph derived from the Internet Movie Database (IMDB), the significance scores provided by FOCS correlate strongly with meaningful actor/director collaborations on serial cinematic projects.

Download Full-text

Assessing Significance of Information Flow in High Dimensional Dynamical Systems With Few Data

Volume 2: Dynamic Modeling and Diagnostics in Biomedical Systems; Dynamics and Control of Wind Energy Systems; Vehicle Energy Management Optimization; Energy Storage, Optimization; Transportation and Grid Applications; Estimation and Identification Methods, Tracking, Detection, Alternative Propulsion Systems; Ground and Space Vehicle Dynamics; Intelligent Transportation Systems and Control; Energy Harvesting; Modeling and Control for Thermo-Fluid Applications, IC Engines, Manufacturing ◽

10.1115/dscc2014-5865 ◽

2014 ◽

Cited By ~ 1

Author(s):

Ross P. Anderson ◽

Maurizio Porfiri

Keyword(s):

Dynamical Systems ◽

Information Flow ◽

Statistical Significance ◽

Transfer Entropy ◽

Statistical Hypothesis ◽

Statistical Hypothesis Testing ◽

Stochastic Dynamical System ◽

Model Free ◽

The Difference ◽

Few Data

Information-theoretical notions of causality provide a model-free approach to identification of the magnitude and direction of influence among sub-components of a stochastic dynamical system. In addition to detecting causal influences, any effective test should also report the level of statistical significance of the finding. Here, we focus on transfer entropy, which has recently been considered for causality detection in a variety of fields based on statistical significance tests that are valid only in the asymptotic regime, that is, with enormous amounts of data. In the interest of applications with limited available data, we develop a non-asymptotic theory for the probability distribution of the difference between the empirically-estimated transfer entropy and the true transfer entropy. Based on this result, we additionally demonstrate an approach for statistical hypothesis testing for directed information flow in dynamical systems with a given number of observed time steps.

Download Full-text