Understanding the Variability in Graph Data Sets through Statistical Modeling on the Stiefel Manifold

Clément Mantoux; Baptiste Couvy-Duchesne; Federica Cacciamani; Stéphane Epelbaum; Stanley Durrleman; Stéphanie Allassonnière

doi:10.3390/e23040490

Understanding the Variability in Graph Data Sets through Statistical Modeling on the Stiefel Manifold

Entropy ◽

10.3390/e23040490 ◽

2021 ◽

Vol 23 (4) ◽

pp. 490

Author(s):

Clément Mantoux ◽

Baptiste Couvy-Duchesne ◽

Federica Cacciamani ◽

Stéphane Epelbaum ◽

Stanley Durrleman ◽

...

Keyword(s):

Degrees Of Freedom ◽

Brain Connectivity ◽

Random Perturbation ◽

Synthetic Data ◽

Large Data ◽

Stiefel Manifold ◽

Data Set ◽

Model Complex ◽

Rank One ◽

The Uk

Network analysis provides a rich framework to model complex phenomena, such as human brain connectivity. It has proven efficient to understand their natural properties and design predictive models. In this paper, we study the variability within groups of networks, i.e., the structure of connection similarities and differences across a set of networks. We propose a statistical framework to model these variations based on manifold-valued latent factors. Each network adjacency matrix is decomposed as a weighted sum of matrix patterns with rank one. Each pattern is described as a random perturbation of a dictionary element. As a hierarchical statistical model, it enables the analysis of heterogeneous populations of adjacency matrices using mixtures. Our framework can also be used to infer the weight of missing edges. We estimate the parameters of the model using an Expectation-Maximization-based algorithm. Experimenting on synthetic data, we show that the algorithm is able to accurately estimate the latent structure in both low and high dimensions. We apply our model on a large data set of functional brain connectivity matrices from the UK Biobank. Our results suggest that the proposed model accurately describes the complex variability in the data set with a small number of degrees of freedom.

Download Full-text

Convolutional equivalent layer for gravity data processing

Geophysics ◽

10.1190/geo2019-0826.1 ◽

2020 ◽

Vol 85 (6) ◽

pp. G129-G141

Author(s):

Diego Takahashi ◽

Vanderlei C. Oliveira Jr. ◽

Valéria C. F. Barbosa

Keyword(s):

Data Processing ◽

Gravity Data ◽

Computational Cost ◽

Synthetic Data ◽

Large Data ◽

Sensitivity Matrix ◽

Data Set ◽

Layer Technique ◽

Symmetric Block ◽

Equivalent Layer

We have developed an efficient and very fast equivalent-layer technique for gravity data processing by modifying an iterative method grounded on an excess mass constraint that does not require the solution of linear systems. Taking advantage of the symmetric block-Toeplitz Toeplitz-block (BTTB) structure of the sensitivity matrix that arises when regular grids of observation points and equivalent sources (point masses) are used to set up a fictitious equivalent layer, we develop an algorithm that greatly reduces the computational complexity and RAM memory necessary to estimate a 2D mass distribution over the equivalent layer. The structure of symmetric BTTB matrix consists of the elements of the first column of the sensitivity matrix, which, in turn, can be embedded into a symmetric block-circulant with circulant-block (BCCB) matrix. Likewise, only the first column of the BCCB matrix is needed to reconstruct the full sensitivity matrix completely. From the first column of the BCCB matrix, its eigenvalues can be calculated using the 2D fast Fourier transform (2D FFT), which can be used to readily compute the matrix-vector product of the forward modeling in the fast equivalent-layer technique. As a result, our method is efficient for processing very large data sets. Tests with synthetic data demonstrate the ability of our method to satisfactorily upward- and downward-continue gravity data. Our results show very small border effects and noise amplification compared to those produced by the classic approach in the Fourier domain. In addition, they show that, whereas the running time of our method is [Formula: see text] s for processing [Formula: see text] observations, the fast equivalent-layer technique used [Formula: see text] s with [Formula: see text]. A test with field data from the Carajás Province, Brazil, illustrates the low computational cost of our method to process a large data set composed of [Formula: see text] observations.

Download Full-text

Exploring the role of hydrological pathways in modulating multi-annual climate teleconnection periodicities from UK rainfall to streamflow

Hydrology and Earth System Sciences ◽

10.5194/hess-25-2223-2021 ◽

2021 ◽

Vol 25 (4) ◽

pp. 2223-2237

Author(s):

William Rust ◽

Mark Cuthbert ◽

John Bloomfield ◽

Ron Corstanje ◽

Nicholas Howden ◽

...

Keyword(s):

North Atlantic ◽

Response Times ◽

Large Data ◽

Data Set ◽

The North ◽

South Wales ◽

The North Atlantic ◽

The North Atlantic Oscillation ◽

The Uk

Abstract. An understanding of multi-annual behaviour in streamflow allows for better estimation of the risks associated with hydrological extremes. This can enable improved preparedness for streamflow-dependant services, such as freshwater ecology, drinking water supply and agriculture. Recently, efforts have focused on detecting relationships between long-term hydrological behaviour and oscillatory climate systems (such as the North Atlantic Oscillation – NAO). For instance, the approximate 7 year periodicity of the NAO has been detected in groundwater-level records in the North Atlantic region, providing potential improvements to the preparedness for future water resource extremes due to their repetitive, periodic nature. However, the extent to which these 7-year, NAO-like signals are propagated to streamflow, and the catchment processes that modulate this propagation, are currently unknown. Here, we show statistically significant evidence that these 7-year periodicities are present in streamflow (and associated catchment rainfall), by applying multi-resolution analysis to a large data set of streamflow and associated catchment rainfall across the UK. Our results provide new evidence for spatial patterns of NAO periodicities in UK rainfall, with areas of greatest NAO signal found in southwest England, south Wales, Northern Ireland and central Scotland, and show that NAO-like periodicities account for a greater proportion of streamflow variability in these areas. Furthermore, we find that catchments with greater subsurface pathway contribution, as characterised by the baseflow index (BFI), generally show increased NAO-like signal strength and that subsurface response times (as characterised by groundwater response time – GRT), of between 4 and 8 years, show a greater signal presence. Our results provide a foundation of understanding for the screening and use of streamflow teleconnections for improving the practice and policy of long-term streamflow resource management.

Download Full-text

Steep channel freezeup processes: understanding complexity with statistical and physical models

Canadian Journal of Civil Engineering ◽

10.1139/cjce-2014-0412 ◽

2015 ◽

Vol 42 (9) ◽

pp. 622-633 ◽

Cited By ~ 1

Author(s):

Mathieu Dubé ◽

Benoit Turcotte ◽

Brian Morse

Keyword(s):

Degrees Of Freedom ◽

Heat Budget ◽

Morphological Characteristics ◽

Large Data ◽

Quantitative Information ◽

Single Step ◽

Physical Models ◽

Data Set ◽

Multiple Parameters ◽

Physically Based

The development of ice dams in steep channels dictates water level variations and influences flow rates and habitat conditions. Despite the dominance of ice dam development in cold region gravel bed channels, practicing engineers and scientists have access to very little quantitative information describing this complex freezeup process. This paper aims to fill this gap by presenting a large data set on the process. The substantial variations observed in formation and melting rates from one site to the next and from one year to the next at the same site are explained with a physically-based numerical model that includes a complete heat budget applied to single step-pool sequence. The model successfully simulates the entire development of an ice dam and shows that the process depends on multiple parameters, or degrees of freedom. It also reveals that morphological characteristics greatly influence ice dam dynamics.

Download Full-text

Practical Data Synthesis for Large Samples

Journal of Privacy and Confidentiality ◽

10.29012/jpc.v7i3.407 ◽

2018 ◽

Vol 7 (3) ◽

pp. 67-97 ◽

Cited By ~ 11

Author(s):

Gillian M Raab ◽

Beata Nowok ◽

Chris Dibben

Keyword(s):

Longitudinal Study ◽

Synthetic Data ◽

Predictive Distribution ◽

Data Sets ◽

Posterior Predictive Distribution ◽

Data Set ◽

Data Synthesis ◽

Large Samples ◽

The Uk ◽

Variance Estimates

We describe results on the creation and use of synthetic data that were derived in the context of a project to make synthetic extracts available for users of the UK Longitudinal Studies. A critical review of existing methods of inference from large synthetic data sets is presented. We introduce new variance estimates for use with large samples of completely synthesised data that do not require them to be generated from the posterior predictive distribution derived from the observed data and can be used with a single synthetic data set. We make recommendations on how to synthesise data based on these results. The practical consequences of these results are illustrated with an example from the Scottish Longitudinal Study.

Download Full-text

Fast 2D inversion of large borehole EM induction data sets with an efficient Fréchet-derivative approximation

Geophysics ◽

10.1190/1.3033213 ◽

2009 ◽

Vol 74 (1) ◽

pp. E75-E91 ◽

Cited By ~ 9

Author(s):

Gong Li Wang ◽

Carlos Torres-Verdín ◽

Jesús M. Salazar ◽

Benjamin Voss

Keyword(s):

Electrical Resistivity ◽

Synthetic Data ◽

Large Data ◽

Inversion Method ◽

Central Component ◽

Data Sets ◽

Spatial Distributions ◽

Desktop Computer ◽

Data Set ◽

Uncertainty Estimator

In addition to reliability and stability, the efficiency and expediency of inversion methods have long been a strong concern for their routine applications by well-log interpreters. We have developed and successfully validated a new inversion method to estimate 2D parametric spatial distributions of electrical resistivity from array-induction measurements acquired in a vertical well. The central component of the method is an efficient approximation to Fréchet derivatives where both the incident and adjoint fields are precomputed and kept unchanged during inversion. To further enhance the overall efficiency of the inversion, we combined the new approximation with both the improved numerical mode-matching method and domain decomposition. Examples of application with synthetic data sets show that the new methodis computer efficient and capable of retrieving original model re-sistivities even in the presence of noise, performing equally well in both high and low contrasts of formation resistivity. In thin resistive beds, the new inversion method estimates more accurate resistivities than standard commercial deconvolution software. We also considered examples of application with field data sets that confirm the new method can successfully process a large data set that includes 200 beds in approximately [Formula: see text] of CPU time on a desktop computer. In addition to 2D parametric spatial distributions of electrical resistivity, the new inversion method provides a qualitative indicator of the uncertainty of estimated parameters based on the estimator’s covariance matrix. The uncertainty estimator provides a qualitative measure of the nonuniqueness of estimated resistivity parameters when the data misfit lies within the measurement error (noise).

Download Full-text

Automatic 3D illumination-diagnosis method for large-N arrays: Robust data scanner and machine-learning feature provider

Geophysics ◽

10.1190/geo2018-0504.1 ◽

2019 ◽

Vol 84 (3) ◽

pp. Q13-Q25 ◽

Cited By ~ 4

Author(s):

Michał Chamarczuk ◽

Michał Malinowski ◽

Yohei Nishitsuji ◽

Jan Thorbecke ◽

Emilia Koivisto ◽

...

Keyword(s):

Ambient Noise ◽

Body Wave ◽

Synthetic Data ◽

Large Data ◽

Body Waves ◽

Support Vector ◽

Full Data ◽

Noise Sources ◽

Data Set ◽

Receiver Array

The main issues related to passive-source reflection imaging with seismic interferometry (SI) are inadequate acquisition parameters for sufficient spatial wavefield sampling and vulnerability of surface arrays to the dominant influence of the omnipresent surface-wave sources. Additionally, long recordings provide large data volumes that require robust and efficient processing methods. We address these problems by developing a two-step wavefield evaluation and event detection (TWEED) method of body waves in recorded ambient noise. TWEED evaluates the spatiotemporal characteristics of noise recordings by simultaneous analysis of adjacent receiver lines. We test our method on synthetic data representing transient ambient-noise sources at the surface and in the deeper subsurface. We discriminate between basic types of seismic events by using three adjacent receiver lines. Subsequently, we apply TWEED to 600 h of ambient noise acquired with an approximately 1000-receiver array deployed over an active underground mine in Eastern Finland. We develop the detection of body-wave events related to mine blasts and other routine mining activities using a representative 1 h noise panel. Using TWEED, we successfully detect 1093 body-wave events in the full data set. To increase the computational efficiency, we use slowness parameters derived from the first step of TWEED as input to a support vector machine (SVM) algorithm. Using this approach, we detect 94% of the TWEED-evaluated body-wave events indicating the possibility to limit the illumination analysis to only one step, and therefore increase the time efficiency at the price of lower detection rate. However, TWEED on a small volume of the recorded data followed by SVM on the rest of the data could be efficiently used for a quick and robust (real-time) scanning for body-wave energy in large data volumes for subsequent application of SI for retrieval of reflections.

Download Full-text

Evaluating the effects of SARS-CoV-2 Spike mutation D614G on transmissibility and pathogenicity

10.1101/2020.07.31.20166082 ◽

2020 ◽

Cited By ~ 23

Author(s):

Erik M Volz ◽

Verity Hill ◽

John T McCrone ◽

Anna Price ◽

David Jorgensen ◽

...

Keyword(s):

Growth Rates ◽

Transmission Rate ◽

Large Data ◽

Selective Advantage ◽

Individual Country ◽

Data Set ◽

The United Kingdom ◽

The World ◽

The Uk ◽

Genetic Modelling

In February 2020 a substitution at the interface between SARS-CoV-2 Spike protein subunits, Spike D614G, was observed in public databases. The Spike 614G variant subsequently increased in frequency in many locations throughout the world. Global patterns of dispersal of Spike 614G are suggestive of a selective advantage of this variant, however the origin of Spike 614G is associated with early colonization events in Europe and subsequent radiations to the rest of the world. Increasing frequency of 614G may therefore be due to a random founder effect. We investigate the hypothesis for positive selection of Spike 614G at the level of an individual country, the United Kingdom, using more than 25,000 whole genome SARS-CoV-2 sequences collected by COVID-19 Genomics UK Consortium. Using phylogenetic analysis, we identify Spike 614G and 614D clades with unique origins in the UK and from these we extrapolate and compare growth rates of co-circulating transmission clusters. We find that Spike 614G clusters are introduced in the UK later on average than 614D clusters and grow to larger size after adjusting for time of introduction. Phylodynamic analysis does not show a significant increase in growth rates for clusters with the 614G variant, but population genetic modelling indicates that 614G increases in frequency relative to 614D in a manner consistent with a selective advantage. We also investigate the potential influence of Spike 614D versus G on virulence by matching a subset of records to clinical data on patient outcomes. We do not find any indication that patients infected with the Spike 614G variant have higher COVID-19 mortality, but younger patients have slightly increased odds of 614G carriage. Despite the availability of a very large data set, well represented by both Spike 614 variants, not all approaches showed a conclusive signal of higher transmission rate for 614G, but significant differences in growth, size, and composition of these lineages indicate a need for continued study.

Download Full-text

Childhood Intelligence Predicts Adult Trait Openness

Journal of Individual Differences ◽

10.1027/1614-0001/a000194 ◽

2016 ◽

Vol 37 (2) ◽

pp. 105-111 ◽

Cited By ~ 6

Author(s):

Adrian Furnham ◽

Helen Cheng

Keyword(s):

Structural Equation Modeling ◽

Structural Equation ◽

Social Background ◽

Study Data ◽

Equation Modeling ◽

Direct Effects ◽

Data Set ◽

National Child Development Study ◽

Nationally Representative ◽

The Uk

Abstract. This study used a longitudinal data set of 5,672 adults followed for 50 years to determine the factors that influence adult trait Openness-to-Experience. In a large, nationally representative sample in the UK (the National Child Development Study), data were collected at birth, in childhood (age 11), adolescence (age 16), and adulthood (ages 33, 42, and 50) to examine the effects of family social background, childhood intelligence, school motivation during adolescence, education, and occupation on the personality trait Openness assessed at age 50 years. Structural equation modeling showed that parental social status, childhood intelligence, school motivation, education, and occupation all had modest, but direct, effects on trait Openness, among which childhood intelligence was the strongest predictor. Gender was not significantly associated with trait Openness. Limitations and implications of the study are discussed.

Download Full-text

Some statistical and CI models to predict chaotic high-frequency financial data

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-189107 ◽

2020 ◽

Vol 39 (5) ◽

pp. 6419-6430

Author(s):

Dusan Marcek

Keyword(s):

Time Series Data ◽

Moving Average ◽

Methodological Approach ◽

Back Propagation ◽

Large Data ◽

Series Data ◽

Data Set ◽

Training Time ◽

Optimal Population ◽

Forecast Time

To forecast time series data, two methodological frameworks of statistical and computational intelligence modelling are considered. The statistical methodological approach is based on the theory of invertible ARIMA (Auto-Regressive Integrated Moving Average) models with Maximum Likelihood (ML) estimating method. As a competitive tool to statistical forecasting models, we use the popular classic neural network (NN) of perceptron type. To train NN, the Back-Propagation (BP) algorithm and heuristics like genetic and micro-genetic algorithm (GA and MGA) are implemented on the large data set. A comparative analysis of selected learning methods is performed and evaluated. From performed experiments we find that the optimal population size will likely be 20 with the lowest training time from all NN trained by the evolutionary algorithms, while the prediction accuracy level is lesser, but still acceptable by managers.

Download Full-text

In silico Prediction of Inhibitory Constant of Thrombin Inhibitors Using Machine Learning

Combinatorial Chemistry & High Throughput Screening ◽

10.2174/1386207322666181220130232 ◽

2019 ◽

Vol 21 (9) ◽

pp. 662-669 ◽

Cited By ~ 1

Author(s):

Junnan Zhao ◽

Lu Zhu ◽

Weineng Zhou ◽

Lingfeng Yin ◽

Yuchen Wang ◽

...

Keyword(s):

Machine Learning ◽

Prediction Models ◽

Regression Tree ◽

Large Data ◽

Thrombin Inhibitors ◽

Coagulation Cascade ◽

Gradient Boosting ◽

Support Vector ◽

Data Set ◽

Descriptor Selection

Background: Thrombin is the central protease of the vertebrate blood coagulation cascade, which is closely related to cardiovascular diseases. The inhibitory constant Ki is the most significant property of thrombin inhibitors. Method: This study was carried out to predict Ki values of thrombin inhibitors based on a large data set by using machine learning methods. Taking advantage of finding non-intuitive regularities on high-dimensional datasets, machine learning can be used to build effective predictive models. A total of 6554 descriptors for each compound were collected and an efficient descriptor selection method was chosen to find the appropriate descriptors. Four different methods including multiple linear regression (MLR), K Nearest Neighbors (KNN), Gradient Boosting Regression Tree (GBRT) and Support Vector Machine (SVM) were implemented to build prediction models with these selected descriptors. Results: The SVM model was the best one among these methods with R2=0.84, MSE=0.55 for the training set and R2=0.83, MSE=0.56 for the test set. Several validation methods such as yrandomization test and applicability domain evaluation, were adopted to assess the robustness and generalization ability of the model. The final model shows excellent stability and predictive ability and can be employed for rapid estimation of the inhibitory constant, which is full of help for designing novel thrombin inhibitors.

Download Full-text