GEOMAX: beyond linear compression for three-point galaxy clustering statistics

Davide Gualdi; Héctor Gil-Marín; Marc Manera; Benjamin Joachimi; Ofer Lahav

doi:10.1093/mnras/staa1941

GEOMAX: beyond linear compression for three-point galaxy clustering statistics

Monthly Notices of the Royal Astronomical Society ◽

10.1093/mnras/staa1941 ◽

2020 ◽

Vol 497 (1) ◽

pp. 776-792 ◽

Cited By ~ 1

Author(s):

Davide Gualdi ◽

Héctor Gil-Marín ◽

Marc Manera ◽

Benjamin Joachimi ◽

Ofer Lahav

Keyword(s):

Statistical Significance ◽

Model Parameters ◽

Data Sets ◽

Full Data ◽

Galaxy Clustering ◽

Data Vector ◽

Credible Intervals ◽

Cosmological Data ◽

Clustering Data ◽

Linear Compression

ABSTRACT We present the GEOMAX algorithm and its python implementation for a two-step compression of bispectrum measurements. The first step groups bispectra by the geometric properties of their arguments; the second step then maximizes the Fisher information with respect to a chosen set of model parameters in each group. The algorithm only requires the derivatives of the data vector with respect to the parameters and a small number of mock data, producing an effective, non-linear compression. By applying GEOMAX to bispectrum monopole measurements from BOSS DR12 CMASS redshift-space galaxy clustering data, we reduce the 68 per cent credible intervals for the inferred parameters (b1, b2, f, σ8) by 50.4, 56.1, 33.2, and 38.3 per cent with respect to standard MCMC on the full data vector. We run the analysis and comparison between compression methods over 100 galaxy mocks to test the statistical significance of the improvements. On average, GEOMAX performs ∼15 per cent better than geometrical or maximal linear compression alone and is consistent with being lossless. Given its flexibility, the GEOMAX approach has the potential to optimally exploit three-point statistics of various cosmological probes like weak lensing or line-intensity maps from current and future cosmological data sets such as DESI, Euclid, PFS, and SKA.

Download Full-text

A Note on Confidence Intervals and Model Specification

Marketing ZFP ◽

10.15358/0344-1369-2019-4-33 ◽

2019 ◽

Vol 41 (4) ◽

pp. 33-42

Author(s):

Thomas Otter

Keyword(s):

Confidence Intervals ◽

Generalized Linear Model ◽

Statistical Information ◽

Regression Coefficients ◽

Model Specification ◽

Model Parameters ◽

Exploratory Research ◽

Empirical Calibration ◽

Credible Intervals ◽

Model Structures

Empirical research in marketing often is, at least in parts, exploratory. The goal of exploratory research, by definition, extends beyond the empirical calibration of parameters in well established models and includes the empirical assessment of different model specifications. In this context researchers often rely on the statistical information about parameters in a given model to learn about likely model structures. An example is the search for the 'true' set of covariates in a regression model based on confidence intervals of regression coefficients. The purpose of this paper is to illustrate and compare different measures of statistical information about model parameters in the context of a generalized linear model: classical confidence intervals, bootstrapped confidence intervals, and Bayesian posterior credible intervals from a model that adapts its dimensionality as a function of the information in the data. I find that inference from the adaptive Bayesian model dominates that based on classical and bootstrapped intervals in a given model.

Download Full-text

Do galactic bars depend on environment?: an information theoretic analysis of Galaxy Zoo 2

Monthly Notices of the Royal Astronomical Society ◽

10.1093/mnras/staa3665 ◽

2020 ◽

Vol 501 (1) ◽

pp. 994-1001

Author(s):

Suman Sarkar ◽

Biswajit Pandey ◽

Snehasish Bhattacharjee

Keyword(s):

Spatial Distribution ◽

Mutual Information ◽

Local Density ◽

Statistical Significance ◽

Distribution Functions ◽

Cumulative Distribution ◽

Host Galaxy ◽

Data Sets ◽

Data Set ◽

Information Theoretic

ABSTRACT We use an information theoretic framework to analyse data from the Galaxy Zoo 2 project and study if there are any statistically significant correlations between the presence of bars in spiral galaxies and their environment. We measure the mutual information between the barredness of galaxies and their environments in a volume limited sample (Mr ≤ −21) and compare it with the same in data sets where (i) the bar/unbar classifications are randomized and (ii) the spatial distribution of galaxies are shuffled on different length scales. We assess the statistical significance of the differences in the mutual information using a t-test and find that both randomization of morphological classifications and shuffling of spatial distribution do not alter the mutual information in a statistically significant way. The non-zero mutual information between the barredness and environment arises due to the finite and discrete nature of the data set that can be entirely explained by mock Poisson distributions. We also separately compare the cumulative distribution functions of the barred and unbarred galaxies as a function of their local density. Using a Kolmogorov–Smirnov test, we find that the null hypothesis cannot be rejected even at $75{{\ \rm per\ cent}}$ confidence level. Our analysis indicates that environments do not play a significant role in the formation of a bar, which is largely determined by the internal processes of the host galaxy.

Download Full-text

Theory and Applications of the Unit Gamma/Gompertz Distribution

Mathematics ◽

10.3390/math9161850 ◽

2021 ◽

Vol 9 (16) ◽

pp. 1850

Author(s):

Rashad A. R. Bantan ◽

Farrukh Jamal ◽

Christophe Chesneau ◽

Mohammed Elgarhy

Keyword(s):

Stochastic Ordering ◽

Real Data ◽

Rate Function ◽

The Other ◽

Likelihood Method ◽

Model Parameters ◽

Data Sets ◽

Gompertz Distribution ◽

Probability And Statistics ◽

Analytical Behavior

Unit distributions are commonly used in probability and statistics to describe useful quantities with values between 0 and 1, such as proportions, probabilities, and percentages. Some unit distributions are defined in a natural analytical manner, and the others are derived through the transformation of an existing distribution defined in a greater domain. In this article, we introduce the unit gamma/Gompertz distribution, founded on the inverse-exponential scheme and the gamma/Gompertz distribution. The gamma/Gompertz distribution is known to be a very flexible three-parameter lifetime distribution, and we aim to transpose this flexibility to the unit interval. First, we check this aspect with the analytical behavior of the primary functions. It is shown that the probability density function can be increasing, decreasing, “increasing-decreasing” and “decreasing-increasing”, with pliant asymmetric properties. On the other hand, the hazard rate function has monotonically increasing, decreasing, or constant shapes. We complete the theoretical part with some propositions on stochastic ordering, moments, quantiles, and the reliability coefficient. Practically, to estimate the model parameters from unit data, the maximum likelihood method is used. We present some simulation results to evaluate this method. Two applications using real data sets, one on trade shares and the other on flood levels, demonstrate the importance of the new model when compared to other unit models.

Download Full-text

Bayesian Inference of Species Trees using Diffusion Models

Systematic Biology ◽

10.1093/sysbio/syaa051 ◽

2020 ◽

Vol 70 (1) ◽

pp. 145-161 ◽

Cited By ~ 1

Author(s):

Marnus Stoltz ◽

Boris Baeumer ◽

Remco Bouckaert ◽

Colin Fox ◽

Gordon Hiscott ◽

...

Keyword(s):

Bayesian Inference ◽

Numerical Algorithms ◽

Diffusion Models ◽

Model Parameters ◽

Data Sets ◽

Species Trees ◽

Computationally Efficient ◽

Data Set ◽

Snp Data ◽

Binary Markers

Abstract We describe a new and computationally efficient Bayesian methodology for inferring species trees and demographics from unlinked binary markers. Likelihood calculations are carried out using diffusion models of allele frequency dynamics combined with novel numerical algorithms. The diffusion approach allows for analysis of data sets containing hundreds or thousands of individuals. The method, which we call Snapper, has been implemented as part of the BEAST2 package. We conducted simulation experiments to assess numerical error, computational requirements, and accuracy recovering known model parameters. A reanalysis of soybean SNP data demonstrates that the models implemented in Snapp and Snapper can be difficult to distinguish in practice, a characteristic which we tested with further simulations. We demonstrate the scale of analysis possible using a SNP data set sampled from 399 fresh water turtles in 41 populations. [Bayesian inference; diffusion models; multi-species coalescent; SNP data; species trees; spectral methods.]

Download Full-text

The seven sisters DANCe

Astronomy and Astrophysics ◽

10.1051/0004-6361/201731996 ◽

2018 ◽

Vol 612 ◽

pp. A70 ◽

Cited By ~ 5

Author(s):

J. Olivares ◽

E. Moraux ◽

L. M. Sarro ◽

H. Bouy ◽

A. Berihuete ◽

...

Keyword(s):

Spatial Distribution ◽

Model Comparison ◽

Precise Determination ◽

Model Parameters ◽

Data Sets ◽

Adequate Model ◽

Comparison Results ◽

Bayesian Evidence ◽

Mass Segregation

Context. Membership analyses of the DANCe and Tycho + DANCe data sets provide the largest and least contaminated sample of Pleiades candidate members to date. Aims. We aim at reassessing the different proposals for the number surface density of the Pleiades in the light of the new and most complete list of candidate members, and inferring the parameters of the most adequate model. Methods. We compute the Bayesian evidence and Bayes Factors for variations of the classical radial models. These include elliptical symmetry, and luminosity segregation. As a by-product of the model comparison, we obtain posterior distributions for each set of model parameters. Results. We find that the model comparison results depend on the spatial extent of the region used for the analysis. For a circle of 11.5 parsecs around the cluster centre (the most homogeneous and complete region), we find no compelling reason to abandon King’s model, although the Generalised King model introduced here has slightly better fitting properties. Furthermore, we find strong evidence against radially symmetric models when compared to the elliptic extensions. Finally, we find that including mass segregation in the form of luminosity segregation in the J band is strongly supported in all our models. Conclusions. We have put the question of the projected spatial distribution of the Pleiades cluster on a solid probabilistic framework, and inferred its properties using the most exhaustive and least contaminated list of Pleiades candidate members available to date. Our results suggest however that this sample may still lack about 20% of the expected number of cluster members. Therefore, this study should be revised when the completeness and homogeneity of the data can be extended beyond the 11.5 parsecs limit. Such a study will allow for more precise determination of the Pleiades spatial distribution, its tidal radius, ellipticity, number of objects and total mass.

Download Full-text

Radiation damage effects on protein conformation at room temperature and 100K

Acta Crystallographica Section A Foundations and Advances ◽

10.1107/s2053273314096557 ◽

2014 ◽

Vol 70 (a1) ◽

pp. C344-C344

Author(s):

Silvia Russi ◽

Shawn Kann ◽

Henry van den Bedem ◽

Ana M. González

Keyword(s):

Data Collection ◽

Radiation Damage ◽

Crystal Structures ◽

Room Temperature ◽

Relative Rate ◽

Absorbed Dose ◽

Conformational Ensemble ◽

Data Sets ◽

Cryogenic Temperatures ◽

Full Data

Protein crystallography data collection at synchrotrons today is routinely carried out at cryogenic temperatures to mitigate radiation damage to the crystal. Although damage still takes place, at 100 K and below, the immobilization of free radicals increases the lifetime of the crystals by orders of magnitude. Increasingly, experiments are carried out at room temperature. The lack of adequate cryo-protectants, the induced lattice changes or internal disorders during the cooling process, and the convenience of collecting data directly from the crystallization plates, are some of the reasons. Moreover, recent studies have shown that flash-freezing affects the conformational ensemble of crystal structures [1], and can hide important functional mechanisms from observation [2]. While there has been a considerable amount of effort in studying radiation damage at cryo-temperatures, its effects at room temperature are still not well understood. We investigated the effects of data collection temperature on secondary local damage to the side chain and main chain from different proteins. Data were collected from crystals of thaumatin and lysozyme at 100 K and room temperature. To carefully control the total absorbed dose, full data sets at room temperature were assembled from a few diffraction images per crystal. Several data sets were collected at increasing levels of absorbed dose. Our analysis shows that while at cryogenic temperatures, radiation damage increases the conformational variability, _x0004_at room temperature it has the opposite effect_x0005_. We also observed that disulfide bonds appear to break up at a different relative rate at room temperature, perhaps because of a more active repair mechanism. Our analysis suggests that elevated conformational heterogeneity in crystal structures at room temperature is observed despite radiation damage, and not as a result thereof.

Download Full-text

Children with 5′-end NF1 gene mutations are more likely to have glioma

Neurology Genetics ◽

10.1212/nxg.0000000000000192 ◽

2017 ◽

Vol 3 (5) ◽

pp. e192 ◽

Cited By ~ 12

Author(s):

Corina Anastasaki ◽

Stephanie M. Morris ◽

Feng Gao ◽

David H. Gutmann

Keyword(s):

Gene Mutation ◽

Statistical Significance ◽

Gene Mutations ◽

Neurofibromatosis Type ◽

Published Data ◽

Data Sets ◽

Nonsense Mutations ◽

Data Set ◽

Nf1 Gene ◽

The Relationship

Objective:To ascertain the relationship between the germline NF1 gene mutation and glioma development in patients with neurofibromatosis type 1 (NF1).Methods:The relationship between the type and location of the germline NF1 mutation and the presence of a glioma was analyzed in 37 participants with NF1 from one institution (Washington University School of Medicine [WUSM]) with a clinical diagnosis of NF1. Odds ratios (ORs) were calculated using both unadjusted and weighted analyses of this data set in combination with 4 previously published data sets.Results:While no statistical significance was observed between the location and type of the NF1 mutation and glioma in the WUSM cohort, power calculations revealed that a sample size of 307 participants would be required to determine the predictive value of the position or type of the NF1 gene mutation. Combining our data set with 4 previously published data sets (n = 310), children with glioma were found to be more likely to harbor 5′-end gene mutations (OR = 2; p = 0.006). Moreover, while not clinically predictive due to insufficient sensitivity and specificity, this association with glioma was stronger for participants with 5′-end truncating (OR = 2.32; p = 0.005) or 5′-end nonsense (OR = 3.93; p = 0.005) mutations relative to those without glioma.Conclusions:Individuals with NF1 and glioma are more likely to harbor nonsense mutations in the 5′ end of the NF1 gene, suggesting that the NF1 mutation may be one predictive factor for glioma in this at-risk population.

Download Full-text

The Weibull Birnbaum-Saunders Distribution And Its Applications

Statistics Optimization & Information Computing ◽

10.19139/soic-2310-5070-887 ◽

2020 ◽

Vol 9 (1) ◽

pp. 61-81

Author(s):

Lazhar BENKHELIFA

Keyword(s):

Maximum Likelihood ◽

Estimation Method ◽

Likelihood Estimation ◽

Real Data ◽

Reliability Estimation ◽

Maximum Likelihood Estimates ◽

Model Parameters ◽

Data Sets ◽

Proposed Model ◽

Modeling Data

A new lifetime model, with four positive parameters, called the Weibull Birnbaum-Saunders distribution is proposed. The proposed model extends the Birnbaum-Saunders distribution and provides great flexibility in modeling data in practice. Some mathematical properties of the new distribution are obtained including expansions for the cumulative and density functions, moments, generating function, mean deviations, order statistics and reliability. Estimation of the model parameters is carried out by the maximum likelihood estimation method. A simulation study is presented to show the performance of the maximum likelihood estimates of the model parameters. The flexibility of the new model is examined by applying it to two real data sets.

Download Full-text

Integrating multimodal data sets into a mathematical framework to describe and predict therapeutic resistance in cancer

10.1101/2020.02.11.943738 ◽

2020 ◽

Cited By ~ 1

Author(s):

Kaitlyn Johnson ◽

Grant R. Howard ◽

Daylin Morgan ◽

Eric A. Brenner ◽

Andrea L. Gardner ◽

...

Keyword(s):

Mathematical Models ◽

Single Cell ◽

Treatment Response ◽

Model Parameters ◽

Data Sets ◽

Response Dynamics ◽

Multimodal Data ◽

Transcriptomic Data ◽

Treatment Regimens ◽

New Treatment

SummaryA significant challenge in the field of biomedicine is the development of methods to integrate the multitude of dispersed data sets into comprehensive frameworks to be used to generate optimal clinical decisions. Recent technological advances in single cell analysis allow for high-dimensional molecular characterization of cells and populations, but to date, few mathematical models have attempted to integrate measurements from the single cell scale with other data types. Here, we present a framework that actionizes static outputs from a machine learning model and leverages these as measurements of state variables in a dynamic mechanistic model of treatment response. We apply this framework to breast cancer cells to integrate single cell transcriptomic data with longitudinal population-size data. We demonstrate that the explicit inclusion of the transcriptomic information in the parameter estimation is critical for identification of the model parameters and enables accurate prediction of new treatment regimens. Inclusion of the transcriptomic data improves predictive accuracy in new treatment response dynamics with a concordance correlation coefficient (CCC) of 0.89 compared to a prediction accuracy of CCC = 0.79 without integration of the single cell RNA sequencing (scRNA-seq) data directly into the model calibration. To the best our knowledge, this is the first work that explicitly integrates single cell clonally-resolved transcriptome datasets with longitudinal treatment response data into a mechanistic mathematical model of drug resistance dynamics. We anticipate this approach to be a first step that demonstrates the feasibility of incorporating multimodal data sets into identifiable mathematical models to develop optimized treatment regimens from data.

Download Full-text

Efficient detection of repeating sites to accelerate phylogenetic likelihood calculations

10.1101/035873 ◽

2016 ◽

Cited By ~ 2

Author(s):

Kassian Kobert ◽

Alexandros Stamatakis ◽

Tomáš Flouri

Keyword(s):

Evolutionary Biology ◽

Likelihood Function ◽

Simulated Data ◽

Evolutionary Model ◽

Identical Result ◽

Model Parameters ◽

Data Sets ◽

Efficient Detection ◽

Novel Method ◽

Computational Bottleneck

The phylogenetic likelihood function is the major computational bottleneck in several applications of evolutionary biology such as phylogenetic inference, species delimitation, model selection and divergence times estimation. Given the alignment, a tree and the evolutionary model parameters, the likelihood function computes the conditional likelihood vectors for every node of the tree. Vector entries for which all input data are identical result in redundant likelihood operations which, in turn, yield identical conditional values. Such operations can be omitted for improving run-time and, using appropriate data structures, reducing memory usage. We present a fast, novel method for identifying and omitting such redundant operations in phylogenetic likelihood calculations, and assess the performance improvement and memory saving attained by our method. Using empirical and simulated data sets, we show that a prototype implementation of our method yields up to 10-fold speedups and uses up to 78% less memory than one of the fastest and most highly tuned implementations of the phylogenetic likelihood function currently available. Our method is generic and can seamlessly be integrated into any phylogenetic likelihood implementation.

Download Full-text