Statistical Analysis of Chemical Element Compositions in Food Science: Problems and Possibilities

Matthias Templ; Barbara Templ

doi:10.3390/molecules26195752

Statistical Analysis of Chemical Element Compositions in Food Science: Problems and Possibilities

Molecules ◽

10.3390/molecules26195752 ◽

2021 ◽

Vol 26 (19) ◽

pp. 5752

Author(s):

Matthias Templ ◽

Barbara Templ

Keyword(s):

Statistical Analysis ◽

Missing Values ◽

Compositional Data ◽

Chemical Elements ◽

Chemical Components ◽

Compositional Data Analysis ◽

Research Fields ◽

Log Ratio ◽

Processing Steps

In recent years, many analyses have been carried out to investigate the chemical components of food data. However, studies rarely consider the compositional pitfalls of such analyses. This is problematic as it may lead to arbitrary results when non-compositional statistical analysis is applied to compositional datasets. In this study, compositional data analysis (CoDa), which is widely used in other research fields, is compared with classical statistical analysis to demonstrate how the results vary depending on the approach and to show the best possible statistical analysis. For example, honey and saffron are highly susceptible to adulteration and imitation, so the determination of their chemical elements requires the best possible statistical analysis. Our study demonstrated how principle component analysis (PCA) and classification results are influenced by the pre-processing steps conducted on the raw data, and the replacement strategies for missing values and non-detects. Furthermore, it demonstrated the differences in results when compositional and non-compositional methods were applied. Our results suggested that the outcome of the log-ratio analysis provided better separation between the pure and adulterated data and allowed for easier interpretability of the results and a higher accuracy of classification. Similarly, it showed that classification with artificial neural networks (ANNs) works poorly if the CoDa pre-processing steps are left out. From these results, we advise the application of CoDa methods for analyses of the chemical elements of food and for the characterization and authentication of food products.

Download Full-text

Part–Whole Relations: New Insights about the Dynamics of Complex Geochemical Riverine Systems

Minerals ◽

10.3390/min10060501 ◽

2020 ◽

Vol 10 (6) ◽

pp. 501

Author(s):

Caterina Gozzi ◽

Roberta Sauro Graziano ◽

Antonella Buccianti

Keyword(s):

Compositional Data ◽

Chemical Elements ◽

Chemical Constituents ◽

Principal Component ◽

Cumulative Distribution ◽

Compositional Data Analysis ◽

Scaling Properties ◽

Global Biogeochemical Cycles ◽

Riverine Systems ◽

Log Ratio

Nature is often characterized by systems that are far from thermodynamic equilibrium, and rivers are not an exception for the Earth’s critical zone. When the chemical composition of stream waters is investigated, it emerges that riverine systems behave as complex systems. This means that the compositions have properties that depend on the integrity of the whole (i.e., the composition with all the chemical constituents), properties that arise thanks to the innumerable nonlinear interactions between the elements of the composition. The presence of interconnections indicates that the properties of the whole cannot be fully understood by examining the parts of the system in isolation. In this work, we propose investigating the complexity of riverine chemistry by using the CoDA (Compositional Data Analysis) methodology and the performance of the perturbation operator in the simplex geometry. With riverine bicarbonate considered as a key component of regional and global biogeochemical cycles and Ca2+ considered as mostly related to the weathering of carbonatic rocks, perturbations were calculated for subsequent couples of compositions after ranking the data for increasing values of the log-ratio ln(Ca2+/HCO3−). Numerical values were analyzed by using robust principal component analysis and non-parametric correlations between compositional parts (heat map) associated with distributional and multifractal methods. The results indicate that HCO3−, Ca2+, Mg2+ and Sr2+ are more resilient, thus contributing to compositional changes for all the values of ln(Ca2+/HCO3−) to a lesser degree with respect to the other chemical elements/components. Moreover, the complementary cumulative distribution function of all the sequences tracing the compositional change and the nonlinear relationship between the Q-th moment versus the scaling exponents for each of them indicate the presence of multifractal variability, thus revealing scaling properties of the fluctuations.

Download Full-text

Variable selection in microbiome compositional data analysis

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqaa029 ◽

2020 ◽

Vol 2 (2) ◽

Cited By ~ 2

Author(s):

Antoni Susin ◽

Yiwen Wang ◽

Kim-Anh Lê Cao ◽

M Luz Calle

Keyword(s):

Data Analysis ◽

Variable Selection ◽

Compositional Data ◽

Penalized Regression ◽

Compositional Data Analysis ◽

Forward Selection ◽

Computationally Efficient ◽

Parsimonious Model ◽

Microbiome Data ◽

Log Ratio

Abstract Though variable selection is one of the most relevant tasks in microbiome analysis, e.g. for the identification of microbial signatures, many studies still rely on methods that ignore the compositional nature of microbiome data. The applicability of compositional data analysis methods has been hampered by the availability of software and the difficulty in interpreting their results. This work is focused on three methods for variable selection that acknowledge the compositional structure of microbiome data: selbal, a forward selection approach for the identification of compositional balances, and clr-lasso and coda-lasso, two penalized regression models for compositional data analysis. This study highlights the link between these methods and brings out some limitations of the centered log-ratio transformation for variable selection. In particular, the fact that it is not subcompositionally consistent makes the microbial signatures obtained from clr-lasso not readily transferable. Coda-lasso is computationally efficient and suitable when the focus is the identification of the most associated microbial taxa. Selbal stands out when the goal is to obtain a parsimonious model with optimal prediction performance, but it is computationally greedy. We provide a reproducible vignette for the application of these methods that will enable researchers to fully leverage their potential in microbiome studies.

Download Full-text

Counts: an outstanding challenge for log-ratio analysis of compositional data in the molecular biosciences

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqaa040 ◽

2020 ◽

Vol 2 (2) ◽

Cited By ~ 1

Author(s):

David R Lovell ◽

Xin-Yi Chua ◽

Annette McGrath

Keyword(s):

Count Data ◽

Compositional Data ◽

Compositional Data Analysis ◽

Ratio Analysis ◽

Sequencing Technology ◽

Scale Invariant ◽

Measurement And Analysis ◽

Discrete Nature ◽

The Impact ◽

Log Ratio

Abstract Thanks to sequencing technology, modern molecular bioscience datasets are often compositions of counts, e.g. counts of amplicons, mRNAs, etc. While there is growing appreciation that compositional data need special analysis and interpretation, less well understood is the discrete nature of these count compositions (or, as we call them, lattice compositions) and the impact this has on statistical analysis, particularly log-ratio analysis (LRA) of pairwise association. While LRA methods are scale-invariant, count compositional data are not; consequently, the conclusions we draw from LRA of lattice compositions depend on the scale of counts involved. We know that additive variation affects the relative abundance of small counts more than large counts; here we show that additive (quantization) variation comes from the discrete nature of count data itself, as well as (biological) variation in the system under study and (technical) variation from measurement and analysis processes. Variation due to quantization is inevitable, but its impact on conclusions depends on the underlying scale and distribution of counts. We illustrate the different distributions of real molecular bioscience data from different experimental settings to show why it is vital to understand the distributional characteristics of count data before applying and drawing conclusions from compositional data analysis methods.

Download Full-text

Performance Assessment in Water Polo Using Compositional Data Analysis

Journal of Human Kinetics ◽

10.1515/hukin-2016-0043 ◽

2016 ◽

Vol 54 (1) ◽

pp. 143-151 ◽

Cited By ~ 4

Author(s):

Enrique García Ordóñez ◽

María del Carmen Iglesias Pérez ◽

Carlos Touriño González

Keyword(s):

Performance Indicators ◽

Cross Validation ◽

Compositional Data ◽

Compositional Data Analysis ◽

Original Sample ◽

Water Polo ◽

Discriminant Analyses ◽

Multivariate Discriminant ◽

Log Ratio ◽

Match Score

AbstractThe aim of the present study was to identify groups of offensive performance indicators which best discriminated between a match score (favourable, balanced or unfavourable) in water polo. The sample comprised 88 regular season games (2011-2014) from the Spanish Professional Water Polo League. The offensive performance indicators were clustered in five groups: Attacks in relation to the different playing situations; Shots in relation to the different playing situations; Attacks outcome; Origin of shots; Technical execution of shots. The variables of each group had a constant sum which equalled 100%. The data were compositional data, therefore the variables were changed by means of the additive log-ratio (alr) transformation. Multivariate discriminant analyses to compare the match scores were calculated using the transformed variables. With regard to the percentage of right classification, the results showed the group that discriminated the most between the match scores was “Attacks outcome” (60.4% for the original sample and 52.2% for cross-validation). The performance indicators that discriminated the most between the match scores in games with penalties were goals (structure coefficient (SC) = .761), counterattack shots (SC = .541) and counterattacks (SC = .481). In matches without penalties, goals were the primary discriminating factor (SC = .576). This approach provides a new tool to compare the importance of the offensive performance groups and their effect on the match score discrimination.

Download Full-text

A Bibliometric Analysis of the 35th anniversary of the paper "The Statistical Analysis of Compositional Data" by John Aitchison (1982)

Austrian Journal of Statistics ◽

10.17713/ajs.v50i2.1066 ◽

2021 ◽

Vol 50 (2) ◽

pp. 38-55

Author(s):

Carolina Navarro ◽

Silvia Gonzalez-Morcillo ◽

Carles Mulet-Forteza ◽

Salvador Linares-Mustaros

Keyword(s):

Data Analysis ◽

Statistical Analysis ◽

Bibliometric Analysis ◽

Compositional Data ◽

Compositional Data Analysis ◽

Royal Statistical Society ◽

Specific Analysis ◽

Wide Range ◽

The World ◽

Bibliometric Approach

This study presents a comprehensive bibliometric analysis of the paper published by John Aitchison in the Journal of the Royal Statistical Society. Series B (Methodological) in 1982. Having recently reached the milestone of 35 years since its publication, this pioneering paper was the first to illustrate the use of the methodology "Compositional Data Analysis" or "CoDA". By October 2019, this paper had received over 780 citations, making it the most widely cited and influential article among those using said methodology. The bibliometric approach used in this study encompasses a wide range of techniques, including a specific analysis of the main authors and institutions to have cited Aitchison' paper. The VOSviewer software was also used for the purpose of developing network maps for said publication. Specifically, the techniques used were co-citations and bibliographic coupling. The results clearly show the significant impact the paper has had on scientific research, having been cited by authors and institutions that publish all around the world.

Download Full-text

Assessing Global Covid-19 Cases Data through Compositional Data Analysis(CoDa)

10.1101/2020.12.17.20248424 ◽

2020 ◽

Author(s):

Luis P.V. Braga ◽

Dina Feigenbaum

Keyword(s):

Data Analysis ◽

Compositional Data ◽

Compositional Data Analysis ◽

Discrete Groups ◽

Data Sets ◽

Cumulative Number ◽

Governmental Agencies ◽

Global Pandemic ◽

Number Of Patients ◽

Log Ratio

AbstractBackgroundCovid-19 cases data pose an enormous challenge to any analysis. The evaluation of such a global pandemic requires matching reports that follow different procedures and even overcoming some countries’ censorship that restricts publications.MethodsThis work proposes a methodology that could assist future studies. Compositional Data Analysis (CoDa) is proposed as the proper approach as Covid-19 cases data is compositional in nature. Under this methodology, for each country three attributes were selected: cumulative number of deaths (D); cumulative number of recovered patients(R); present number of patients (A).ResultsAfter the operation called closure, with c=1, a ternary diagram and Log-Ratio plots, as well as, compositional statistics are presented. Cluster analysis is then applied, splitting the countries into discrete groups.ConclusionsThis methodology can also be applied to other data sets such as countries, cities, provinces or districts in order to help authorities and governmental agencies to improve their actions to fight against a pandemic.

Download Full-text

A field guide for the compositional analysis of any-omics data

GigaScience ◽

10.1093/gigascience/giz107 ◽

2019 ◽

Vol 8 (9) ◽

Cited By ~ 22

Author(s):

Thomas P Quinn ◽

Ionas Erb ◽

Greg Gloor ◽

Cedric Notredame ◽

Mark F Richardson ◽

...

Keyword(s):

Data Analysis ◽

General Solution ◽

Compositional Data ◽

Compositional Analysis ◽

Compositional Data Analysis ◽

Nucleotide Synthesis ◽

Library Size ◽

Next Generation Sequencing Ngs ◽

Concise Guide ◽

Log Ratio

Abstract Background Next-generation sequencing (NGS) has made it possible to determine the sequence and relative abundance of all nucleotides in a biological or environmental sample. A cornerstone of NGS is the quantification of RNA or DNA presence as counts. However, these counts are not counts per se: their magnitude is determined arbitrarily by the sequencing depth, not by the input material. Consequently, counts must undergo normalization prior to use. Conventional normalization methods require a set of assumptions: they assume that the majority of features are unchanged and that all environments under study have the same carrying capacity for nucleotide synthesis. These assumptions are often untestable and may not hold when heterogeneous samples are compared. Results Methods developed within the field of compositional data analysis offer a general solution that is assumption-free and valid for all data. Herein, we synthesize the extant literature to provide a concise guide on how to apply compositional data analysis to NGS count data. Conclusions In highlighting the limitations of total library size, effective library size, and spike-in normalizations, we propose the log-ratio transformation as a general solution to answer the question, “Relative to some important activity of the cell, what is changing?”

Download Full-text

Compositional data analysis of sedimentological, mineralogical and geochemical data for the evaluation of Austrian loess and loess loam deposits

10.5194/egusphere-egu2020-20444 ◽

2020 ◽

Author(s):

Heinz Reitner ◽

Christian Benold ◽

Peter Filzmoser ◽

Maria Heinrich ◽

Gerhard Hobiger ◽

...

Keyword(s):

Information System ◽

Statistical Analysis ◽

Raw Materials ◽

Analysis Data ◽

Grain Size Analysis ◽

Geochemical Data ◽

Size Analysis ◽

Compositional Data Analysis ◽

Rock Composition ◽

Log Ratio

<p>Austrian loess and loess loam deposits represent an important source of raw materials for the heavy clay industry for centuries. Building material quality of loess and loess loam deposits and their suitability for different applications is significantly influenced by their heterogeneous properties. These depend on the geology of the source area, climatic conditions, geomorphological location, stratigraphic position, intensity of weathering and redeposition potential. The description of occurrences, properties and availability of these raw materials is therefore an important prerequisite to meet the industrial quality requirements. A large number of different sub-datasets exist at the Geological Survey of Austria, which comprise grain-size analysis, bulk rock composition, clay mineralogy, and geochemistry data of loess and loess loam. Within our project, these individual data sets underwent a thorough examination and have been merged into a coherent database to enable the joint regional and statistical analysis of the data. By applying a log-ratio approach the compositional nature of the analysis data has been taken into account for multivariate statistical methods.&#160;<br>Within our study we focused on the classic Austrian loess regions in the Northern Alpine foreland areas of Upper and Lower Austria and in the Vienna Basin. By transferring the results of the statistical analysis to a Geographic Information System (GIS) these served as the fundamental basis for our categorization of the loess and loess loam occurrences. Taking into account previously published approaches based on soil profile classifications as well as trends and patterns derived from the analysis data, we finally were able to delineate different districts of brick raw materials deposits. These will be made publically accessible to the industry and interested parties as part of the web application of the Austrian Interactive Raw Material Information System IRIS-Online.</p>

Download Full-text

An application of compositional data analysis to multiomic time-series data

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqaa079 ◽

2020 ◽

Vol 2 (4) ◽

Cited By ~ 1

Author(s):

Laura Sisk-Hackworth ◽

Scott T Kelley

Keyword(s):

Data Analysis ◽

Time Series Data ◽

Compositional Data ◽

Series Data ◽

Compositional Data Analysis ◽

Metabolomics Data ◽

Normalization Methods ◽

Next Generation Sequencing Ngs ◽

Ngs Data ◽

Log Ratio

Abstract Compositional data analysis (CoDA) methods have increased in popularity as a new framework for analyzing next-generation sequencing (NGS) data. CoDA methods, such as the centered log-ratio (clr) transformation, adjust for the compositional nature of NGS counts, which is not addressed by traditional normalization methods. CoDA has only been sparsely applied to NGS data generated from microbial communities or to multiple ‘omics’ datasets. In this study, we applied CoDA methods to analyze NGS and untargeted metabolomic datasets obtained from bacterial and fungal communities. Specifically, we used clr transformation to reanalyze NGS amplicon and metabolomics data from a study investigating the effects of building material type, moisture and time on microbial and metabolomic diversity. Compared to analysis of untransformed data, analysis of clr-transformed data revealed novel relationships and stronger associations between sample conditions and microbial and metabolic community profiles.

Download Full-text

Amalgams: data-driven amalgamation for the reference-free dimensionality reduction of zero-laden compositional data

10.1101/2020.02.27.968677 ◽

2020 ◽

Cited By ~ 1

Author(s):

Thomas P. Quinn ◽

Ionas Erb

Keyword(s):

Dimension Reduction ◽

Domain Knowledge ◽

Compositional Data ◽

Real Data ◽

R Package ◽

Data Driven ◽

Alternative Methods ◽

Compositional Data Analysis ◽

Data Sets ◽

Log Ratio

AbstractIn the health sciences, many data sets produced by next-generation sequencing (NGS) only contain relative information because of biological and technical factors that limit the total number of nucleotides observed for a given sample. As mutually dependent elements, it is not possible to interpret any component in isolation, at least without normalization. The field of compositional data analysis (CoDA) has emerged with alternative methods for relative data based on log-ratio transforms. However, NGS data often contain many more features than samples, and thus require creative new ways to reduce the dimensionality of the data without sacrificing interpretability. The summation of parts, called amalgamation, is a practical way of reducing dimensionality, but can introduce a non-linear distortion to the data. We exploit this non-linearity to propose a powerful yet interpretable dimension reduction method. In this report, we present data-driven amalgamation as a new method and conceptual framework for reducing the dimensionality of compositional data. Unlike expert-driven amalgamation which requires prior domain knowledge, our data-driven amalgamation method uses a genetic algorithm to answer the question, “What is the best way to amalgamate the data to achieve the user-defined objective?”. We present a user-friendly R package, called amalgam, that can quickly find the optimal amalgamation to (a) preserve the distance between samples, or (b) classify samples as diseased or not. Our benchmark on 13 real data sets confirm that these amalgamations compete with the state-of-the-art unsupervised and supervised dimension reduction methods in terms of performance, but result in new variables that are much easier to understand: they are groups of features added together.

Download Full-text