scholarly journals Logistic Biplot by Conjugate Gradient Algorithms and Iterated SVD

Mathematics ◽  
2021 ◽  
Vol 9 (16) ◽  
pp. 2015
Author(s):  
Jose Giovany Babativa-Márquez ◽  
José Luis Vicente-Villardón

Multivariate binary data are increasingly frequent in practice. Although some adaptations of principal component analysis are used to reduce dimensionality for this kind of data, none of them provide a simultaneous representation of rows and columns (biplot). Recently, a technique named logistic biplot (LB) has been developed to represent the rows and columns of a binary data matrix simultaneously, even though the algorithm used to fit the parameters is too computationally demanding to be useful in the presence of sparsity or when the matrix is large. We propose the fitting of an LB model using nonlinear conjugate gradient (CG) or majorization–minimization (MM) algorithms, and a cross-validation procedure is introduced to select the hyperparameter that represents the number of dimensions in the model. A Monte Carlo study that considers scenarios with several sparsity levels and different dimensions of the binary data set shows that the procedure based on cross-validation is successful in the selection of the model for all algorithms studied. The comparison of the running times shows that the CG algorithm is more efficient in the presence of sparsity and when the matrix is not very large, while the performance of the MM algorithm is better when the binary matrix is balanced or large. As a complement to the proposed methods and to give practical support, a package has been written in the R language called BiplotML. To complete the study, real binary data on gene expression methylation are used to illustrate the proposed methods.

2020 ◽  
Author(s):  
Elzbieta Gralinska ◽  
Martin Vingron

SummaryIn molecular biology, just as in many other fields of science, data often come in the form of matrices or contingency tables with many measurements (rows) for a set of variables (columns). While projection methods like Principal Component Analysis or Correspondence Analysis can be applied for obtaining an overview of such data, in cases where the matrix is very large the associated loss of information upon projection into two or three dimensions may be dramatic. However, when the set of variables can be grouped into clusters, this opens up a new angle on the data. We focus on the question which measurements are associated to a cluster and distinguish it from other clusters. Correspondence Analysis employs a geometry geared towards answering this question. We exploit this feature in order to introduce Association Plots for visualizing cluster-specific measurements in complex data. Association Plots are two-dimensional, independent of the size of data matrix or cluster, and depict the measurements associated to a cluster of variables. We demonstrate our method first on a small data set and then on a genomic example comprising more than 10,000 conditions. We will show that Association Plots can clearly highlight those measurements which characterize a cluster of variables.


2020 ◽  
Vol 2 (1) ◽  
pp. 96-101
Author(s):  
Ahmad Fauzi ◽  
Riki Supriyadi ◽  
Nurlaelatul Maulidah

Abstrak  - Skrining merupakan upaya deteksi dini untuk mengidentifikasi penyakit atau kelainan yang secara klinis belum jelas dengan menggunakan tes, pemeriksaan atau prosedur tertentu. Upaya ini dapat digunakan secara cepat untuk membedakan orang - orang yang kelihatannya sehat tetapi sesungguhnya menderita suatu kelainan.Tujuan utama penelitian ini adalah untuk meningkatkan peforma klasifikasi pada diagnosis kanker payudara dengan menerapkan seleksi fitur pada beberapa algoritme klasifikasi. Penelitian ini menggunakan database kanker payudara Breast Cancer Coimbra Data Set . Metode seleksi fitur berbasis pricipal component analysis akan dipasangkan dengan beberapa algoritme klasifikasi dan metode, seperti Logitboost,Bagging,dan Random Forest. Penelitian ini menggunakan 10 fold cross validation sebagai metode evaluasi. Hasil penelitian menunjukkan metode seleksi fitur berbasis pricipal component analysis mengalami peningkatan peforma klasifikasi secara signifikan setelah dipasangkan dengan seleksi fitur Random Forest dan logitboost, Random forest menunjukan peforma terbaik dengan akurasi 79.3103% dengan nilai AUC sebesar 0,843. Kata Kunci: Seleksi Fitur,PCA, Kanker Payudara,Skrining,Random Forest


Author(s):  
S.M. Shaharudin ◽  
N. Ahmad ◽  
N.H. Zainuddin ◽  
N.S. Mohamed

A robust dimension reduction method in Principal Component Analysis (PCA) was used to rectify the issue of unbalanced clusters in rainfall patterns due to the skewed nature of rainfall data. A robust measure in PCA using Tukey’s biweight correlation to downweigh observations was introduced and the optimum breakdown point to extract the number of components in PCA using this approach is proposed. A set of simulated data matrix that mimicked the real data set was used to determine an appropriate breakdown point for robust PCA and  compare the performance of the both approaches. The simulated data indicated a breakdown point of 70% cumulative percentage of variance gave a good balance in extracting the number of components .The results showed a  more significant and substantial improvement with the robust PCA than the PCA based Pearson correlation in terms of the average number of clusters obtained and its cluster quality.


1981 ◽  
Vol 46 (2) ◽  
pp. 272-283 ◽  
Author(s):  
Robert K. Vierra ◽  
David L. Carlson

Multivariate statistical techniques such as factor analysis are capable of producing patterned results with most, if not all, data matrices. This paper demonstrates that patterned results are obtainable when principal component analysis is applied to a random data set. It is suggested that Bartlett's test for the statistical significance of a correlation matrix be employed in deciding whether a factor analysis of the matrix is justified.


2016 ◽  
Vol 35 (2) ◽  
pp. 173-190 ◽  
Author(s):  
S. Shahid Shaukat ◽  
Toqeer Ahmed Rao ◽  
Moazzam A. Khan

AbstractIn this study, we used bootstrap simulation of a real data set to investigate the impact of sample size (N = 20, 30, 40 and 50) on the eigenvalues and eigenvectors resulting from principal component analysis (PCA). For each sample size, 100 bootstrap samples were drawn from environmental data matrix pertaining to water quality variables (p = 22) of a small data set comprising of 55 samples (stations from where water samples were collected). Because in ecology and environmental sciences the data sets are invariably small owing to high cost of collection and analysis of samples, we restricted our study to relatively small sample sizes. We focused attention on comparison of first 6 eigenvectors and first 10 eigenvalues. Data sets were compared using agglomerative cluster analysis using Ward’s method that does not require any stringent distributional assumptions.


2005 ◽  
Vol 59 (5) ◽  
pp. 630-638 ◽  
Author(s):  
Slobodan Šašić ◽  
Donald A. Clark ◽  
John C. Mitchell ◽  
Martin J. Snowden

Sample–sample (SS) two-dimensional (2D) correlation spectroscopy is applied in this study as a spectral selection tool to produce chemical images of real-world pharmaceutical samples consisting of two, three, and four components. The most unique spectra in a Raman mapping spectral matrix are found after analysis of the covariance matrix. (This is obtained by multiplying the original mapping data matrix by itself.) These spectra are identified by analyzing the slices of the covariance matrix at the positions where covariance values are at maxima. Chemical images are subsequently produced in a univariate fashion by visually selecting the wavenumbers in the extracted spectra that are least overlapped. The performance of SS 2D correlation is compared with principal component analysis in terms of highlighting the most prominent spectral differences across the whole data set (which typically comprises several thousand spectra) and determining the total number of species present. In addition, the selection of the unique spectra by SS 2D correlation is compared with the selection obtained by the orthogonal projection approach (OPA). Both comparisons are found to be satisfactory and demonstrate that a quite simple SS 2D correlation routine can be used for producing reliable images of unknown samples. The main benefit of using SS 2D correlation is that it is based on a few data processing commands that can be executed separately and produce results that are closely related to the chemical features of the system.


2021 ◽  
Vol 58 (6A) ◽  
pp. 288
Author(s):  
Hoang Quoc Tuan ◽  
Lai Quoc Dat ◽  
Cung Thi To Quynh ◽  
Nguyen Hoang Dung ◽  
Nguyen Xuan Loi ◽  
...  

Compositions of fatty acids and amino acids compound were investigated in coffee beans included Arabica and Robusta cultivars grown in three region of Vietnam. Principal component analysis (PCA) and hierarchical cluster analysis (HCA) were performed on the complete data set to reveal chemical differences among all samples and identify markers characteristic of a particular botanical geographical origin of the coffee. The major fatty acids in the coffee oil analyzed in this study were linoleic acid (C18:2), stearic acid (C18:0), oleic acid (C18:1) palmitic acid (C16:0) and myristic acid (C14:0), followed by small amounts of arachic acid (C20:0), docosanoic acid (C22:0) and eicosenoic acid (C20:1). Glutamic acid and aspartic acid were found at high amount in robusta coffee, from 271 mg/100gDW to 786 mg/100g DW and 373mg/100g DW to 486 mg/100g DW, respectively, whereas alanine and glutamic acid in arabica coffee were in high amount at 268 mg/100g DW to 351 mg/100g DW and 209 mg/100g DW to 285 mg/100g DW, respectively. Leucine (301 to 416 mg/100 g DW), phenylalanine (226 to 305 mg/100 g DW), and lysine (199 to 269 mg/100 g DW). PCA of the complete data matrix demonstrated that there were significant differences among all coffee cultivars and geographical origin, HCA supported the results of PCA and achieved a satisfactory classification performance.


2020 ◽  
Vol 8 (2) ◽  
pp. 346-358
Author(s):  
Alberto Oliveira da Silva ◽  
Adelaide Freitas

The extraction of essential features of any real-valued time series is crucial for exploring, modeling and producing, for example, forecasts. Taking advantage of the representation of a time series data by its trajectory matrix of Hankel constructed using Singular Spectrum Analysis, as well as of its decomposition through Principal Component Analysis via Partial Least Squares, we implement a graphical display employing the biplot methodology. A diversity of types of biplots can be constructed depending on the two matrices considered in the factorization of the trajectory matrix. In this work, we discuss the called HJ-biplot which yields a simultaneous representation of both rows and columns of the matrix with maximum quality. Interpretation of this type of biplot on Hankel related trajectory matrices is discussed from a real-world data set.


Sign in / Sign up

Export Citation Format

Share Document