scholarly journals Expression reflects population structure

2018 ◽  
Author(s):  
Brielin C Brown ◽  
Nicolas L. Bray ◽  
Lior Pachter

AbstractPopulation structure in genotype data has been extensively studied, and is revealed by looking at the principal components of the genotype matrix. However, no similar analysis of population structure in gene expression data has been conducted, in part because a naïve principal components analysis of the gene expression matrix does not cluster by population. We identify a linear projection that reveals population structure in gene expression data. Our approach relies on the coupling of the principal components of genotype to the principal components of gene expression via canonical correlation analysis. Futhermore, we analyze the variance of each gene within the projection matrix to determine which genes significantly influence the projection. We identify thousands of significant genes, and show that a number of the top genes have been implicated in diseases that disproportionately impact African Americans.Author SummaryHigh dimensional, multi-modal genomics datasets are becoming increasingly common, which warrants investigation into analysis techniques that can reveal structure in the data without over-fitting. Here, we show that the coupling of principal component analysis to canonical correlation analysis offers an efficient approach to exploratory analysis of this kind of data. We apply this method to the GEUVADIS dataset of genotype and gene expression values of European and Yoruban individuals, finding as-of-yet unstudied population structure in the gene expression values. Moreover, many of the top genes identified by our method have been previously implicated in diseases that disproportionately impact African Americans.

2005 ◽  
Vol 03 (02) ◽  
pp. 303-316 ◽  
Author(s):  
ZHENQIU LIU ◽  
DECHANG CHEN ◽  
HALIMA BENSMAIL ◽  
YING XU

Kernel principal component analysis (KPCA) has been applied to data clustering and graphic cut in the last couple of years. This paper discusses the application of KPCA to microarray data clustering. A new algorithm based on KPCA and fuzzy C-means is proposed. Experiments with microarray data show that the proposed algorithms is in general superior to traditional algorithms.


1997 ◽  
Vol 25 ◽  
pp. 347-352 ◽  
Author(s):  
Chris Derksen ◽  
Kkevin Misurak ◽  
Ellsworth Ledrew ◽  
Joe Piwowar ◽  
Barry Goodison

The stochastic relationships between terrestrial snow water equivalent (SWE) and measures of the atmospheric circulation were investigated for the Canadian Prairies and the American Great Plains for the winter of 1988. Snow-cover extent, derived from EASE-grid SSM/I satellite data, and griddcd atmospheric data from the National Meteorological Center were averaged at five day intervals. Principal components analysis (PCA) were performed for the time series of SSM/I snow-cover imagery as well as for 700 mb geopotential height and temperature, 500 mb height and 700–500 mb thickness. Canonical correlation analysis of the derived principal component weights was used to identify relationships between atmospheric variables and SWE. Results of the PCA indicate that a high degree of variance in upper air variables (>75%) can be explained by the first three principal components, while the first three SWE components account for over 90% of the variance in the original data. Results of the canonical correlation analysis show positive relationships between snow-cover accumulation and a meridional pressure distribution pattern, while snow ablation is linked to a zonal atmospheric pressure pattern.


1997 ◽  
Vol 25 ◽  
pp. 347-352 ◽  
Author(s):  
Chris Derksen ◽  
Kkevin Misurak ◽  
Ellsworth Ledrew ◽  
Joe Piwowar ◽  
Barry Goodison

The stochastic relationships between terrestrial snow water equivalent (SWE) and measures of the atmospheric circulation were investigated for the Canadian Prairies and the American Great Plains for the winter of 1988. Snow-cover extent, derived from EASE-grid SSM/I satellite data, and griddcd atmospheric data from the National Meteorological Center were averaged at five day intervals. Principal components analysis (PCA) were performed for the time series of SSM/I snow-cover imagery as well as for 700 mb geopotential height and temperature, 500 mb height and 700–500 mb thickness. Canonical correlation analysis of the derived principal component weights was used to identify relationships between atmospheric variables and SWE. Results of the PCA indicate that a high degree of variance in upper air variables (>75%) can be explained by the first three principal components, while the first three SWE components account for over 90% of the variance in the original data. Results of the canonical correlation analysis show positive relationships between snow-cover accumulation and a meridional pressure distribution pattern, while snow ablation is linked to a zonal atmospheric pressure pattern.


2020 ◽  
Vol 15 ◽  
Author(s):  
Chen-An Tsai ◽  
James J. Chen

Background: Gene set enrichment analyses (GSEA) provide a useful and powerful approach to identify differentially expressed gene sets with prior biological knowledge. Several GSEA algorithms have been proposed to perform enrichment analyses on groups of genes. However, many of these algorithms have focused on identification of differentially expressed gene sets in a given phenotype. Objective: In this paper, we propose a gene set analytic framework, Gene Set Correlation Analysis (GSCoA), that simultaneously measures within and between gene sets variation to identify sets of genes enriched for differential expression and highly co-related pathways. Methods: We apply co-inertia analysis to the comparisons of cross-gene sets in gene expression data to measure the costructure of expression profiles in pairs of gene sets. Co-inertia analysis (CIA) is one multivariate method to identify trends or co-relationships in multiple datasets, which contain the same samples. The objective of CIA is to seek ordinations (dimension reduction diagrams) of two gene sets such that the square covariance between the projections of the gene sets on successive axes is maximized. Simulation studies illustrate that CIA offers superior performance in identifying corelationships between gene sets in all simulation settings when compared to correlation-based gene set methods. Result and Conclusion: We also combine between-gene set CIA and GSEA to discover the relationships between gene sets significantly associated with phenotypes. In addition, we provide a graphical technique for visualizing and simultaneously exploring the associations of between and within gene sets and their interaction and network. We then demonstrate integration of within and between gene sets variation using CIA and GSEA, applied to the p53 gene expression data using the c2 curated gene sets. Ultimately, the GSCoA approach provides an attractive tool for identification and visualization of novel associations between pairs of gene sets by integrating co-relationships between gene sets into gene set analysis.


Author(s):  
Qiang Zhao ◽  
Jianguo Sun

Statistical analysis of microarray gene expression data has recently attracted a great deal of attention. One problem of interest is to relate genes to survival outcomes of patients with the purpose of building regression models for the prediction of future patients' survival based on their gene expression data. For this, several authors have discussed the use of the proportional hazards or Cox model after reducing the dimension of the gene expression data. This paper presents a new approach to conduct the Cox survival analysis of microarray gene expression data with the focus on models' predictive ability. The method modifies the correlation principal component regression (Sun, 1995) to handle the censoring problem of survival data. The results based on simulated data and a set of publicly available data on diffuse large B-cell lymphoma show that the proposed method works well in terms of models' robustness and predictive ability in comparison with some existing partial least squares approaches. Also, the new approach is simpler and easy to implement.


Sign in / Sign up

Export Citation Format

Share Document