scholarly journals Estimation of Mediation Effect for High-dimensional Omics Mediators with Application to the Framingham Heart Study

2019 ◽  
Author(s):  
Tianzhong Yang ◽  
Jingbo Niu ◽  
Han Chen ◽  
Peng Wei

SUMMARYEnvironmental exposures can regulate intermediate molecular phenotypes, such as gene expression, by different mechanisms and thereby lead to various health outcomes. It is of significant scientific interest to unravel the role of potentially high-dimensional intermediate phenotypes in the relationship between environmental exposure and traits. Mediation analysis is an important tool for investigating such relationships. However, it has mainly focused on low-dimensional settings, and there is a lack of a good measure of the total mediation effect. Here, we extend an R-squared (Rsq) effect size measure, originally proposed in the single-mediator setting, to the moderate- and high-dimensional mediator settings in the mixed model framework. Based on extensive simulations, we compare our measure and estimation procedure with several frequently used mediation measures, including product, proportion, and ratio measures. Our Rsq measure has small bias and variance under the correctly specified model. To mitigate potential bias induced by non-mediators, we examine two variable selection procedures, i.e., iterative sure independence screening and false discovery rate control, to exclude the non-mediators. We evaluate the consistency of the proposed estimation procedures and introduce a resampling-based confidence interval. By applying the proposed estimation procedure, we find that more than half of the aging-related variations in systolic blood pressure can be explained by gene expression profiles in the Framingham Heart Study.

2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Tianzhong Yang ◽  
Jingbo Niu ◽  
Han Chen ◽  
Peng Wei

Abstract Background Environmental exposures can regulate intermediate molecular phenotypes, such as gene expression, by different mechanisms and thereby lead to various health outcomes. It is of significant scientific interest to unravel the role of potentially high-dimensional intermediate phenotypes in the relationship between environmental exposure and traits. Mediation analysis is an important tool for investigating such relationships. However, it has mainly focused on low-dimensional settings, and there is a lack of a good measure of the total mediation effect. Here, we extend an R-squared (R$$^2$$ 2 ) effect size measure, originally proposed in the single-mediator setting, to the moderate- and high-dimensional mediator settings in the mixed model framework. Results Based on extensive simulations, we compare our measure and estimation procedure with several frequently used mediation measures, including product, proportion, and ratio measures. Our R$$^2$$ 2 -based second-moment measure has small bias and variance under the correctly specified model. To mitigate potential bias induced by non-mediators, we examine two variable selection procedures, i.e., iterative sure independence screening and false discovery rate control, to exclude the non-mediators. We establish the consistency of the proposed estimation procedures and introduce a resampling-based confidence interval. By applying the proposed estimation procedure, we found that 38% of the age-related variations in systolic blood pressure can be explained by gene expression profiles in the Framingham Heart Study of 1711 individuals. An R package “RsqMed” is available on CRAN. Conclusion R-squared (R$$^2$$ 2 ) is an effective and efficient measure for total mediation effect especially under high-dimensional setting.


Author(s):  
Ingrid M. Lönnstedt ◽  
Sven Nelander

AbstractThe systematic study of transcriptional responses to genetic and chemical perturbations in human cells is still in its early stages. The largest available dataset to date is the newly released L1000 compendium. With its 1.3 million gene expression profiles of treated human cells it offers many opportunities for biomedical data mining, but also data normalization challenges of new dimensions. We developed a novel and practical approach to obtain accurate estimates of fold change response profiles from L1000, based on the RUV (Remove Unwanted Variation) statistical framework. Extending RUV to a big data setting, we propose an estimation procedure, in which an underlying RUV model is tuned by feedback through dataset specific statistical measures, reflecting


2006 ◽  
Vol 18 (2) ◽  
pp. 239
Author(s):  
J. Piedrahita ◽  
S. Bischoff ◽  
J. Estrada ◽  
B. Freking ◽  
D. Nonneman ◽  
...  

Genomic imprinting arises from differential epigenetic markings including DNA methylation and histone modifications and results in one allele being expressed in a parent-of-origin specific manner. For further insight into the porcine epigenome, gene expression profiles of parthenogenetic (PRT; two maternally derived chromosome sets) and biparental embryos (BP; one maternal and one paternal set of chromosomes) were compared using microarrays. Comparison of the expression profiles of the two tissue types permits identification of both maternally and paternally imprinted genes and thus the degree of conservation of imprinted genes between swine and other mammalian species. Diploid porcine parthenogenetic fetuses were generated using follicular oocytes (BOMED, Madison, WI, USA). Oocytes with a visible polar body were activated using a single square pulse of direct current of 50 V/mm for 100 �s and diploidized by culture in 10 �g/mL cycloheximide for 6 h to limit extrusion of the second polar body. Following culture, BP embryos obtained by natural matings, and PRT embryos, were surgically transferred to oviducts on the first day of estrus. Fetuses recovered at 28-30 days of gestation were dissected to separate viscera including brain, liver, and placenta; the visceral tissues were then flash-frozen in liquid nitrogen. Porcine fibroblast tissue was obtained from the remaining carcass by mincing, trypsinization, and plating cells in �-MEM. Total RNA was extracted from frozen tissue or cell culture using RNA Aqueous kit (Ambion, Austin, TX, USA) according to the manufacturer's protocol. Gene expression differences between BP and PRT tissues were determined using the GeneChip� Porcine Genome Array (Affymetrix, Santa Clara, CA) containing 23 256 transcripts from Sus scrofa and representing 42 genes known to be imprinted in human and/or mice. Triplicate arrays were utilized for each tissue type, and for PRT versus BP combination. Significant differential gene expression was identified by a linear mixed model analysis using SAS 5.0 (SAS Institute, Cary, NC, USA). Storey's q-value method was used to correct for multiple testing at q d 0.05. The following genes were classified as imprinted on the basis of their expression profiles: In fibroblasts, ARHI, HTR2A, MEST, NDN, NNAT, PEG3, PLAGL1, PEG10, SGCE, SNRPN, and UBE3A; in liver, IGF2, PEG3, PLAGL1, PEG10, and SNRPN; in placenta, HTR2A, IGF2, MEST, NDN, NNAT, PEG3, PLAGL1, PEG10, and SNRPN; and in brain, none. Additionally, several genes not known to be imprinted in humans/mice were highly differentially expressed between the two tissue types. Overall, utilizing the PRT models and gene expression profiles, we have identified thirteen genes where imprinting is conserved between swine and humans/mice, and several candidate genes that represent potentially imprinted genes. Presently, our efforts are focused in the identification of single nucleotide polymorphisms (SNPs) to more carefully evaluate the behavior of these genes in normal and abnormal gestations and to test whether the candidate genes are indeed imprinted. This research was supported by USDA-CSREES grant 524383 to J. P. and B. F.


2019 ◽  
Author(s):  
Chan Wang ◽  
Jiyuan Hu ◽  
Martin J Blaser ◽  
Huilin Li

Abstract Motivation Recent microbiome association studies have revealed important associations between microbiome and disease/health status. Such findings encourage scientists to dive deeper to uncover the causal role of microbiome in the underlying biological mechanism, and have led to applying statistical models to quantify causal microbiome effects and to identify the specific microbial agents. However, there are no existing causal mediation methods specifically designed to handle high dimensional and compositional microbiome data. Results We propose a rigorous Sparse Microbial Causal Mediation Model (SparseMCMM) specifically designed for the high dimensional and compositional microbiome data in a typical three-factor (treatment, microbiome and outcome) causal study design. In particular, linear log-contrast regression model and Dirichlet regression model are proposed to estimate the causal direct effect of treatment and the causal mediation effects of microbiome at both the community and individual taxon levels. Regularization techniques are used to perform the variable selection in the proposed model framework to identify signature causal microbes. Two hypothesis tests on the overall mediation effect are proposed and their statistical significance is estimated by permutation procedures. Extensive simulated scenarios show that SparseMCMM has excellent performance in estimation and hypothesis testing. Finally, we showcase the utility of the proposed SparseMCMM method in a study which the murine microbiome has been manipulated by providing a clear and sensible causal path among antibiotic treatment, microbiome composition and mouse weight. Availability and implementation https://sites.google.com/site/huilinli09/software and https://github.com/chanw0/SparseMCMM. Supplementary information Supplementary data are available at Bioinformatics online.


2015 ◽  
Vol 11 (10) ◽  
pp. 2690-2698 ◽  
Author(s):  
Mirko Francesconi ◽  
Ben Lehner

Gene expression profiling is a fast, cheap and standardised analysis that provides a high dimensional measurement of the state of a biological sample, including of single cells. Computational methods to reconstruct the composition of samples and spatial and temporal information from expression profiles are described, as well as how they can be used to describe the effects of genetic variation.


2017 ◽  
Author(s):  
Brian Cleary ◽  
Le Cong ◽  
Eric S. Lander ◽  
Aviv Regev

AbstractRNA profiling is an excellent phenotype of cellular responses and tissue states, but can be costly to generate at the massive scale required for studies of regulatory circuits, genetic states or perturbation screens. Here, we draw on a series of advances over the last decade in the field of mathematics to establish a rigorous link between biological structure, data compressibility, and efficient data acquisition. We propose that very few random composite measurements – in which gene abundances are combined in a random linear combination – are needed to approximate the high-dimensional similarity between any pair of gene abundance profiles. We then show how finding latent, sparse representations of gene expression data would enable us to “decompress” a small number of random composite measurements and recover high-dimensional gene expression levels that were not measured (unobserved). We present a new algorithm for finding sparse, modular structure, which improves the ability to interpret samples in terms of small numbers of active modules, and show that the modular structure we find is sufficient to recover gene expression profiles from composite measurements (with ~100-fold fewer composite measurements than genes). Moreover, the knowledge that sparse, modular structures exist allows us to recover expression profiles from composite measurements, even without access to any training data. Finally, we present a proof-of-concept experiment for making composite measurements in the laboratory, involving the measurement of linear combinations of RNA abundances. Altogether, our results suggest new compressive modalities in experimental biology that can form a foundation for massive scaling in high-throughput measurements, while also offering new insights into the interpretation of high-dimensional data.


2009 ◽  
Vol 6 (1) ◽  
Author(s):  
Andrej Kastrin

The high dimensionality of global gene expression profiles, where number of variables (genes) is very large compared to the number of observations (samples), presents challenges that affect generalizability and applicability of microarray analysis. Latent variable modeling offers a promising approach to deal with high-dimensional microarray data. The latent variable model is based on a few latent variables that capture most of the gene expression information. Here, we describe how to accomplish a reduction in dimension by a latent variable methodology, which can greatly reduce the number of features used to characterize microarray data. We propose a general latent variable framework for prediction of predefined classes of samples using gene expression profiles from microarray experiments. The framework consists of (i) selection of smaller number of genes that are most differentially expressed between samples, (ii) dimension reduction using hierarchical clustering, where each cluster partition is identified as latent variable, (iii) discretization of gene expression matrix, (iv) fitting the Rasch item response model for genes in each cluster partition to estimate the expression of latent variable, and (v) construction of prediction model with latent variables as covariates to study the relationship between latent variables and phenotype. Two different microarray data sets are used to illustrate a general framework of the approach. We show that the predictive performance of our method is comparable to the current best approach based on an all-gene space. The method is general and can be applied to the other high-dimensional data problems.


Circulation ◽  
2014 ◽  
Vol 129 (suppl_1) ◽  
Author(s):  
Michael M Mendelson ◽  
Brian Chen ◽  
Chunyu Liu ◽  
Roby Joehanes ◽  
Peter Munson ◽  
...  

Objective: To describe the influence of type of dietary fat on the activity of metabolic pathways, as measured by gene expression profiles, and the relation to plasma lipids. Background: Metabolic studies have demonstrated strong associations of dietary fatty acid composition with plasma lipids. Relatively little is known about the pathways and gene expression changes that mediate these relationships in the general population. Methods: We analyzed self-reported dietary intake of fatty acids, plasma lipid levels, and genome-wide gene expression data from Framingham Heart Study Offspring and Third Generation cohort participants. We excluded participants on lipid therapy. Multivariable linear regression models were conducted with plasma lipids as separate outcomes, energy-adjusted residuals of dietary fats as predictors, and adjustment for clinical and dietary covariates. Normalized gene expression from whole blood derived RNA was similarly modeled with additional adjustment for cell count and batch effects. Results: Among 3681 participants, higher polyunsaturated fatty acid (PUFA) intake is associated with lower LDL-C (estimated β [regression coefficient] = -0.5, p=0.002), higher HDL-C (β= 0.4, p<0.0001), and lower triglyceride (β= -0.009, p=0.0003) concentrations after adjustment for age, sex, carbohydrate, protein, and alcohol intake. Higher PUFA intake was associated with differential gene expression of cholesterol efflux transporters (ABCA1/ABCG1), LDL receptor degrader (IDOL), and non-lipoprotein metabolism related transcripts (FDR < 0.05). In contrast, higher saturated fat intake (SFA) showed inverse associations with ABCA1 expression levels and HDL cholesterol ( Figure 1 ). Conclusions: Higher PUFA intake is associated with a less atherogenic lipid profile and higher ABCA1 expression with inverse associations for higher SFA intake. Gene expression analysis reveals important links between dietary fat type, specific cholesterol metabolism pathways, and lipids in a community cohort.


Sign in / Sign up

Export Citation Format

Share Document