surrogate variable analysis
Recently Published Documents


TOTAL DOCUMENTS

25
(FIVE YEARS 0)

H-INDEX

11
(FIVE YEARS 0)

2020 ◽  
Author(s):  
Jiacheng Dai ◽  
Yu Chen ◽  
Chao Chen ◽  
Chunyu Liu

AbstractAgonal factors, the conditions that occur just prior to death, can impact the molecular quality of postmortem brains, influencing gene expression results. Nevertheless, study designs using postmortem brain tissue rarely, if ever, account for these factors, and previous studies had not documented nor adjusted for agonal factors. Our study used gene expression data of 262 samples from ROSMAP with the following terminal states recorded for each donor: surgery, fever, infection, unconsciousness, difficulty breathing, and mechanical ventilation. Performed differential gene expression and weighted gene co-expression network analyses (WGCNA), fever and infection were the primary contributors to brain gene expression changes. Fever and infection also contributed to brain cell-type specific gene expression and cell proportion changes. Furthermore, the gene expression patterns implicated in fever and infection were unique to other agonal factors. We also found that previous studies of gene expression in postmortem brains were confounded by variables of hypoxia or oxygen level pathways. Therefore, correction for agonal factors through probabilistic estimation of expression residuals (PEER) or surrogate variable analysis (SVA) is recommended to control for unknown agonal factors. Our analyses revealed fever and infection contributing to gene expression changes in postmortem brains and emphasized the necessity of study designs that document and account for agonal factors.


2020 ◽  
Author(s):  
Annie I. Arockiaraj ◽  
Dongjing Liu ◽  
John R. Shaffer ◽  
Theresa A. Koleck ◽  
Elizabeth A. Crago ◽  
...  

AbstractOne challenge in conducting DNA methylation-based epigenome-wide association studies (EWAS) is the appropriate cleaning and quality-checking of the methylation values to minimize biases and experimental artifacts, while simultaneously retaining potential biological signals. These issues are compounded in studies that include multiple tissue types, and/or tissues for which reference data are unavailable to assist in adjusting for cell-type mixture, for example cerebral spinal fluid (CSF). For our study that evaluated blood and CSF taken from aneurysmal subarachnoid hemorrhage (aSAH) patients, we developed a protocol to clean and quality-check genome-wide methylation levels and compared the methylomic profiles of the two tissues to determine whether blood is a suitable surrogate for CSF. CSF samples were collected from 279 aSAH patients longitudinally during the first 14 days of hospitalization, and a subset of 88 of these patients also provided blood samples within the first two days. Quality control (QC) procedures included identification and exclusion of poor performing samples and low-quality probes, functional normalization, and correction for cell-type heterogeneity via surrogate variable analysis (SVA). Significant differences in rates of poor sample performance was observed between blood (1.1% failing QC) and CSF (9.12% failing QC; p = 0.003). Functional normalization increased the concordance of methylation values among technical replicates in both CSF and blood. Likewise, SVA improved the asymptotic behavior of the test of association in a simulated EWAS under the null hypothesis. To determine the suitability of blood as a surrogate for CSF, we calculated the correlation of adjusted methylation values between blood and CSF globally and by genomic regions. Overall, mean correlation (r < 0.26) was low, suggesting that blood is not a suitable surrogate for global methylation in CSF. However, differences in the magnitude of the correlation were observed by genomic region (CpG island, shore, shelf, open sea; p < 0.001 for all) and orientation with respect to nearby genes (3’ UTR, transcription start site, exon, body, 5’ UTR; p < 0.01 for all). In conclusion, the correlation analysis and QC pipelines indicated that DNA extracted from blood was not, overall, a suitable surrogate for DNA extracted from CSF in aSAH methylomic studies.


2020 ◽  
Vol 36 (11) ◽  
pp. 3582-3584
Author(s):  
Nathan Lawlor ◽  
Eladio J Marquez ◽  
Donghyung Lee ◽  
Duygu Ucar

Abstract Summary Single-cell RNA-sequencing (scRNA-seq) technology enables studying gene expression programs from individual cells. However, these data are subject to diverse sources of variation, including ‘unwanted’ variation that needs to be removed in downstream analyses (e.g. batch effects) and ‘wanted’ or biological sources of variation (e.g. variation associated with a cell type) that needs to be precisely described. Surrogate variable analysis (SVA)-based algorithms, are commonly used for batch correction and more recently for studying ‘wanted’ variation in scRNA-seq data. However, interpreting whether these variables are biologically meaningful or stemming from technical reasons remains a challenge. To facilitate the interpretation of surrogate variables detected by algorithms including IA-SVA, SVA or ZINB-WaVE, we developed an R Shiny application [Visual Surrogate Variable Analysis (V-SVA)] that provides a web-browser interface for the identification and annotation of hidden sources of variation in scRNA-seq data. This interactive framework includes tools for discovery of genes associated with detected sources of variation, gene annotation using publicly available databases and gene sets, and data visualization using dimension reduction methods. Availability and implementation The V-SVA Shiny application is publicly hosted at https://vsva.jax.org/ and the source code is freely available at https://github.com/nlawlor/V-SVA. Contact [email protected] or [email protected] Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Author(s):  
Basile Jumentier ◽  
Kevin Caye ◽  
Barbara Heude ◽  
Johanna Lepeule ◽  
Olivier François

AbstractAssociation of phenotypes or exposures with genomic and epigenomic data faces important statistical challenges. One of these challenges is to remove variation due to unobserved confounding factors, such as individual ancestry or cell-type composition in tissues. This issue can be addressed with penalized latent factor regression models, where penalties are introduced to cope with high dimension in the data. If a relatively small proportion of genomic or epigenomic markers correlate with the variable of interest, sparsity penalties may help to capture the relevant associations, but the improvement over non-sparse approaches has not been fully evaluated yet. In this study, we introduced least-squares algorithms that jointly estimate effect sizes and confounding factors in sparse latent factor regression models. Computer simulations provided evidence that sparse latent factor regression models achieve higher statistical performance than other sparse methods, including the least absolute shrinkage and selection operator (LASSO) and a Bayesian sparse linear mixed model (BSLMM). Additional simulations based on real data showed that sparse latent factor regression models were more robust to departure from the generative model than non-sparse approaches, such as surrogate variable analysis (SVA) and other methods. We applied sparse latent factor regression models to a genome-wide association study of a flowering trait for the plant Arabidopsis thaliana and to an epigenome-wide association study of smoking status in pregnant women. For both applications, sparse latent factor regression models facilitated the estimation of non-null effect sizes while avoiding multiple testing problems. The results were not only consistent with previous discoveries, but they also pinpointed new genes with functional annotations relevant to each application.


2019 ◽  
Vol 36 (4) ◽  
pp. 852-860 ◽  
Author(s):  
Kevin Caye ◽  
Basile Jumentier ◽  
Johanna Lepeule ◽  
Olivier François

Abstract Gene-environment association (GEA) studies are essential to understand the past and ongoing adaptations of organisms to their environment, but those studies are complicated by confounding due to unobserved demographic factors. Although the confounding problem has recently received considerable attention, the proposed approaches do not scale with the high-dimensionality of genomic data. Here, we present a new estimation method for latent factor mixed models (LFMMs) implemented in an upgraded version of the corresponding computer program. We developed a least-squares estimation approach for confounder estimation that provides a unique framework for several categories of genomic data, not restricted to genotypes. The speed of the new algorithm is several order faster than existing GEA approaches and then our previous version of the LFMM program. In addition, the new method outperforms other fast approaches based on principal component or surrogate variable analysis. We illustrate the program use with analyses of the 1000 Genomes Project data set, leading to new findings on adaptation of humans to their environment, and with analyses of DNA methylation profiles providing insights on how tobacco consumption could affect DNA methylation in patients with rheumatoid arthritis. Software availability: Software is available in the R package lfmm at https://bcm-uga.github.io/lfmm/.


2018 ◽  
Author(s):  
L Collado-Torres ◽  
EE Burke ◽  
A Peterson ◽  
JH Shin ◽  
RE Straub ◽  
...  

AbstractRecent large-scale genomics efforts have better characterized the molecular correlates of schizophrenia in postmortem human neocortex, but not hippocampus which is a brain region prominently implicated in its pathogenesis. Here in the second phase of the BrainSeq Consortium (Phase II), we have generated RiboZero RNA-seq data for 900 samples across both the dorsolateral prefrontal cortex (DLPFC) and the hippocampus (HIPPO) for 551 individuals (286 affected by schizophrenia disorder: SCZD). We identify substantial regional differences in gene expression, in both pre- and post-natal life, and find widespread differences in how genes are regulated across development. By extending quality surrogate variable analysis (qSVA) to multiple brain regions, we identified 48 and 245 differentially expressed genes (DEG) by SCZD diagnosis (FDR<5%) in HIPPO and DLPFC, respectively, with surprisingly minimal overlap in DEG between the two brain regions. We further identified 205,618 brain region-dependent eQTLs (FDR<1%) and found that 124 GWAS risk loci contain eQTLs in at least one of the regions. We also identify potential molecular correlates of in vivo evidence of altered prefrontal-hippocampal functional coherence in schizophrenia. These results underscore the complexity and regional heterogeneity of the transcriptional correlates of schizophrenia, and suggest future schizophrenia therapeutics may need to target molecular pathologies localized to specific brain regions.


2018 ◽  
Author(s):  
Kevin Caye ◽  
Basile Jumentier ◽  
Olivier François

AbstractMotivationGenome-wide, epigenome-wide and gene-environment association studies are plagued with the problems of confounding and causality. Although those problems have received considerable attention in each application field, no consensus have emerged on which approaches are the most appropriate to solve this problem. Current methods use approximate heuristics for estimating confounders, and often ignore correlation between confounders and primary variables, resulting in suboptimal power and precision.ResultsIn this study, we developed a least-squares estimation theory of confounder estimation using latent factor models, providing a unique framework for several categories of genomic data. Based on statistical learning methods, the proposed algorithms are fast and efficient, and can be proven to provide optimal solutions mathematically. In simulations, the algorithms outperformed commonly used methods based on principal components and surrogate variable analysis. In analysis of methylation profiles and genotypic data, they provided new insights on the molecular basis of diseases and adaptation of humans to their environment.Availability and implementationSoftware is available in the R package lfmm at https://bcm-uga.github.io/lfmm/.


2017 ◽  
Vol 1 ◽  
pp. 168-183 ◽  
Author(s):  
Wilson Wen Bin Goh ◽  
Judy Chia-Ghee Sng ◽  
Jie Yin Yee ◽  
Yuen Mei See ◽  
Tih-Shih Lee ◽  
...  

The ultra-high risk (UHR) state was originally conceived to identify individuals at imminent risk of developing psychosis. Although recent studies have suggested that most individuals designated UHR do not, they constitute a distinctive group, exhibiting cognitive and functional impairments alongside multiple psychiatric morbidities. UHR characterization using molecular markers may improve understanding, provide novel insight into pathophysiology, and perhaps improve psychosis prediction reliability. Whole-blood gene expressions from 56 UHR subjects and 28 healthy controls are checked for existence of a consistent gene expression profile (signature) underlying UHR, across a variety of normalization and heterogeneity-removal techniques, including simple log-conversion, quantile normalization, gene fuzzy scoring (GFS), and surrogate variable analysis. During functional analysis, consistent and reproducible identification of important genes depends largely on how data are normalized. Normalization techniques that address sample heterogeneity are superior. The best performer, the unsupervised GFS, produced a strong and concise 12-gene signature, enriched for psychosis-associated genes. Importantly, when applied on random subsets of data, classifiers built with GFS are “meaningful” in the sense that the classifier models built using genes selected after other forms of normalization do not outperform random ones, but GFS-derived classifiers do. Data normalization can present highly disparate interpretations on biological data. Comparative analysis has shown that GFS is efficient at preserving signals while eliminating noise. Using this, we demonstrate confidently that the UHR designation is well correlated with a distinct blood-based gene signature.


2017 ◽  
Vol 114 (38) ◽  
pp. 10286-10291 ◽  
Author(s):  
Xin Fang ◽  
Anand Sastry ◽  
Nathan Mih ◽  
Donghyuk Kim ◽  
Justin Tan ◽  
...  

Transcriptional regulatory networks (TRNs) have been studied intensely for >25 y. Yet, even for theEscherichia coliTRN—probably the best characterized TRN—several questions remain. Here, we address three questions: (i) How complete is our knowledge of theE. coliTRN; (ii) how well can we predict gene expression using this TRN; and (iii) how robust is our understanding of the TRN? First, we reconstructed a high-confidence TRN (hiTRN) consisting of 147 transcription factors (TFs) regulating 1,538 transcription units (TUs) encoding 1,764 genes. The 3,797 high-confidence regulatory interactions were collected from published, validated chromatin immunoprecipitation (ChIP) data and RegulonDB. For 21 different TF knockouts, up to 63% of the differentially expressed genes in the hiTRN were traced to the knocked-out TF through regulatory cascades. Second, we trained supervised machine learning algorithms to predict the expression of 1,364 TUs given TF activities using 441 samples. The algorithms accurately predicted condition-specific expression for 86% (1,174 of 1,364) of the TUs, while 193 TUs (14%) were predicted better than random TRNs. Third, we identified 10 regulatory modules whose definitions were robust against changes to the TRN or expression compendium. Using surrogate variable analysis, we also identified three unmodeled factors that systematically influenced gene expression. Our computational workflow comprehensively characterizes the predictive capabilities and systems-level functions of an organism’s TRN from disparate data types.


2017 ◽  
Vol 114 (27) ◽  
pp. 7130-7135 ◽  
Author(s):  
Andrew E. Jaffe ◽  
Ran Tao ◽  
Alexis L. Norris ◽  
Marc Kealhofer ◽  
Abhinav Nellore ◽  
...  

RNA sequencing (RNA-seq) is a powerful approach for measuring gene expression levels in cells and tissues, but it relies on high-quality RNA. We demonstrate here that statistical adjustment using existing quality measures largely fails to remove the effects of RNA degradation when RNA quality associates with the outcome of interest. Using RNA-seq data from molecular degradation experiments of human primary tissues, we introduce a method—quality surrogate variable analysis (qSVA)—as a framework for estimating and removing the confounding effect of RNA quality in differential expression analysis. We show that this approach results in greatly improved replication rates (>3×) across two large independent postmortem human brain studies of schizophrenia and also removes potential RNA quality biases in earlier published work that compared expression levels of different brain regions and other diagnostic groups. Our approach can therefore improve the interpretation of differential expression analysis of transcriptomic data from human tissue.


Sign in / Sign up

Export Citation Format

Share Document