Mapping Tumor-Specific Expression QTLs in Impure Tumor Samples

Mapping Intimacies ◽

10.1101/136614 ◽

2017 ◽

Cited By ~ 3

Author(s):

Douglas R. Wilson ◽

Wei Sun ◽

Joseph G. Ibrahim

Keyword(s):

Gene Expression ◽

Type I Error ◽

The Cancer Genome Atlas ◽

Type I ◽

Eqtl Mapping ◽

Rna Seq ◽

Specific Expression ◽

Normal Cells ◽

Technology Application ◽

Tumor Tissues

AbstractThe study of gene expression quantitative trait loci (eQTL) is an effective approach to illuminate the functional roles of genetic variants. Computational methods have been developed for eQTL mapping using gene expression data from microarray or RNA-seq technology. Application of these methods for eQTL mapping in tumor tissues is problematic because tumor tissues are composed of both tumor and infiltrating normal cells (e.g. immune cells) and eQTL effects may vary between tumor and infiltrating normal cells. To address this challenge, we have developed a new method for eQTL mapping using RNA-seq data from tumor samples. Our method separately estimates the eQTL effects in tumor and infiltrating normal cells using both total expression and allele-specific expression (ASE). We demonstrate that our method controls type I error rate and has higher power than some alternative approaches. We applied our method to study RNA-seq data from The Cancer Genome Atlas and illustrated the similarities and differences of eQTL effects in tumor and normal cells.

Download Full-text

The genetic architecture of gene expression levels in wild baboons

eLife ◽

10.7554/elife.04729 ◽

2015 ◽

Vol 4 ◽

Cited By ~ 70

Author(s):

Jenny Tung ◽

Xiang Zhou ◽

Susan C Alberts ◽

Matthew Stephens ◽

Yoav Gilad

Keyword(s):

Gene Expression ◽

Genetic Architecture ◽

Eqtl Mapping ◽

Rna Seq ◽

Specific Expression ◽

Data Set ◽

Expression Levels ◽

Trait Locus ◽

Eqtl Data ◽

Gene Expression Levels

Primate evolution has been argued to result, in part, from changes in how genes are regulated. However, we still know little about gene regulation in natural primate populations. We conducted an RNA sequencing (RNA-seq)-based study of baboons from an intensively studied wild population. We performed complementary expression quantitative trait locus (eQTL) mapping and allele-specific expression analyses, discovering substantial evidence for, and surprising power to detect, genetic effects on gene expression levels in the baboons. eQTL were most likely to be identified for lineage-specific, rapidly evolving genes; interestingly, genes with eQTL significantly overlapped between baboons and a comparable human eQTL data set. Our results suggest that genes vary in their tolerance of genetic perturbation, and that this property may be conserved across species. Further, they establish the feasibility of eQTL mapping using RNA-seq data alone, and represent an important step towards understanding the genetic architecture of gene expression in primates.

Download Full-text

On Using Local Ancestry to Characterize the Genetic Architecture of Human Phenotypes: Genetic Regulation of Gene Expression in Multiethnic or Admixed Populations as a Model

10.1101/483107 ◽

2018 ◽

Cited By ~ 1

Author(s):

Yizhen Zhong ◽

Minoli Perera ◽

Eric R. Gamazon

Keyword(s):

Gene Expression ◽

Complex Traits ◽

Genetic Architecture ◽

Genetic Regulation ◽

Regulation Of Gene Expression ◽

Type I ◽

Eqtl Mapping ◽

Entire Genome ◽

Local Ancestry ◽

Heritability Estimation

AbstractBackgroundUnderstanding the nature of the genetic regulation of gene expression promises to advance our understanding of the genetic basis of disease. However, the methodological impact of use of local ancestry on high-dimensional omics analyses, including most prominently expression quantitative trait loci (eQTL) mapping and trait heritability estimation, in admixed populations remains critically underexplored.ResultsHere we develop a statistical framework that characterizes the relationships among the determinants of the genetic architecture of an important class of molecular traits. We estimate the trait variance explained by ancestry using local admixture relatedness between individuals. Using National Institute of General Medical Sciences (NIGMS) and Genotype-Tissue Expression (GTEx) datasets, we show that use of local ancestry can substantially improve eQTL mapping and heritability estimation and characterize the sparse versus polygenic component of gene expression in admixed and multiethnic populations respectively. Using simulations of diverse genetic architectures to estimate trait heritability and the level of confounding, we show improved accuracy given individual-level data and evaluate a summary statistics based approach. Furthermore, we provide a computationally efficient approach to local ancestry analysis in eQTL mapping while increasing control of type I and type II error over traditional approaches.ConclusionOur study has important methodological implications on genetic analysis of omics traits across a range of genomic contexts, from a single variant to a prioritized region to the entire genome. Our findings highlight the importance of using local ancestry to better characterize the heritability of complex traits and to more accurately map genetic associations.

Download Full-text

DECENT: differential expression with capture efficiency adjustmeNT for single-cell RNA-seq data

Bioinformatics ◽

10.1093/bioinformatics/btz453 ◽

2019 ◽

Vol 35 (24) ◽

pp. 5155-5162 ◽

Cited By ~ 10

Author(s):

Chengzhong Ye ◽

Terence P Speed ◽

Agus Salim

Keyword(s):

Single Cell ◽

Differential Expression ◽

Type I Error ◽

R Package ◽

Supplementary Information ◽

Type I ◽

Common Phenomenon ◽

Rna Seq ◽

Capture Process ◽

Technological Platforms

Abstract Motivation Dropout is a common phenomenon in single-cell RNA-seq (scRNA-seq) data, and when left unaddressed it affects the validity of the statistical analyses. Despite this, few current methods for differential expression (DE) analysis of scRNA-seq data explicitly model the process that gives rise to the dropout events. We develop DECENT, a method for DE analysis of scRNA-seq data that explicitly and accurately models the molecule capture process in scRNA-seq experiments. Results We show that DECENT demonstrates improved DE performance over existing DE methods that do not explicitly model dropout. This improvement is consistently observed across several public scRNA-seq datasets generated using different technological platforms. The gain in improvement is especially large when the capture process is overdispersed. DECENT maintains type I error well while achieving better sensitivity. Its performance without spike-ins is almost as good as when spike-ins are used to calibrate the capture model. Availability and implementation The method is implemented as a publicly available R package available from https://github.com/cz-ye/DECENT. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Optimal selection of genetic variants for adjustment of population stratification in European association studies

Briefings in Bioinformatics ◽

10.1093/bib/bbz023 ◽

2019 ◽

Vol 21 (3) ◽

pp. 753-761 ◽

Cited By ~ 2

Author(s):

Regina Brinster ◽

Dominique Scherer ◽

Justo Lorenzo Bermejo

Keyword(s):

Genetic Variants ◽

Population Stratification ◽

Statistical Power ◽

Type I Error ◽

Association Studies ◽

Reference Sample ◽

Error Rates ◽

The Cancer Genome Atlas ◽

Type I ◽

Genotype Data

Abstract Population stratification is usually corrected relying on principal component analysis (PCA) of genome-wide genotype data, even in populations considered genetically homogeneous, such as Europeans. The need to genotype only a small number of genetic variants that show large differences in allele frequency among subpopulations—so-called ancestry-informative markers (AIMs)—instead of the whole genome for stratification adjustment could represent an advantage for replication studies and candidate gene/pathway studies. Here we compare the correction performance of classical and robust principal components (PCs) with the use of AIMs selected according to four different methods: the informativeness for assignment measure ($IN$-AIMs), the combination of PCA and F-statistics, PCA-correlated measurement and the PCA weighted loadings for each genetic variant. We used real genotype data from the Population Reference Sample and The Cancer Genome Atlas to simulate European genetic association studies and to quantify type I error rate and statistical power in different case–control settings. In studies with the same numbers of cases and controls per country and control-to-case ratios reflecting actual rates of disease prevalence, no adjustment for population stratification was required. The unnecessary inclusion of the country of origin, PCs or AIMs as covariates in the regression models translated into increasing type I error rates. In studies with cases and controls from separate countries, no investigated method was able to adequately correct for population stratification. The first classical and the first two robust PCs achieved the lowest (although inflated) type I error, followed at some distance by the first eight $IN$-AIMs.

Download Full-text

No counts, no variance: allowing for loss of degrees of freedom when assessing biological variability from RNA-seq data

Statistical Applications in Genetics and Molecular Biology ◽

10.1515/sagmb-2017-0010 ◽

2017 ◽

Vol 16 (2) ◽

Cited By ~ 1

Author(s):

Aaron T. L. Lun ◽

Gordon K. Smyth

Keyword(s):

Software Package ◽

Error Control ◽

Degrees Of Freedom ◽

Linear Models ◽

Type I Error ◽

Real Data ◽

Type I ◽

Rna Seq ◽

Study Gene Expression ◽

Complex Models

AbstractRNA sequencing (RNA-seq) is widely used to study gene expression changes associated with treatments or biological conditions. Many popular methods for detecting differential expression (DE) from RNA-seq data use generalized linear models (GLMs) fitted to the read counts across independent replicate samples for each gene. This article shows that the standard formula for the residual degrees of freedom (d.f.) in a linear model is overstated when the model contains fitted values that are exactly zero. Such fitted values occur whenever all the counts in a treatment group are zero as well as in more complex models such as those involving paired comparisons. This misspecification results in underestimation of the genewise variances and loss of type I error control. This article proposes a formula for the reduced residual d.f. that restores error control in simulated RNA-seq data and improves detection of DE genes in a real data analysis. The new approach is implemented in the quasi-likelihood framework of the edgeR software package. The results of this article also apply to RNA-seq analyses that apply linear models to log-transformed counts, such as those in the limma software package, and more generally to any count-based GLM where exactly zero fitted values are possible.

Download Full-text

Testing equality of means in partially paired data with incompleteness in single response

Statistical Methods in Medical Research ◽

10.1177/0962280218765007 ◽

2018 ◽

Vol 28 (5) ◽

pp. 1508-1522 ◽

Cited By ~ 1

Author(s):

Qianya Qi ◽

Li Yan ◽

Lili Tian

Keyword(s):

Type I Error ◽

Real Data ◽

The Cancer Genome Atlas ◽

P Value ◽

Type I ◽

Paired Data ◽

Data Set ◽

Equality Of Means ◽

Breast Cancer Study ◽

Single Response

In testing differentially expressed genes between tumor and healthy tissues, data are usually collected in paired form. However, incomplete paired data often occur. While extensive statistical researches exist for paired data with incompleteness in both arms, hardly any recent work can be found on paired data with incompleteness in single arm. This paper aims to fill this gap by proposing some new methods, namely, P-value pooling methods and a nonparametric combination test. Simulation studies are conducted to investigate the performance of the proposed methods in terms of type I error and power at small to moderate sample sizes. A real data set from The Cancer Genome Atlas (TCGA) breast cancer study is analyzed using the proposed methods.

Download Full-text

Multi-Omic Analyses of the m5C Regulator ALYREF Reveal Its Essential Roles in Hepatocellular Carcinoma

Frontiers in Oncology ◽

10.3389/fonc.2021.633415 ◽

2021 ◽

Vol 11 ◽

Author(s):

Chen Xue ◽

Yalei Zhao ◽

Ganglei Li ◽

Lanjuan Li

Keyword(s):

Gene Expression ◽

Hepatocellular Carcinoma ◽

Immune Cell ◽

The Cancer Genome Atlas ◽

Cancer Tissue ◽

Hub Genes ◽

Specific Expression ◽

Tissue Samples ◽

Expression Levels ◽

Advanced Tumor

The ALYREF protein acts as a crucial epigenetic regulator in several cancers. However, the specific expression levels and functional roles of ALYREF in cancers are largely unknown, including for hepatocellular carcinoma (HCC). In a pan-cancer tissue analysis that included HCC, we assessed the expression of ALYREF compared to normal tissues using The Cancer Genome Atlas database. Associations between ALYREF gene expression and the clinical characteristics of HCC patient samples were assessed using the UALCAN database. Kaplan-Meier plots were performed to assess HCC patient prognosis, and the TIMER database was used to explore associations between ALYREF expression and immune-cell infiltrations. The same methods were used to assess eIF4A3 expression in HCC patient samples. In addition, ALYREF- and elF4A3-related differentially expressed genes (DEGs) were determined using LinkedOmics, associated protein functionalities were predicted for positively associated DEGs, and both the TargetScan and miRDB databases were used to predict potential upstream miRNAs for control of ALYREF and eIF4A3 expression. We found that ALYREF gene expression was dysregulated in several cancers and was significantly elevated in HCC patient tissue samples and HCC cell lines. The overexpression of ALYREF was significantly related to both advanced tumor-node-metastasis stages and poor HCC prognosis. Furthermore, we found that eIF4A3 expression was significantly correlated with ALYREF expression, and that upregulated eIF4A3 was significantly associated with poor HCC patient outcomes. In the protein-protein interaction network, we identified eight hub genes based on the positively associated DEGs in common between ALYREF and eIF4A3, and the high expression levels of these hub genes were positively associated with patient clinical outcomes. In addition, we identified miR-4666a-5p and miR-6124 as potential regulators of ALYREF and eIF4A3 expression. These findings suggest that increased ALYREF expression may function as a novel biomarker for both HCC diagnosis and prognosis predictions.

Download Full-text

recount-brain: a curated repository of human brain RNA-seq datasets metadata

10.1101/618025 ◽

2019 ◽

Author(s):

Ashkaun Razmara ◽

Shannon E. Ellis ◽

Dustin J. Sokolowski ◽

Sean Davis ◽

Michael D. Wilson ◽

...

Keyword(s):

Gene Expression ◽

Human Brain ◽

Gene Expression Data ◽

Differential Expression Analysis ◽

Controlled Vocabulary ◽

The Cancer Genome Atlas ◽

Tissue Type ◽

Expression Data ◽

Rna Seq ◽

Human Brain Tissue

AbstractThe usability of publicly-available gene expression data is often limited by the availability of high-quality, standardized biological phenotype and experimental condition information (“metadata”). We released the recount2 project, which involved re-processing ∼70,000 samples in the Sequencing Read Archive (SRA), Genotype-Tissue Expression (GTEx), and The Cancer Genome Atlas (TCGA) projects. While samples from the latter two projects are well-characterized with extensive metadata, the ∼50,000 RNA-seq samples from SRA in recount2 are inconsistently annotated with metadata. Tissue type, sex, and library type can be estimated from the RNA sequencing (RNA-seq) data itself. However, more detailed and harder to predict metadata, like age and diagnosis, must ideally be provided by labs that deposit the data.To facilitate more analyses within human brain tissue data, we have complemented phenotype predictions by manually constructing a uniformly-curated database of public RNA-seq samples present in SRA and recount2. We describe the reproducible curation process for constructing recount-brain that involves systematic review of the primary manuscript, which can serve as a guide to annotate other studies and tissues. We further expanded recount-brain by merging it with GTEx and TCGA brain samples as well as linking to controlled vocabulary terms for tissue, Brodmann area and disease. Furthermore, we illustrate how to integrate the sample metadata in recount-brain with the gene expression data in recount2 to perform differential expression analysis. We then provide three analysis examples involving modeling postmortem interval, glioblastoma, and meta-analyses across GTEx and TCGA. Overall, recount-brain facilitates expression analyses and improves their reproducibility as individual researchers do not have to manually curate the sample metadata. recount-brain is available via the add_metadata() function from the recount Bioconductor package at bioconductor.org/packages/recount.

Download Full-text

Computational staining of pathology images to study tumor microenvironment in lung cancer

10.1101/630749 ◽

2019 ◽

Author(s):

Shidan Wang ◽

Ruichen Rong ◽

Donghan M. Yang ◽

Ling Cai ◽

Lin Yang ◽

...

Keyword(s):

Gene Expression ◽

Tumor Microenvironment ◽

Spatial Organization ◽

Risk Group ◽

Biological Pathways ◽

The Cancer Genome Atlas ◽

Analysis Tool ◽

Cell Nuclei ◽

Tumor Tissues ◽

Different Types

ABSTRACTThe spatial organization of different types of cells in tumor tissues reveals important information about the tumor microenvironment (TME). In order to facilitate the study of cellular spatial organization and interactions, we developed a comprehensive nuclei segmentation and classification tool to characterize the TME from standard Hematoxylin and Eosin (H&E)-stained pathology images. This tool can computationally “stain” different types of cell nuclei in H&E pathology images to facilitate pathologists in analyzing the TME.A Mask Regional-Convolutional Neural Network (Mask-RCNN) model was developed to segment the nuclei of tumor, stromal, lymphocyte, macrophage, karyorrhexis and red blood cells in lung adenocarcinoma (ADC). Using this tool, we identified and classified cell nuclei and extracted 48 cell spatial organization-related features that characterize the TME. Using these features, we developed a prognostic model from the National Lung Screening Trial dataset, and independently validated the model in The Cancer Genome Atlas (TCGA) lung ADC dataset, in which the predicted high-risk group showed significantly worse survival than the low-risk group (pv= 0.001), with a hazard ratio of 2.23 [1.37-3.65] after adjusting for clinical variables. Furthermore, the image-derived TME features were significantly correlated with the gene expression of biological pathways. For example, transcription activation of both the T-cell receptor (TCR) and Programmed cell death protein 1 (PD1) pathways was positively correlated with the density of detected lymphocytes in tumor tissues, while expression of the extracellular matrix organization pathway was positively correlated with the density of stromal cells.This study developed a deep learning-based analysis tool to dissect the TME from tumor tissue images. Using this tool, we demonstrated that the spatial organization of different cell types is predictive of patient survival and associated with the gene expression of biological pathways. Although developed from the pathology images of lung ADC, this model can be adapted into other types of cancers.

Download Full-text

Explainable autoencoder-based representation learning for gene expression data

10.1101/2021.12.21.473742 ◽

2021 ◽

Author(s):

Yang Yu ◽

Pathum Kossinna ◽

Wenyuan Liao ◽

Qingrun Zhang

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Hidden Variables ◽

Representation Learning ◽

The Cancer Genome Atlas ◽

Expression Data ◽

Rna Seq ◽

Gene Expression Data Analysis ◽

Cancer Genome Atlas ◽

Modern Machine

Modern machine learning methods have been extensively utilized in gene expression data analysis. In particular, autoencoders (AE) have been employed in processing noisy and heterogenous RNA-Seq data. However, AEs usually lead to "black-box" hidden variables difficult to interpret, hindering downstream experimental validation and clinical translation. To bridge the gap between complicated models and biological interpretations, we developed a tool, XAE4Exp (eXplainable AutoEncoder for Expression data), which integrates AE and SHapley Additive exPlanations (SHAP), a flagship technique in the field of eXplainable AI (XAI). It quantitatively evaluates the contributions of each gene to the hidden structure learned by an AE, substantially improving the expandability of AE outcomes. By applying XAE4Exp to The Cancer Genome Atlas (TCGA) breast cancer gene expression data, we identified genes that are not differentially expressed, and pathways in various cancer-related classes. This tool will enable researchers and practitioners to analyze high-dimensional expression data intuitively, paving the way towards broader uses of deep learning.

Download Full-text