scholarly journals SArKS: de novo discovery of gene expression regulatory motifs and domains by suffix array kernel smoothing

2017 ◽  
Author(s):  
Dennis Wylie ◽  
Hans A. Hofmann ◽  
Boris V. Zemelman

AbstractMotivationWe set out to develop an algorithm that can mine differential gene expression data to identify candidate cell type-specific DNA regulatory sequences. Differential expression is usually quantified as a continuous score—fold-change, test-statistic, p-value—comparing biological classes. Unlike existing approaches, our de novo strategy, termed SArKS, applies nonparametric kernel smoothing to uncover promoter motifs that correlate with elevated differential expression scores. SArKS detects motifs by smoothing sequence scores over sequence similarity. A second round of smoothing over spatial proximity reveals multi-motif domains (MMDs). Discovered motifs can then be merged or extended based on adjacency within MMDs. False positive rates are estimated and controlled by permutation testing.ResultsWe applied SArKS to published gene expression data representing distinct neocortical neuron classes in M. musculus and interneuron developmental states in H. sapiens. When benchmarked against several existing algorithms for correlative motif discovery using a cross-validation procedure, SArKS identified larger motif sets that formed the basis for regression models with higher correlative power.Availabilityhttps://github.com/denniscwylie/[email protected] informationappended to document.

2019 ◽  
Vol 35 (20) ◽  
pp. 3944-3952 ◽  
Author(s):  
Dennis C Wylie ◽  
Hans A Hofmann ◽  
Boris V Zemelman

Abstract Motivation We set out to develop an algorithm that can mine differential gene expression data to identify candidate cell type-specific DNA regulatory sequences. Differential expression is usually quantified as a continuous score—fold-change, test-statistic, P-value—comparing biological classes. Unlike existing approaches, our de novo strategy, termed SArKS, applies non-parametric kernel smoothing to uncover promoter motif sites that correlate with elevated differential expression scores. SArKS detects motif k-mers by smoothing sequence scores over sequence similarity. A second round of smoothing over spatial proximity reveals multi-motif domains (MMDs). Discovered motif sites can then be merged or extended based on adjacency within MMDs. False positive rates are estimated and controlled by permutation testing. Results We applied SArKS to published gene expression data representing distinct neocortical neuron classes in Mus musculus and interneuron developmental states in Homo sapiens. When benchmarked against several existing algorithms using a cross-validation procedure, SArKS identified larger motif sets that formed the basis for regression models with higher correlative power. Availability and implementation https://github.com/denniscwylie/sarks. Supplementary information Supplementary data are available at Bioinformatics online.


2014 ◽  
Author(s):  
LIYANG Diao ◽  
Antoine Marcais ◽  
Scott Norton ◽  
Kevin C. Chen

MicroRNAs (miRNAs) are a class of ~22nt non-coding RNAs that potentially regulate over 60% of human protein-coding genes. MiRNA activity is highly specific, differing between cell types, developmental stages and environmental conditions, so the identification of active miRNAs in a given sample is of great interest. Here we present a novel computational approach for analyzing both mRNA sequence and gene expression data, called MixMir. Our method corrects for 3' UTR background sequence similarity between transcripts, which is known to correlate with mRNA transcript abundance. We demonstrate that after accounting for kmer sequence similarities in 3' UTRs, a statistical linear model based on motif presence/absence can effectively discover active miRNAs in a sample. MixMir utilizes fast software implementations for solving mixed linear models which are widely-used in genome-wide association studies (GWAS). Essentially we use 3' UTR sequence similarity in place of population cryptic relatedness in the GWAS problem. Compared to similar methods such as miREDUCE, Sylamer and cWords, we found that MixMir performed better at discovering true miRNA motifs in Dicer knockout CD4+ T-cells, as well as protein and mRNA expression data obtained from miRNA transfection experiments in human cell lines. MixMir can be freely downloaded from https://github.com/ldiao/MixMir.


2017 ◽  
Author(s):  
Ionas Erb ◽  
Thomas Quinn ◽  
David Lovell ◽  
Cedric Notredame

AbstractGene expression data, such as those generated by next generation sequencing technologies (RNA-seq), are of an inherently relative nature: the total number of sequenced reads has no biological meaning. This issue is most often addressed with various normalization techniques which all face the same problem: once information about the total mRNA content of the origin cells is lost, it cannot be recovered by mere technical means. Additional knowledge, in the form of an unchanged reference, is necessary; however, this reference can usually only be estimated. Here we propose a novel method where sample normalization is unnecessary, but important insights can be obtained nevertheless. Instead of trying to recover absolute abundances, our method is entirely based on ratios, so normalization factors cancel by default. Although the differential expression of individual genes cannot be recovered this way, the ratios themselves can be differentially expressed (even when their constituents are not). Yet, most current analyses are blind to these cases, while our approach reveals them directly. Specifically, we show how the differential expression of gene ratios can be formalized by decomposing log-ratio variance (LRV) and deriving intuitive statistics from it. Although small LRVs have been used to detect proportional genes in gene expression data before, we focus here on the change in proportionality factors between groups of samples (e.g. tissue-specific proportionality). For this, we propose a statistic that is equivalent to the squared t-statistic of one-way ANOVA, but for gene ratios. In doing so, we show how precision weights can be incorporated to account for the peculiarities of count data, and, moreover, how a moderated statistic can be derived in the same way as the one following from a hierarchical model for individual genes. We also discuss approaches to deal with zero counts, deriving an expression of our statistic that is able to incorporate them. In providing a detailed analysis of the connections between the differential expression of genes and the differential proportionality of pairs, we facilitate a clear interpretation of new concepts. The proposed framework is applied to a data set from GTEx consisting of 98 samples from the cerebellum and cortex, with selected examples shown. A computationally efficient implementation of the approach in R has been released as an addendum to the propr package.1


2020 ◽  
Author(s):  
Thomas J. Hall ◽  
Michael P. Mullen ◽  
Gillian P. McHugo ◽  
Kate E. Killick ◽  
Siobhán C. Ring ◽  
...  

Abstract BackgroundBovine TB (BTB), caused by infection with Mycobacterium bovis, is a major endemic disease affecting global cattle production, particularly in many developing countries. The key innate immune that first encounters the pathogen is the alveolar macrophage, previously shown to be substantially reprogrammed during intracellular infection by the pathogen. Here we use differential expression, and correlation- and interaction-based network approaches to analyse the host response to infection with M. bovis at the transcriptome level to identify core infection response pathways and gene modules. These outputs were then integrated with genome-wide association study (GWAS) data sets to enhance detection of genomic variants for susceptibility/resistance to M. bovis infection.ResultsThe host gene expression data consisted of bovine RNA-seq data from alveolar macrophages infected with M. bovis at 24 and 48 hours post-infection. These RNA-seq data were analysed using three distinct analysis pipelines and novel response pathways and modules were further refined using cross-comparison and integration of the results. First, a differential expression analysis was carried out to determine the most significantly differentially expressed (DE) genes between conditions at each time point. Second, two networks were constructed at each time point using gene correlation patterns to determine changes in expression across conditions. Functional sub-modules within each correlation network were selected by statistical criteria for modularity. Third, a base gene interaction network of the mammalian host response to mycobacterial infection was generated using the GeneCards database and InnateDB. Differential gene expression data were superimposed on this base network to extract functional modules of interconnected DE genes.ConclusionsBovine GWAS data was obtained from a published BTB susceptibility/resistance study. The results from the three parallel analyses were integrated with this data to determine which of the three approaches identified genes significantly enriched for SNPs associated with susceptibility/resistance to M. bovis infection. Results indicate distinct and significant overlap in SNP discovery, demonstrating that network-based integration of biologically relevant transcriptomics data can leverage substantial additional information from GWAS data sets.


Author(s):  
Pau Erola ◽  
Johan L M Björkegren ◽  
Tom Michoel

Abstract Motivation Recently, it has become feasible to generate large-scale, multi-tissue gene expression data, where expression profiles are obtained from multiple tissues or organs sampled from dozens to hundreds of individuals. When traditional clustering methods are applied to this type of data, important information is lost, because they either require all tissues to be analyzed independently, ignoring dependencies and similarities between tissues, or to merge tissues in a single, monolithic dataset, ignoring individual characteristics of tissues. Results We developed a Bayesian model-based multi-tissue clustering algorithm, revamp, which can incorporate prior information on physiological tissue similarity, and which results in a set of clusters, each consisting of a core set of genes conserved across tissues as well as differential sets of genes specific to one or more subsets of tissues. Using data from seven vascular and metabolic tissues from over 100 individuals in the STockholm Atherosclerosis Gene Expression (STAGE) study, we demonstrate that multi-tissue clusters inferred by revamp are more enriched for tissue-dependent protein-protein interactions compared to alternative approaches. We further demonstrate that revamp results in easily interpretable multi-tissue gene expression associations to key coronary artery disease processes and clinical phenotypes in the STAGE individuals. Availability and implementation Revamp is implemented in the Lemon-Tree software, available at https://github.com/eb00/lemon-tree Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Cynthia Z Ma ◽  
Michael R Brent

Abstract Motivation The activity of a transcription factor (TF) in a sample of cells is the extent to which it is exerting its regulatory potential. Many methods of inferring TF activity from gene expression data have been described, but due to the lack of appropriate large-scale datasets, systematic and objective validation has not been possible until now. Results We systematically evaluate and optimize the approach to TF activity inference in which a gene expression matrix is factored into a condition-independent matrix of control strengths and a condition-dependent matrix of TF activity levels. We find that expression data in which the activities of individual TFs have been perturbed are both necessary and sufficient for obtaining good performance. To a considerable extent, control strengths inferred using expression data from one growth condition carry over to other conditions, so the control strength matrices derived here can be used by others. Finally, we apply these methods to gain insight into the upstream factors that regulate the activities of yeast TFs Gcr2, Gln3, Gcn4 and Msn2. Availability and implementation Evaluation code and data are available at https://doi.org/10.5281/zenodo.4050573. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Ramon Viñas ◽  
Helena Andrés-Terré ◽  
Pietro Liò ◽  
Kevin Bryson

Abstract Motivation High-throughput gene expression can be used to address a wide range of fundamental biological problems, but datasets of an appropriate size are often unavailable. Moreover, existing transcriptomics simulators have been criticized because they fail to emulate key properties of gene expression data. In this article, we develop a method based on a conditional generative adversarial network to generate realistic transcriptomics data for Escherichia coli and humans. We assess the performance of our approach across several tissues and cancer-types. Results We show that our model preserves several gene expression properties significantly better than widely used simulators, such as SynTReN or GeneNetWeaver. The synthetic data preserve tissue- and cancer-specific properties of transcriptomics data. Moreover, it exhibits real gene clusters and ontologies both at local and global scales, suggesting that the model learns to approximate the gene expression manifold in a biologically meaningful way. Availability and implementation Code is available at: https://github.com/rvinas/adversarial-gene-expression. Supplementary information Supplementary data are available at Bioinformatics online.


2006 ◽  
Vol 04 (04) ◽  
pp. 833-852 ◽  
Author(s):  
CORNELIU HENEGAR ◽  
RAFFAELLA CANCELLO ◽  
SOPHIE ROME ◽  
HUBERT VIDAL ◽  
KARINE CLÉMENT ◽  
...  

Motivation: Functional profiling is a key step of microarray gene expression data analysis. Identifying co-regulated biological processes could help for better understanding of underlying biological interactions within the studied biological frame. Results: We present herein an original approach designed to search for putatively co-regulated biological processes sharing a significant number of co-expressed genes. An R language implementation named "FunCluster" was built and tested on two gene expression data sets. A discriminatory functional analysis of the first data set, related to experiments performed on separated adipocytes and stroma vascular fraction cells of human white adipose tissue, highlighted the prevalent role of nonadipose cells in the synthesis of inflammatory and immunity molecules in human adiposity. On the second data set, resulting from a model investigating insulin coordinated regulation of gene expression in human skeletal muscle, FunCluster analysis spotlighted novel functional classes of putatively co-regulated biological processes related to protein metabolism and the regulation of muscular contraction. Availability: Supplementary information about the FunCluster tool is available on-line at .


Sign in / Sign up

Export Citation Format

Share Document