Maps of open chromatin highlight cell type-restricted patterns of regulatory sequence variation at hematological trait loci

D. S. Paul; C. A. Albers; A. Rendon; K. Voss; J. Stephens; P. van der Harst; J. C. Chambers; N. Soranzo; W. H. Ouwehand; P. Deloukas;

doi:10.1101/gr.155127.113

The impact of different negative training data on regulatory sequence predictions

PLoS ONE ◽

10.1371/journal.pone.0237412 ◽

2020 ◽

Vol 15 (12) ◽

pp. e0237412

Author(s):

Louisa-Marie Krützfeldt ◽

Max Schubach ◽

Martin Kircher

Keyword(s):

Model Performance ◽

Training Data ◽

Training Dataset ◽

Support Vector ◽

Regulatory Sequence ◽

Open Chromatin ◽

Regulatory Sequences ◽

Cell Type ◽

The Impact ◽

Negative Training

Regulatory regions, like promoters and enhancers, cover an estimated 5–15% of the human genome. Changes to these sequences are thought to underlie much of human phenotypic variation and a substantial proportion of genetic causes of disease. However, our understanding of their functional encoding in DNA is still very limited. Applying machine or deep learning methods can shed light on this encoding and gapped k-mer support vector machines (gkm-SVMs) or convolutional neural networks (CNNs) are commonly trained on putative regulatory sequences. Here, we investigate the impact of negative sequence selection on model performance. By training gkm-SVM and CNN models on open chromatin data and corresponding negative training dataset, both learners and two approaches for negative training data are compared. Negative sets use either genomic background sequences or sequence shuffles of the positive sequences. Model performance was evaluated on three different tasks: predicting elements active in a cell-type, predicting cell-type specific elements, and predicting elements' relative activity as measured from independent experimental data. Our results indicate strong effects of the negative training data, with genomic backgrounds showing overall best results. Specifically, models trained on highly shuffled sequences perform worse on the complex tasks of tissue-specific activity and quantitative activity prediction, and seem to learn features of artificial sequences rather than regulatory activity. Further, we observe that insufficient matching of genomic background sequences results in model biases. While CNNs achieved and exceeded the performance of gkm-SVMs for larger training datasets, gkm-SVMs gave robust and best results for typical training dataset sizes without the need of hyperparameter optimization.

Download Full-text

The impact of different negative training data on regulatory sequence predictions

10.1101/2020.07.28.224485 ◽

2020 ◽

Author(s):

Louisa-Marie Krützfeldt ◽

Max Schubach ◽

Martin Kircher

Keyword(s):

Model Performance ◽

Training Data ◽

Training Dataset ◽

Support Vector ◽

Regulatory Sequence ◽

Open Chromatin ◽

Regulatory Sequences ◽

Cell Type ◽

The Impact ◽

Negative Training

AbstractRegulatory regions, like promoters and enhancers, cover an estimated 5-15% of the human genome. Changes to these sequences are thought to underlie much of human phenotypic variation and a substantial proportion of genetic causes of disease. However, our understanding of their functional encoding in DNA is still very limited. Applying machine or deep learning methods can shed light on this encoding and gapped k-mer support vector machines (gkm-SVMs) or convolutional neural networks (CNNs) are commonly trained on putative regulatory sequences.Here, we investigate the impact of negative sequence selection on model performance. By training gkm-SVM and CNN models on open chromatin data and corresponding negative training dataset, both learners and two approaches for negative training data are compared. Negative sets use either genomic background sequences or sequence shuffles of the positive sequences. Model performance was evaluated on three different tasks: predicting elements active in a cell-type, predicting cell-type specific elements, and predicting elements’ relative activity as measured from independent experimental data.Our results indicate strong effects of the negative training data, with genomic backgrounds showing overall best results. Specifically, models trained on highly shuffled sequences perform worse on the complex tasks of tissue-specific activity and quantitative activity prediction, and seem to learn features of artificial sequences rather than regulatory activity. Further, we observe that insufficient matching of genomic background sequences results in model biases. While CNNs achieved and exceeded the performance of gkm-SVMs for larger training datasets, gkm-SVMs gave robust and best results for typical training dataset sizes without the need of hyperparameter optimization.

Download Full-text

Comprehensive analysis of single cell ATAC-seq data with SnapATAC

Nature Communications ◽

10.1038/s41467-021-21583-9 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Rongxin Fang ◽

Sebastian Preissl ◽

Yang Li ◽

Xiaomeng Hou ◽

Jacinta Lucero ◽

...

Keyword(s):

Single Cell ◽

Single Cell Analysis ◽

Expression Patterns ◽

Regulatory Elements ◽

Cellular Heterogeneity ◽

Specific Gene ◽

Open Chromatin ◽

Cell Type ◽

Process Data ◽

Cell Type Specific

AbstractIdentification of the cis-regulatory elements controlling cell-type specific gene expression patterns is essential for understanding the origin of cellular diversity. Conventional assays to map regulatory elements via open chromatin analysis of primary tissues is hindered by sample heterogeneity. Single cell analysis of accessible chromatin (scATAC-seq) can overcome this limitation. However, the high-level noise of each single cell profile and the large volume of data pose unique computational challenges. Here, we introduce SnapATAC, a software package for analyzing scATAC-seq datasets. SnapATAC dissects cellular heterogeneity in an unbiased manner and map the trajectories of cellular states. Using the Nyström method, SnapATAC can process data from up to a million cells. Furthermore, SnapATAC incorporates existing tools into a comprehensive package for analyzing single cell ATAC-seq dataset. As demonstration of its utility, SnapATAC is applied to 55,592 single-nucleus ATAC-seq profiles from the mouse secondary motor cortex. The analysis reveals ~370,000 candidate regulatory elements in 31 distinct cell populations in this brain region and inferred candidate cell-type specific transcriptional regulators.

Download Full-text

Antagonising Chromatin Remodelling Activities in the Regulation of Mammalian Ribosomal Transcription

Genes ◽

10.3390/genes12070961 ◽

2021 ◽

Vol 12 (7) ◽

pp. 961

Author(s):

Kanwal Tariq ◽

Ann-Kristin Östlund Farrants

Keyword(s):

Stress Responses ◽

Ribosomal Gene ◽

Chromatin Remodelling ◽

Chromatin State ◽

Rrna Gene ◽

Open Chromatin ◽

Histone Chaperones ◽

Embryonic Cells ◽

Cell Type ◽

Chromatin States

Ribosomal transcription constitutes the major energy consuming process in cells and is regulated in response to proliferation, differentiation and metabolic conditions by several signalling pathways. These act on the transcription machinery but also on chromatin factors and ncRNA. The many ribosomal gene repeats are organised in a number of different chromatin states; active, poised, pseudosilent and repressed gene repeats. Some of these chromatin states are unique to the 47rRNA gene repeat and do not occur at other locations in the genome, such as the active state organised with the HMG protein UBF whereas other chromatin state are nucleosomal, harbouring both active and inactive histone marks. The number of repeats in a certain state varies on developmental stage and cell type; embryonic cells have more rRNA gene repeats organised in an open chromatin state, which is replaced by heterochromatin during differentiation, establishing different states depending on cell type. The 47S rRNA gene transcription is regulated in different ways depending on stimulus and chromatin state of individual gene repeats. This review will discuss the present knowledge about factors involved, such as chromatin remodelling factors NuRD, NoRC, CSB, B-WICH, histone modifying enzymes and histone chaperones, in altering gene expression and switching chromatin states in proliferation, differentiation, metabolic changes and stress responses.

Download Full-text

Predicting lineage-specific differences in open chromatin across dozens of mammalian genomes

10.1101/2020.12.04.410795 ◽

2020 ◽

Author(s):

Irene M. Kaplow ◽

Morgan E. Wirthlin ◽

Alyssa J. Lawler ◽

Ashley R. Brown ◽

Michael Kleyman ◽

...

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Genome Sequence ◽

Evaluation Metrics ◽

Open Chromatin ◽

Learning Models ◽

Cell Type ◽

Mammalian Genomes ◽

Multiple Species ◽

Machine Learning Models

ABSTRACTMany phenotypes have evolved through gene expression, meaning that differences between species are caused in part by differences in enhancers. Here, we demonstrate that we can accurately predict differences between species in open chromatin status at putative enhancers using machine learning models trained on genome sequence across species. We present a new set of criteria that we designed to explicitly demonstrate if models are useful for studying open chromatin regions whose orthologs are not open in every species. Our approach and evaluation metrics can be applied to any tissue or cell type with open chromatin data available from multiple species.

Download Full-text

Analysis of putative cis-regulatory elements regulating blood pressure variation

Human Molecular Genetics ◽

10.1093/hmg/ddaa098 ◽

2020 ◽

Vol 29 (11) ◽

pp. 1922-1932

Author(s):

Priyanka Nandakumar ◽

Dongwon Lee ◽

Thomas J Hoffmann ◽

Georg B Ehret ◽

Dan Arking ◽

...

Keyword(s):

Blood Pressure ◽

Association Studies ◽

Specific Effect ◽

Cell Types ◽

Regulatory Elements ◽

Open Chromatin ◽

Genome Wide Association Studies ◽

Cell Type ◽

Functional Scores ◽

Cell Type Specific

Abstract Hundreds of loci have been associated with blood pressure (BP) traits from many genome-wide association studies. We identified an enrichment of these loci in aorta and tibial artery expression quantitative trait loci in our previous work in ~100 000 Genetic Epidemiology Research on Aging study participants. In the present study, we sought to fine-map known loci and identify novel genes by determining putative regulatory regions for these and other tissues relevant to BP. We constructed maps of putative cis-regulatory elements (CREs) using publicly available open chromatin data for the heart, aorta and tibial arteries, and multiple kidney cell types. Variants within these regions may be evaluated quantitatively for their tissue- or cell-type-specific regulatory impact using deltaSVM functional scores, as described in our previous work. We aggregate variants within these putative CREs within 50 Kb of the start or end of ‘expressed’ genes in these tissues or cell types using public expression data and use deltaSVM scores as weights in the group-wise sequence kernel association test to identify candidates. We test for association with both BP traits and expression within these tissues or cell types of interest and identify the candidates MTHFR, C10orf32, CSK, NOV, ULK4, SDCCAG8, SCAMP5, RPP25, HDGFRP3, VPS37B and PPCDC. Additionally, we examined two known QT interval genes, SCN5A and NOS1AP, in the Atherosclerosis Risk in Communities Study, as a positive control, and observed the expected heart-specific effect. Thus, our method identifies variants and genes for further functional testing using tissue- or cell-type-specific putative regulatory information.

Download Full-text

A combination of closely associated positive and negative cis-acting promoter elements regulates transcription of the skeletal alpha-actin gene.

Molecular and Cellular Biology ◽

10.1128/mcb.10.2.528 ◽

1990 ◽

Vol 10 (2) ◽

pp. 528-538 ◽

Cited By ~ 69

Author(s):

K L Chow ◽

R J Schwartz

Keyword(s):

Transcriptional Activity ◽

Primary Cultures ◽

Gene Promoter ◽

Regulatory Sequence ◽

Actin Gene ◽

Cell Type ◽

Specific Expression ◽

Cis Acting ◽

Cell Type Specific Expression ◽

Alpha Actin

The chicken skeletal alpha-actin gene promoter region provides at least a 75-fold-greater transcriptional activity in muscle cells than in fibroblasts. The cis-acting sequences required for cell type-restricted expression within this 200-base-pair (bp) region were elucidated by chloramphenicol acetyltransferase assays of site-directed Bg/II linker-scanning mutations transiently transfected into primary cultures. Four positive cis-acting elements were identified and are required for efficient transcriptional activity in myogenic cells. These elements, conserved across vertebrate evolution, include the ATAAAA box (-24 bp), paired CCAAT-box-associated repeats (CBARs; at -83 bp and -127 bp), and the upstream T+A-rich regulatory sequence (at -176 bp). Basal transcriptional activity in fibroblasts was not as dependent on the upstream CBAR or regions of the upstream T+A-rich regulatory sequence. Transfection experiments provided evidence that positive regulatory factors required for alpha-actin expression in fibroblasts are limiting. In addition, negative cis-acting elements were detected and found closely associated with the G+C-rich sequences that surround the paired CBARs. Negative elements may have a role in restricting developmentally timed expression in myoblasts and appear to inhibit promoter activity in nonmyogenic cells. Cell type-specific expression of the skeletal alpha-actin gene promoter is regulated by combinatorial and possibly competitive interactions between multiple positive and negative cis-acting elements.

Download Full-text

Percoll Gradient Separation of Cord Blood Mononuclear Cells Reveals the Presence of a Novel Population of CXCR4+ Oct-4+ Small Embryonic-Like Stem Cells.

Blood ◽

10.1182/blood.v106.11.1069.1069 ◽

2005 ◽

Vol 106 (11) ◽

pp. 1069-1069 ◽

Cited By ~ 1

Author(s):

Magda Kucia ◽

Krzysztof Oldak ◽

Mariusz Z. Ratajczak ◽

Janina Ratajczak ◽

Zygmunt Pojda

Keyword(s):

Stem Cells ◽

Stem Cell ◽

Cord Blood ◽

Intestinal Epithelium ◽

Skeletal Muscles ◽

Mononuclear Cells ◽

Neural Tissue ◽

Open Chromatin ◽

Percoll Gradient ◽

Cell Type

Abstract Cord blood (CB) mononuclear cells (MNC) were demonstrated to contribute to organ/tissue regeneration, however, the identity of the specific cell type(s) involved remains unknown. Recently, it had been identified in adult bone marrow (BM) a mobile, SDF-1-, HGF/SF- and LIF-responsive population of CXCR4+ non-hematopoietic MNC that expresses markers (RQ-PCR, immunhistochemistry) of early pluripotent/tissue committed stem cells (TCSC) for skeletal muscles, heart, neural tissue, liver, pancreas, epidermis and intestinal epithelium (Leukemia2004:18;29–40). We hypothesized that a similar population of these rare cells may also be present in CB, however, their final yield may depend on the method of MNC preparation. We hypothesized that since these cells are very small (~3–5 μm), they may cross a Ficoll-Paque gradient or sedimentate more rapidly than other CB MNC and as result of this are lost during routine CB preparations. Thus, taking in consideration their small size, in order to evaluate if these cells are present in CB, we isolated small CB MNC by employing Percoll gradient (1.078–1.095). We found that this allows us to isolate from CB a population of small cells (2.5% of the total number of MNC) that is enriched in a population of TCSC/PSC(~0.002% of MNC) that we have originally identified in BM. Accordingly, these CB-derived TCSC/PSC CXCR4+ cells are very small (~3μm), posses large nuclei that contain embryonic stem cell type open chromatin (euchromatin), and express several markers for skeletal muscles, heart, neural tissue, liver, pancreas, epidermis and intestinal epithelium as well as pluripotent stem cell (PSC) transcription factors such as Oct-4, Nanog and Rex-1. In vitro cultures of CB-derived small TCSC/PSC are able to grow neurospheres that gave rise to neuronal (β-III tubulin+, nestin+) and macroglia (O4+, MBP+, GFAP+) lineages and cardiomyocytes (β-myosin heavy chain+, α-sarcomeric actin+). Based on this we conclude that CB contains embryonic-like stem cells which may be lost during routine procedures to isolate MNC. Thus, Percoll gradient centrifugation allows for optimal isolation of these small CXCR4+ PSC/TCSC and we postulate that the CB tissue/organ regenerating potential may be much higher than initially postulated and we are currently testing this hypothesis in vivo in animal models.

Download Full-text

Activation-Dependent TRAF3 Exon 8 Alternative Splicing Is Controlled by CELF2 and hnRNP C Binding to an Upstream Intronic Element

Molecular and Cellular Biology ◽

10.1128/mcb.00488-16 ◽

2016 ◽

Vol 37 (7) ◽

Cited By ~ 6

Author(s):

Astrid-Solveig Schultz ◽

Marco Preussner ◽

Mario Bunse ◽

Rotem Karni ◽

Florian Heyd

Keyword(s):

Alternative Splicing ◽

Exon Skipping ◽

Regulatory Elements ◽

Model Systems ◽

Regulatory Sequence ◽

Dependent Manner ◽

Cell Type ◽

Activated T Cells ◽

Hnrnp C ◽

Cell Type Specific

ABSTRACT Cell-type-specific and inducible alternative splicing has a fundamental impact on regulating gene expression and cellular function in a variety of settings, including activation and differentiation. We have recently shown that activation-induced skipping of TRAF3 exon 8 activates noncanonical NF-κB signaling upon T cell stimulation, but the regulatory basis for this splicing event remains unknown. Here we identify cis- and trans-regulatory elements rendering this splicing switch activation dependent and cell type specific. The cis-acting element is located 340 to 440 nucleotides upstream of the regulated exon and acts in a distance-dependent manner, since altering the location reduces its activity. A small interfering RNA screen, followed by cross-link immunoprecipitation and mutational analyses, identified CELF2 and hnRNP C as trans-acting factors that directly bind the regulatory sequence and together mediate increased exon skipping in activated T cells. CELF2 expression levels correlate with TRAF3 exon skipping in several model systems, suggesting that CELF2 is the decisive factor, with hnRNP C being necessary but not sufficient. These data suggest an interplay between CELF2 and hnRNP C as the mechanistic basis for activation-dependent alternative splicing of TRAF3 exon 8 and additional exons and uncover an intronic splicing silencer whose full activity depends on the precise location more than 300 nucleotides upstream of the regulated exon.

Download Full-text

Analysis of putative cis-regulatory elements regulating blood pressure variation

10.1101/820522 ◽

2019 ◽

Author(s):

Priyanka Nandakumar ◽

Dongwon Lee ◽

Thomas J. Hoffmann ◽

Georg B. Ehret ◽

Dan Arking ◽

...

Keyword(s):

Gene Expression ◽

Blood Pressure ◽

Cell Types ◽

Regulatory Elements ◽

Open Chromatin ◽

Genome Wide Association Studies ◽

Cell Type ◽

Functional Scores ◽

Cell Type Specific ◽

Different Tissues

AbstractHundreds of loci have been associated with blood pressure traits from many genome-wide association studies. We identified an enrichment of these loci in aorta and tibial artery expression quantitative trait loci in our previous work in ∼100,000 Genetic Epidemiology Research on Aging (GERA) study participants. In the present study, we subsequently focused on determining putative regulatory regions for these and other tissues of relevance to blood pressure, to both fine-map these loci by pinpointing genes and variants of functional interest within them, and to identify any novel genes.We constructed maps of putative cis-regulatory elements using publicly available open chromatin data for the heart, aorta and tibial arteries, and multiple kidney cell types. Sequence variants within these regions may be evaluated quantitatively for their tissue- or cell-type-specific regulatory impact using deltaSVM functional scores, as described in our previous work. In order to identify genes of interest, we aggregate these variants in these putative cis-regulatory elements within 50Kb of the start or end of genes considered as “expressed” in these tissues or cell types using publicly available gene expression data, and use the deltaSVM scores as weights in the well-known group-wise sequence kernel association test (SKAT). We test for association with both blood pressure traits as well as expression within these tissues or cell types of interest, and identify several genes, including MTHFR, C10orf32, CSK, NOV, ULK4, SDCCAG8, SCAMP5, RPP25, HDGFRP3, VPS37B, and PPCDC. Although our study centers on blood pressure traits, we additionally examined two known genes, SCN5A and NOS1AP involved in the cardiac trait QT interval, in the Atherosclerosis Risk in Communities Study (ARIC), as a positive control, and observed an expected heart-specific effect. Thus, our method may be used to identify variants and genes for further functional testing using tissue- or cell-type-specific putative regulatory information.Author SummarySequence change in genes (“variants”) are linked to the presence and severity of different traits or diseases. However, as genes may be expressed in different tissues and at different times and degrees, using this information is expected to more accurately identify genes of interest. Variants within the genes are essential, but also in the sequences (“regulatory elements”) that control the genes’ expression in different tissues or cell types. In this study, we aim to use this information about expression and variants potentially involved in gene expression regulation to better pinpoint genes and variants in regulatory elements of interest for blood pressure regulation. We do so by taking advantage of such data that are publicly available, and use methods to combine information about variants in aggregate within a gene’s putative regulatory elements in tissues thought to be relevant for blood pressure, and identify several genes, meant to enable experimental follow-up.

Download Full-text