Multi-scale deep tensor factorization learns a latent representation of the human epigenome

Mapping Intimacies ◽

10.1101/364976 ◽

2018 ◽

Cited By ~ 12

Author(s):

Jacob Schreiber ◽

Timothy Durham ◽

Jeffrey Bilmes ◽

William Stafford Noble

Keyword(s):

Gene Expression ◽

Human Genome ◽

Replication Timing ◽

Cell Types ◽

Factorization Method ◽

Computational Genomics ◽

Tensor Factorization ◽

Multi Scale ◽

Predict Gene Expression ◽

Machine Learning Models

AbstractThe human epigenome has been experimentally characterized by measurements of protein binding, chromatin acessibility, methylation, and histone modification in hundreds of cell types. The result is a huge compendium of data, consisting of thousands of measurements for every basepair in the human genome. These data are difficult to make sense of, not only for humans, but also for computational methods that aim to detect genes and other functional elements, predict gene expression, characterize polymorphisms, etc. To address this challenge, we propose a deep neural network tensor factorization method, Avocado, that compresses epigenomic data into a dense, information-rich representation of the human genome. We use data from the Roadmap Epigenomics Consortium to demonstrate that this learned representation of the genome is broadly useful: first, by imputing epigenomic data more accurately than previous methods, and second, by showing that machine learning models that exploit this representation outperform those trained directly on epigenomic data on a variety of genomics tasks. These tasks include predicting gene expression, promoter-enhancer interactions, replication timing, and an element of 3D chromatin architecture. Our findings suggest the broad utility of Avocado’s learned latent representation for computational genomics and epigenomics.

Download Full-text

A pitfall for machine learning methods aiming to predict across cell types

Genome Biology ◽

10.1186/s13059-020-02177-y ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Jacob Schreiber ◽

Ritambhara Singh ◽

Jeffrey Bilmes ◽

William Stafford Noble

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Cell Types ◽

Chromatin Domain ◽

Learning Models ◽

Machine Learning Methods ◽

Domain Boundaries ◽

Average Activity ◽

Test Sets ◽

Machine Learning Models

AbstractMachine learning models that predict genomic activity are most useful when they make accurate predictions across cell types. Here, we show that when the training and test sets contain the same genomic loci, the resulting model may falsely appear to perform well by effectively memorizing the average activity associated with each locus across the training cell types. We demonstrate this phenomenon in the context of predicting gene expression and chromatin domain boundaries, and we suggest methods to diagnose and avoid the pitfall. We anticipate that, as more data becomes available, future projects will increasingly risk suffering from this issue.

Download Full-text

Author Correction: Avocado: a multi-scale deep tensor factorization method learns a latent representation of the human epigenome

Genome Biology ◽

10.1186/s13059-021-02470-4 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Jacob Schreiber ◽

Timothy Durham ◽

Jeffrey Bilmes ◽

William Stafford Noble

Keyword(s):

Factorization Method ◽

Tensor Factorization ◽

Multi Scale

Download Full-text

A pitfall for machine learning methods aiming to predict across cell types

10.1101/512434 ◽

2019 ◽

Cited By ~ 11

Author(s):

Jacob Schreiber ◽

Ritambhara Singh ◽

Jeffrey Bilmes ◽

William Stafford Noble

Keyword(s):

Gene Expression ◽

Machine Learning ◽

Cell Types ◽

Chromatin Domain ◽

Genomic Locus ◽

Enhancer Activity ◽

Domain Boundaries ◽

Multiple Cell ◽

Average Activity ◽

Predict Gene Expression

AbstractMachine learning models used to predict phenomena such as gene expression, enhancer activity, transcription factor binding, or chromatin conformation are most useful when they can generalize to make accurate predictions across cell types. In this situation, a natural strategy is to train the model on experimental data from some cell types and evaluate performance on one or more held-out cell types. In this work, we show that when the training set contains examples derived from the same genomic loci across multiple cell types, the resulting model can be susceptible to a particular form of bias related to memorizing the average activity associated with each genomic locus. Consequently, the trained model may appear to perform well when evaluated on the genomic loci that it was trained on but tends to perform poorly on loci that it was not trained on. We demonstrate this phenomenon by using epigenomic measurements and nucleotide sequence to predict gene expression and chromatin domain boundaries, and we suggest methods to diagnose and avoid the pitfall. We anticipate that, as more data and computing resources become available, future projects will increasingly risk suffering from this issue.

Download Full-text

Allele-specific control of replication timing and genome organization during development

10.1101/221762 ◽

2017 ◽

Cited By ~ 2

Author(s):

Juan Carlos Rivera-Mulia ◽

Andrew Dimond ◽

Daniel Vera ◽

Claudia Trevilla-Garcia ◽

Takayo Sasaki ◽

...

Keyword(s):

Gene Expression ◽

Cell Fate ◽

Genome Organization ◽

Replication Timing ◽

Cell Types ◽

Chromatin Accessibility ◽

Parental Origin ◽

F1 Hybrid ◽

Primary Mouse ◽

Allele Specific

AbstractDNA replication occurs in a defined temporal order known as the replication-timing (RT) program. RT is regulated during development in discrete chromosomal units, coordinated with transcriptional activity and 3D genome organization. Here, we derived distinct cell types from F1 hybrid musculus X castaneus mouse crosses and exploited the high single nucleotide polymorphism (SNP) density to characterize allelic differences in RT (Repli-seq), genome organization (Hi-C and promoter-capture Hi-C), gene expression (nuclear RNA-seq) and chromatin accessibility (ATAC-seq). We also presentHARP: a new computational tool for sorting SNPs in phased genomes to efficiently measure allele-specific genome-wide data. Analysis of 6 different hybrid mESC clones with different genomes (C57BL/6, 129/sv and CAST/Ei), parental configurations and gender revealed significant RT asynchrony between alleles across ~12 % of the autosomal genome linked to sub-species genomes but not to parental origin, growth conditions or gender. RT asynchrony in mESCs strongly correlated with changes in Hi-C compartments between alleles but not SNP density, gene expression, imprinting or chromatin accessibility. We then tracked mESC RT asynchronous regions during development by analyzing differentiated cell types including extraembryonic endoderm stem (XEN) cells, 4 male and female primary mouse embryonic fibroblasts (MEFs) and neural precursors (NPCs) differentiatedin vitrofrom mESCs with opposite parental configurations. Surprisingly, we found that RT asynchrony and allelic discordance in Hi-C compartments seen in mESCs was largely lost in all differentiated cell types, coordinated with a more uniform Hi-C compartment arrangement, suggesting that genome organization of homologues converges to similar folding patterns during cell fate commitment.

Download Full-text

RT States: systematic annotation of the human genome using cell type-specific replication timing programs

10.1101/394601 ◽

2018 ◽

Author(s):

Axel Poulet ◽

Ben Li ◽

Tristan Dubos ◽

Juan Carlos Rivera-Mulia ◽

David M. Gilbert ◽

...

Keyword(s):

Human Genome ◽

Cell Fate ◽

Developmental Stages ◽

Replication Timing ◽

Cell Types ◽

Biological Properties ◽

Biological Processes ◽

Genomic Locus ◽

Genome Wide ◽

Fundamental Biological Process

ABSTRACTThe replication timing (RT) program has been linked to many key biological processes including cell fate commitment, 3D chromatin organization and transcription regulation. Significant technology progress now allows to characterize the RT program in the entire human genome in a high-throughput and high-resolution fashion. These experiments suggest that RT changes dynamically during development in coordination with gene activity. Since RT is such a fundamental biological process, we believe that an effective quantitative profile of the local RT program from a diverse set of cell types in various developmental stages and lineages can provide crucial biological insights for a genomic locus. In the present study, we explored recurrent and spatially coherent combinatorial profiles from 42 RT programs collected from multiple lineages at diverse differentiation states. We found that a Hidden Markov Model with 15 hidden states provide a good model to describe these genome-wide RT profiling data. Each of the hidden state represents a unique combination of RT profiles across different cell types which we refer to as “RT states”. To understand the biological properties of these RT states, we inspected their relationship with chromatin states, gene expression, functional annotation and 3D chromosomal organization. We found that the newly defined RT states possess interesting genome-wide functional properties that add complementary information to the existing annotation of the human genome.AUTHOR SUMMARYThe replication timing (RT) program is an important cellular mechanism and has been linked to many key biological processes including cell fate commitment, 3D chromatin organization and transcription regulation. Significant technology progress now allows us to characterize the RT program in the entire human genome. Results from these experiments suggest that RT changes dynamically across different developmental stages. Since RT is such a fundamental biological process, we believe that the local RT program from a diverse set of cell types in various developmental stages can provide crucial biological insights for a genomic locus. In the present study, we explored combinatorial profiles from 42 RT programs collected from multiple lineages at diverse differentiation states. We developed a statistical model consist of 15 “RT states” to describe these genome-wide RT profiling data. To understand the biological properties of these RT states, we inspected the relationship between RT states and other types of functional annotations of the genome. We found that the newly defined RT states possess interesting genome-wide functional properties that add complementary information to the existing annotation of the human genome.

Download Full-text

Avocado: a multi-scale deep tensor factorization method learns a latent representation of the human epigenome

Genome Biology ◽

10.1186/s13059-020-01977-6 ◽

2020 ◽

Vol 21 (1) ◽

Cited By ~ 2

Author(s):

Jacob Schreiber ◽

Timothy Durham ◽

Jeffrey Bilmes ◽

William Stafford Noble

Keyword(s):

Factorization Method ◽

Tensor Factorization ◽

Multi Scale

Download Full-text

RT States: systematic annotation of the human genome using cell type-specific replication timing programs

Bioinformatics ◽

10.1093/bioinformatics/bty957 ◽

2018 ◽

Vol 35 (13) ◽

pp. 2167-2176

Author(s):

Axel Poulet ◽

Ben Li ◽

Tristan Dubos ◽

Juan Carlos Rivera-Mulia ◽

David M Gilbert ◽

...

Keyword(s):

Human Genome ◽

Cell Fate ◽

Developmental Stages ◽

Replication Timing ◽

Cell Types ◽

Biological Properties ◽

Supplementary Information ◽

Genomic Locus ◽

Genome Wide ◽

Fundamental Biological Process

Abstract Motivation The replication timing (RT) program has been linked to many key biological processes including cell fate commitment, 3D chromatin organization and transcription regulation. Significant technology progress now allows to characterize the RT program in the entire human genome in a high-throughput and high-resolution fashion. These experiments suggest that RT changes dynamically during development in coordination with gene activity. Since RT is such a fundamental biological process, we believe that an effective quantitative profile of the local RT program from a diverse set of cell types in various developmental stages and lineages can provide crucial biological insights for a genomic locus. Results In this study, we explored recurrent and spatially coherent combinatorial profiles from 42 RT programs collected from multiple lineages at diverse differentiation states. We found that a Hidden Markov Model with 15 hidden states provide a good model to describe these genome-wide RT profiling data. Each of the hidden state represents a unique combination of RT profiles across different cell types which we refer to as ‘RT states’. To understand the biological properties of these RT states, we inspected their relationship with chromatin states, gene expression, functional annotation and 3D chromosomal organization. We found that the newly defined RT states possess interesting genome-wide functional properties that add complementary information to the existing annotation of the human genome. Availability and implementation R scripts for inferring HMM models and Perl scripts for further analysis are available https://github.com/PouletAxel/script_HMM_Replication_timing. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Learning a latent representation of human genomics using Avocado

10.1101/2020.06.18.159756 ◽

2020 ◽

Author(s):

Jacob Schreiber ◽

William Noble

Keyword(s):

Neural Network ◽

Deep Neural Network ◽

High Throughput Sequencing ◽

Factorization Method ◽

High Accuracy ◽

Tensor Factorization ◽

Learning Models ◽

Human Genomics ◽

The Past ◽

Machine Learning Models

AbstractIn the past decade, the use of high-throughput sequencing assays has allowed researchers to experimentally acquire thousands of functional measurements for each basepair in the human genome. Despite their value, these measurements are only a small fraction of the potential experiments that could be performed while also being too numerous to easily visualize or compute on. In a recent pair of publications, we address both of these challenges with a deep neural network tensor factorization method, Avocado, that compresses these measurements into dense, information-rich representations. We demonstrate that these learned representations can be used to impute with high accuracy the output of experimental assays that have not yet been performed and that machine learning models that leverage these representations outperform those trained directly on the functional measurements on a variety of genomics tasks. The code is publicly available at https://github.com/jmschrei/avocado.

Download Full-text