SpectralTAD: an R package for defining a hierarchy of Topologically Associated Domains using spectral clustering

Mapping Intimacies ◽

10.1101/549170 ◽

2019 ◽

Cited By ~ 2

Author(s):

Kellen G. Cresswell ◽

John C. Stansfield ◽

Mikhail G. Dozmorov

Keyword(s):

Spectral Clustering ◽

3D Structure ◽

Real Life ◽

Depth Resolution ◽

Building Blocks ◽

R Package ◽

Sequencing Depth ◽

Ctcf Binding ◽

Robust Detection ◽

Topologically Associated Domains

AbstractThe three-dimensional (3D) structure of the genome plays a crucial role in regulating gene expression. Chromatin conformation capture technologies (Hi-C) have revealed that the genome is organized in a hierarchy of topologically associated domains (TADs), the fundamental building blocks of the genome. Identifying such hierarchical structures is a critical step in understanding regulatory interactions within the genome. Existing tools for TAD calling frequently require tunable parameters, are sensitive to biases such as sequencing depth, resolution, and sparsity of Hi-C data, and are computationally inefficient. Furthermore, the choice of TAD callers within the R/Bioconductor ecosystem is limited. To address these challenges, we frame the problem of TAD detection in a spectral clustering framework. Our SpectralTAD R package has automatic parameter selection, robust to sequencing depth, resolution and sparsity of Hi-C data, and detects hierarchical, biologically relevant TAD structure. Using simulated and real-life Hi-C data, we show that SpectralTAD outperforms rGMAP and TopDom, two state-of-the-art R-based TAD callers. TAD boundaries that are shared among multiple levels of the hierarchy were more enriched in relevant genomic annotations, e.g., CTCF binding sites, suggesting their higher biological importance. In contrast, boundaries of primary TADs, defined as TADs which cannot be split into sub-TADs, were found to be less enriched in genomic annotations, suggesting their more dynamic role in genome regulation. In summary, we present a simple, fast, and user-friendly R package for robust detection of TAD hierarchies supported by biological evidence. SpectralTAD is available on https://github.com/dozmorovlab/SpectralTAD and Bioconductor (submitted).

Download Full-text

Correction to: SpectralTAD: an R package for defining a hierarchy of topologically associated domains using spectral clustering

BMC Bioinformatics ◽

10.1186/s12859-020-03710-3 ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Kellen G. Cresswell ◽

John C. Stansfield ◽

Mikhail G. Dozmorov

Keyword(s):

Spectral Clustering ◽

R Package ◽

Topologically Associated Domains

Download Full-text

DNA sequence-dependent chromatin architecture and nuclear hubs formation

Scientific Reports ◽

10.1038/s41598-019-51036-9 ◽

2019 ◽

Vol 9 (1) ◽

Cited By ~ 4

Author(s):

Kamel Jabbari ◽

Maharshi Chakraborty ◽

Thomas Wiehe

Keyword(s):

Dna Sequence ◽

3D Structure ◽

Ctcf Binding ◽

Nuclear Speckles ◽

Dna Base Composition ◽

Sequence Composition ◽

Chromatin Architecture ◽

Chromatin Interactions ◽

Dna Base ◽

Topologically Associated Domains

Abstract In this study, by exploring chromatin conformation capture data, we show that the nuclear segregation of Topologically Associated Domains (TADs) is contributed by DNA sequence composition. GC-peaks and valleys of TADs strongly influence interchromosomal interactions and chromatin 3D structure. To gain insight on the compositional and functional constraints associated with chromatin interactions and TADs formation, we analysed intra-TAD and intra-loop GC variations. This led to the identification of clear GC-gradients, along which, the density of genes, super-enhancers, transcriptional activity, and CTCF binding sites occupancy co-vary non-randomly. Further, the analysis of DNA base composition of nucleolar aggregates and nuclear speckles showed strong sequence-dependant effects. We conjecture that dynamic DNA binding affinity and flexibility underlay the emergence of chromatin condensates, their growth is likely promoted in mechanically soft regions (GC-rich) of the lowest chromatin and nucleosome densities. As a practical perspective, the strong linear association between sequence composition and interchromosomal contacts can help define consensus chromatin interactions, which in turn may be used to study alternative states of chromatin architecture.

Download Full-text

SpectralTAD: an R package for defining a hierarchy of topologically associated domains using spectral clustering

BMC Bioinformatics ◽

10.1186/s12859-020-03652-w ◽

2020 ◽

Vol 21 (1) ◽

Cited By ~ 1

Author(s):

Kellen G. Cresswell ◽

John C. Stansfield ◽

Mikhail G. Dozmorov

Keyword(s):

Spectral Clustering ◽

R Package ◽

Topologically Associated Domains

Download Full-text

HiCcompare: a method for joint normalization of Hi-C datasets and differential chromatin interaction detection

10.1101/147850 ◽

2017 ◽

Cited By ~ 2

Author(s):

John C. Stansfield ◽

Mikhail G. Dozmorov

Keyword(s):

Gene Expression ◽

Comparative Analysis ◽

Regulation Of Gene Expression ◽

3D Structure ◽

Real Life ◽

R Package ◽

Chromatin Interaction ◽

Interaction Detection ◽

Chromatin Interactions ◽

A Genome

AbstractChanges in spatial chromatin interactions are now emerging as a unifying mechanism or-chestrating regulation of gene expression. Evolution of chromatin conformation capture methods into Hi-C sequencing technology now allows an insight into chromatin interactions on a genome-wide scale. However, Hi-C data contains many DNA sequence- and technology-driven biases. These biases prevent effective comparison of chromatin interactions aimed at identifying genomic regions differentially interacting between, disease-normal states or different cell types. Several methods have been developed for normalizing individual Hi-C datasets. However, they fail to account for biases between two or more Hi-C datasets, hindering comparative analysis of chromatin interactions. We developed a simple and effective method HiCcompare for the joint normalization and differential analysis of multiple Hi-C datasets. The method avoids constraining Hi-C data within a rigid statistical model, allowing a data-driven normalization of biases using locally weighted linear regression (loess). The method identifies region-specific chromatin interaction changes complementary to changes due to large-scale genomic rearrangements, such as copy number variants (CNVs). HiCcompare outperforms methods for normalizing individual Hi-C datasets in detecting a priori known chromatin interaction differences in simulated and real-life settings while detecting biologically relevant changes. HiCcompare is freely available as a Bioconductor R package https://bioconductor.org/packages/HiCcompare/.Author SummaryAdvances in chromosome conformation capture sequencing technologies (Hi-C) have sparked interest in studying the 3-dimensional (3D) chromatin interaction structure of the human genome. The 3D structure of the genome is now considered as a primary regulator of gene expression. Changes to the 3D chromatin interactions are now emerging as a hallmark of cancer and other complex diseases. With the growing availability of Hi-C data generated under different conditions (e.g. tumor-normal, cell-type-specific), methods are needed to compare them. However, biases in Hi-C data hinder their comparative analysis. To account for biases, several normalization techniques have been developed for removing biases in individual Hi-C datasets, but very few were designed to account for between-datasets biases. We developed a new method and R package HiCcompare for the joint normalization of multiple Hi-C datasets and differential chromatin interaction detection. Our results show the superiority of our joint normalization methods compared to methods for normalizing individual datasets in detecting true chromatin interaction changes. HiCcompare enables further research into discovering the dynamics of 3D genomic changes.

Download Full-text

SuperTAD: robust detection of hierarchical topologically associated domains with optimized structural information

Genome Biology ◽

10.1186/s13059-020-02234-6 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Yu Wei Zhang ◽

Meng Bo Wang ◽

Shuai Cheng Li

Keyword(s):

Structural Information ◽

Current Method ◽

Structural Proteins ◽

Robust Detection ◽

Topologically Associating Domains ◽

Topologically Associated Domains ◽

Significant Enrichment ◽

High Consistency ◽

Public Datasets ◽

State Of Art

AbstractTopologically associating domains (TADs) are the organizational units of chromosome structures. TADs can contain TADs, thus forming a hierarchy. TAD hierarchies can be inferred from Hi-C data through coding trees. However, the current method for computing coding trees is not optimal. In this paper, we propose optimal algorithms for this computation. In comparison with seven state-of-art methods using two public datasets, from GM12878 and IMR90 cells, SuperTAD shows a significant enrichment of structural proteins around detected boundaries and histone modifications within TADs and displays a high consistency between various resolutions of identical Hi-C matrices.

Download Full-text

LncRNA:DNA triplex-forming sites are positioned at specific areas of genome organization and are predictors for Topologically Associated Domains

BMC Genomics ◽

10.1186/s12864-021-07727-7 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Benjamin Soibam ◽

Ayzhamal Zhamangaraeva

Keyword(s):

Genome Organization ◽

Chromatin Organization ◽

Ctcf Binding ◽

General Mechanism ◽

Promoter Regions ◽

Cellular Processes ◽

A Genome ◽

Topologically Associated Domains ◽

Multiple Cell ◽

Genomic Regions

Abstract Background Chromosomes are organized into units called topologically associated domains (TADs). TADs dictate regulatory landscapes and other DNA-dependent processes. Even though various factors that contribute to the specification of TADs have been proposed, the mechanism is not fully understood. Understanding the process for specification and maintenance of these units is essential in dissecting cellular processes and disease mechanisms. Results In this study, we report a genome-wide study that considers the idea of long noncoding RNAs (lncRNAs) mediating chromatin organization using lncRNA:DNA triplex-forming sites (TFSs). By analyzing the TFSs of expressed lncRNAs in multiple cell lines, we find that they are enriched in TADs, their boundaries, and loop anchors. However, they are evenly distributed across different regions of a TAD showing no preference for any specific portions within TADs. No relationship is observed between the locations of these TFSs and CTCF binding sites. However, TFSs are located not just in promoter regions but also in intronic, intergenic, and 3’UTR regions. We also show these triplex-forming sites can be used as predictors in machine learning models to discriminate TADs from other genomic regions. Finally, we compile a list of important “TAD-lncRNAs” which are top predictors for TADs identification. Conclusions Our observations advocate the idea that lncRNA:DNA TFSs are positioned at specific areas of the genome organization and are important predictors for TADs. LncRNA:DNA triplex formation most likely is a general mechanism of action exhibited by some lncRNAs, not just for direct gene regulation but also to mediate 3D chromatin organization.

Download Full-text

Identifying home locations in human mobility data: an open-source R package for comparison and reproducibility

10.31235/osf.io/k3jp2 ◽

2021 ◽

Author(s):

Qingqing Chen ◽

Ate Poorthuis

Keyword(s):

Software Package ◽

Ad Hoc ◽

Human Mobility ◽

Building Blocks ◽

R Package ◽

Location Based Services ◽

R Software ◽

Mobility Data ◽

Residential Population ◽

Research Goal

Identifying meaningful locations, such as home or work, from human mobility data has become an increasingly common prerequisite for geographic research. Although location-based services (LBS) and other mobile technology have rapidly grown in recent years, it can be challenging to infer meaningful places from such data, which - compared to conventional datasets – can be devoid of context. Existing approaches are often developed ad-hoc and can lack transparency and reproducibility. To address this, we introduce an R software package for inferring home locations from LBS data. The package implements pre-existing algorithms and provides building blocks to make writing algorithmic ‘recipes’ more convenient. We evaluate this approach by analyzing a de-identified LBS dataset from Singapore that aims to balance ethics and privacy with the research goal of identifying meaningful locations. We show that ensemble approaches, combining multiple algorithms, can be especially valuable in this regard as the resulting patterns of inferred home locations closely correlate with the distribution of residential population. We hope this package, and others like it, will contribute to an increase in use and sharing of comparable algorithms, research code and data. This will increase transparency and reproducibility in mobility analyses and further the ongoing discourse around ethical big data research.

Download Full-text

Improving Transparency, Falsifiability, and Rigour by Making Hypothesis Tests Machine Readable

10.31234/osf.io/5xcda ◽

2020 ◽

Cited By ~ 2

Author(s):

Daniel Lakens ◽

Lisa Marie DeBruine

Keyword(s):

Peer Review ◽

Hypothesis Test ◽

Scientific Information ◽

Real Life ◽

R Package ◽

Statistical Prediction ◽

Hypothesis Tests ◽

Research Questions ◽

Meta Analyses ◽

Machine Readable

Making scientific information machine-readable greatly facilitates its re-use. Many scientific articles have the goal to test a hypothesis, so making the tests of statistical predictions easier to find and access could be very beneficial. We propose an approach that can be used to make hypothesis tests machine readable. We believe there are two benefits to specifying a hypothesis test in a way that a computer can evaluate whether the statistical prediction is corroborated or not. First, hypothesis test will become more transparent, falsifiable, and rigorous. Second, scientists will benefit if information related to hypothesis tests in scientific articles is easily findable and re-usable, for example when performing meta-analyses, during peer review, and when examining meta-scientific research questions. We examine what a machine readable hypothesis test should look like, and demonstrate the feasibility of machine readable hypothesis tests in a real-life example using the fully operational prototype R package scienceverse.

Download Full-text

Monitoring Robust Estimates for Compositional Data

Austrian Journal of Statistics ◽

10.17713/ajs.v50i2.1067 ◽

2021 ◽

Vol 50 (2) ◽

pp. 16-37

Author(s):

Valentin Todorov

Keyword(s):

Compositional Data ◽

Real Life ◽

R Package ◽

Diagnostic Tools ◽

Data Set ◽

Mahalanobis Distances ◽

Robust Estimates ◽

Real Life Data ◽

Manufactured Exports ◽

Parameter Values

In a number of recent articles Riani, Cerioli, Atkinson and others advocate the technique of monitoring robust estimates computed over a range of key parameter values. Through this approach the diagnostic tools of choice can be tuned in such a way that highly robust estimators which are as efficient as possible are obtained. This approach is applicable to various robust multivariate estimates like S- and MM-estimates, MVE and MCD as well as to the Forward Search in whichmonitoring is part of the robust method. Key tool for detection of multivariate outliers and for monitoring of robust estimates is the Mahalanobis distances and statistics related to these distances. However, the results obtained with thistool in case of compositional data might be unrealistic since compositional data contain relative rather than absolute information and need to be transformed to the usual Euclidean geometry before the standard statistical tools can be applied. Various data transformations of compositional data have been introduced in the literature and theoretical results on the equivalence of the additive, the centered, and the isometric logratio transformation in the context of outlier identification exist. To illustrate the problem of monitoring compositional data and to demonstrate the usefulness of monitoring in this case we start with a simple example and then analyze a real life data set presenting the technologicalstructure of manufactured exports. The analysis is conducted with the R package fsdaR, which makes the analytical and graphical tools provided in the MATLAB FSDA library available for R users.

Download Full-text

Chicdiff: a computational pipeline for detecting differential chromosomal interactions in Capture Hi-C data

Bioinformatics ◽

10.1093/bioinformatics/btz450 ◽

2019 ◽

Vol 35 (22) ◽

pp. 4764-4766 ◽

Cited By ~ 3

Author(s):

Jonathan Cairns ◽

William R Orchard ◽

Valeriya Malysheva ◽

Mikhail Spivakov

Keyword(s):

Multiple Testing ◽

Positive Association ◽

R Package ◽

Supplementary Information ◽

Gene Promoters ◽

Robust Detection ◽

Testing Procedures ◽

Multiple Testing Procedures ◽

Powerful Approach ◽

Cell Type Specific

Abstract Summary Capture Hi-C is a powerful approach for detecting chromosomal interactions involving, at least on one end, DNA regions of interest, such as gene promoters. We present Chicdiff, an R package for robust detection of differential interactions in Capture Hi-C data. Chicdiff enhances a state-of-the-art differential testing approach for count data with bespoke normalization and multiple testing procedures that account for specific statistical properties of Capture Hi-C. We validate Chicdiff on published Promoter Capture Hi-C data in human Monocytes and CD4+ T cells, identifying multitudes of cell type-specific interactions, and confirming the overall positive association between promoter interactions and gene expression. Availability and implementation Chicdiff is implemented as an R package that is publicly available at https://github.com/RegulatoryGenomicsGroup/chicdiff. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text