scholarly journals SpectralTAD: an R package for defining a hierarchy of Topologically Associated Domains using spectral clustering

2019 ◽  
Author(s):  
Kellen G. Cresswell ◽  
John C. Stansfield ◽  
Mikhail G. Dozmorov

AbstractThe three-dimensional (3D) structure of the genome plays a crucial role in regulating gene expression. Chromatin conformation capture technologies (Hi-C) have revealed that the genome is organized in a hierarchy of topologically associated domains (TADs), the fundamental building blocks of the genome. Identifying such hierarchical structures is a critical step in understanding regulatory interactions within the genome. Existing tools for TAD calling frequently require tunable parameters, are sensitive to biases such as sequencing depth, resolution, and sparsity of Hi-C data, and are computationally inefficient. Furthermore, the choice of TAD callers within the R/Bioconductor ecosystem is limited. To address these challenges, we frame the problem of TAD detection in a spectral clustering framework. Our SpectralTAD R package has automatic parameter selection, robust to sequencing depth, resolution and sparsity of Hi-C data, and detects hierarchical, biologically relevant TAD structure. Using simulated and real-life Hi-C data, we show that SpectralTAD outperforms rGMAP and TopDom, two state-of-the-art R-based TAD callers. TAD boundaries that are shared among multiple levels of the hierarchy were more enriched in relevant genomic annotations, e.g., CTCF binding sites, suggesting their higher biological importance. In contrast, boundaries of primary TADs, defined as TADs which cannot be split into sub-TADs, were found to be less enriched in genomic annotations, suggesting their more dynamic role in genome regulation. In summary, we present a simple, fast, and user-friendly R package for robust detection of TAD hierarchies supported by biological evidence. SpectralTAD is available on https://github.com/dozmorovlab/SpectralTAD and Bioconductor (submitted).

2019 ◽  
Vol 9 (1) ◽  
Author(s):  
Kamel Jabbari ◽  
Maharshi Chakraborty ◽  
Thomas Wiehe

Abstract In this study, by exploring chromatin conformation capture data, we show that the nuclear segregation of Topologically Associated Domains (TADs) is contributed by DNA sequence composition. GC-peaks and valleys of TADs strongly influence interchromosomal interactions and chromatin 3D structure. To gain insight on the compositional and functional constraints associated with chromatin interactions and TADs formation, we analysed intra-TAD and intra-loop GC variations. This led to the identification of clear GC-gradients, along which, the density of genes, super-enhancers, transcriptional activity, and CTCF binding sites occupancy co-vary non-randomly. Further, the analysis of DNA base composition of nucleolar aggregates and nuclear speckles showed strong sequence-dependant effects. We conjecture that dynamic DNA binding affinity and flexibility underlay the emergence of chromatin condensates, their growth is likely promoted in mechanically soft regions (GC-rich) of the lowest chromatin and nucleosome densities. As a practical perspective, the strong linear association between sequence composition and interchromosomal contacts can help define consensus chromatin interactions, which in turn may be used to study alternative states of chromatin architecture.


2017 ◽  
Author(s):  
John C. Stansfield ◽  
Mikhail G. Dozmorov

AbstractChanges in spatial chromatin interactions are now emerging as a unifying mechanism or-chestrating regulation of gene expression. Evolution of chromatin conformation capture methods into Hi-C sequencing technology now allows an insight into chromatin interactions on a genome-wide scale. However, Hi-C data contains many DNA sequence- and technology-driven biases. These biases prevent effective comparison of chromatin interactions aimed at identifying genomic regions differentially interacting between, disease-normal states or different cell types. Several methods have been developed for normalizing individual Hi-C datasets. However, they fail to account for biases between two or more Hi-C datasets, hindering comparative analysis of chromatin interactions. We developed a simple and effective method HiCcompare for the joint normalization and differential analysis of multiple Hi-C datasets. The method avoids constraining Hi-C data within a rigid statistical model, allowing a data-driven normalization of biases using locally weighted linear regression (loess). The method identifies region-specific chromatin interaction changes complementary to changes due to large-scale genomic rearrangements, such as copy number variants (CNVs). HiCcompare outperforms methods for normalizing individual Hi-C datasets in detecting a priori known chromatin interaction differences in simulated and real-life settings while detecting biologically relevant changes. HiCcompare is freely available as a Bioconductor R package https://bioconductor.org/packages/HiCcompare/.Author SummaryAdvances in chromosome conformation capture sequencing technologies (Hi-C) have sparked interest in studying the 3-dimensional (3D) chromatin interaction structure of the human genome. The 3D structure of the genome is now considered as a primary regulator of gene expression. Changes to the 3D chromatin interactions are now emerging as a hallmark of cancer and other complex diseases. With the growing availability of Hi-C data generated under different conditions (e.g. tumor-normal, cell-type-specific), methods are needed to compare them. However, biases in Hi-C data hinder their comparative analysis. To account for biases, several normalization techniques have been developed for removing biases in individual Hi-C datasets, but very few were designed to account for between-datasets biases. We developed a new method and R package HiCcompare for the joint normalization of multiple Hi-C datasets and differential chromatin interaction detection. Our results show the superiority of our joint normalization methods compared to methods for normalizing individual datasets in detecting true chromatin interaction changes. HiCcompare enables further research into discovering the dynamics of 3D genomic changes.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Yu Wei Zhang ◽  
Meng Bo Wang ◽  
Shuai Cheng Li

AbstractTopologically associating domains (TADs) are the organizational units of chromosome structures. TADs can contain TADs, thus forming a hierarchy. TAD hierarchies can be inferred from Hi-C data through coding trees. However, the current method for computing coding trees is not optimal. In this paper, we propose optimal algorithms for this computation. In comparison with seven state-of-art methods using two public datasets, from GM12878 and IMR90 cells, SuperTAD shows a significant enrichment of structural proteins around detected boundaries and histone modifications within TADs and displays a high consistency between various resolutions of identical Hi-C matrices.


BMC Genomics ◽  
2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Benjamin Soibam ◽  
Ayzhamal Zhamangaraeva

Abstract Background Chromosomes are organized into units called topologically associated domains (TADs). TADs dictate regulatory landscapes and other DNA-dependent processes. Even though various factors that contribute to the specification of TADs have been proposed, the mechanism is not fully understood. Understanding the process for specification and maintenance of these units is essential in dissecting cellular processes and disease mechanisms. Results In this study, we report a genome-wide study that considers the idea of long noncoding RNAs (lncRNAs) mediating chromatin organization using lncRNA:DNA triplex-forming sites (TFSs). By analyzing the TFSs of expressed lncRNAs in multiple cell lines, we find that they are enriched in TADs, their boundaries, and loop anchors. However, they are evenly distributed across different regions of a TAD showing no preference for any specific portions within TADs. No relationship is observed between the locations of these TFSs and CTCF binding sites. However, TFSs are located not just in promoter regions but also in intronic, intergenic, and 3’UTR regions. We also show these triplex-forming sites can be used as predictors in machine learning models to discriminate TADs from other genomic regions. Finally, we compile a list of important “TAD-lncRNAs” which are top predictors for TADs identification. Conclusions Our observations advocate the idea that lncRNA:DNA TFSs are positioned at specific areas of the genome organization and are important predictors for TADs. LncRNA:DNA triplex formation most likely is a general mechanism of action exhibited by some lncRNAs, not just for direct gene regulation but also to mediate 3D chromatin organization.


2021 ◽  
Author(s):  
Qingqing Chen ◽  
Ate Poorthuis

Identifying meaningful locations, such as home or work, from human mobility data has become an increasingly common prerequisite for geographic research. Although location-based services (LBS) and other mobile technology have rapidly grown in recent years, it can be challenging to infer meaningful places from such data, which - compared to conventional datasets – can be devoid of context. Existing approaches are often developed ad-hoc and can lack transparency and reproducibility. To address this, we introduce an R software package for inferring home locations from LBS data. The package implements pre-existing algorithms and provides building blocks to make writing algorithmic ‘recipes’ more convenient. We evaluate this approach by analyzing a de-identified LBS dataset from Singapore that aims to balance ethics and privacy with the research goal of identifying meaningful locations. We show that ensemble approaches, combining multiple algorithms, can be especially valuable in this regard as the resulting patterns of inferred home locations closely correlate with the distribution of residential population. We hope this package, and others like it, will contribute to an increase in use and sharing of comparable algorithms, research code and data. This will increase transparency and reproducibility in mobility analyses and further the ongoing discourse around ethical big data research.


2020 ◽  
Author(s):  
Daniel Lakens ◽  
Lisa Marie DeBruine

Making scientific information machine-readable greatly facilitates its re-use. Many scientific articles have the goal to test a hypothesis, so making the tests of statistical predictions easier to find and access could be very beneficial. We propose an approach that can be used to make hypothesis tests machine readable. We believe there are two benefits to specifying a hypothesis test in a way that a computer can evaluate whether the statistical prediction is corroborated or not. First, hypothesis test will become more transparent, falsifiable, and rigorous. Second, scientists will benefit if information related to hypothesis tests in scientific articles is easily findable and re-usable, for example when performing meta-analyses, during peer review, and when examining meta-scientific research questions. We examine what a machine readable hypothesis test should look like, and demonstrate the feasibility of machine readable hypothesis tests in a real-life example using the fully operational prototype R package scienceverse.


2021 ◽  
Vol 50 (2) ◽  
pp. 16-37
Author(s):  
Valentin Todorov

In a number of recent articles Riani, Cerioli, Atkinson and others advocate the technique of monitoring robust estimates computed over a range of key parameter values. Through this approach the diagnostic tools of choice can be tuned in such a way that highly robust estimators which are as efficient as possible are obtained. This approach is applicable to various robust multivariate estimates like S- and MM-estimates, MVE and MCD as well as to the Forward Search in whichmonitoring is part of the robust method. Key tool for detection of multivariate outliers and for monitoring of robust estimates is the Mahalanobis distances and statistics related to these distances. However, the results obtained with thistool in case of compositional data might be unrealistic since compositional data contain relative rather than absolute information and need to be transformed to the usual Euclidean geometry before the standard statistical tools can be applied. Various data transformations of compositional data have been introduced in the literature and theoretical results on the equivalence of the additive, the centered, and the isometric logratio transformation in the context of outlier identification exist. To illustrate the problem of monitoring compositional data and to demonstrate the usefulness of monitoring in this case we start with a simple example and then analyze a real life data set presenting the technologicalstructure of manufactured exports. The analysis is conducted with the R package fsdaR, which makes the analytical and graphical tools provided in the MATLAB FSDA library available for R users.


2019 ◽  
Vol 35 (22) ◽  
pp. 4764-4766 ◽  
Author(s):  
Jonathan Cairns ◽  
William R Orchard ◽  
Valeriya Malysheva ◽  
Mikhail Spivakov

Abstract Summary Capture Hi-C is a powerful approach for detecting chromosomal interactions involving, at least on one end, DNA regions of interest, such as gene promoters. We present Chicdiff, an R package for robust detection of differential interactions in Capture Hi-C data. Chicdiff enhances a state-of-the-art differential testing approach for count data with bespoke normalization and multiple testing procedures that account for specific statistical properties of Capture Hi-C. We validate Chicdiff on published Promoter Capture Hi-C data in human Monocytes and CD4+ T cells, identifying multitudes of cell type-specific interactions, and confirming the overall positive association between promoter interactions and gene expression. Availability and implementation Chicdiff is implemented as an R package that is publicly available at https://github.com/RegulatoryGenomicsGroup/chicdiff. Supplementary information Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document