scholarly journals Data mining and machine learning methods for chromosome conformation data analysis

2019 ◽  
Author(s):  
◽  
Oluwatosin Oluwadare

Sixteen years after the sequencing of the human genome, the Human Genome Project (HGP), and 17 years after the introduction of Chromosome Conformation Capture (3C) technologies, three-dimensional (3-D) inference and big data remains problematic in the field of genomics, and specifically, in the field of 3C data analysis. Three-dimensional inference involves the reconstruction of a genome's 3D structure or, in some cases, ensemble of structures from contact interaction frequencies extracted from a variant of the 3C technology called the Hi-C technology. Further questions remain about chromosome topology and structure; enhancer-promoter interactions; location of genes, gene clusters, and transcription factors; the relationship between gene expression and epigenetics; and chromosome visualization at a higher scale, among others. In this dissertation, four major contributions are described, first, 3DMax, a tool for chromosome and genome 3-D structure prediction from H-C data using optimization algorithm, second, GSDB, a comprehensive and common repository that contains 3D structures for Hi-C datasets from novel 3D structure reconstruction tools developed over the years, third, ClusterTAD, a method for topological associated domains (TAD) extraction from Hi-C data using unsupervised learning algorithm. Finally, we introduce a tool called, GenomeFlow, a comprehensive graphical tool to facilitate the entire process of modeling and analysis of 3D genome organization. It is worth noting that GenomeFlow and GSDB are the first of their kind in the 3D chromosome and genome research field. All the methods are available as software tools that are freely available to the scientific community.

2019 ◽  
Author(s):  
Oluwatosin Oluwadare ◽  
Max Highsmith ◽  
Jianlin Cheng

ABSTRACTAdvances in the study of chromosome conformation capture (3C) technologies, such as Hi-C technique - capable of capturing chromosomal interactions in a genome-wide scale - have led to the development of three-dimensional (3D) chromosome and genome structure reconstruction methods from Hi-C data. The 3D genome structure is important because it plays a role in a variety of important biological activities such as DNA replication, gene regulation, genome interaction, and gene expression. In recent years, numerous Hi-C datasets have been generated, and likewise, a number of genome structure construction algorithms have been developed. However, until now, there has been no freely available repository for 3D chromosome structures. In this work, we outline the construction of a novel Genome Structure Database (GSDB) to create a comprehensive repository that contains 3D structures for Hi-C datasets constructed by a variety of 3D structure reconstruction tools. GSDB contains over 50,000 structures constructed by 12 state-of-the-art chromosome and genome structure prediction methods for publicly used Hi-C datasets with varying resolution. The database is useful for the community to study the function of genome from a 3D perspective. GSDB is accessible at http://sysbio.rnet.missouri.edu/3dgenome/GSDB


2021 ◽  
Author(s):  
Van Hovenga ◽  
Oluwatosin Oluwadare ◽  
Jugal Kalita

Chromosome conformation capture (3C) is a method of measuring chromosome topology in terms of loci interaction. The Hi-C method is a derivative of 3C that allows for genome wide quantification of chromosome interaction. From such interaction data, it is possible to infer the three-dimensional (3D) structure of the underlying chromosome. In this paper, we use a node embedding algorithm and a graph neural network to predict the 3D coordinates of each genomic loci from the corresponding Hi-C contact data. Unlike other chromosome structure prediction methods, our method can generalize a single model across Hi-C resolutions, multiple restriction enzymes, and multiple cell populations while maintaining reconstruction accuracy. We derive these results using three separate Hi-C data sets from the GM12878, GM06990, and K562 cell lines. We also compare the reconstruction accuracy of our method to four other existing methods and show that our method yields superior performance. Our algorithm outperforms the state-of-the-art methods in the accuracy of prediction and introduces a novel method for 3D structure prediction from Hi-C data.


Sequencing ◽  
2013 ◽  
Vol 2013 ◽  
pp. 1-10 ◽  
Author(s):  
Amitava Moulick ◽  
Debashis Mukhopadhyay ◽  
Shonima Talapatra ◽  
Nirmalya Ghoshal ◽  
Sarmistha Sen Raychaudhuri

Plantago ovata Forsk is a medicinally important plant. Metallothioneins are cysteine rich proteins involved in the detoxification of heavy metals. Molecular cloning and modeling of MT from P. ovata is not reported yet. The present investigation will describe the isolation, structure prediction, characterization, and expression under copper stress of type 2 metallothionein (MT2) from this species. The gene of the protein comprises three exons and two introns. The deduced protein sequence contains 81 amino acids with a calculated molecular weight of about 8.1 kDa and a theoretical pI value of 4.77. The transcript level of this protein was increased in response to copper stress. Homology modeling was used to construct a three-dimensional structure of P. ovata MT2. The 3D structure model of P. ovata MT2 will provide a significant clue for further structural and functional study of this protein.


2021 ◽  
Author(s):  
Rebeca San Martin ◽  
Priyojit Das ◽  
Renata Dos Reis Marques ◽  
Yang Xu ◽  
Rachel Patton McCord

Prostate cancer aggressiveness and metastatic potential are influenced by gene expression, genomic aberrations, and cellular morphology. These processes are in turn dependent in part on the 3D structure of chromosomes, packaged inside the nucleus. Using chromosome conformation capture (Hi-C), we conducted a systematic genome architecture comparison on a cohort of cell lines that model prostate cancer progression, ranging from normal epithelium to bone metastasis. Here, we describe how chromatin compartmentalization identity (A- open vs. B-closed) changes with progression: specifically, we find that 48 gene clusters switch from the B to the A compartment, including androgen receptor, WNT5A, and CDK14. These switches could prelude transcription activation and are accompanied by changes in the structure, size, and boundaries of the topologically associating domains (TADs). Further, compartmentalization changes in chromosome 21 are exacerbated with progression and may explain, in part, the genesis of the TMPRSS2-ERG translocation: one of the main drivers of prostate cancer.  These results suggest that discrete, 3D genome structure changes play a deleterious role in prostate cancer progression. 


2021 ◽  
Author(s):  
Marina A Pak ◽  
Karina A Markhieva ◽  
Mariia S Novikova ◽  
Dmitry S Petrov ◽  
Ilya S Vorobyev ◽  
...  

AlphaFold changed the field of structural biology by achieving three-dimensional (3D) structure prediction from protein sequence at experimental quality. The astounding success even led to claims that the protein folding problem is "solved". However, protein folding problem is more than just structure prediction from sequence. Presently, it is unknown if the AlphaFold-triggered revolution could help to solve other problems related to protein folding. Here we assay the ability of AlphaFold to predict the impact of single mutations on protein stability (ΔΔG) and function. To study the question we extracted metrics from AlphaFold predictions before and after single mutation in a protein and correlated the predicted change with the experimentally known ΔΔG values. Additionally, we correlated the AlphaFold predictions on the impact of a single mutation on structure with a large scale dataset of single mutations in GFP with the experimentally assayed levels of fluorescence. We found a very weak or no correlation between AlphaFold output metrics and change of protein stability or fluorescence. Our results imply that AlphaFold cannot be immediately applied to other problems or applications in protein folding.


Author(s):  
Badri Adhikari

AbstractProtein structure prediction continues to stand as an unsolved problem in bioinformatics and biomedicine. Deep learning algorithms and the availability of metagenomic sequences have led to the development of new approaches to predict inter-residue distances—the key intermediate step. Different from the recently successful methods which frame the problem as a multi-class classification problem, this article introduces a real-valued distance prediction method REALDIST. Using a representative set of 43 thousand protein chains, a variant of deep ResNet is trained to predict real-valued distance maps. The contacts derived from the real-valued distance maps predicted by this method, on the most difficult CASP13 free-modeling protein datasets, demonstrate a long-range top-L precision of 52%, which is 17% higher than the top CASP13 predictor Raptor-X and slightly higher than the more recent trRosetta method. Similar improvements are observed on the CAMEO ‘hard’ and ‘very hard’ datasets. Three-dimensional (3D) structure prediction guided by real-valued distances reveals that for short proteins the mean accuracy of the 3D models is slightly higher than the top human predictor AlphaFold and server predictor Quark in the CASP13 competition.


2021 ◽  
Author(s):  
Michael Heinzinger ◽  
Maria Littmann ◽  
Ian Sillitoe ◽  
Nicola Bordin ◽  
Christine Orengo ◽  
...  

Thanks to the recent advances in protein three-dimensional (3D) structure prediction, in particular through AlphaFold 2 and RoseTTAFold, the abundance of protein 3D information will explode over the next year(s). Expert resources based on 3D structures such as SCOP and CATH have been organizing the complex sequence-structure-function relations into a hierarchical classification schema. Experimental structures are leveraged through multiple sequence alignments, or more generally through homology-based inference (HBI) transferring annotations from a protein with experimentally known annotation to a query without annotation. Here, we presented a novel approach that expands the concept of HBI from a low-dimensional sequence-distance lookup to the level of a high-dimensional embedding-based annotation transfer (EAT). Secondly, we introduced a novel solution using single protein sequence representations from protein Language Models (pLMs), so called embeddings (Prose, ESM-1b, ProtBERT, and ProtT5), as input to contrastive learning, by which a new set of embeddings was created that optimized constraints captured by hierarchical classifications of protein 3D structures. These new embeddings (dubbed ProtTucker) clearly improved what was historically referred to as threading or fold recognition. Thereby, the new embeddings enabled the intrusion into the midnight zone of protein comparisons, i.e., the region in which the level of pairwise sequence similarity is akin of random relations and therefore is hard to navigate by HBI methods. Cautious benchmarking showed that ProtTucker reached much further than advanced sequence comparisons without the need to compute alignments allowing it to be orders of magnitude faster. Code is available at https://github.com/Rostlab/EAT .


2019 ◽  
Author(s):  
Max Highsmith ◽  
Oluwatosin Oluwadare ◽  
Jianlin Cheng

AbstractMotivationThe three-dimensional (3D) organization of an organism’s genome and chromosomes plays a significant role in many biological processes. Currently, methods exist for modeling chromosomal 3D structure using contact matrices generated via chromosome conformation capture (3C) techniques such as Hi-C. However, the effectiveness of these methods is inherently bottlenecked by the quality of the Hi-C data, which may be corrupted by experimental noise. Consequently, it is valuable to develop methods for eliminating the impact of noise on the quality of reconstructed structures.ResultsWe develop unsupervised and semi-supervised deep learning algorithms (i.e. deep convolutional autoencoders) to denoise Hi-C contact matrix data and improve the quality of chromosome structure predictions. When applied to noisy synthetic contact matrices of the yeast genome, our network demonstrates consistent improvement across metrics for contact matrix similarity including: Pearson Correlation, Spearman Correlation and Signal-to-Noise Ratio. Positive improvement across these metrics is seen consistently across a wide space of parameters to both gaussian and poisson noise [email protected] and [email protected]


2008 ◽  
Vol 06 (01) ◽  
pp. 183-201 ◽  
Author(s):  
YONGGANG LU ◽  
JING HE ◽  
CHARLIE E. M. STRAUSS

Cryoelectron microscopy (cryoEM) is an experimental technique to determine the three-dimensional (3D) structure of large protein complexes. Currently, this technique is able to generate protein density maps at 6–9 Å resolution, at which the skeleton of the structure (which is composed of α-helices and β-sheets) can be visualized. As a step towards predicting the entire backbone of the protein from the protein density map, we developed a method to predict the topology and sequence alignment for the skeleton helices. Our method combines the geometrical information of the skeleton helices with the Rosetta ab initio structure prediction method to derive a consensus topology and sequence alignment for the skeleton helices. We tested the method with 60 proteins. For 45 proteins, the majority of the skeleton helices were assigned a correct topology from one of our top ten predictions. The offsets of the alignment for most of the assigned helices were within ±2 amino acids in the sequence. We also analyzed the use of the skeleton helices as a clustering tool for the decoy structures generated by Rosetta. Our comparison suggests that the topology clustering is a better method than a general overlap clustering method to enrich the ranking of decoys, particularly when the decoy pool is small.


Sign in / Sign up

Export Citation Format

Share Document