scholarly journals Motto: Representing motifs in consensus sequences with minimum information loss

2019 ◽  
Author(s):  
Mengchi Wang ◽  
David Wang ◽  
Kai Zhang ◽  
Vu Ngo ◽  
Shicai Fan ◽  
...  

ABSTRACTSequence analysis frequently requires intuitive understanding and convenient representation of motifs. Typically, motifs are represented as position weight matrices (PWMs) and visualized using sequence logos. However, in many scenarios, representing motifs by wildcard-style consensus sequences is compact and sufficient for interpreting the motif information and search for motif match. Based on mutual information theory and Jenson-Shannon Divergence, we propose a mathematical framework to minimize the information loss in converting PWMs to consensus sequences. We name this representation as sequence Motto and have implemented an efficient algorithm with flexible options for converting motif PWMs into Motto from nucleotides, amino acids, and customized alphabets. Here we show that this representation provides a simple and efficient way to identify the binding sites of 1156 common TFs in the human genome. The effectiveness of the method was benchmarked by comparing sequence matches found by Motto with PWM scanning results found by FIMO. On average, our method achieves 0.81 area under the precision-recall curve, significantly (p-value < 0.01) outperforming all existing methods, including maximal positional weight, Douglas and minimal mean square error. We believe this representation provides a distilled summary of a motif, as well as the statistical justification.AVAILABILITYMotto is freely available at http://wanglab.ucsd.edu/star/motto.

Genetics ◽  
2020 ◽  
Vol 216 (2) ◽  
pp. 353-358
Author(s):  
Mengchi Wang ◽  
David Wang ◽  
Kai Zhang ◽  
Vu Ngo ◽  
Shicai Fan ◽  
...  

Sequence analysis frequently requires intuitive understanding and convenient representation of motifs. Typically, motifs are represented as position weight matrices (PWMs) and visualized using sequence logos. However, in many scenarios, in order to interpret the motif information or search for motif matches, it is compact and sufficient to represent motifs by wildcard-style consensus sequences (such as [GC][AT]GATAAG[GAC]). Based on mutual information theory and Jensen-Shannon divergence, we propose a mathematical framework to minimize the information loss in converting PWMs to consensus sequences. We name this representation as sequence Motto and have implemented an efficient algorithm with flexible options for converting motif PWMs into Motto from nucleotides, amino acids, and customized characters. We show that this representation provides a simple and efficient way to identify the binding sites of 1156 common transcription factors (TFs) in the human genome. The effectiveness of the method was benchmarked by comparing sequence matches found by Motto with PWM scanning results found by FIMO. On average, our method achieves a 0.81 area under the precision-recall curve, significantly (P-value < 0.01) outperforming all existing methods, including maximal positional weight, Cavener’s method, and minimal mean square error. We believe this representation provides a distilled summary of a motif, as well as the statistical justification.


2021 ◽  
Vol 10 (1) ◽  
pp. 70
Author(s):  
Oladosu Oyebisi Oladimeji ◽  
Abimbola Oladimeji ◽  
Oladimeji Olayanju

Introduction: Hepatitis C is a chronic infection caused by hepatitis c virus - a blood borne virus. Therefore, the infection occurs through exposure to small quantities of blood. It has been estimated by World Health Organization (WHO) to have affected 71 million people worldwide. This infection costs individual, groups and government a lot because no vaccine has been gotten yet for the treatment. This disease is likely to continue to affect more people because it’s long asymptotic phase which makes its early detection not feasible.Material and Methods: In this study, we have presented machine learning models to automatically classify the diagnosis test of hepatitis and also ranked the test features in order to know how they contribute to the classification which help in decision making process by the health care industry. The synthetic minority oversampling technique (SMOTE) was used to solve the problem of imbalance dataset.Results: The models were evaluated based on metrics such as Matthews correlation coefficient, F-measure, Precision-Recall curve and Receiver Operating Characteristic Area Under Curve.  We found that using SMOTE techniques helped raise performance of the predictive models. Also, random forest (RF) had the best performance based on Matthews correlation coefficient (0.99), F-measure (0.99), Precision-Recall curve (1.00) and Receiver Operating Characteristic Area Under Curve (0.99).Conclusion: This discovery has the potential to impact on clinical practice, when health workers aim at classifying diagnosis result of disease at its early stage.


2013 ◽  
Vol 11 (01) ◽  
pp. 1340004 ◽  
Author(s):  
IVAN KULAKOVSKIY ◽  
VICTOR LEVITSKY ◽  
DMITRY OSHCHEPKOV ◽  
LEONID BRYZGALOV ◽  
ILYA VORONTSOV ◽  
...  

Chromatin immunoprecipitation followed by deep sequencing (ChIP-Seq) became a method of choice to locate DNA segments bound by different regulatory proteins. ChIP-Seq produces extremely valuable information to study transcriptional regulation. The wet-lab workflow is often supported by downstream computational analysis including construction of models of nucleotide sequences of transcription factor binding sites in DNA, which can be used to detect binding sites in ChIP-Seq data at a single base pair resolution. The most popular TFBS model is represented by positional weight matrix (PWM) with statistically independent positional weights of nucleotides in different columns; such PWMs are constructed from a gapless multiple local alignment of sequences containing experimentally identified TFBSs. Modern high-throughput techniques, including ChIP-Seq, provide enough data for careful training of advanced models containing more parameters than PWM. Yet, many suggested multiparametric models often provide only incremental improvement of TFBS recognition quality comparing to traditional PWMs trained on ChIP-Seq data. We present a novel computational tool, diChIPMunk, that constructs TFBS models as optimal dinucleotide PWMs, thus accounting for correlations between nucleotides neighboring in input sequences. diChIPMunk utilizes many advantages of ChIPMunk, its ancestor algorithm, accounting for ChIP-Seq base coverage profiles ("peak shape") and using the effective subsampling-based core procedure which allows processing of large datasets. We demonstrate that diPWMs constructed by diChIPMunk outperform traditional PWMs constructed by ChIPMunk from the same ChIP-Seq data. Software website: http://autosome.ru/dichipmunk/


Genome ◽  
1989 ◽  
Vol 31 (2) ◽  
pp. 503-509 ◽  
Author(s):  
Veronica C. Blasquez ◽  
Ann O. Sperry ◽  
Peter N. Cockerill ◽  
William T. Garrard

We have recently identified an evolutionarily conserved class of sequences that organize chromosomal loops in the interphase nucleus, which we have termed "matrix association regions" (MARs). MARs are about 200 bp long, AT-rich, contain topoisomerase II consensus sequences and other AT-rich sequence motifs, often reside near cis-acting regulatory sequences, and their binding sites are abundant (> 10 000 per mammalian nucleus). Here we demonstrate that the interactions between the mouse κ immunoglobulin gene MAR and topoisomerase II or the "nuclear matrix" occur between multiple and sometimes overlapping binding sites. Interestingly, the sites most susceptible to topoisomerase II cleavage are localized near the breakpoints of a previously described illegitimate recombination event. The presence of multiple binding sites within single MARs may allow DNA and RNA polymerase passage without disrupting primary loop organization.Key words: MARs, chromatin loops, topoisomerase II, nuclear matrix.


1988 ◽  
Vol 8 (6) ◽  
pp. 2275-2279 ◽  
Author(s):  
M E Cerdan ◽  
R S Zitomer

In Saccharomyces cerevisiae, the two genes, CYC1 and CYC7, that encode the isoforms of cytochrome c are expressed at different levels. Oxygen regulation is mediated by the expression of the CYP1 gene, and the CYP1 protein interacts with both CYC1 upstream activation sequence 1 (UAS1) and CYC7 UASo. In this study, the homology between the CYP1-binding sites of both genes was investigated. The most noticeable difference between the CYC1 and CYC7 UASs is the presence of GC base pairs at the same positions in a repeated sequence in CYC7 compared with CG base pairs in CYC1. Directed mutagenesis changing these GC residues to CG residues in CYC7 led to CYC1-like expression of CYC7 both in a CYP1 wild-type strain and in a strain carrying the semidominant mutation CYP1-16 which reverses the oxygen-dependent expression of the two genes. Our results strongly support the hypothesis that the CYP1-binding sites in CYC1 and CYC7 are related forms of the same sequence and that the CYP1-16 protein has altered specificity for the variant forms of the consensus sequences in both genes.


2021 ◽  
Vol 25 (1) ◽  
pp. 7-17
Author(s):  
A. V. Tsukanov ◽  
V. G. Levitsky ◽  
T. I. Merkulova

The most popular model for the search of ChIP-seq data for transcription factor binding sites (TFBS) is the positional weight matrix (PWM). However, this model does not take into account dependencies between nucleotide occurrences in different site positions. Currently, two recently proposed models, BaMM and InMoDe, can do as much. However, application of these models was usually limited only to comparing their recognition accuracies with that of PWMs, while none of the analyses of the co-prediction and relative positioning of hits of different models in peaks has yet been performed. To close this gap, we propose the pipeline called MultiDeNA. This pipeline includes stages of model training, assessing their recognition accuracy, scanning ChIP-seq peaks and their classif ication based on scan results. We applied our pipeline to 22 ChIP-seq datasets of TF FOXA2 and considered PWM, dinucleotide PWM (diPWM), BaMM and InMoDe models. The combination of these four models allowed a signif icant increase in the fraction of recognized peaks compared to that for the sole PWM model: the increase was 26.3 %. The BaMM model provided the main contribution to the recognition of sites. Although the major fraction of predicted peaks contained TFBS of different models with coincided positions, the medians of the fraction of peaks containing the predictions of sole models were 1.08, 0.49, 4.15 and 1.73 % for PWM, diPWM, BaMM and InMoDe, respectively. Thus, FOXA2 BSs were not fully described by only a sole model, which indicates theirs heterogeneity. We assume that the BaMM model is the most successful in describing the structure of the FOXA2 BS in ChIP-seq datasets under study.


2021 ◽  
Author(s):  
Eunsaem Lee ◽  
Se Young Jung ◽  
Hyung Ju Hwang ◽  
Jaewoo Jung

BACKGROUND Nationwide population-based cohorts provide a new opportunity to build automated risk prediction models at the patient level, and claim data are one of the more useful resources to this end. To avoid unnecessary diagnostic intervention after cancer screening tests, patient-level prediction models should be developed. OBJECTIVE We aimed to develop cancer prediction models using nationwide claim databases with machine learning algorithms, which are explainable and easily applicable in real-world environments. METHODS As source data, we used the Korean National Insurance System Database. Every Korean in ≥40 years old undergoes a national health checkup every 2 years. We gathered all variables from the database including demographic information, basic laboratory values, anthropometric values, and previous medical history. We applied conventional logistic regression methods, light gradient boosting methods, neural networks, survival analysis, and one-class embedding classifier methods to effectively analyze high dimension data based on deep learning–based anomaly detection. Performance was measured with area under the curve and area under precision recall curve. We validated our models externally with a health checkup database from a tertiary hospital. RESULTS The one-class embedding classifier model received the highest area under the curve scores with values of 0.868, 0.849, 0.798, 0.746, 0.800, 0.749, and 0.790 for liver, lung, colorectal, pancreatic, gastric, breast, and cervical cancers, respectively. For area under precision recall curve, the light gradient boosting models had the highest score with values of 0.383, 0.401, 0.387, 0.300, 0.385, 0.357, and 0.296 for liver, lung, colorectal, pancreatic, gastric, breast, and cervical cancers, respectively. CONCLUSIONS Our results show that it is possible to easily develop applicable cancer prediction models with nationwide claim data using machine learning. The 7 models showed acceptable performances and explainability, and thus can be distributed easily in real-world environments.


Sign in / Sign up

Export Citation Format

Share Document