scholarly journals The Resolved Mutual Information Function as a Structural Fingerprint of Biomolecular Sequences for Interpretable Machine Learning Classifiers

Entropy ◽  
2021 ◽  
Vol 23 (10) ◽  
pp. 1357
Author(s):  
Katrin Sophie Bohnsack ◽  
Marika Kaden ◽  
Julia Abel ◽  
Sascha Saralajew ◽  
Thomas Villmann

In the present article we propose the application of variants of the mutual information function as characteristic fingerprints of biomolecular sequences for classification analysis. In particular, we consider the resolved mutual information functions based on Shannon-, Rényi-, and Tsallis-entropy. In combination with interpretable machine learning classifier models based on generalized learning vector quantization, a powerful methodology for sequence classification is achieved which allows substantial knowledge extraction in addition to the high classification ability due to the model-inherent robustness. Any potential (slightly) inferior performance of the used classifier is compensated by the additional knowledge provided by interpretable models. This knowledge may assist the user in the analysis and understanding of the used data and considered task. After theoretical justification of the concepts, we demonstrate the approach for various example data sets covering different areas in biomolecular sequence analysis.

Author(s):  
Krishna Prasad K ◽  
Aithal P. S. ◽  
Navin N. Bappalige ◽  
Soumya S

Purpose: Predicting and then preventing cardiac arrest of a patient in ICU is the most challenging phase even for a most highly skilled professional. The data been collected in ICU for a patient are huge, and the selection of a portion of data for preventing cardiac arrest in a quantum of time is highly decisive, analysing and predicting that large data require an effective system. An effective integration of computer applications and cardiovascular data is necessary to predict the cardiovascular risks. A machine learning technique is the right choice in the advent of technology to manage patients with cardiac arrest. Methodology: In this work we have collected and merged three data sets, Cleveland Dataset of US patients with total 303 records, Statlog Dataset of UK patients with 270 records, and Hungarian dataset of Hungary, Switzerland with 617 records. These data are the most comprehensive data set with a combination of all three data sets consisting of 11 common features with 1190 records. Findings/Results: Feature extraction phase extracts 7 features, which contribute to the event. In addition, extracted features are used to train the selected machine learning classifier models, and results are obtained and obtained results are then evaluated using test data and final results are drawn. Extra Tree Classifier has the highest value of 0.957 for average area under the curve (AUC). Originality: The originality of this combined Dataset analysis using machine learning classifier model results Extra Tree Classifier with highest value of 0.957 for average area under the curve (AUC). Paper Type: Experimental Research Keywords: Cardiac, Machine Learning, Random Forest, XBOOST, ROC AUC, ST Slope.


2020 ◽  
Author(s):  
Marika Kaden ◽  
Katrin Sophie Bohnsack ◽  
Mirko Weber ◽  
Mateusz Kudła ◽  
Kaja Gutowska ◽  
...  

AbstractWe present an approach to investigate SARS-CoV-2 virus sequences based on alignment-free methods for RNA sequence comparison. In particular, we verify a given clustering result for the GISAID data set, which was obtained analyzing the molecular differences in coronavirus populations by phylogenetic trees. For this purpose, we use alignment-free dissimilarity measures for sequences and combine them with learning vector quantization classifiers for virus type discriminant analysis and classification. Those vector quantizers belong to the class of interpretable machine learning methods, which, on the one hand side provide additional knowledge about the classification decisions like discriminant feature correlations, and on the other hand can be equipped with a reject option. This option gives the model the property of self controlled evidence if applied to new data, i.e. the models refuses to make a classification decision, if the model evidence for the presented data is not given. After training such a classifier for the GISAID data set, we apply the obtained classifier model to another but unlabeled SARS-CoV-2 virus data set. On the one hand side, this allows us to assign new sequences to already known virus types and, on the other hand, the rejected sequences allow speculations about new virus types with respect to nucleotide base mutations in the viral sequences.Author summaryThe currently emerging global disease COVID-19 caused by novel SARS-CoV-2 viruses requires all scientific effort to investigate the development of the viral epidemy, the properties of the virus and its types. Investigations of the virus sequence are of special interest. Frequently, those are based on mathematical/statistical analysis. However, machine learning methods represent a promising alternative, if one focuses on interpretable models, i.e. those that do not act as black-boxes. Doing so, we apply variants of Learning Vector Quantizers to analyze the SARS-CoV-2 sequences. We encoded the sequences and compared them in their numerical representations to avoid the computationally costly comparison based on sequence alignments. Our resulting model is interpretable, robust, efficient, and has a self-controlling mechanism regarding the applicability to data. This framework was applied to two data sets concerning SARS-CoV-2. We were able to verify previously published virus type findings for one of the data sets by training our model to accurately identify the virus type of sequences. For sequences without virus type information (second data set), our trained model can predict them. Thereby, we observe a new scattered spreading of the sequences in the data space which probably is caused by mutations in the viral sequences.


1992 ◽  
Vol 02 (01) ◽  
pp. 137-154 ◽  
Author(s):  
WENTIAN LI

This paper aims at understanding the statistical features of nucleic acid sequences from the knowledge of the dynamical process that produces them. Two studies are carried out: first, mutual information function of the limiting sequences generated by simple sequence manipulation dynamics with replications and mutations are calculated numerically (sometimes analytically). It is shown that elongation and replication can easily produce long-range correlations. These long range correlations could be destroyed in various degrees by mutation in different sequence manipulation models. Second, mutual information functions for several human nucleic acids sequences are determined. It is observed that intron sequences (noncoding sequences) tend to have longer correlation lengths than exon sequences (protein-coding sequences).


2011 ◽  
Vol 225-226 ◽  
pp. 601-604
Author(s):  
Gao Rong Zeng ◽  
Jian Ming Liu ◽  
Ai Wen Jiang

A mutual information function was defined as a criterion measuring the robustness of watermarking algorithm. Considering QIM scheme, error probability of watermarking can be calculated to validate the measurement of mutual information function. By mean of numerical computation, mutual information under Gaussian noise and uniform noise is calculated with change of noise standard deviation. In the experiment, an audio section is selected as the host and their third lever wavelet detail coefficients are quantified according to watermark bit series. Experiment results show that statistic Bit Error Rate (BER) is matched with evaluation conclusion of mutual information method when step is on the small side. Mutual information function can be selected as a cost function to evaluate the robustness of watermarking algorithm, and predict the BER.


Autism is a neuro-developmental disability that affects human communication and behaviour. It is a condition that is associated with the complex disorder of the brain which can lead to significant changes in social interaction and behaviour of a human being.Machine learning techniques are being applied to autism data sets to discover useful hidden patterns and to construct predictive models for detecting its risk.This paper focuses on finding the best machine learning classifier on the UCI autism disorder data set for identifying the main factors associated with autism. The results obtained using Multilayer Perceptron, Naive Bayes Classifier and Bayesian Networkwere compared with J48 Decision tree algorithm. The superiority of MultilayerPerceptron over the well known classification algorithms in predicting the autism risk is established in this paper.


Sign in / Sign up

Export Citation Format

Share Document