The Resolved Mutual Information Function as a Structural Fingerprint of Biomolecular Sequences for Interpretable Machine Learning Classifiers

Katrin Sophie Bohnsack; Marika Kaden; Julia Abel; Sascha Saralajew; Thomas Villmann

doi:10.3390/e23101357

The Resolved Mutual Information Function as a Structural Fingerprint of Biomolecular Sequences for Interpretable Machine Learning Classifiers

Entropy ◽

10.3390/e23101357 ◽

2021 ◽

Vol 23 (10) ◽

pp. 1357

Author(s):

Katrin Sophie Bohnsack ◽

Marika Kaden ◽

Julia Abel ◽

Sascha Saralajew ◽

Thomas Villmann

Keyword(s):

Machine Learning ◽

Mutual Information ◽

Data Sets ◽

Sequence Classification ◽

Information Function ◽

Theoretical Justification ◽

Learning Classifier ◽

Interpretable Machine Learning ◽

Interpretable Models ◽

Mutual Information Function

In the present article we propose the application of variants of the mutual information function as characteristic fingerprints of biomolecular sequences for classification analysis. In particular, we consider the resolved mutual information functions based on Shannon-, Rényi-, and Tsallis-entropy. In combination with interpretable machine learning classifier models based on generalized learning vector quantization, a powerful methodology for sequence classification is achieved which allows substantial knowledge extraction in addition to the high classification ability due to the model-inherent robustness. Any potential (slightly) inferior performance of the used classifier is compensated by the additional knowledge provided by interpretable models. This knowledge may assist the user in the analysis and understanding of the used data and considered task. After theoretical justification of the concepts, we demonstrate the approach for various example data sets covering different areas in biomolecular sequence analysis.

Download Full-text

Detection of Different Authorship of Text Sequences through Self-organizing Maps and Mutual Information Function

Advances in Soft Computing - Lecture Notes in Computer Science ◽

10.1007/978-3-642-16773-7_16 ◽

2010 ◽

pp. 186-195 ◽

Cited By ~ 1

Author(s):

Antonio Neme ◽

Blanca Lugo ◽

Alejandra Cervera

Keyword(s):

Mutual Information ◽

Information Function ◽

Self Organizing Maps ◽

Mutual Information Function ◽

Self Organizing

Download Full-text

Performance analysis of the mutual information function for nonlinear and linear signal processing

1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258) ◽

10.1109/icassp.1999.756217 ◽

1999 ◽

Cited By ~ 4

Author(s):

H.-P. Bernhard ◽

G.A. Darbellay

Keyword(s):

Signal Processing ◽

Performance Analysis ◽

Mutual Information ◽

Information Function ◽

Linear Signal ◽

Mutual Information Function

Download Full-text

An Integration of Cardiovascular Event Data and Machine Learning Models for Cardiac Arrest Predictions

International Journal of Health Sciences and Pharmacy ◽

10.47992/ijhsp.2581.6411.0061 ◽

2021 ◽

pp. 55-71

Author(s):

Krishna Prasad K ◽

Aithal P. S. ◽

Navin N. Bappalige ◽

Soumya S

Keyword(s):

Machine Learning ◽

Cardiac Arrest ◽

Area Under The Curve ◽

Computer Applications ◽

Data Sets ◽

Cardiovascular Risks ◽

Data Set ◽

Average Area ◽

Learning Classifier ◽

Tree Classifier

Purpose: Predicting and then preventing cardiac arrest of a patient in ICU is the most challenging phase even for a most highly skilled professional. The data been collected in ICU for a patient are huge, and the selection of a portion of data for preventing cardiac arrest in a quantum of time is highly decisive, analysing and predicting that large data require an effective system. An effective integration of computer applications and cardiovascular data is necessary to predict the cardiovascular risks. A machine learning technique is the right choice in the advent of technology to manage patients with cardiac arrest. Methodology: In this work we have collected and merged three data sets, Cleveland Dataset of US patients with total 303 records, Statlog Dataset of UK patients with 270 records, and Hungarian dataset of Hungary, Switzerland with 617 records. These data are the most comprehensive data set with a combination of all three data sets consisting of 11 common features with 1190 records. Findings/Results: Feature extraction phase extracts 7 features, which contribute to the event. In addition, extracted features are used to train the selected machine learning classifier models, and results are obtained and obtained results are then evaluated using test data and final results are drawn. Extra Tree Classifier has the highest value of 0.957 for average area under the curve (AUC). Originality: The originality of this combined Dataset analysis using machine learning classifier model results Extra Tree Classifier with highest value of 0.957 for average area under the curve (AUC). Paper Type: Experimental Research Keywords: Cardiac, Machine Learning, Random Forest, XBOOST, ROC AUC, ST Slope.

Download Full-text

Analysis of SARS-CoV-2 RNA-Sequences by Interpretable Machine Learning Models

10.1101/2020.05.15.097741 ◽

2020 ◽

Author(s):

Marika Kaden ◽

Katrin Sophie Bohnsack ◽

Mirko Weber ◽

Mateusz Kudła ◽

Kaja Gutowska ◽

...

Keyword(s):

Machine Learning ◽

Virus Type ◽

Data Sets ◽

Data Set ◽

Machine Learning Methods ◽

Alignment Free ◽

Interpretable Machine Learning ◽

Vector Quantizers ◽

The One ◽

Viral Sequences

AbstractWe present an approach to investigate SARS-CoV-2 virus sequences based on alignment-free methods for RNA sequence comparison. In particular, we verify a given clustering result for the GISAID data set, which was obtained analyzing the molecular differences in coronavirus populations by phylogenetic trees. For this purpose, we use alignment-free dissimilarity measures for sequences and combine them with learning vector quantization classifiers for virus type discriminant analysis and classification. Those vector quantizers belong to the class of interpretable machine learning methods, which, on the one hand side provide additional knowledge about the classification decisions like discriminant feature correlations, and on the other hand can be equipped with a reject option. This option gives the model the property of self controlled evidence if applied to new data, i.e. the models refuses to make a classification decision, if the model evidence for the presented data is not given. After training such a classifier for the GISAID data set, we apply the obtained classifier model to another but unlabeled SARS-CoV-2 virus data set. On the one hand side, this allows us to assign new sequences to already known virus types and, on the other hand, the rejected sequences allow speculations about new virus types with respect to nucleotide base mutations in the viral sequences.Author summaryThe currently emerging global disease COVID-19 caused by novel SARS-CoV-2 viruses requires all scientific effort to investigate the development of the viral epidemy, the properties of the virus and its types. Investigations of the virus sequence are of special interest. Frequently, those are based on mathematical/statistical analysis. However, machine learning methods represent a promising alternative, if one focuses on interpretable models, i.e. those that do not act as black-boxes. Doing so, we apply variants of Learning Vector Quantizers to analyze the SARS-CoV-2 sequences. We encoded the sequences and compared them in their numerical representations to avoid the computationally costly comparison based on sequence alignments. Our resulting model is interpretable, robust, efficient, and has a self-controlling mechanism regarding the applicability to data. This framework was applied to two data sets concerning SARS-CoV-2. We were able to verify previously published virus type findings for one of the data sets by training our model to accurately identify the virus type of sequences. For sequences without virus type information (second data set), our trained model can predict them. Thereby, we observe a new scattered spreading of the sequences in the data space which probably is caused by mutations in the viral sequences.

Download Full-text

Mutual Information Function in Respirocardial Coordinations of Healthy Human Neonates in Quiet and Active Sleep

Klinische Neurophysiologie ◽

10.1055/s-2004-831976 ◽

2004 ◽

Vol 35 (03) ◽

Author(s):

M Frasch ◽

U Zwiener ◽

D Hoyer ◽

M Eiselt

Keyword(s):

Mutual Information ◽

Information Function ◽

Active Sleep ◽

Healthy Human ◽

Human Neonates ◽

Mutual Information Function

Download Full-text

Mutual Information Function Assesses Autonomic Information Flow of Heart Rate Dynamics at Different Time Scales

IEEE Transactions on Biomedical Engineering ◽

10.1109/tbme.2005.844023 ◽

2005 ◽

Vol 52 (4) ◽

pp. 584-592 ◽

Cited By ~ 65

Author(s):

D. Hoyer ◽

B. Pompe ◽

K.H. Chon ◽

H. Hardraht ◽

C. Wicher ◽

...

Keyword(s):

Heart Rate ◽

Mutual Information ◽

Time Scales ◽

Information Flow ◽

Information Function ◽

Heart Rate Dynamics ◽

Different Time Scales ◽

Mutual Information Function

Download Full-text

GENERATING NONTRIVIAL LONG-RANGE CORRELATIONS AND 1/f SPECTRA BY REPLICATION AND MUTATION

International Journal of Bifurcation and Chaos ◽

10.1142/s0218127492000136 ◽

1992 ◽

Vol 02 (01) ◽

pp. 137-154 ◽

Cited By ~ 76

Author(s):

WENTIAN LI

Keyword(s):

Mutual Information ◽

Long Range ◽

Dynamical Process ◽

Information Function ◽

Protein Coding ◽

Correlation Lengths ◽

Noncoding Sequences ◽

Long Range Correlations ◽

Simple Sequence ◽

Mutual Information Function

This paper aims at understanding the statistical features of nucleic acid sequences from the knowledge of the dynamical process that produces them. Two studies are carried out: first, mutual information function of the limiting sequences generated by simple sequence manipulation dynamics with replications and mutations are calculated numerically (sometimes analytically). It is shown that elongation and replication can easily produce long-range correlations. These long range correlations could be destroyed in various degrees by mutation in different sequence manipulation models. Second, mutual information functions for several human nucleic acids sequences are determined. It is observed that intron sequences (noncoding sequences) tend to have longer correlation lengths than exon sequences (protein-coding sequences).

Download Full-text

Auto-Mutual Information Function for Predicting Pain Responses in EEG Signals during Sedation

IFMBE Proceedings - XIII Mediterranean Conference on Medical and Biological Engineering and Computing 2013 ◽

10.1007/978-3-319-00846-2_154 ◽

2014 ◽

pp. 623-626 ◽

Cited By ~ 1

Author(s):

U. Melia ◽

M. Vallverdú ◽

M. Jospin ◽

E. W. Jensen ◽

J. F. Valencia ◽

...

Keyword(s):

Mutual Information ◽

Information Function ◽

Eeg Signals ◽

Mutual Information Function ◽

Pain Responses

Download Full-text

Robustness Analysis of QIM Watermarking against Additional Noise

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.225-226.601 ◽

2011 ◽

Vol 225-226 ◽

pp. 601-604

Author(s):

Gao Rong Zeng ◽

Jian Ming Liu ◽

Ai Wen Jiang

Keyword(s):

Mutual Information ◽

Bit Error Rate ◽

Gaussian Noise ◽

Error Probability ◽

Robustness Analysis ◽

Information Function ◽

Uniform Noise ◽

Additional Noise ◽

Information Method ◽

Mutual Information Function

A mutual information function was defined as a criterion measuring the robustness of watermarking algorithm. Considering QIM scheme, error probability of watermarking can be calculated to validate the measurement of mutual information function. By mean of numerical computation, mutual information under Gaussian noise and uniform noise is calculated with change of noise standard deviation. In the experiment, an audio section is selected as the host and their third lever wavelet detail coefficients are quantified according to watermark bit series. Experiment results show that statistic Bit Error Rate (BER) is matched with evaluation conclusion of mutual information method when step is on the small side. Mutual information function can be selected as a cost function to evaluate the robustness of watermarking algorithm, and predict the BER.

Download Full-text

Classification of Autism Spectrum Disorder Data using Machine Learning Techniques

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.f1114.0886s19 ◽

2019 ◽

Vol 8 (6S) ◽

pp. 565-569

Keyword(s):

Machine Learning ◽

Autism Spectrum ◽

Machine Learning Techniques ◽

Data Sets ◽

Human Communication ◽

Data Set ◽

Complex Disorder ◽

Learning Classifier ◽

Learning Techniques

Autism is a neuro-developmental disability that affects human communication and behaviour. It is a condition that is associated with the complex disorder of the brain which can lead to significant changes in social interaction and behaviour of a human being.Machine learning techniques are being applied to autism data sets to discover useful hidden patterns and to construct predictive models for detecting its risk.This paper focuses on finding the best machine learning classifier on the UCI autism disorder data set for identifying the main factors associated with autism. The results obtained using Multilayer Perceptron, Naive Bayes Classifier and Bayesian Networkwere compared with J48 Decision tree algorithm. The superiority of MultilayerPerceptron over the well known classification algorithms in predicting the autism risk is established in this paper.

Download Full-text