Globally-Consistent Rule-Based Summary-Explanations for Machine Learning Models: Application to Credit-Risk Evaluation

Triage and diagnosis of COVID-19 from medical social media (Preprint)

10.2196/preprints.30397 ◽

2021 ◽

Author(s):

Abul Hasan ◽

Mark Levene ◽

David Weston ◽

Renate Fromson ◽

Nicolas Koslover ◽

...

Keyword(s):

Machine Learning ◽

Social Media ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Learning Models ◽

Rule Based ◽

Additional Information ◽

Processing Pipeline ◽

Machine Learning Models

BACKGROUND The COVID-19 pandemic has created a pressing need for integrating information from disparate sources, in order to assist decision makers. Social media is important in this respect, however, to make sense of the textual information it provides and be able to automate the processing of large amounts of data, natural language processing methods are needed. Social media posts are often noisy, yet they may provide valuable insights regarding the severity and prevalence of the disease in the population. In particular, machine learning techniques for triage and diagnosis could allow for a better understanding of what social media may offer in this respect. OBJECTIVE This study aims to develop an end-to-end natural language processing pipeline for triage and diagnosis of COVID-19 from patient-authored social media posts, in order to provide researchers and other interested parties with additional information on the symptoms, severity and prevalence of the disease. METHODS The text processing pipeline first extracts COVID-19 symptoms and related concepts such as severity, duration, negations, and body parts from patients’ posts using conditional random fields. An unsupervised rule-based algorithm is then applied to establish relations between concepts in the next step of the pipeline. The extracted concepts and relations are subsequently used to construct two different vector representations of each post. These vectors are applied separately to build support vector machine learning models to triage patients into three categories and diagnose them for COVID-19. RESULTS We report that Macro- and Micro-averaged F_{1\ }scores in the range of 71-96% and 61-87%, respectively, for the triage and diagnosis of COVID-19, when the models are trained on human labelled data. Our experimental results indicate that similar performance can be achieved when the models are trained using predicted labels from concept extraction and rule-based classifiers, thus yielding end-to-end machine learning. Also, we highlight important features uncovered by our diagnostic machine learning models and compare them with the most frequent symptoms revealed in another COVID-19 dataset. In particular, we found that the most important features are not always the most frequent ones. CONCLUSIONS Our preliminary results show that it is possible to automatically triage and diagnose patients for COVID-19 from natural language narratives using a machine learning pipeline, in order to provide additional information on the severity and prevalence of the disease through the eyes of social media.

Download Full-text

Evaluation of rule-based, CountVectorizer, and Word2Vec machine learning models for tweet analysis to improve disaster relief

10.1109/ghtc53159.2021.9612486 ◽

2021 ◽

Author(s):

Radhika Goyal

Keyword(s):

Machine Learning ◽

Disaster Relief ◽

Learning Models ◽

Rule Based ◽

Machine Learning Models

Download Full-text

Using Small Business Banking Data for Explainable Credit Risk Scoring

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i08.7055 ◽

2020 ◽

Vol 34 (08) ◽

pp. 13396-13401

Author(s):

Wei Wang ◽

Christopher Lesner ◽

Alexander Ran ◽

Marko Rukonic ◽

Jason Xue ◽

...

Keyword(s):

Machine Learning ◽

Small Business ◽

Credit Risk ◽

Risk Model ◽

Learning Models ◽

Machine Learning Model ◽

Risk Scoring ◽

Financial Transactions ◽

Credit Risk Model ◽

Machine Learning Models

Machine learning applied to financial transaction records can predict how likely a small business is to repay a loan. For this purpose we compared a traditional scorecard credit risk model against various machine learning models and found that XGBoost with monotonic constraints outperformed scorecard model by 7% in K-S statistic. To deploy such a machine learning model in production for loan application risk scoring it must comply with lending industry regulations that require lenders to provide understandable and specific reasons for credit decisions. Thus we also developed a loan decision explanation technique based on the ideas of WoE and SHAP. Our research was carried out using a historical dataset of tens of thousands of loans and millions of associated financial transactions. The credit risk scoring model based on XGBoost with monotonic constraints and SHAP explanations described in this paper have been deployed by QuickBooks Capital to assess incoming loan applications since July 2019.

Download Full-text

What can we learn from what a machine has learned? Interpreting credit risk machine learning models

The Journal of Risk Model Validation ◽

10.21314/jrmv.2020.235 ◽

2021 ◽

Author(s):

Nehalkumar Bharodia ◽

Wei Chen

Keyword(s):

Machine Learning ◽

Credit Risk ◽

Learning Models ◽

Machine Learning Models

Download Full-text

Explainable Machine Learning Models of Consumer Credit Risk

SSRN Electronic Journal ◽

10.2139/ssrn.4006840 ◽

2022 ◽

Author(s):

Randall Davis ◽

Andrew W. Lo ◽

Sudhanshu Mishra ◽

Arash Nourian ◽

Manish Singh ◽

...

Keyword(s):

Machine Learning ◽

Credit Risk ◽

Consumer Credit ◽

Learning Models ◽

Machine Learning Models ◽

Consumer Credit Risk

Download Full-text

Functional networks inference from rule-based machine learning models

BioData Mining ◽

10.1186/s13040-016-0106-4 ◽

2016 ◽

Vol 9 (1) ◽

Cited By ~ 4

Author(s):

Nicola Lazzarini ◽

Paweł Widera ◽

Stuart Williamson ◽

Rakesh Heer ◽

Natalio Krasnogor ◽

...

Keyword(s):

Machine Learning ◽

Functional Networks ◽

Learning Models ◽

Rule Based ◽

Machine Learning Models

Download Full-text

Credit Risk Model Based on Central Bank Credit Registry Data

Journal of Risk and Financial Management ◽

10.3390/jrfm14030138 ◽

2021 ◽

Vol 14 (3) ◽

pp. 138

Author(s):

Fisnik Doko ◽

Slobodan Kalajdziski ◽

Igor Mishkovski

Keyword(s):

Machine Learning ◽

Random Forest ◽

Decision Tree ◽

Central Bank ◽

Credit Risk ◽

Commercial Banks ◽

Registry Data ◽

Learning Models ◽

Model Based ◽

Machine Learning Models

Data science and machine-learning techniques help banks to optimize enterprise operations, enhance risk analyses and gain competitive advantage. There is a vast amount of research in credit risk, but to our knowledge, none of them uses credit registry as a data source to model the probability of default for individual clients. The goal of this paper is to evaluate different machine-learning models to create accurate model for credit risk assessment using the data from the real credit registry dataset of the Central Bank of Republic of North Macedonia. We strongly believe that the model developed in this research will be an additional source of valuable information to commercial banks, by leveraging historical data for all the population of the country in all the commercial banks. Thus, in this research, we compare five machine-learning models to classify credit risk data, i.e., logistic regression, decision tree, random forest, support vector machines (SVM) and neural network. We evaluate the five models using different machine-learning metrics, and we propose a model based on credit registry data from the central bank with detailed methodology that can predict the credit risk based on credit history of the population in the country. Our results show that the best accuracy is achieved by using decision tree performing on imbalanced data with and without scaling, followed by random forest and linear regression.

Download Full-text

Advances in Electronic Phenotyping: From Rule-Based Definitions to Machine Learning Models

Annual Review of Biomedical Data Science ◽

10.1146/annurev-biodatasci-080917-013315 ◽

2018 ◽

Vol 1 (1) ◽

pp. 53-68 ◽

Cited By ~ 40

Author(s):

Juan M. Banda ◽

Martin Seneviratne ◽

Tina Hernandez-Boussard ◽

Nigam H. Shah

Keyword(s):

Machine Learning ◽

Clinical Decision ◽

Future Research ◽

Learning Models ◽

Rule Based ◽

Research Problems ◽

Fundamental Research ◽

Future Research Directions ◽

Effectiveness Studies ◽

Machine Learning Models

With the widespread adoption of electronic health records (EHRs), large repositories of structured and unstructured patient data are becoming available to conduct observational studies. Finding patients with specific conditions or outcomes, known as phenotyping, is one of the most fundamental research problems encountered when using these new EHR data. Phenotyping forms the basis of translational research, comparative effectiveness studies, clinical decision support, and population health analyses using routinely collected EHR data. We review the evolution of electronic phenotyping, from the early rule-based methods to the cutting edge of supervised and unsupervised machine learning models. We aim to cover the most influential papers in commensurate detail, with a focus on both methodology and implementation. Finally, future research directions are explored.

Download Full-text

Raising the Flag: Monitoring User Perceived Disinformation on Reddit

Information ◽

10.3390/info12010004 ◽

2020 ◽

Vol 12 (1) ◽

pp. 4

Author(s):

Vlad Achimescu ◽

Pavel Dimitrov Chachev

Keyword(s):

Machine Learning ◽

Descriptive Analysis ◽

Learning Models ◽

Rule Based ◽

Part Of Speech ◽

News Websites ◽

Time Periods ◽

Internet Forums ◽

Rule Based Approach ◽

Machine Learning Models

The truth value of any new piece of information is not only investigated by media platforms, but also debated intensely on internet forums. Forum users are fighting back against misinformation, by informally flagging suspicious posts as false or misleading in their comments. We propose extracting posts informally flagged by Reddit users as a means to narrow down the list of potential instances of disinformation. To identify these flags, we built a dictionary enhanced with part of speech tags and dependency parsing to filter out specific phrases. Our rule-based approach performs similarly to machine learning models, but offers more transparency and interactivity. Posts matched by our technique are presented in a publicly accessible, daily updated, and customizable dashboard. This paper offers a descriptive analysis of which topics, venues, and time periods were linked to perceived misinformation in the first half of 2020, and compares user flagged sources with an external dataset of unreliable news websites. Using this method can help researchers understand how truth and falsehood are perceived in the subreddit communities, and to identify new false narratives before they spread through the larger population.

Download Full-text

Editable machine learning models? A rule-based framework for user studies of explainability

Advances in Data Analysis and Classification ◽

10.1007/s11634-020-00419-2 ◽

2020 ◽

Author(s):

Stanislav Vojíř ◽

Tomáš Kliegr

Keyword(s):

Machine Learning ◽

User Studies ◽

Learning Models ◽

Rule Based ◽

Machine Learning Models

Download Full-text