Credit Scoring Using Machine Learning by Combing Social Network Information: Evidence from Peer-to-Peer Lending

Beibei Niu; Jinzheng Ren; Xiaotao Li

doi:10.3390/info10120397

Credit Scoring Using Machine Learning by Combing Social Network Information: Evidence from Peer-to-Peer Lending

Information ◽

10.3390/info10120397 ◽

2019 ◽

Vol 10 (12) ◽

pp. 397 ◽

Cited By ~ 1

Author(s):

Beibei Niu ◽

Jinzheng Ren ◽

Xiaotao Li

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Social Network ◽

Financial Institutions ◽

Learning Algorithm ◽

Credit Scoring ◽

Peer To Peer ◽

Machine Learning Algorithms ◽

Loan Default ◽

Network Information

Financial institutions use credit scoring to evaluate potential loan default risks. However, insufficient credit information limits the peer-to-peer (P2P) lending platform’s capacity to build effective credit scoring. In recent years, many types of data are used for credit scoring to compensate for the lack of credit history data. Whether social network information can be used to strengthen financial institutions’ predictive power has received much attention in the industry and academia. The aim of this study is to test the reliability of social network information in predicting loan default. We extract borrowers’ social network information from mobile phones and then use logistic regression to test the relationship between social network information and loan default. Three machine learning algorithms—random forest, AdaBoost, and LightGBM—were constructed to demonstrate the predictive performance of social network information. The logistic regression results show that there is a statistically significant correlation between social network information and loan default. The machine learning algorithm results show that social network information can improve loan default prediction performance significantly. The experiment results suggest that social network information is valuable for credit scoring.

Download Full-text

Implementation of Machine Learning Algorithms for Prediction of Fluidelastic Instability in Tube Arrays

Journal of Pressure Vessel Technology ◽

10.1115/1.4049876 ◽

2021 ◽

Vol 143 (2) ◽

Author(s):

Joaquin E. Moran ◽

Yasser Selima

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Learning Algorithm ◽

Machine Learning Algorithms ◽

Two Phase ◽

Factors Affecting ◽

Logistic Regression Models ◽

Number Of Factors ◽

Tube Arrays ◽

Fluidelastic Instability

Abstract Fluidelastic instability (FEI) in tube arrays has been studied extensively experimentally and theoretically for the last 50 years, due to its potential to cause significant damage in short periods. Incidents similar to those observed at San Onofre Nuclear Generating Station indicate that the problem is not yet fully understood, probably due to the large number of factors affecting the phenomenon. In this study, a new approach for the analysis and interpretation of FEI data using machine learning (ML) algorithms is explored. FEI data for both single and two-phase flows have been collected from the literature and utilized for training a machine learning algorithm in order to either provide estimates of the reduced velocity (single and two-phase) or indicate if the bundle is stable or unstable under certain conditions (two-phase). The analysis included the use of logistic regression as a classification algorithm for two-phase flow problems to determine if specific conditions produce a stable or unstable response. The results of this study provide some insight into the capability and potential of logistic regression models to analyze FEI if appropriate quantities of experimental data are available.

Download Full-text

An implementation of ensemble methods, logistic regression, and neural network for default prediction in Peer-to-Peer lending

Zbornik radova Ekonomskog fakulteta u Rijeci časopis za ekonomsku teoriju i praksu/Proceedings of Rijeka Faculty of Economics Journal of Economics and Business ◽

10.18045/zbefri.2021.1.163 ◽

2021 ◽

Vol 39 (1) ◽

pp. 163-197

Author(s):

Aneta Dzik-Walczak ◽

Mateusz Heba

Keyword(s):

Neural Network ◽

Neural Networks ◽

Logistic Regression ◽

Regression Model ◽

Financial Institutions ◽

Logistic Regression Model ◽

Credit Scoring ◽

Peer To Peer ◽

Ensemble Model ◽

Peer Lending

Credit scoring has become an important issue because competition among financial institutions is intense and even a small improvement in predictive accuracy can result in significant savings. Financial institutions are looking for optimal strategies using credit scoring models. Therefore, credit scoring tools are extensively studied. As a result, various parametric statistical methods, non-parametric statistical tools and soft computing approaches have been developed to improve the accuracy of credit scoring models. In this paper, different approaches are used to classify customers into those who repay the loan and those who default on a loan. The purpose of this study is to investigate the performance of two credit scoring techniques, the logistic regression model estimated on categorized variables modified with the use of WOE (Weight of Evidence) transformation, and neural networks. We also combine multiple classifiers and test whether ensemble learning has better performance. To evaluate the feasibility and effectiveness of these methods, the analysis is performed on Lending Club data. In addition, we investigate Peer-to-peer lending, also called social lending. From the results, it can be concluded that the logistic regression model can provide better performance than neural networks. The proposed ensemble model (a combination of logistic regression and neural network by averaging the probabilities obtained from both models) has higher AUC, Gini coefficient and Kolmogorov-Smirnov statistics compared to other models. Therefore, we can conclude that the ensemble model allows to successfully reduce the potential risks of losses due to misclassification costs.

Download Full-text

Predicting the Grade of Prostate Cancer Based on a Biparametric MRI Radiomics Signature

Contrast Media & Molecular Imaging ◽

10.1155/2021/7830909 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Li Zhang ◽

Xia Zhe ◽

Min Tang ◽

Jing Zhang ◽

Jialiang Ren ◽

...

Keyword(s):

Prostate Cancer ◽

Machine Learning ◽

Logistic Regression ◽

Learning Algorithm ◽

Area Under The Curve ◽

Machine Learning Algorithms ◽

Support Vector ◽

Low Grade ◽

High Grade ◽

Random Forest Tree

Purpose. This study aimed to investigate the value of biparametric magnetic resonance imaging (bp-MRI)-based radiomics signatures for the preoperative prediction of prostate cancer (PCa) grade compared with visual assessments by radiologists based on the Prostate Imaging Reporting and Data System Version 2.1 (PI-RADS V2.1) scores of multiparametric MRI (mp-MRI). Methods. This retrospective study included 142 consecutive patients with histologically confirmed PCa who were undergoing mp-MRI before surgery. MRI images were scored and evaluated by two independent radiologists using PI-RADS V2.1. The radiomics workflow was divided into five steps: (a) image selection and segmentation, (b) feature extraction, (c) feature selection, (d) model establishment, and (e) model evaluation. Three machine learning algorithms (random forest tree (RF), logistic regression, and support vector machine (SVM)) were constructed to differentiate high-grade from low-grade PCa. Receiver operating characteristic (ROC) analysis was used to compare the machine learning-based analysis of bp-MRI radiomics models with PI-RADS V2.1. Results. In all, 8 stable radiomics features out of 804 extracted features based on T2-weighted imaging (T2WI) and ADC sequences were selected. Radiomics signatures successfully categorized high-grade and low-grade PCa cases ( P < 0.05 ) in both the training and test datasets. The radiomics model-based RF method (area under the curve, AUC: 0.982; 0.918), logistic regression (AUC: 0.886; 0.886), and SVM (AUC: 0.943; 0.913) in both the training and test cohorts had better diagnostic performance than PI-RADS V2.1 (AUC: 0.767; 0.813) when predicting PCa grade. Conclusions. The results of this clinical study indicate that machine learning-based analysis of bp-MRI radiomic models may be helpful for distinguishing high-grade and low-grade PCa that outperformed the PI-RADS V2.1 scores based on mp-MRI. The machine learning algorithm RF model was slightly better.

Download Full-text

Machine learning algorithm to predict delirium from emergency department data

10.1101/2021.02.19.21251956 ◽

2021 ◽

Author(s):

Sangil Lee ◽

Brianna Mueller ◽

W. Nick Street ◽

Ryan M. Carnahan

Keyword(s):

Machine Learning ◽

Emergency Department ◽

Logistic Regression ◽

Random Forest ◽

Barthel Index ◽

Risk Model ◽

Learning Algorithm ◽

Machine Learning Algorithms ◽

Sensitivity Threshold ◽

Mortality And Morbidity

AbstractIntroductionDelirium is a cerebral dysfunction seen commonly in the acute care setting. Delirium is associated with increased mortality and morbidity and is frequently missed in the emergency department (ED) by clinical gestalt alone. Identifying those at risk of delirium may help prioritize screening and interventions.ObjectiveOur objective was to identify clinically valuable predictive models for prevalent delirium within the first 24 hours of hospitalization based on the available data by assessing the performance of logistic regression and a variety of machine learning models.MethodsThis was a retrospective cohort study to develop and validate a predictive risk model to detect delirium using patient data obtained around an ED encounter. Data from electronic health records for patients hospitalized from the ED between January 1, 2014, and December 31, 2019, were extracted. Eligible patients were aged 65 or older, admitted to an inpatient unit from the emergency department, and had at least one DOSS assessment or CAM-ICU recorded while hospitalized. The outcome measure of this study was delirium within one day of hospitalization determined by a positive DOSS or CAM assessment. We developed the model with and without the Barthel index for activity of daily living, since this was measured after hospital admission.ResultsThe area under the ROC curves for delirium ranged from .69 to .77 without the Barthel index. Random forest and gradient-boosted machine showed the highest AUC of .77. At the 90% sensitivity threshold, gradient-boosted machine, random forest, and logistic regression achieved a specificity of 35%. After the Barthel index was included, random forest, gradient-boosted machine, and logistic regression models demonstrated the best predictive ability with respective AUCs of .85 to .86.ConclusionThis study demonstrated the use of machine learning algorithms to identify the combination of variables that are predictive of delirium within 24 hours of hospitalization from the ED.

Download Full-text

Machine Learning Algorithm’s Measurement and Analytical Visualization of User’s Reviews for Google Play Store

10.20944/preprints202003.0249.v1 ◽

2020 ◽

Cited By ~ 1

Author(s):

Abdul Karim ◽

Azhari Azhari ◽

Samir Brahim Belhaouri ◽

Ali Adil Qureshi

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Learning Algorithm ◽

Machine Learning Algorithms ◽

Ve Bayes ◽

Android Apps ◽

Online Marketplace ◽

Document Frequency ◽

Logistic Regression Algorithm ◽

Google Play

The fact is quite transparent that almost everybody around the world is using android apps. Half of the population of this planet is associated with messaging, social media, gaming, and browsers. This online marketplace provides free and paid access to users. On the Google Play store, users are encouraged to download countless of applications belonging to predefined categories. In this research paper, we have scrapped thousands of users reviews and app ratings. We have scrapped 148 apps’ reviews from 14 categories. We have collected 506259 reviews from Google play store and subsequently checked the semantics of reviews about some applications form users to determine whether reviews are positive, negative, or neutral. We have evaluated the results by using different machine learning algorithms like Naïve Bayes, Random Forest, and Logistic Regression algorithm. we have calculated Term Frequency (TF) and Inverse Document Frequency (IDF) with different parameters like accuracy, precision, recall, and F1 and compared the statistical result of these algorithms. We have visualized these statistical results in the form of a bar chart. In this paper, the analysis of each algorithm is performed one by one, and the results have been compared. Eventually, We've discovered that Logistic Regression is the best algorithm for a review-analysis of all Google play store. We have proved that Logistic Regression gets the speed of precision, accuracy, recall, and F1 in both after preprocessing and data collection of this dataset.

Download Full-text

Comparing Performance of Machine Learning Algorithms for Default Risk Prediction in Peer to Peer Lending

TEM Journal ◽

10.18421/tem101-16 ◽

2021 ◽

pp. 133-143

Author(s):

Yanka Aleksandrova

Keyword(s):

Machine Learning ◽

Default Risk ◽

Credit Scoring ◽

Learning Algorithms ◽

Peer To Peer ◽

Machine Learning Algorithms ◽

Ensemble Classifiers ◽

Peer Lending ◽

Official Site ◽

Heterogeneous Ensemble

The purpose of this research is to evaluate several popular machine learning algorithms for credit scoring for peer to peer lending. The dataset to fit the models is extracted from the official site of Lending Club. Several models have been implemented, including single classifiers (logistic regression, decision tree, multilayer perceptron), homogeneous ensembles (XGBoost, GBM, Random Forest) and heterogeneous ensemble classifiers like Stacked Ensembles. Results show that ensemble classifiers outperform single ones with Stacked Ensemble and XGBoost being the leaders.

Download Full-text

Network context matters: graph convolutional network model over social networks improves the detection of unknown HIV infections among young men who have sex with men

Journal of the American Medical Informatics Association ◽

10.1093/jamia/ocz070 ◽

2019 ◽

Vol 26 (11) ◽

pp. 1263-1271 ◽

Cited By ~ 3

Author(s):

Yang Xiang ◽

Kayo Fujimoto ◽

John Schneider ◽

Yuxi Jia ◽

Degui Zhi ◽

...

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Social Network ◽

Random Forest ◽

Hiv Infections ◽

Hiv Status ◽

Learning Methods ◽

Network Information ◽

Machine Learning Methods ◽

Sex With Men

Abstract Objective HIV infection risk can be estimated based on not only individual features but also social network information. However, there have been insufficient studies using n machine learning methods that can maximize the utility of such information. Leveraging a state-of-the-art network topology modeling method, graph convolutional networks (GCN), our main objective was to include network information for the task of detecting previously unknown HIV infections. Materials and Methods We used multiple social network data (peer referral, social, sex partners, and affiliation with social and health venues) that include 378 young men who had sex with men in Houston, TX, collected between 2014 and 2016. Due to the limited sample size, an ensemble approach was engaged by integrating GCN for modeling information flow and statistical machine learning methods, including random forest and logistic regression, to efficiently model sparse features in individual nodes. Results Modeling network information using GCN effectively increased the prediction of HIV status in the social network. The ensemble approach achieved 96.6% on accuracy and 94.6% on F1 measure, which outperformed the baseline methods (GCN, logistic regression, and random forest: 79.0%, 90.5%, 94.4% on accuracy, respectively; and 57.7%, 80.2%, 90.4% on F1). In the networks with missing HIV status, the ensemble also produced promising results. Conclusion Network context is a necessary component in modeling infectious disease transmissions such as HIV. GCN, when combined with traditional machine learning approaches, achieved promising performance in detecting previously unknown HIV infections, which may provide a useful tool for combatting the HIV epidemic.

Download Full-text

Analisis Klasifikasi Sentimen Terhadap Sekolah Daring pada Twitter Menggunakan Supervised Machine Learning

Jurnal Teknik Informatika dan Sistem Informasi ◽

10.28932/jutisi.v7i1.3216 ◽

2021 ◽

Vol 7 (1) ◽

Author(s):

Ni Luh Putu Chandra Savitri ◽

Radya Amirur Rahman ◽

Reyhan Venyutzky ◽

Nur Aini Rakhmawati

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Teaching Method ◽

Learning Algorithm ◽

Confusion Matrix ◽

Machine Learning Algorithms ◽

Sentence Length ◽

Supervised Machine Learning ◽

Positive Side ◽

Online School

Covid-19 pandemic urges countries to limit interaction of their people to reduce transmission. Indonesia requires people to do activities at home, one of which is online school. Many people share their thoughts through social media Twitter. Therefore, authors conducted sentiment analysis using supervised machine learning algorithm to determine distribution of words used in commenting on online schools, relationship between sentence, length and sentiment, and best algorithms that can be used to get most accurate results. In this study, authors used the method of crawling with RapidMiner to get data from Twitter. Then authors do data cleansing, data processing with classification methods using Random Forest Classifier , Logistic Regression , BernoulliNB and SVC algorithm. After that authors evaluate using confusion matrix, accuracy rate and classification report. In this research, authors found there are positive, negative, and neutral sentiments expressed on the online school implementation through comments. Authors ranked top three most used words used to express positive sentiments which includes bahagia, rajin and senang. On negative sentiments, top three words are capek, muak and bosen. On neutral sentiments, top three words are tidur, capek, and buka. Lengthy Tweets are usually imbued with negative remarks. On the other hand, the tweet tends to be positive and neutral tweet is usually stable. Authors conclude that the weakness of online school is the amount of workload that makes students tired alongside ineffective teaching method which makes it hard for students to understand the material given by school. However, on the positive side, some people agree with policies that are implemented and they feel like they gained some benefits from the implementation. From the four supervised machine learning algorithms that have been tested, Logistic Regression shows the highest accuracy, 0,87. The analysis shows that society tends to be neutral to the implementation of online school.

Download Full-text

Bots Recognition in Social Networks Using the Random Forest Algorithm

Mechanical Engineering and Computer Science ◽

10.24108/0419.0001473 ◽

2019 ◽

pp. 24-41

Author(s):

M. G. Khachatrian ◽

P. G. Klyucharev

Keyword(s):

Machine Learning ◽

Social Networks ◽

Social Network ◽

Random Forest ◽

Online Social Networks ◽

Learning Algorithm ◽

Online Social Network ◽

Machine Learning Algorithms ◽

Random Forest Algorithm ◽

Twitter Account

Online social networks are of essence, as a tool for communication, for millions of people in their real world. However, online social networks also serve an arena of information war. One tool for infowar is bots, which are thought of as software designed to simulate the real user’s behaviour in online social networks.The paper objective is to develop a model for recognition of bots in online social networks. To develop this model, a machine-learning algorithm “Random Forest” was used. Since implementation of machine-learning algorithms requires the maximum data amount, the Twitter online social network was used to solve the problem of bot recognition. This online social network is regularly used in many studies on the recognition of bots.For learning and testing the Random Forest algorithm, a Twitter account dataset was used, which involved above 3,000 users and over 6,000 bots. While learning and testing the Random Forest algorithm, the optimal hyper-parameters of the algorithm were determined at which the highest value of the F1 metric was reached. As a programming language that allowed the above actions to be implemented, was chosen Python, which is frequently used in solving problems related to machine learning.To compare the developed model with the other authors’ models, testing was based on the two Twitter account datasets, which involved as many as half of bots and half of real users. As a result of testing on these datasets, F1-metrics of 0.973 and 0.923 were obtained. The obtained F1-metric values are quite high as compared with the papers of other authors.As a result, in this paper a model of high accuracy rates was obtained that can recognize bots in the Twitter online social network.

Download Full-text

Predicting the risk of asthma attacks in children, adolescents and adults: protocol for a machine learning algorithm derived from a primary care-based retrospective cohort

BMJ Open ◽

10.1136/bmjopen-2019-036099 ◽

2020 ◽

Vol 10 (7) ◽

pp. e036099

Author(s):

Zain Hussain ◽

Syed Ahmar Shah ◽

Mome Mukherjee ◽

Aziz Sheikh

Keyword(s):

Machine Learning ◽

Primary Care ◽

Logistic Regression ◽

Learning Algorithm ◽

Machine Learning Algorithms ◽

Machine Learning Techniques ◽

Research Database ◽

Data Ethics ◽

Adolescents And Adults ◽

And Performance

IntroductionMost asthma attacks and subsequent deaths are potentially preventable. We aim to develop a prognostic tool for identifying patients at high risk of asthma attacks in primary care by leveraging advances in machine learning.Methods and analysisCurrent prognostic tools use logistic regression to develop a risk scoring model for asthma attacks. We propose to build on this by systematically applying various well-known machine learning techniques to a large longitudinal deidentified primary care database, the Optimum Patient Care Research Database, and comparatively evaluate their performance with the existing logistic regression model and against each other. Machine learning algorithms vary in their predictive abilities based on the dataset and the approach to analysis employed. We will undertake feature selection, classification (both one-class and two-class classifiers) and performance evaluation. Patients who have had actively treated clinician-diagnosed asthma, aged 8–80 years and with 3 years of continuous data, from 2016 to 2018, will be selected. Risk factors will be obtained from the first year, while the next 2 years will form the outcome period, in which the primary endpoint will be the occurrence of an asthma attack.Ethics and disseminationWe have obtained approval from OPCRD’s Anonymous Data Ethics Protocols and Transparency (ADEPT) Committee. We will seek ethics approval from The University of Edinburgh’s Research Ethics Group (UREG). We aim to present our findings at scientific conferences and in peer-reviewed journals.

Download Full-text