scholarly journals A New Framework of Feature Engineering for Machine Learning in Financial Fraud Detection

2020 ◽  
Author(s):  
Chie Ikeda ◽  
Karim Ouazzane ◽  
Qicheng Yu

Financial fraud activities have soared despite the advancement of fraud detection models empowered by machine learning (ML). To address this issue, we propose a new framework of feature engineering for ML models. The framework consists of feature creation that combines feature aggregation and feature transformation, and feature selection that accommodates a variety of ML algorithms. To illustrate the effectiveness of the framework, we conduct an experiment using an actual financial transaction dataset and show that the framework significantly improves the performance of ML fraud detection models. Specifically, all the ML models complemented by a feature set generated from our framework surpass the same models without such a feature set by nearly 40% on the F1-measure and 20% on the Area Under the Curve (AUC) value.

2021 ◽  
Vol 3 (1) ◽  
Author(s):  
Nicholas Nuechterlein ◽  
Beibin Li ◽  
Abdullah Feroze ◽  
Eric C Holland ◽  
Linda Shapiro ◽  
...  

Abstract Background Combined whole-exome sequencing (WES) and somatic copy number alteration (SCNA) information can separate isocitrate dehydrogenase (IDH)1/2-wildtype glioblastoma into two prognostic molecular subtypes, which cannot be distinguished by epigenetic or clinical features. The potential for radiographic features to discriminate between these molecular subtypes has yet to be established. Methods Radiologic features (n = 35 340) were extracted from 46 multisequence, pre-operative magnetic resonance imaging (MRI) scans of IDH1/2-wildtype glioblastoma patients from The Cancer Imaging Archive (TCIA), all of whom have corresponding WES/SCNA data. We developed a novel feature selection method that leverages the structure of extracted MRI features to mitigate the dimensionality challenge posed by the disparity between a large number of features and the limited patients in our cohort. Six traditional machine learning classifiers were trained to distinguish molecular subtypes using our feature selection method, which was compared to least absolute shrinkage and selection operator (LASSO) feature selection, recursive feature elimination, and variance thresholding. Results We were able to classify glioblastomas into two prognostic subgroups with a cross-validated area under the curve score of 0.80 (±0.03) using ridge logistic regression on the 15-dimensional principle component analysis (PCA) embedding of the features selected by our novel feature selection method. An interrogation of the selected features suggested that features describing contours in the T2 signal abnormality region on the T2-weighted fluid-attenuated inversion recovery (FLAIR) MRI sequence may best distinguish these two groups from one another. Conclusions We successfully trained a machine learning model that allows for relevant targeted feature extraction from standard MRI to accurately predict molecularly-defined risk-stratifying IDH1/2-wildtype glioblastoma patient groups.


Information ◽  
2022 ◽  
Vol 13 (1) ◽  
pp. 35
Author(s):  
Jibouni Ayoub ◽  
Dounia Lotfi ◽  
Ahmed Hammouch

The analysis of social networks has attracted a lot of attention during the last two decades. These networks are dynamic: new links appear and disappear. Link prediction is the problem of inferring links that will appear in the future from the actual state of the network. We use information from nodes and edges and calculate the similarity between users. The more users are similar, the higher the probability of their connection in the future will be. The similarity metrics play an important role in the link prediction field. Due to their simplicity and flexibility, many authors have proposed several metrics such as Jaccard, AA, and Katz and evaluated them using the area under the curve (AUC). In this paper, we propose a new parameterized method to enhance the AUC value of the link prediction metrics by combining them with the mean received resources (MRRs). Experiments show that the proposed method improves the performance of the state-of-the-art metrics. Moreover, we used machine learning algorithms to classify links and confirm the efficiency of the proposed combination.


Sensors ◽  
2020 ◽  
Vol 20 (16) ◽  
pp. 4575 ◽  
Author(s):  
Jihyun Lee ◽  
Jiyoung Woo ◽  
Ah Reum Kang ◽  
Young-Seob Jeong ◽  
Woohyun Jung ◽  
...  

Hypotensive events in the initial stage of anesthesia can cause serious complications in the patients after surgery, which could be fatal. In this study, we intended to predict hypotension after tracheal intubation using machine learning and deep learning techniques after intubation one minute in advance. Meta learning models, such as random forest, extreme gradient boosting (Xgboost), and deep learning models, especially the convolutional neural network (CNN) model and the deep neural network (DNN), were trained to predict hypotension occurring between tracheal intubation and incision, using data from four minutes to one minute before tracheal intubation. Vital records and electronic health records (EHR) for 282 of 319 patients who underwent laparoscopic cholecystectomy from October 2018 to July 2019 were collected. Among the 282 patients, 151 developed post-induction hypotension. Our experiments had two scenarios: using raw vital records and feature engineering on vital records. The experiments on raw data showed that CNN had the best accuracy of 72.63%, followed by random forest (70.32%) and Xgboost (64.6%). The experiments on feature engineering showed that random forest combined with feature selection had the best accuracy of 74.89%, while CNN had a lower accuracy of 68.95% than that of the experiment on raw data. Our study is an extension of previous studies to detect hypotension before intubation with a one-minute advance. To improve accuracy, we built a model using state-of-art algorithms. We found that CNN had a good performance, but that random forest had a better performance when combined with feature selection. In addition, we found that the examination period (data period) is also important.


2019 ◽  
Vol 26 (3) ◽  
pp. 1810-1826 ◽  
Author(s):  
Behnaz Raef ◽  
Masoud Maleki ◽  
Reza Ferdousi

The aim of this study is to develop a computational prediction model for implantation outcome after an embryo transfer cycle. In this study, information of 500 patients and 1360 transferred embryos, including cleavage and blastocyst stages and fresh or frozen embryos, from April 2016 to February 2018, were collected. The dataset containing 82 attributes and a target label (indicating positive and negative implantation outcomes) was constructed. Six dominant machine learning approaches were examined based on their performance to predict embryo transfer outcomes. Also, feature selection procedures were used to identify effective predictive factors and recruited to determine the optimum number of features based on classifiers performance. The results revealed that random forest was the best classifier (accuracy = 90.40% and area under the curve = 93.74%) with optimum features based on a 10-fold cross-validation test. According to the Support Vector Machine-Feature Selection algorithm, the ideal numbers of features are 78. Follicle stimulating hormone/human menopausal gonadotropin dosage for ovarian stimulation was the most important predictive factor across all examined embryo transfer features. The proposed machine learning-based prediction model could predict embryo transfer outcome and implantation of embryos with high accuracy, before the start of an embryo transfer cycle.


Stroke ◽  
2020 ◽  
Vol 51 (Suppl_1) ◽  
Author(s):  
Masaki Ito ◽  
Satoshi Kuroda ◽  
Hidetsugu Asanoi ◽  
Taku Sugiyama ◽  
Takafumi Shindo ◽  
...  

Background: Outcomes of stroke with cancer-related coagulopathy (Trousseau syndrome) is predominantly attributed to cancer managements; however, stroke management by anticoagulants can contribute to the best supportive care. We aimed to find predictors of the outcome by multivariate analysis, including machine-learning (ML) based feature-engineering. Methods: A single-center retrospective study using a prospective cohort was conducted between April 2011 and June 2019. Out of the cumulative total of 110 acute ischemic stroke patients with malignancy, 65 were treated with anticoagulants, including warfarin (n=19), non-vitamin K dependent oral anticoagulants (NOAC, n=40), or subcutaneous heparin injections (n=6). Cancer-related coagulopathy was defined by elevated blood D-dimer levels at the onset of stroke with malignancy. The incidence of stroke recurrence was analyzed using 40 variables by logistic regression (LR) and in-house ML programs. Results: Out of 65 instances of the cancer-related stroke, 12 (18.5%) stroke recurrences were observed during 455 ± 70 days (mean, SEM). The stroke subtypes were cardioembolism (n=2), stroke with undetermined etiology (n=23) or other determined etiology (cancer-related coagulopathy, n=40). Multivariate LR revealed significant predictors of stroke recurrence, including NOAC usage and stroke subtype. Whereas, combination of forward stepwise selection and Naïve-Bayes (NB) or support vector machine found the blood D-dimer level as an additional important predictor. Input the D-dimer level in addition to NOAC usage and stroke subtype yielded the best area under the curve (AUC) for either of LR or NB compared to input warfarin or heparin usage. AUC for the LR for these 3 variables was better than that for NB. Conclusion: This study suggests the incidence of stroke recurrence is high in this clinical situation. NOAC usage, stroke subtype, and blood D-dimer level at the onset of stroke have predictive value of the outcome.


2020 ◽  
pp. 097215092092866
Author(s):  
Sonika Gupta ◽  
Sushil Kumar Mehta

The financial fraud detection problem involves analysis of the large financial datasets. Financial statement fraud detection process is concentrated on two major aspects: first, identification of the financial variables and ratios, also termed as features. Second, applying the data mining methods to classify the organizations into two broad categories: fraudulent and non-fraudulent organizations. If the input dataset contains large number of irrelevant and correlated features, the computational load of the machine learning technique increases and the effectiveness of the classification outcomes decreases. The feature selection process selects a subset of most significant attributes or variables that can be the representative of original data. This selected subset can help in learning the pattern in data at much less time and with accuracy, in order to produce useful information for decision-making. This article briefly states the methods applied in the prior studies for selecting the features for financial statement fraud detection. This article also presents an approach to feature selection using correlation-based filter selection methods in which feature selection is performed based on ensemble model, and tests the outcome of the approach by applying the mean ratio analysis on financial data of Indian companies.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Magdalyn E. Elkin ◽  
Xingquan Zhu

AbstractIn this study, we propose to use machine learning to understand terminated clinical trials. Our goal is to answer two fundamental questions: (1) what are common factors/markers associated to terminated clinical trials? and (2) how to accurately predict whether a clinical trial may be terminated or not? The answer to the first question provides effective ways to understand characteristics of terminated trials for stakeholders to better plan their trials; and the answer to the second question can direct estimate the chance of success of a clinical trial in order to minimize costs. By using 311,260 trials to build a testbed with 68,999 samples, we use feature engineering to create 640 features, reflecting clinical trial administration, eligibility, study information, criteria etc. Using feature ranking, a handful of features, such as trial eligibility, trial inclusion/exclusion criteria, sponsor types etc., are found to be related to the clinical trial termination. By using sampling and ensemble learning, we achieve over 67% Balanced Accuracy and over 0.73 AUC (Area Under the Curve) scores to correctly predict clinical trial termination, indicating that machine learning can help achieve satisfactory prediction results for clinical trial study.


Author(s):  
Anita Ramachandran ◽  
Adarsh Ramesh ◽  
Aditya Sukhlecha ◽  
Avtansh Pandey ◽  
Anupama Karuppiah

The application of machine learning techniques to detect and classify falls is a prominent area of research in the domain of intelligent assisted living systems. Machine learning (ML) based solutions for fall detection systems built on wearable devices use various sources of information such inertial motion units (IMU), vital signs, acoustic or channel state information parameters. Most existing research rely on only one of these sources; however, a need to do more experimenation to observe the efficiency of the ML classifiers while coupling features from diverse sources, was felt. In addition, fall detection systems based on wearable devices, require intelligent feature engineering and selection for dimensionality reduction, so as to reduce the computational complexity of the devices. In this paper we do a comprehensive performance analysis of ML classifiers for fall detection, on a dataset we collected. The analysis includes the impact of the following aspects on the performance of ML classifiers for fall detection: (i) using a combination of features from 2 sensors-an IMU sensor and a heart rate sensor, (ii) feature engineering and feature selection based on statistical methods, and (iii) using ensemble techniques for fall detection. We find that the inclusion of heart rate along with IMU sensor parameters improves the accuracy of fall detection. The conclusions from our experimentations on feature selection and ensemble analysis can serve as inputs for researchers designing wearable device-based fall detection systems.


Sign in / Sign up

Export Citation Format

Share Document