Current and Next Visit Prediction for Fatty Liver Disease with a Large-Scale Dataset (Preprint)
BACKGROUND Fatty liver disease (FLD) arises from the accumulation of fat in the liver and may cause liver inflammation which, according to past research it is shown that if not actively well-controlled, may develop into liver fibrosis, cirrhosis, or even hepatocellular carcinoma in the future. OBJECTIVE We describe the construction of machine-learning models for current-visit prediction (CVP) which can help physicians obtain more information for accurate diagnosis, and next-visit prediction (NVP) which can help physicians deal provide potential high-risk patients with advice to effectively prevent or delay health deterioration. METHODS The large-scale and high-dimensional dataset used in this study comes from the MJ Health Research Foundation in Taipei. The models we created use sequence forward selection (SFS) and one-pass ranking (OPR) for feature selection. For current-visit prediction (CVP), we explored multiple models including Adaboost, support vector machine (SVM), logistic regression (LR), random forest (RF), Gaussian Naïve Bayes (GNB), decision trees C4.5 (C4.5), and classification & regression trees (CART). For next-visit prediction (NVP), we used long short-term memory (LSTM) as a sequence classifier that uses various input sets for prediction. Model performance is evaluated based on two criteria: the accuracy of the test set, and the IoU and coverage between the features selected by OPR/SFS and by domain experts. RESULTS The dataset respectively includes 34,856 and 31,394 unique visits by male and female patients during 2009∼2016. The test accuracy results of CVP for Adaboost, SVM, LR, RF, GNB, C4.5, and CART were respectively 84.28, 83.84, 82.22, 82.21, 76.03, 75.78, and 75.53%. The test accuracy results of NVP of LSTM with fixed and variable intervals were respectively 78.20% and 76.79%. The proposed two paradigms of LSTM respectively achieved 39.29% and 41.21% error reduction when compared with a baseline model of simple induction. CONCLUSIONS This study explores a large fatty liver disease (FLD) dataset with high dimensionality. We have developed prediction models that can use for CVP and NVP for FLD prediction. We have also implemented efficient feature selection schemes for CVP and NVP to compare the automatically selected features with expert-selected features.