Study of different data science methods for demand prediction and replenishment forecasting at retail network

11th International Scientific Conference “Business and Management 2020” ◽

10.3846/bm.2020.604 ◽

2020 ◽

Author(s):

Aleksei Iurasov ◽

Giedre Stanelyte

Keyword(s):

Fuzzy Logic ◽

Logistic Regression ◽

Random Forest ◽

Linear Regression ◽

Prediction Model ◽

Data Science ◽

Science Methods ◽

Demand Prediction ◽

One Step ◽

Additive Regression

The demand prediction becoming an essential tool to remain or even lead in the competitionamong the retail businesses. A well-done demand prediction model could help retailer to track the level ofinventory, orders and sales in the most effective way in which the best results could be achieved. However,there are many different methods and opinions of how to create a demand prediction model. In this paper,we will analyse the most commonly used methods of Linear regression, Logistic Regression, ProbabilisticNeural Network, Bayesian Additive Regression Trees, Random Forest and Fuzzy Logic with their specificationsand limitations found in studies of authors. After review performed all methods will be compared accordingto characteristics selected. Moreover, in order to get more practical results the accuracy of LogisticRegression and Random Forest methods will be compared based on data of milk sales collected from retailnetwork. For constructing of decision support system for retail network, we need to go beyond demandprediction one-step to replenishment forecasting. It was concluded that there is no best method to forecastreplenishment and results can differ based on the data and conditions analysing. In every situation authorsseeking to select the method with the highest accuracy and the lowest number of errors possible. Limitationsof research: limited number of goods and stores included in the modelling.

Download Full-text

Risk Scoring System of Mortality and Prediction Model of Hospital Stay for Critically Ill Patients Receiving Parenteral Nutrition

Healthcare ◽

10.3390/healthcare9070853 ◽

2021 ◽

Vol 9 (7) ◽

pp. 853

Author(s):

Jee-Yun Kim ◽

Jeong Yee ◽

Tae-Im Park ◽

So-Youn Shin ◽

Man-Ho Ha ◽

...

Keyword(s):

Logistic Regression ◽

Regression Analysis ◽

Linear Regression ◽

Parenteral Nutrition ◽

Prediction Model ◽

Scoring System ◽

Prediction Equation ◽

Icu Patients ◽

Risk Scoring ◽

Risk Scoring System

Predicting the clinical progression of intensive care unit (ICU) patients is crucial for survival and prognosis. Therefore, this retrospective study aimed to develop the risk scoring system of mortality and the prediction model of ICU length of stay (LOS) among patients admitted to the ICU. Data from ICU patients aged at least 18 years who received parenteral nutrition support for ≥50% of the daily calorie requirement from February 2014 to January 2018 were collected. In-hospital mortality and log-transformed LOS were analyzed by logistic regression and linear regression, respectively. For calculating risk scores, each coefficient was obtained based on regression model. Of 445 patients, 97 patients died in the ICU; the observed mortality rate was 21.8%. Using logistic regression analysis, APACHE II score (15–29: 1 point, 30 or higher: 2 points), qSOFA score ≥ 2 (2 points), serum albumin level < 3.4 g/dL (1 point), and infectious or respiratory disease (1 point) were incorporated into risk scoring system for mortality; patients with 0, 1, 2–4, and 5–6 points had approximately 10%, 20%, 40%, and 65% risk of death. For LOS, linear regression analysis showed the following prediction equation: log(LOS) = 0.01 × (APACHE II) + 0.04 × (total bilirubin) − 0.09 × (admission diagnosis of gastrointestinal disease or injury, poisoning, or other external cause) + 0.970. Our study provides the mortality risk score and LOS prediction equation. It could help clinicians to identify those at risk and optimize ICU management.

Download Full-text

Data Science for Extubation Prediction and Value of Information in Surgical Intensive Care Unit

Journal of Clinical Medicine ◽

10.3390/jcm8101709 ◽

2019 ◽

Vol 8 (10) ◽

pp. 1709 ◽

Cited By ~ 3

Author(s):

Tsung-Lun Tsai ◽

Min-Hsin Huang ◽

Chia-Yen Lee ◽

Wu-Wei Lai

Keyword(s):

Intensive Care Unit ◽

Logistic Regression ◽

Intensive Care ◽

Prediction Model ◽

Value Of Information ◽

Data Science ◽

Apache Ii ◽

Surgical Intensive Care Unit ◽

Surgical Intensive Care ◽

Science Framework

Besides the traditional indices such as biochemistry, arterial blood gas, rapid shallow breathing index (RSBI), acute physiology and chronic health evaluation (APACHE) II score, this study suggests a data science framework for extubation prediction in the surgical intensive care unit (SICU) and investigates the value of the information our prediction model provides. A data science framework including variable selection (e.g., multivariate adaptive regression splines, stepwise logistic regression and random forest), prediction models (e.g., support vector machine, boosting logistic regression and backpropagation neural network (BPN)) and decision analysis (e.g., Bayesian method) is proposed to identify the important variables and support the extubation decision. An empirical study of a leading hospital in Taiwan in 2015–2016 is conducted to validate the proposed framework. The results show that APACHE II and white blood cells (WBC) are the two most critical variables, and then the priority sequence is eye opening, heart rate, glucose, sodium and hematocrit. BPN with selected variables shows better prediction performance (sensitivity: 0.830; specificity: 0.890; accuracy 0.860) than that with APACHE II or RSBI. The value of information is further investigated and shows that the expected value of experimentation (EVE), 0.652 days (patient staying in the ICU), is saved when comparing with current clinical experience. Furthermore, the maximal value of information occurs in a failure rate around 7.1% and it reveals the “best applicable condition” of the proposed prediction model. The results validate the decision quality and useful information provided by our predicted model.

Download Full-text

FORECASTING PRICES IN THE RENTAL HOUSING MARKET WITH MACHINE LEARNING METHODS

Bulletin of V. N. Karazin Kharkiv National University Economic Series ◽

10.26565/2311-2379-2020-99-12 ◽

2020 ◽

Keyword(s):

Machine Learning ◽

Random Forest ◽

Linear Regression ◽

Regression Models ◽

Data Science ◽

Polynomial Regression ◽

Short Term ◽

Learning Methods ◽

Machine Learning Methods ◽

Pricing Factors

The study of pricing factors in the market of the short-term rental has been done. Airbnb was chosen as the object of the study; it is a platform for accommodation, search, and rental around the world. At the beginning of 2021, the company offers 7 million homes from more than 220 countries. The Data Science methods play a significant role in the company's success. One of the key algorithms of the company is the pricing algorithm. Using the "Price Recommendations" feature, the homeowner can analyze which dates are most likely to be booked at the current price and which are not, it helps form a favorable offer. The system calculates the recommended cost of housing based on hundreds of parameters, some of which are easy to recognize, but there are less obvious factors that can also affect demand. The paper proposes an algorithm for identifying implicit pricing factors in the short-term rental market using machine learning methods, which includes: 1) data mining and data preparation; 2) building and analysis of linear regression models; 3) building and analysis of nonlinear regression models. The study was based on ads from the Airbnb site in Washington and New York using scripts developed in Python. The following models are built and analyzed: simple linear regression, multiple linear regression, polynomial regression, decision trees, random forest, and boosting. The results of the study showed that the most important factors are accommodates, cleaning_fee, room_type, bedrooms. But based on the model evaluation criteria, they cannot be used for implementation: linear models are of low quality, while the random forest, boosting, and trees are overfitted. Still the results can be used in conducting business analysis.

Download Full-text

Comparison of Machine Learning With Logistic Regression for Prediction of Chronic Kidney Disease in the Thai Adult Population

Ramathibodi Medical Journal ◽

10.33165/rmj.2021.44.4.250334 ◽

2021 ◽

Vol 44 (4) ◽

pp. 1-12

Author(s):

Ratchainant Thammasudjarit ◽

Punnathorn Ingsathit ◽

Sigit Ari Saputro ◽

Atiporn Ingsathit ◽

Ammarin Thakkinstian

Keyword(s):

Neural Network ◽

Machine Learning ◽

Chronic Kidney Disease ◽

Logistic Regression ◽

Random Forest ◽

Kidney Disease ◽

Decision Tree ◽

Prediction Model ◽

Prediction Models ◽

The Neural Network

Background: Chronic kidney disease (CKD) takes huge amounts of resources for treatments. Early detection of patients by risk prediction model should be useful in identifying risk patients and providing early treatments. Objective: To compare the performance of traditional logistic regression with machine learning (ML) in predicting the risk of CKD in Thai population. Methods: This study used Thai Screening and Early Evaluation of Kidney Disease (SEEK) data. Seventeen features were firstly considered in constructing prediction models using logistic regression and 4 MLs (Random Forest, Naïve Bayes, Decision Tree, and Neural Network). Data were split into train and test data with a ratio of 70:30. Performances of the model were assessed by estimating recall, C statistics, accuracy, F1, and precision. Results: Seven out of 17 features were included in the prediction models. A logistic regression model could well discriminate CKD from non-CKD patients with the C statistics of 0.79 and 0.78 in the train and test data. The Neural Network performed best among ML followed by a Random Forest, Naïve Bayes, and a Decision Tree with the corresponding C statistics of 0.82, 0.80, 0.78, and 0.77 in training data set. Performance of these corresponding models in testing data decreased about 5%, 3%, 1%, and 2% relative to the logistic model by 2%. Conclusions: Risk prediction model of CKD constructed by the logit equation may yield better discrimination and lower tendency to get overfitting relative to ML models including the Neural Network and Random Forest.

Download Full-text

Telecom Churn Prediction Using Seven Machine Learning Experiments integrating Features engineering and Normalization

10.21203/rs.3.rs-239201/v1 ◽

2021 ◽

Author(s):

Hemlata Jain ◽

Ajay Khunteta ◽

Sumit Private Shrivastav

Keyword(s):

Machine Learning ◽

Feature Extraction ◽

Logistic Regression ◽

Deep Learning ◽

Random Forest ◽

Prediction Model ◽

Confusion Matrix ◽

Hybrid Models ◽

Churn Prediction ◽

Learning Technique

Abstract Machine Learning and Deep learning classification has become an important topic in the area of Telecom Churn Prediction. Researchers have come out with very efficient experiments for Churn Prediction and have given a new direction to the telecommunication Industry to save their customers. Companies are eagerly developing the models for predicting churn and putting their efforts to save the potential churners. Therefore, for a better churn prediction model, finding the factors of churn is very important. This study is aiming to find the factors of user’s churn by evaluating their past service usage details. For this purpose, study is taking the advantage of feature importance, feature normalisation, feature correlation and feature extraction. After feature selection and extraction this study performing seven different experiments on the dataset to bring out the best results and compared the techniques. First Experiment includes a hybrid model of Decision tree and Logistic Regression, second experiment include PCA with Logistic Regression and Logit Boost, third experiment using a Deep Learning Technique that is CNN-VAE (Convolutional Neural Network with Variational Autoencoder), Fourth, fifth, sixth and seventh experiments was done on Logistic Regression, Logit Boost, XGBoost and Random Forest respectively. First four experiments are hybrid models and rest are using standalone techniques. The Orange dataset was used in this technique which has 3333 subscriber’s entries and 21 features. On the other hand, these experiments are compared with already existing models that have been developed in literature studies. The performance was evaluated using Accuracy, Precision, Recall rate, F-measure, Confusion Matrix, Marco Average and Weighted Average. This study proved to get better results as compared to old models. Random Forest outperformed in this study by achieving 95% Accuracy and all other experiments also produced very good results. The study states the importance of data mining techniques for a churn prediction model and proposes a very good comparison model where all machine Learning Standalone techniques, Deep Learning Technique and hybrid models with Feature Extraction tasks are being used and compared on the same dataset to evaluate the techniques performance better.

Download Full-text

Flood Severity assessment of the coastal tract situated between Muriganga and Saptamukhi estuaries of Sundarban delta of India using Frequency Ratio (FR), Fuzzy Logic (FL), Logistic Regression (LR) and Random Forest (RF) models

Regional Studies in Marine Science ◽

10.1016/j.rsma.2021.101624 ◽

2021 ◽

Vol 42 ◽

pp. 101624

Author(s):

Abhishek Ghosh ◽

Priyanka Dey

Keyword(s):

Fuzzy Logic ◽

Logistic Regression ◽

Random Forest ◽

Frequency Ratio ◽

Severity Assessment

Download Full-text

A Prediction Model for Bacteremia and Transfer to Intensive Care in Pediatric and Adolescent Cancer Patients With Febrile Neutropenia

10.21203/rs.3.rs-1120441/v1 ◽

2021 ◽

Author(s):

Muayad Alali ◽

Anoop Mayampurath ◽

Yangyang Dai ◽

Allison H. Bartlett

Keyword(s):

Logistic Regression ◽

Intensive Care ◽

High Risk ◽

Random Forest ◽

Febrile Neutropenia ◽

Prediction Model ◽

Cancer Patients ◽

Low Risk ◽

Adolescent Cancer ◽

Time Of Presentation

Abstract Objectives:Febrile neutropenia (FN) is a common condition in children receiving chemotherapy. Our goal in this study was to develop a model for predicting blood stream infection (BSI) and transfer to intensive care (TIC) at time of presentation in pediatric cancer patients with FN. Methods: We conducted an observational cohort analysis of pediatric and adolescent cancer patients younger than 24 years admitted for fever and chemotherapy-induced neutropenia over a 7-year period. We excluded stem cell transplant recipients who developed FN after transplant and febrile non-neutropenic episodes. The primary outcome was onset of BSI, as determined by positive blood culture within 7 days of onset of FN. The secondary outcome was transfer to intensive care (TIC) within 14 days of FN onset. Predictor variables include demographics, clinical, and laboratory measures on initial presentation for FN. Data were divided into independent derivation (2009-2015) and prospective validation (2015-2016) cohorts. Prediction models were built for both outcomes using logistic regression and random forest and compared with Hakim model. Performance was assessed using area under the receiver operating characteristic curve (AUC) metrics. Results: A total of 505 FN episodes (FNEs) were identified in 230 patients. BSI was diagnosed in 106 (21%) and TIC occurred in 56 (10.6%) episodes. The most common oncologic diagnosis with FN was acute lymphoblastic leukemia (ALL), and the highest rate of BSI was in patients with AML. Patients who had BSI had higher maximum temperature, higher rates of prior BSI and higher incidence of hypotension compared with patients who did not have BSI. FN patients who were transferred to the intensive care (TIC) had higher temperature and higher incidence of hypotension at presentation compared to FN patients who didn’t have TIC. We compared 3 models: (1) random forest (2) logistic regression and (3) Hakim model. The areas under the curve for BSI prediction were (0.79, 0.65, and 0.64, P < 0.05) for models 1,2, and 3, respectively. And for TIC prediction were (0.88, 0.76, and 0.65, P < 0.05) respectively. The random forest model demonstrated higher accuracy in predicting BSI and TIC and showed a negative predictive value (NPV) of 0.91 and 0.97 for BSI and TIC respectively at the best cutoff point as determined by Youden’s Index. Likelihood ratios (LRs) (post-test probability) for RF model have potential utility of identifying low risk for BSI and TIC (0.24 and 0.12) and high-risk patients (3.5 and 6.8) respectively. Conclusions: Our prediction model has a good diagnostic performance in clinical practices for both BSI and TIC in FN patients at the time of presentation. The model can be used to identify a group of individuals at low risk for BSI who may benefit from early discharge and reduce length of stay, also it can identify FN patients at high risk of complications who might benefit from more intensive therapies at presentation.

Download Full-text

Data Science bidang Pemasaran : Analisis Prilaku Pelanggan

Data Sciences Indonesia (DSI) ◽

10.47709/dsi.v1i1.1194 ◽

2021 ◽

Vol 1 (1) ◽

pp. 21-32

Author(s):

Mawaddah Harahap ◽

Yusniar Lubis ◽

Zakarias Situmorang

Keyword(s):

Logistic Regression ◽

Random Forest ◽

Data Science ◽

Random Forest Classifier ◽

Digital Data

Dalam kegiatan pemasaran digital, data Science (DS) memiliki peran penting dalam memahami kinerja industri pemasaran sebelum menerapkan teknik pemasaran digital pada pemasaran produk. Hal ini dikarenakan setiap pelanggan merespons secara berbeda setiap penawaran. Perilaku pelanggan juga berubah berdasarkan waktu karena mereka mungkin memiliki kebutuhan yang berbeda pada situasi yang berbeda. Pada makalah ini fokus menyajikan analisis bisnis dengan penerapan DS untuk mengeksplorasi pola perilaku dan juga memprediksi bagaimana pelanggan akan merespons penawaran yang berbeda. Penerapan analisis data eksplorasi juga diterapkan untuk menjawab beberapa pertanyaan bisnis, dari hasil pengamatan menghasilkan lima kelompok pelanggan yang disajikan dalam bentuk visualisasi dan model Random Forest Classifier memiliki skor akurasi prediksi terbaik sebesar 91%, kemudian K neighbors Classifier dan Logistic Regression.

Download Full-text

Pendekatan Data Science untuk Menemukan Churn Pelanggan pada Sector Perbankan dengan Machine Learning

Data Sciences Indonesia (DSI) ◽

10.47709/dsi.v1i1.1169 ◽

2021 ◽

Vol 1 (1) ◽

pp. 8-13

Author(s):

Amir Mahmud Husein ◽

Mawaddah Harahap

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Random Forest ◽

Data Science ◽

Random Forest Classifier ◽

Random Tree ◽

Chi Square ◽

Tree Classifier

Peralihan pelanggan merupakan fenomena dimana pelanggan perusahaan berhenti membeli atau berinteraksi sehingga sangat penting bagi perusahaan khususnya perbankan untuk memprediksi kemungkinan churn pelanggan dan hasilnya dapat digunakan untuk membantu retensi pelanggan dan bagian dari strategi perusahaan. Makalah ini menyajikan analisis dan prediksi churn pelanggan dengan menggunakan lima model berbeda yaitu Kneighbors Classifier, Logistic Regression, Linear SVC, Random Tree Classifier dan Random Forest Classifier. Berdasarkan hasil pengujian pendekatan model Random Forest Classifier dan Kneighbors Classifier lebih baik dari pada model lain dengan akurasi sebesar 86% dan 84%. Rekayasa fitur dengan pendekatan Anova dan Chi Square memiliki pengaruh yang signifikan terhadap peningkatan kinerja model prediksi.

Download Full-text

The Use of Fuzzy Estimators for the Construction of a Prediction Model Concerning an Environmental Ecosystem

Sustainability ◽

10.3390/su11185039 ◽

2019 ◽

Vol 11 (18) ◽

pp. 5039

Author(s):

Georgia Ellina ◽

Garyfalos Papaschinopoulos ◽

Basil Papadopoulos

Keyword(s):

Experimental Data ◽

Fuzzy Logic ◽

Linear Regression ◽

Prediction Model ◽

Shallow Lakes ◽

Water Body ◽

Fuzzy Numbers ◽

Fuzzy Linear Regression ◽

Fuzzy Implications ◽

Close Interval

As a variable system, the Lake of Kastoria is a good example regarding the pattern of the Mediterranean shallow lakes. The focus of this study is on the investigation of this lake’s eutrophication, analyzing the relation of the basic factors that affect this phenomenon using fuzzy logic. In the method we suggest, while there are many fuzzy implications that can be used since the proposition can take values in the close interval [0,1], we investigate the most appropriate implication for the studied water body. We propose a method evaluating fuzzy implications by constructing triangular non-asymptotic fuzzy numbers for each of the studied parameters coming from experimental data. This is achieved with the use of fuzzy estimators and fuzzy linear regression. In this way, we achieve a better understanding of the mechanisms and functions that regulate this ecosystem.

Download Full-text