scholarly journals Flight Delay Classification Prediction Based on Stacking Algorithm

2021 ◽  
Vol 2021 ◽  
pp. 1-10
Author(s):  
Jia Yi ◽  
Honghai Zhang ◽  
Hao Liu ◽  
Gang Zhong ◽  
Guiyi Li

With the development of civil aviation, the number of flights keeps increasing and the flight delay has become a serious issue and even tends to normality. This paper aims to prove that Stacking algorithm has advantages in airport flight delay prediction, especially for the algorithm selection problem of machine learning technology. In this research, the principle of the Stacking classification algorithm is introduced, the SMOTE algorithm is selected to process imbalanced datasets, and the Boruta algorithm is utilized for feature selection. There are five supervised machine learning algorithms in the first-level learner of Stacking including KNN, Random Forest, Logistic Regression, Decision Tree, and Gaussian Naive Bayes. The second-level learner is Logistic Regression. To verify the effectiveness of the proposed method, comparative experiments are carried out based on Boston Logan International Airport flight datasets from January to December 2019. Multiple indexes are used to comprehensively evaluate the prediction results, such as Accuracy, Precision, Recall, F1 Score, ROC curve, and AUC Score. The results show that the Stacking algorithm not only could improve the prediction accuracy but also maintains great stability.

2021 ◽  
Vol 10 (1) ◽  
pp. 99
Author(s):  
Sajad Yousefi

Introduction: Heart disease is often associated with conditions such as clogged arteries due to the sediment accumulation which causes chest pain and heart attack. Many people die due to the heart disease annually. Most countries have a shortage of cardiovascular specialists and thus, a significant percentage of misdiagnosis occurs. Hence, predicting this disease is a serious issue. Using machine learning models performed on multidimensional dataset, this article aims to find the most efficient and accurate machine learning models for disease prediction.Material and Methods: Several algorithms were utilized to predict heart disease among which Decision Tree, Random Forest and KNN supervised machine learning are highly mentioned. The algorithms are applied to the dataset taken from the UCI repository including 294 samples. The dataset includes heart disease features. To enhance the algorithm performance, these features are analyzed, the feature importance scores and cross validation are considered.Results: The algorithm performance is compared with each other, so that performance based on ROC curve and some criteria such as accuracy, precision, sensitivity and F1 score were evaluated for each model. As a result of evaluation, Accuracy, AUC ROC are 83% and 99% respectively for Decision Tree algorithm. Logistic Regression algorithm with accuracy and AUC ROC are 88% and 91% respectively has better performance than other algorithms. Therefore, these techniques can be useful for physicians to predict heart disease patients and prescribe them correctly.Conclusion: Machine learning technique can be used in medicine for analyzing the related data collections to a disease and its prediction. The area under the ROC curve and evaluating criteria related to a number of classifying algorithms of machine learning to evaluate heart disease and indeed, the prediction of heart disease is compared to determine the most appropriate classification. As a result of evaluation, better performance was observed in both Decision Tree and Logistic Regression models.


Author(s):  
Ni Luh Putu Chandra Savitri ◽  
Radya Amirur Rahman ◽  
Reyhan Venyutzky ◽  
Nur Aini Rakhmawati

Covid-19 pandemic urges countries to limit interaction of their people to reduce transmission. Indonesia requires people to do activities at home, one of which is online school. Many people share their thoughts through social media Twitter. Therefore, authors conducted sentiment analysis using supervised machine learning algorithm to determine distribution of words used in commenting on online schools, relationship between sentence, length and sentiment, and best algorithms that can be used to get most accurate results. In this study, authors used the method of crawling with RapidMiner to get data from Twitter. Then authors do data cleansing, data processing with classification methods using Random Forest Classifier , Logistic Regression , BernoulliNB and SVC algorithm. After that authors evaluate using confusion matrix, accuracy rate and classification report. In this research, authors found there are positive, negative, and neutral sentiments expressed on the online school implementation through comments. Authors ranked top three most used words used to express positive sentiments which includes bahagia, rajin and senang. On negative sentiments, top three words are capek, muak and bosen. On neutral sentiments, top three words are tidur, capek, and buka. Lengthy Tweets are usually imbued with negative remarks. On the other hand, the tweet tends to be positive and neutral tweet is usually stable. Authors conclude that the weakness of online school is the amount of workload that makes students tired alongside ineffective teaching method which makes it hard for students to understand the material given by school. However, on the positive side, some people agree with policies that are implemented and they feel like they gained some benefits from the implementation. From the four supervised machine learning algorithms that have been tested, Logistic Regression shows the highest accuracy, 0,87. The analysis shows that society tends to be neutral to the implementation of online school.


Artificial intelligence is the technology that lets a machine mimic the thinking ability of a human being. Machine learning is the subset of AI, that makes this machine exhibit human behavior by making it learn from the known data, without the need of explicitly programming it. The health care sector has adopted this technology, for the development of medical procedures, maintaining huge patient’s records, assist physicians in the prediction, detection, and treatment of diseases and many more. In this paper, a comparative study of six supervised machine learning algorithms namely Logistic Regression(LR),support vector machine(SVM),Decision Tree(DT).Random Forest(RF),k-nearest neighbor(k-NN),Naive Bayes (NB) are made for the classification and prediction of diseases. Result shows out of compared supervised learning algorithms here, logistic regression is performing best with an accuracy of 81.4 % and the least performing is k-NN with just an accuracy of 69.01% in the classification and prediction of diseases.


2021 ◽  
pp. 1-11
Author(s):  
Daniel A. Harris ◽  
Kyla L. Pyndiura ◽  
Shelby L. Sturrock ◽  
Rebecca A.G. Christensen

Money laundering is a pervasive legal and economic problem that hides criminal activity. Identifying money laundering is a priority for both banks and governments, thus, machine learning algorithms have emerged as a possible strategy to detect suspicious financial activity within financial institutions. We used traditional regression and supervised machine learning techniques to identify bank customers at an increased risk of committing money laundering. Specifically, we assessed whether model performance differed across varying operationalizations of the outcome (e.g., multinomial vs. binary classification) and determined whether the inclusion of investigator-derived novel features (e.g., averages across existing features) could improve model performance. We received two proprietary datasets from Scotiabank, a large bank headquartered in Canada. The datasets included customer account information (N = 4,469) and customers’ monthly transaction histories (N = 2,827) from April 15, 2019 to April 15, 2020. We implemented traditional logistic regression, logistic regression with LASSO regularization (LASSO), K-nearest neighbours (KNN), and extreme gradient boosted models (XGBoost). Results indicated that traditional logistic regression with a binary outcome, conducted with investigator-derived novel features, performed the best with an F1 score of 0.79 and accuracy of 0.72. Models with a binary outcome had higher accuracy than the multinomial models, but the F1 scores yielded mixed results. For KNN and XGBoost, we observed little change or worsening performance after the introduction of the investigator-derived novel features. However, the investigator-derived novel features improved model performance for LASSO and traditional logistic regression. Our findings demonstrate that investigators should consider different operationalizations of the outcome, where possible, and include novel features derived from existing features to potentially improve the detection of customer at risk of committing money laundering.


2018 ◽  
Author(s):  
Nazmul Hossain ◽  
Fumihiko Yokota ◽  
Akira Fukuda ◽  
Ashir Ahmed

BACKGROUND Predictive analytics through machine learning has been extensively using across industries including eHealth and mHealth for analyzing patient’s health data, predicting diseases, enhancing the productivity of technology or devices used for providing healthcare services and so on. However, not enough studies were conducted to predict the usage of eHealth by rural patients in developing countries. OBJECTIVE The objective of this study is to predict rural patients’ use of eHealth through supervised machine learning algorithms and propose the best-fitted model after evaluating their performances in terms of predictive accuracy. METHODS Data were collected between June and July 2016 through a field survey with structured questionnaire form 292 randomly selected rural patients in a remote North-Western sub-district of Bangladesh. Four supervised machine learning algorithms namely logistic regression, boosted decision tree, support vector machine, and artificial neural network were chosen for this experiment. A ‘correlation-based feature selection’ technique was applied to include the most relevant but not redundant features into the model. A 10-fold cross-validation technique was applied to reduce bias and over-fitting of the data. RESULTS Logistic regression outperformed other three algorithms with 85.9% predictive accuracy, 86.4% precision, 90.5% recall, 88.1% F-score, and AUC of 91.5% followed by neural network, decision tree and support vector machine with the accuracy rate of 84.2%, 82.9 %, and 80.4% respectively. CONCLUSIONS The findings of this study are expected to be helpful for eHealth practitioners in selecting appropriate areas to serve and dealing with both under-capacity and over-capacity by predicting the patients’ response in advance with a certain level of accuracy and precision.


2021 ◽  
Author(s):  
Daniela A. Gomez-Cravioto ◽  
Ramon E. Diaz-Ramos ◽  
Neil Hernandez Gress ◽  
Jose Luis Preciado ◽  
Hector G. Ceballos

Abstract Background: This paper explores different machine learning algorithms and approaches for predicting alum income to obtain insights on the strongest predictors for income and a ‘high’ earners’ class. Methods: The study examines the alum sample data obtained from a survey from Tecnologico de Monterrey, a multicampus Mexican private university, and analyses it within the cross-industry standard process for data mining. Survey results include 17,898 and 12,275 observations before and after cleaning and pre-processing, respectively. The dataset includes values for income and a large set of independent variables, including demographic and occupational attributes of the former students and academic attributes from the institution’s history. We conduct an in-depth analysis to determine whether the accuracy of traditional algorithms in econometric research to predict income can be improved with a data science approach. Furthermore, we present insights on patterns obtained using explainable artificial intelligence techniques. Results: Results show that the gradient boosting model outperformed the parametric models, linear and logistic regression, in predicting alum’s current income with statistically significant results (p < 0.05) in three tasks: ordinary least-squares regression, multi-class classification and binary classification. Moreover, the linear and logistic regression models were found to be the most accurate methods for predicting the alum’s first income. The non-parametric models showed no significant improvements. Conclusion: We identified that age, gender, working hours per week, first income after graduation and variables related to the alum’s job position and firm contributed to explaining their income. Findings indicated a gender wage gap, suggesting that further work is needed to enable equality.


2021 ◽  
Author(s):  
Naveen Kunnathuvalappil Hariharan

Learning the determinants of successful project budgeting is crucial. This research attempts toempirically find the determinants of a successful budget. To find this, this work applied threedifferent supervised machine learning algorithms for classification: Support Vector Machine(SVM), Logistic regression, and Probit regression with data from 470 projects. Five featureshave been selected: coordination, participation, budget control, communication, andmotivation. The SVM analysis results showed that SVM could predict successful and failedbudgets with fairly good accuracy. The results from Logistic and Probit regression showed thatif managers properly focus on coordination, participation, budget control, and communication,the probability of success in project-budget increases.


2020 ◽  
Vol 14 (2) ◽  
pp. 140-159
Author(s):  
Anthony-Paul Cooper ◽  
Emmanuel Awuni Kolog ◽  
Erkki Sutinen

This article builds on previous research around the exploration of the content of church-related tweets. It does so by exploring whether the qualitative thematic coding of such tweets can, in part, be automated by the use of machine learning. It compares three supervised machine learning algorithms to understand how useful each algorithm is at a classification task, based on a dataset of human-coded church-related tweets. The study finds that one such algorithm, Naïve-Bayes, performs better than the other algorithms considered, returning Precision, Recall and F-measure values which each exceed an acceptable threshold of 70%. This has far-reaching consequences at a time where the high volume of social media data, in this case, Twitter data, means that the resource-intensity of manual coding approaches can act as a barrier to understanding how the online community interacts with, and talks about, church. The findings presented in this article offer a way forward for scholars of digital theology to better understand the content of online church discourse.


Sign in / Sign up

Export Citation Format

Share Document