Developing an Ensembled Machine Learning Prediction Model for Marine Fish and Aquaculture Production

Labonnah Farzana Rahman; Mohammad Marufuzzaman; Lubna Alam; Md Azizul Bari; Ussif Rashid Sumaila; Lariyah Mohd Sidek

doi:10.3390/su13169124

Developing an Ensembled Machine Learning Prediction Model for Marine Fish and Aquaculture Production

Sustainability ◽

10.3390/su13169124 ◽

2021 ◽

Vol 13 (16) ◽

pp. 9124

Author(s):

Labonnah Farzana Rahman ◽

Mohammad Marufuzzaman ◽

Lubna Alam ◽

Md Azizul Bari ◽

Ussif Rashid Sumaila ◽

...

Keyword(s):

Machine Learning ◽

Marine Fish ◽

Climatic Variables ◽

Gradient Boosting ◽

Global Changes ◽

Fish Production ◽

Random Forest Regression ◽

Linear Gradient ◽

Feature Importance ◽

Aquaculture Production

The fishing industry is identified as a strategic sector to raise domestic protein production and supply in Malaysia. Global changes in climatic variables have impacted and continue to impact marine fish and aquaculture production, where machine learning (ML) methods are yet to be extensively used to study aquatic systems in Malaysia. ML-based algorithms could be paired with feature importance, i.e., (features that have the most predictive power) to achieve better prediction accuracy and can provide new insights on fish production. This research aims to develop an ML-based prediction of marine fish and aquaculture production. Based on the feature importance scores, we select the group of climatic variables for three different ML models: linear, gradient boosting, and random forest regression. The past 20 years (2000–2019) of climatic variables and fish production data were used to train and test the ML models. Finally, an ensemble approach named voting regression combines those three ML models. Performance matrices are generated and the results showed that the ensembled ML model obtains R2 values of 0.75, 0.81, and 0.55 for marine water, freshwater, and brackish water, respectively, which outperforms the single ML model in predicting all three types of fish production (in tons) in Malaysia.

Download Full-text

Interpretable Machine Learning for Early Neurological Deterioration Prediction in Atrial Fibrillation-Related Stroke

10.21203/rs.3.rs-446890/v1 ◽

2021 ◽

Author(s):

Seong Hwan Kim ◽

Eun-Tae Jeon ◽

Sungwook Yu ◽

Kyungmi O ◽

Chi Kyung Kim ◽

...

Keyword(s):

Machine Learning ◽

Atrial Fibrillation ◽

Neurological Deterioration ◽

Gradient Boosting ◽

Support Vector ◽

Light Gradient ◽

Interpretable Machine Learning ◽

Extreme Gradient Boosting ◽

Early Neurological Deterioration ◽

Feature Importance

Abstract We aimed to develop a novel prediction model for early neurological deterioration (END) based on an interpretable machine learning (ML) algorithm for atrial fibrillation (AF)-related stroke and to evaluate the prediction accuracy and feature importance of ML models. Data from multi-center prospective stroke registries in South Korea were collected. After stepwise data preprocessing, we utilized logistic regression, support vector machine, extreme gradient boosting, light gradient boosting machine (LightGBM), and multilayer perceptron models. We used the Shapley additive explanations (SHAP) method to evaluate feature importance. Of the 3,623 stroke patients, the 2,363 who had arrived at the hospital within 24 hours of symptom onset and had available information regarding END were included. Of these, 318 (13.5%) had END. The LightGBM model showed the highest area under the receiver operating characteristic curve (0.778, 95% CI, 0.726 - 0.830). The feature importance analysis revealed that fasting glucose level and the National Institute of Health Stroke Scale score were the most influential factors. Among ML algorithms, the LightGBM model was particularly useful for predicting END, as it revealed new and diverse predictors. Additionally, the SHAP method can be adjusted to individualize the features’ effects on the predictive power of the model.

Download Full-text

Swindling Shonky Anatomization of Credit Card Transactions using Machine Learning

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.d7621.118419 ◽

2019 ◽

Vol 8 (4) ◽

pp. 1477-1483

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Random Forest ◽

Decision Tree ◽

Credit Card ◽

Naive Bayes ◽

Gradient Boosting ◽

Decision Tree Classifier ◽

Tree Classifier ◽

Feature Importance

With the fast moving technological advancement, the internet usage has been increased rapidly in all the fields. The money transactions for all the applications like online shopping, banking transactions, bill settlement in any industries, online ticket booking for travel and hotels, Fees payment for educational organization, Payment for treatment to hospitals, Payment for super market and variety of applications are using online credit card transactions. This leads to the fraud usage of other accounts and transaction that result in the loss of service and profit to the institution. With this background, this paper focuses on predicting the fraudulent credit card transaction. The Credit Card Transaction dataset from KAGGLE machine learning Repository is used for prediction analysis. The analysis of fraudulent credit card transaction is achieved in four ways. Firstly, the relationship between the variables of the dataset is identified and represented by the graphical notations. Secondly, the feature importance of the dataset is identified using Random Forest, Ada boost, Logistic Regression, Decision Tree, Extra Tree, Gradient Boosting and Naive Bayes classifiers. Thirdly, the extracted feature importance if the credit card transaction dataset is fitted to Random Forest classifier, Ada boost classifier, Logistic Regression classifier, Decision Tree classifier, Extra Tree classifier, Gradient Boosting classifier and Naive Bayes classifier. Fourth, the Performance Analysis is done by analyzing the performance metrics like Accuracy, FScore, AUC Score, Precision and Recall. The implementation is done by python in Anaconda Spyder Navigator Integrated Development Environment. Experimental Results shows that the Decision Tree classifier have achieved the effective prediction with the precision of 1.0, recall of 1.0, FScore of 1.0 , AUC Score of 89.09 and Accuracy of 99.92%.

Download Full-text

An Attempt to Boost Molecular Descriptors with Quantum-Derived Features in Prediction of Maximum Emission Wavelengths of Chromophores

10.26434/chemrxiv.14534136.v1 ◽

2021 ◽

Author(s):

Bartłomiej Fliszkiewicz

Keyword(s):

Machine Learning ◽

Experimental Data ◽

Optical Properties ◽

Molecular Descriptors ◽

Gradient Boosting ◽

Learning Models ◽

Linear Gradient ◽

Maximum Emission ◽

Improve Accuracy ◽

Machine Learning Models

The following research assesses the capability of machine learning in predicting maximum emission wavelength of organic compounds. The predictions are based on structure descriptors and fingerprints widely applied in cheminformatics. In an attempt to further improve accuracy, developed machine learning models were enriched with quantum mechanics derived features. Multi linear, gradient boosting and random forest regressions were applied. Computers were trained and tested with database of experimental data of optical properties.

Download Full-text

Machine Learning Approaches to Determine Feature Importance for Predicting Infant Autopsy Outcome

10.1101/2020.05.21.20105221 ◽

2020 ◽

Author(s):

John Booth ◽

Ben Margetts ◽

William Bryant ◽

Richard Issitt ◽

John Ciaran Hutchinson ◽

...

Keyword(s):

Machine Learning ◽

Decision Tree ◽

Cause Of Death ◽

Policy Development ◽

Predictive Accuracy ◽

Predictive Performance ◽

Gradient Boosting ◽

Objective Evidence ◽

Wide Range ◽

Feature Importance

Introduction: Sudden unexpected death in infancy (SUDI) represents the commonest presentation of postneonatal death, yet despite full postmortem examination (autopsy), the cause of death is only determined in around 45% of cases, the majority remaining unexplained. In order to aid counselling and understand how to improve the investigation, we explored whether machine learning could be used to derive data driven insights for prediction of infant autopsy outcome. Methods: A paediatric autopsy database containing >7,000 cases in total with >300 variables per case, was analysed with cases categorised both by stage of examination (external, internal and internal with histology), and autopsy outcome classified as explained-(medical cause of death identified) or unexplained. For the purposes of this study only cases from infant and child deaths aged ≤ 2 years were included (N=3100). Following this, decision tree, random forest, and gradient boosting models were iteratively trained and evaluated for each stage of the post-mortem examination and compared using predictive accuracy metrics. Results: Data from 3,100 infant and young child autopsies were included. The naive decision tree model using initial external examination data had a predictive performance of 68% for determining whether a medical cause of death could be identified. Model performance increased when internal examination data was included and a core set of data items were identified using model feature importance as key variables for determining autopsy outcome. The most effective model was the XG Boost, with overall predictive performance of 80%, demonstrating age at death, and cardiovascular or respiratory histological findings as the most important variables associated with determining cause of death. Conclusion: This study demonstrates the feasibility of using machine learning models to objectively determine component importance of complex medical procedures, in this case infant autopsy, to inform clinical practice. It further highlights the value of collecting routine clinical procedural data according to defined standards. This approach can be applied to a wide range of clinical and operational healthcare scenarios providing objective, evidence-based information for uses such counselling, decision making and policy development.

Download Full-text

Exploration of Multiple Linear Regression with Ensembling Schemes for Roof Fall Assessment using Machine Learning

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.l3474.1081219 ◽

2019 ◽

Vol 8 (12) ◽

pp. 134-139

Keyword(s):

Machine Learning ◽

Linear Regression ◽

Multiple Linear Regression ◽

Mean Squared Error ◽

Absolute Error ◽

Gradient Boosting ◽

The People ◽

Roof Fall ◽

Feature Importance ◽

Feature Scaling

Roof fall of the building is the major threat to the society as it results in severe damages to the life of the people. Recently, engineers are focusing on the prediction of roof fall of the building in order to avoid the damage to the environment and people. Early prediction of Roof fall is the social responsibility of the engineers towards existence of health and wealth of the nation. This paper attempts to identify the essential attributes of the Roof fall dataset that are taken from the UCI Machine learning repository for predicting the existence of roof fall. In this paper, the important features are extorted from the various ensembling methods like Gradient Boosting Regressor, Random Forest Regressor, AdaBoost Regressor and Extra Trees Regressor. The extracted feature importance of each of the ensembling methods is then fitted with multiple linear regression to analyze the performance. The same extracted feature importance of each of the ensembling methods are subjected to feature scaling and then fitted with multiple linear regression to analyze the performance. The Performance analysis is done with the performance parameters such as Mean Squared Log Error (MSLE), Mean Absolute error (MAE), R2 Score, Mean Squared error (MSE) and Explained Variance Score (EVS). The execution is carried out using python code in Spyder Anaconda Navigator IP Console. Experimental results shows that before feature scaling, Extra Tree Regressor is found to be effective with the MSE of 0.06, MAE of 0.07, R2 Score of 87%, EVS of 0.89 and MSLE of 0.02 as compared to other ensembling methods. In the same way, after applying feature scaling, the feature importance extracted from the Extra Tree Regressor is found to be effective with the MSE of 0.04, MAE of 0.03, R2 Score of 96%, EVS of 0.9 and MSLE of 0.01 as compared to other ensembling methods.

Download Full-text

An Attempt to Boost Molecular Descriptors with Quantum-Derived Features in Prediction of Maximum Emission Wavelengths of Chromophores

10.26434/chemrxiv.14534136 ◽

2021 ◽

Author(s):

Bartłomiej Fliszkiewicz

Keyword(s):

Machine Learning ◽

Experimental Data ◽

Optical Properties ◽

Molecular Descriptors ◽

Gradient Boosting ◽

Learning Models ◽

Linear Gradient ◽

Maximum Emission ◽

Improve Accuracy ◽

Machine Learning Models

Download Full-text

Machine Learning Approaches to Determine Feature Importance for Predicting Infant Autopsy Outcome

Pediatric and Developmental Pathology ◽

10.1177/10935266211001644 ◽

2021 ◽

pp. 109352662110016

Author(s):

John Booth ◽

Ben Margetts ◽

Will Bryant ◽

Richard Issitt ◽

Ciaran Hutchinson ◽

...

Keyword(s):

Machine Learning ◽

Decision Tree ◽

Cause Of Death ◽

Predictive Performance ◽

Gradient Boosting ◽

Learning Approaches ◽

Unexpected Death ◽

Component Importance ◽

Infant And Young Child ◽

Feature Importance

Introduction Sudden unexpected death in infancy (SUDI) represents the commonest presentation of postneonatal death. We explored whether machine learning could be used to derive data driven insights for prediction of infant autopsy outcome. Methods A paediatric autopsy database containing >7,000 cases, with >300 variables, was analysed by examination stage and autopsy outcome classified as ‘explained (medical cause of death identified)’ or ‘unexplained’. Decision tree, random forest, and gradient boosting models were iteratively trained and evaluated. Results Data from 3,100 infant and young child (<2 years) autopsies were included. Naïve decision tree using external examination data had performance of 68% for predicting an explained death. Core data items were identified using model feature importance. The most effective model was XG Boost, with overall predictive performance of 80%, demonstrating age at death, and cardiovascular and respiratory histological findings as the most important variables associated with determining medical cause of death. Conclusion This study demonstrates feasibility of using machine-learning to evaluate component importance of complex medical procedures (paediatric autopsy) and highlights value of collecting routine clinical data according to defined standards. This approach can be applied to a range of clinical and operational healthcare scenarios

Download Full-text

Interpretable machine learning for early neurological deterioration prediction in atrial fibrillation-related stroke

Scientific Reports ◽

10.1038/s41598-021-99920-7 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Seong-Hwan Kim ◽

Eun-Tae Jeon ◽

Sungwook Yu ◽

Kyungmi Oh ◽

Chi Kyung Kim ◽

...

Keyword(s):

Machine Learning ◽

Atrial Fibrillation ◽

Neurological Deterioration ◽

Gradient Boosting ◽

Support Vector ◽

Light Gradient ◽

Interpretable Machine Learning ◽

Extreme Gradient Boosting ◽

Early Neurological Deterioration ◽

Feature Importance

AbstractWe aimed to develop a novel prediction model for early neurological deterioration (END) based on an interpretable machine learning (ML) algorithm for atrial fibrillation (AF)-related stroke and to evaluate the prediction accuracy and feature importance of ML models. Data from multicenter prospective stroke registries in South Korea were collected. After stepwise data preprocessing, we utilized logistic regression, support vector machine, extreme gradient boosting, light gradient boosting machine (LightGBM), and multilayer perceptron models. We used the Shapley additive explanation (SHAP) method to evaluate feature importance. Of the 3,213 stroke patients, the 2,363 who had arrived at the hospital within 24 h of symptom onset and had available information regarding END were included. Of these, 318 (13.5%) had END. The LightGBM model showed the highest area under the receiver operating characteristic curve (0.772; 95% confidence interval, 0.715–0.829). The feature importance analysis revealed that fasting glucose level and the National Institute of Health Stroke Scale score were the most influential factors. Among ML algorithms, the LightGBM model was particularly useful for predicting END, as it revealed new and diverse predictors. Additionally, the effects of the features on the predictive power of the model were individualized using the SHAP method.

Download Full-text

Nowcasting thunderstorm hazards using machine learning: the impact of data sources on performance

10.5194/nhess-2021-171 ◽

2021 ◽

Author(s):

Jussi Leinonen ◽

Ulrich Hamann ◽

Urs Germann ◽

John R. Mecikalski

Keyword(s):

Machine Learning ◽

Weather Prediction ◽

Radar Data ◽

Data Sources ◽

Gradient Boosting ◽

Predictive Variables ◽

Feature Importance ◽

Radar Echo ◽

Elevation Model ◽

The Impact

Abstract. In order to aid feature selection in thunderstorm nowcasting, we present an analysis of the utility of various sources of data for machine-learning-based nowcasting of hazards related to thunderstorms. We considered ground-based radar data, satellite-based imagery and lightning observations, forecast data from numerical weather prediction (NWP) and the topography from a digital elevation model (DEM), ending up with 106 different predictive variables. We evaluated machine-learning models to nowcast radar reflectivity (representing precipitation), lightning occurrence, and the 45 dBZ radar echo top height that can be used as an indicator of hail, producing predictions for lead times up to 60 min. The study was carried out in an area in the northeast United States, where observations from the Geostationary Operational Environmental Satellite 16 are available and can be used as a proxy for the upcoming Meteosat Third Generation capabilities in Europe. The benefits of the data sources were evaluated using two complementary approaches: using feature importance reported by the machine learning model based on gradient boosted trees, and by repeating the analysis using all possible combinations of the data sources. The two approaches sometimes yielded seemingly contradictory results, as the feature importance reported by the gradient boosting algorithm sometimes disregards certain features that are still useful in the absence of more powerful predictors, while at times it overstates the importance of other features. We found that the radar data is overall the most important predictor, the satellite imagery is beneficial for all of the studied predictands, and the lightning data is very useful for nowcasting lightning but of limited use for the other hazards. The benefits of the NWP data are more limited over the nowcast period, and we did not find evidence that the nowcast benefits from the DEM data.

Download Full-text

Developing a Machine Learning prediction model for bedside decision support by predicting readmission or death following discharge from the Intensive Care unit

10.21203/rs.2.21940/v1 ◽

2020 ◽

Author(s):

Patrick J. Thoral ◽

Mattia Fornasa ◽

Daan P. de Bruin ◽

Hidde Hovenkamp ◽

Ronald H. Driessen ◽

...

Keyword(s):

Machine Learning ◽

Relative Risk ◽

Risk Reduction ◽

Impact Analysis ◽

Relative Risk Reduction ◽

Patient Characteristics ◽

Gradient Boosting ◽

Icu Discharge ◽

Feature Importance ◽

Supervised Learning Algorithms

Abstract Background Unexpected ICU readmission is associated with longer length of stay and an increase in mortality. Real time support systems could prevent untimely discharge from the ICU. We aim to develop a machine learning model for implementation at the bedside by predicting the risk of ICU readmission or death at time of potential discharge, showing feature importance and visualizing day-to-day changes in risk. Methods Data from adult patients, admitted to our mixed surgical-medical ICU between 2004 and 2016, were used in the analysis. Patient characteristics, clinical observations, (automated) physiological measurements, laboratory studies and treatment data were considered as model features. Different supervised learning algorithms were trained to predict ICU readmission and/or death, both within 7 days from ICU discharge, using 10-fold cross-validation. Feature importance was determined using SHapley Additive exPlanations. We constructed readmission probability-time curves to identify subgroups. Results Our dataset included 14,105 admissions. The combined readmission/mortality rate within seven days of ICU discharge was 5.3%. Using Gradient Boosting, the model achieved a Receiver Operating Characteristic AUC of 0.802 (95% CI 0.789-0.816) and a Precision-Recall AUC of 0.198 (95% CI 0.185-0.211). The most predictive features were well-known parameters, including physiological parameters, as well as less apparent features like nutritional support. Impact analysis using probability-time curves identified specific patients groups, that might lead to a change in discharge management with a relative risk reduction of 17%. Conclusions We developed a model that can accurately predict readmission and mortality after ICU discharge. Impact analysis showed that a relative risk reduction of 17% could be achievable. Given the large and increasing number of ICU admissions worldwide, this modest reduction may have significant impact for patients and society.

Download Full-text