Extracting Knowledge from Big Data for Sustainability: A Comparison of Machine Learning Techniques

Raghu Garg; Himanshu Aggarwal; Piera Centobelli; Roberto Cerchione

doi:10.3390/su11236669

Extracting Knowledge from Big Data for Sustainability: A Comparison of Machine Learning Techniques

Sustainability ◽

10.3390/su11236669 ◽

2019 ◽

Vol 11 (23) ◽

pp. 6669 ◽

Cited By ~ 4

Author(s):

Raghu Garg ◽

Himanshu Aggarwal ◽

Piera Centobelli ◽

Roberto Cerchione

Keyword(s):

Machine Learning ◽

Big Data ◽

Soil Quality ◽

Soil Analysis ◽

Regression Tree ◽

Machine Learning Techniques ◽

Stochastic Gradient Descent ◽

Coefficient Of Determination ◽

Support Vector ◽

Operating Characteristics

At present, due to the unavailability of natural resources, society should take the maximum advantage of data, information, and knowledge to achieve sustainability goals. In today’s world condition, the existence of humans is not possible without the essential proliferation of plants. In the photosynthesis procedure, plants use solar energy to convert into chemical energy. This process is responsible for all life on earth, and the main controlling factor for proper plant growth is soil since it holds water, air, and all essential nutrients of plant nourishment. Though, due to overexposure, soil gets despoiled, so fertilizer is an essential component to hold the soil quality. In that regard, soil analysis is a suitable method to determine soil quality. Soil analysis examines the soil in laboratories and generates reports of unorganized and insignificant data. In this study, different big data analysis machine learning methods are used to extracting knowledge from data to find out fertilizer recommendation classes on behalf of present soil nutrition composition. For this experiment, soil analysis reports are collected from the Tata soil and water testing center. In this paper, Mahoot library is used for analysis of stochastic gradient descent (SGD), artificial neural network (ANN) performance on Hadoop environment. For better performance evaluation, we also used single machine experiments for random forest (RF), K-nearest neighbors K-NN, regression tree (RT), support vector machine (SVM) using polynomial function, SVM using radial basis function (RBF) methods. Detailed experimental analysis was carried out using overall accuracy, AUC–ROC (receiver operating characteristics (ROC), and area under the ROC curve (AUC)) curve, mean absolute prediction error (MAE), root mean square error (RMSE), and coefficient of determination (R2) validation measurements on soil reports dataset. The results provide a comparison of solution classes and conclude that the SGD outperforms other approaches. Finally, the proposed results support to select the solution or recommend a class which suggests suitable fertilizer to crops for maximum production.

Download Full-text

Machine Learning for Sensorless Temperature Estimation of a BLDC Motor

Sensors ◽

10.3390/s21144655 ◽

2021 ◽

Vol 21 (14) ◽

pp. 4655

Author(s):

Dariusz Czerwinski ◽

Jakub Gęca ◽

Krzysztof Kolano

Keyword(s):

Machine Learning ◽

Temperature Measurement ◽

Stochastic Gradient Descent ◽

Estimation Accuracy ◽

Coefficient Of Determination ◽

Percentage Error ◽

Support Vector ◽

Bldc Motor ◽

Temperature Estimation ◽

Motor Operation

In this article, the authors propose two models for BLDC motor winding temperature estimation using machine learning methods. For the purposes of the research, measurements were made for over 160 h of motor operation, and then, they were preprocessed. The algorithms of linear regression, ElasticNet, stochastic gradient descent regressor, support vector machines, decision trees, and AdaBoost were used for predictive modeling. The ability of the models to generalize was achieved by hyperparameter tuning with the use of cross-validation. The conducted research led to promising results of the winding temperature estimation accuracy. In the case of sensorless temperature prediction (model 1), the mean absolute percentage error MAPE was below 4.5% and the coefficient of determination R2 was above 0.909. In addition, the extension of the model with the temperature measurement on the casing (model 2) allowed reducing the error value to about 1% and increasing R2 to 0.990. The results obtained for the first proposed model show that the overheating protection of the motor can be ensured without direct temperature measurement. In addition, the introduction of a simple casing temperature measurement system allows for an estimation with accuracy suitable for compensating the motor output torque changes related to temperature.

Download Full-text

Analyzing Behavior of Cancer Patients using Machine Learning Techniques

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.i8414.078919 ◽

2019 ◽

Vol 8 (9) ◽

pp. 1547-1556

Keyword(s):

Machine Learning ◽

Natural Language ◽

Cancer Patients ◽

Language Processing ◽

Machine Learning Techniques ◽

Support Vector ◽

Svm Classifier ◽

Operating Characteristics ◽

Decision Tree Classifier ◽

Tree Classifier

The online discussion forums and blogs are very vibrant platforms for cancer patients to express their views in the form of stories. These stories sometimes become a source of inspiration for some patients who are anxious in searching the similar cases. This paper proposes a method using natural language processing and machine learning to analyze unstructured texts accumulated from patient’s reviews and stories. The proposed methodology aims to identify behavior, emotions, side-effects, decisions and demographics associated with the cancer victims. The pre-processing phase of our work involves extraction of web text followed by text-cleaning where some special characters and symbols are omitted, and finally tagging the texts using NLTK’s (Natural Language Toolkit) POS (Parts of Speech) Tagger. The post-processing phase performs training of seven machine learning classifiers (refer Table 6). The Decision Tree classifier shows the higher precision (0.83) among the other classifiers while, the Area under the operating Characteristics (AUC) for Support Vector Machine (SVM) classifier is highest (0.98).

Download Full-text

Investigating the Applications of Machine Learning Techniques to Predict the Rock Brittleness Index

Applied Sciences ◽

10.3390/app10051691 ◽

2020 ◽

Vol 10 (5) ◽

pp. 1691 ◽

Cited By ~ 5

Author(s):

Deliang Sun ◽

Mahshid Lonbani ◽

Behnam Askarian ◽

Danial Jahed Armaghani ◽

Reza Tarinejad ◽

...

Keyword(s):

Machine Learning ◽

Point Load ◽

P Wave ◽

Machine Learning Techniques ◽

Brittleness Index ◽

Coefficient Of Determination ◽

Support Vector ◽

Ann Model ◽

Learning Techniques ◽

Rock Brittleness

Despite the vast usage of machine learning techniques to solve engineering problems, a very limited number of studies on the rock brittleness index (BI) have used these techniques to analyze issues in this field. The present study developed five well-known machine learning techniques and compared their performance to predict the brittleness index of the rock samples. The comparison of the models’ performance was conducted through a ranking system. These techniques included Chi-square automatic interaction detector (CHAID), random forest (RF), support vector machine (SVM), K-nearest neighbors (KNN), and artificial neural network (ANN). This study used a dataset from a water transfer tunneling project in Malaysia. Results of simple rock index tests i.e., Schmidt hammer, p-wave velocity, point load, and density were considered as model inputs. The results of this study indicated that while the RF model had the best performance for training (ranking = 25), the ANN outperformed other models for testing (ranking = 22). However, the KNN model achieved the highest cumulative ranking, which was 37. The KNN model showed desirable stability for both training and testing. However, the results of validation stage indicated that RF model with coefficient of determination (R2) of 0.971 provides higher performance capacity for prediction of the rock BI compared to KNN model with R2 of 0.807 and ANN model with R2 of 0.860. The results of this study suggest a practical use of the machine learning models in solving problems related to rock mechanics specially rock brittleness index.

Download Full-text

Fertilizer Recommendation System using SGD on Mahout and Hadoop Platform

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.i7571.078919 ◽

2019 ◽

Vol 8 (9) ◽

pp. 620-624

Keyword(s):

Big Data ◽

Soil Quality ◽

Classification System ◽

Recommendation System ◽

Soil Analysis ◽

Research Work ◽

Machine Learning Techniques ◽

Healthy Plant ◽

Soil Nutrition ◽

Fertilizer Recommendation

In the current world scenario, the existence of human is impossible without the necessary proliferation of plants. Health of plant depends on water and soil nutrition that help plants to produce energy. Apply appropriate recommended fertilizer quantity is necessary for a healthy plant. However, due to overexposure, soil sometimes gets degraded, so fertilizer is an important element to retain the soil quality. Now a days decision support system plays a vital role in the recommendation. These recommendation systems are based on historical data. In this respect, soil analysis is an appropriate approach to determine the soil quality. Soil analysis generates a report of unstructured and unperceivable data by testing soil in laboratories that make it agriculture big data. This type of systems generally has been implemented in the banking and health care sector for fraud detection and patient recommendations, respectively. In this paper, we have been proposed fertilizer recommendation system based on present nutrition quantity in the soil. In this system, the useful data is extracted from soil analysis reports and save into two files: 1) first file save soil nutrition composition and solution number that act as the label in 2) second file save solution number and recommended fertilizer quantity. Soil composition encoded into vector use by classification system to trained system. In this research work, SGD big data analysis machine learning techniques are applied to identify the fertilizer recommendation classes based on present soil nutrition composition. Here, SGD classification system is used to train the system. Our proposed system obtained 64.08% total average accuracy. The proposed model can also be used by agriculture experts to recommend fertilizer quantity according to crop type and present nutrition composition.

Download Full-text

Overview of the role of big data in mental health: A scoping review (Preprint)

10.2196/preprints.32424 ◽

2021 ◽

Author(s):

Arfan Ahmed ◽

Sarah Aziz ◽

Marco Angus ◽

Mahmood Alzubaidi ◽

Alaa Abd-Alrazaq ◽

...

Keyword(s):

Mental Health ◽

Machine Learning ◽

Big Data ◽

Scoping Review ◽

Mental Health Disorders ◽

Machine Learning Techniques ◽

Support Vector ◽

Great Effort ◽

Scoping Reviews ◽

Health Disorders

BACKGROUND Big Data offers promise in the field of mental health and plays an important part when it comes to automation, analysis and prevention of mental health disorders OBJECTIVE The purpose of this scoping review is to explore how big data was exploited in mental health. This review specifically addresses both the volume, velocity, veracity and variety of collected data as well as how data was attained, stored, managed, and kept private and secure. METHODS Six databases were searched to find relevant articles. PRISMA Extension for Scoping Reviews (PRISMA-ScR) was used as a guideline methodology to develop a comprehensive scoping review. RESULTS General and Big Data features were extracted from the studies reviewed. Various technologies were noted when it comes to using Big Data in mental health with depression and anxiety being the focus of most of the studies. Some of these included Machine Learning (ML) models in 22 studies of which Random Forest (RF) was the most widely used. Logistic Regression (LR) was used in 4 studies, and Support Vector Machine (SVM) was used in 3 studies. CONCLUSIONS In order to utilize Big Data as a way to mitigate mental health disorders and prevent their appearance altogether a great effort is still needed. Integration and analysis of Big Data, doctors and researchers alike can find patterns in otherwise difficult to identify data by making use of AI and Machine Learning techniques. Similarly, machine learning and artificial intelligence can be used to automate the analytical process.

Download Full-text

Acoustic Diversity Classification Using Machine Learning Techniques: Towards Automated Marine Big Data Analysis

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213020600118 ◽

2020 ◽

Vol 29 (03n04) ◽

pp. 2060011

Author(s):

Emna Hachicha Belghith ◽

François Rioult ◽

Medjber Bouzidi

Keyword(s):

Machine Learning ◽

Big Data ◽

Data Analysis ◽

Big Data Analysis ◽

Machine Learning Techniques ◽

Support Vector ◽

K Nearest Neighbor ◽

Learning Techniques ◽

Acoustic Diversity ◽

Marine Data

During the last years, big data has become the new emerging trend that increasingly attracting the attention of the R&D community in several fields (e.g., image processing, database engineering, data mining, artificial intelligence). Marine data is part of these fields which accommodates this growth, hence the appearance of marine big data paradigm that monitoring advocates the assessment of human impact on marine data. Nonetheless, supporting acoustic sounds classification is missing in such environment, with taking into account the diversity of such data (i.e., sounds of living undersea species, sounds of human activities, and sounds of environmental effects). To overcome this issue, we propose in this paper an approach that efficiently allowing acoustic diversity classification using machine learning techniques. The aim is to reach an automated support of marine big data analysis. We have conducted a set of experiments, using a real marine dataset, in order to validate our approach and show its effectiveness and efficiency. To do so, three machine learning techniques are employed: (i) classic machine learning models (i.e., k-nearest neighbor and support vector machine), (ii) deep learning based on convolutional neural networks, and (iii) transfer learning based on the reuse of pretrained models.

Download Full-text

Productivity estimation of cutter suction dredger operation through data mining and learning from real-time big data

Engineering Construction & Architectural Management ◽

10.1108/ecam-05-2020-0357 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Jiake Fu ◽

Huijing Tian ◽

Lingguang Song ◽

Mingchao Li ◽

Shuo Bai ◽

...

Keyword(s):

Neural Network ◽

Machine Learning ◽

Data Mining ◽

Big Data ◽

Bp Neural Network ◽

Coefficient Of Determination ◽

Support Vector ◽

Content Type ◽

Optimal Coefficient ◽

Productivity Estimation

PurposeThis paper presents a new approach of productivity estimation of cutter suction dredger operation through data mining and learning from real-time big data.Design/methodology/approachThe paper used big data, data mining and machine learning techniques to extract features of cutter suction dredgers (CSD) for predicting its productivity. ElasticNet-SVR (Elastic Net-Support Vector Machine) method is used to filter the original monitoring data. Along with the actual working conditions of CSD, 15 features were selected. Then, a box plot was used to clean the corresponding data by filtering out outliers. Finally, four algorithms, namely SVR (Support Vector Regression), XGBoost (Extreme Gradient Boosting), LSTM (Long-Short Term Memory Network) and BP (Back Propagation) Neural Network, were used for modeling and testing.FindingsThe paper provided a comprehensive forecasting framework for productivity estimation including feature selection, data processing and model evaluation. The optimal coefficient of determination (R2) of four algorithms were all above 80.0%, indicating that the features selected were representative. Finally, the BP neural network model coupled with the SVR model was selected as the final model.Originality/valueMachine-learning algorithm incorporating domain expert judgments was used to select predictive features. The final optimal coefficient of determination (R2) of the coupled model of BP neural network and SVR is 87.6%, indicating that the method proposed in this paper is effective for CSD productivity estimation.

Download Full-text

Machine Learning Based Method for Prediction of Heart Disease in Big Data Environment

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.f3957.049620 ◽

2020 ◽

Vol 9 (4) ◽

pp. 1917-1921

Keyword(s):

Machine Learning ◽

Big Data ◽

Historical Data ◽

Heart Diseases ◽

Medical Professional ◽

Machine Learning Techniques ◽

Support Vector ◽

Huge Data ◽

Learning Techniques ◽

Non Linear

Prediction of diseases is one of the challenging tasks in healthcare domain. Conventionally the heart diseases were diagnosed by experienced medical professional and cardiologist with the help of medical and clinical tests. With conventional method even experienced medical professional struggled to predict the disease with sufficient accuracy. In addition, manually analysing and extracting useful knowledge from the archived disease data becomes time consuming as well as infeasible. The advent of machine learning techniques enables the prediction of various diseases in healthcare domain. Machine learning algorithms are trained to learn from the existing historical data and prediction models are being created to predict the unknown raw data. For the past two decades, machine learning techniques are extensively employed for disease prediction. Despite the capability of machine algorithm on learning from huge historical data which is stored in data mart and data warehouses using traditional database technologies such as Oracle OnLine Analytical Processing (OLAP). The conventional database technologies suffer from the limitation that they cannot handle huge data or unstructured data or data that comes with speed. In this context, big data tools and technologies plays a major role in storing and facilitating the processing of huge data. In this paper, an approach is proposed for prediction of heart diseases using Support Vector Algorithm in Spark environment. Support Vector Machine algorithm is basically a binary classifier which classifies both linear and non-linear input data. It transforms the non-linear data into hyper plan with the help of different kernel functions. Spark is a distributed big data processing platform which has a unique feature of keeping and processing a huge data in memory. The proposed approach is tested with a benchmark dataset from UCI repository and results are discussed.

Download Full-text

Modelling and Prediction of Concrete Compressive Strength Using Machine Learning

International Journal of Scientific Research in Computer Science Engineering and Information Technology ◽

10.32628/cseit217385 ◽

2021 ◽

pp. 526-532

Author(s):

K Sumanth Reddy ◽

Gaddam Pranith ◽

Karre Varun ◽

Thipparthy Surya Sai Teja

Keyword(s):

Machine Learning ◽

Compressive Strength ◽

Regression Tree ◽

Classification And Regression Tree ◽

Machine Learning Techniques ◽

Support Vector ◽

Rational Relation ◽

Compressive Strength Of Concrete ◽

And Performance ◽

Materials Used

The compressive strength of concrete plays an important role in determining the durability and performance of concrete. Due to rapid growth in material engineering finalizing an appropriate proportion for the mix of concrete to obtain the desired compressive strength of concrete has become cumbersome and a laborious task further the problem becomes more complex to obtain a rational relation between the concrete materials used to the strength obtained. The development in computational methods can be used to obtain a rational relation between the materials used and the compressive strength using machine learning techniques which reduces the influence of outliers and all unwanted variables influence in the determination of compressive strength. In this paper basic machine learning technics Multilayer perceptron neural network (MLP), Support Vector Machines (SVM), linear regressions (LR) and Classification and Regression Tree (CART), have been used to develop a model for determining the compressive strength for two different set of data (ingredients). Among all technics used the SVM provides a better results in comparison to other, but comprehensively the SVM cannot be a universal model because many recent literatures have proved that such models need more data and also the dynamicity of the attributes involved play an important role in determining the efficacy of the model.

Download Full-text

A Comparative Analysis of Machine Learning Techniques for Cyberbullying Detection on Twitter

Future Internet ◽

10.3390/fi12110187 ◽

2020 ◽

Vol 12 (11) ◽

pp. 187 ◽

Cited By ~ 1

Author(s):

Amgad Muneer ◽

Suliman Mohamed Fati

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Performance Metrics ◽

Machine Learning Techniques ◽

Stochastic Gradient Descent ◽

Gradient Boosting ◽

Support Vector ◽

Light Gradient ◽

Global Issue ◽

Cyberbullying Detection

The advent of social media, particularly Twitter, raises many issues due to a misunderstanding regarding the concept of freedom of speech. One of these issues is cyberbullying, which is a critical global issue that affects both individual victims and societies. Many attempts have been introduced in the literature to intervene in, prevent, or mitigate cyberbullying; however, because these attempts rely on the victims’ interactions, they are not practical. Therefore, detection of cyberbullying without the involvement of the victims is necessary. In this study, we attempted to explore this issue by compiling a global dataset of 37,373 unique tweets from Twitter. Moreover, seven machine learning classifiers were used, namely, Logistic Regression (LR), Light Gradient Boosting Machine (LGBM), Stochastic Gradient Descent (SGD), Random Forest (RF), AdaBoost (ADB), Naive Bayes (NB), and Support Vector Machine (SVM). Each of these algorithms was evaluated using accuracy, precision, recall, and F1 score as the performance metrics to determine the classifiers’ recognition rates applied to the global dataset. The experimental results show the superiority of LR, which achieved a median accuracy of around 90.57%. Among the classifiers, logistic regression achieved the best F1 score (0.928), SGD achieved the best precision (0.968), and SVM achieved the best recall (1.00).

Download Full-text