Performance Analysis of Statistical and Supervised Learning Techniques in Stock Data Mining

Manik Sharma; Samriti Sharma; Gurvinder Singh

doi:10.3390/data3040054

Performance Analysis of Statistical and Supervised Learning Techniques in Stock Data Mining

Data ◽

10.3390/data3040054 ◽

2018 ◽

Vol 3 (4) ◽

pp. 54 ◽

Cited By ~ 7

Author(s):

Manik Sharma ◽

Samriti Sharma ◽

Gurvinder Singh

Keyword(s):

Logistic Regression ◽

Supervised Learning ◽

Misclassification Rate ◽

Support Vector ◽

P Value ◽

Linear Discriminant ◽

Statistical Measures ◽

Specificity And Sensitivity ◽

Learning Techniques ◽

Topsis Technique

Nowadays, overwhelming stock data is available, which areonly of use if it is properly examined and mined. In this paper, the last twelve years of ICICI Bank’s stock data have been extensively examined using statistical and supervised learning techniques. This study may be of great interest for those who wish to mine or study the stock data of banks or any financial organization. Different statistical measures have been computed to explore the nature, range, distribution, and deviation of data. The different descriptive statistical measures assist in finding different valuable metrics such as mean, variance, skewness, kurtosis, p-value, a-squared, and 95% confidence mean interval level of ICICI Bank’s stock data. Moreover, daily percentage changes occurring over the last 12 years have also been recorded and examined. Additionally, the intraday stock status has been mined using ten different classifiers. The performance of different classifiers has been evaluated on the basis of various parameters such as accuracy, misclassification rate, precision, recall, specificity, and sensitivity. Based upon different parameters, the predictive results obtained using logistic regression are more acceptable than the outcomes of other classifiers, whereas naïve Bayes, C4.5, random forest, linear discriminant, and cubic support vector machine (SVM) merely act as a random guessing machine. The outstanding performance of logistic regression has been validated using TOPSIS (technique for order preference by similarity to ideal solution) and WSA (weighted sum approach).

Download Full-text

Improved semi-supervised learning technique for automatic detection of South African abusive language on Twitter

South African Computer Journal ◽

10.18489/sacj.v32i2.847 ◽

2020 ◽

Vol 32 (2) ◽

Author(s):

Oluwafemi Oriola ◽

Eduan Kotzé

Keyword(s):

Logistic Regression ◽

Supervised Learning ◽

South African ◽

Learning Curves ◽

Training Data ◽

Support Vector ◽

Learning Techniques ◽

Learning Technique ◽

Unlabelled Data ◽

Language Detection

Semi-supervised learning is a potential solution for improving training data in low-resourced abusive language detection contexts such as South African abusive language detection on Twitter. However, the existing semi-supervised learning methods have been skewed towards small amounts of labelled data, with small feature space. This paper, therefore, presents a semi-supervised learning technique that improves the distribution of training data by assigning labels to unlabelled data based on the majority voting over different feature sets of labelled and unlabelled data clusters. The technique is applied to South African English corpora consisting of labelled and unlabelled abusive tweets. The proposed technique is compared with state-of-the-art self-learning and active learning techniques based on syntactic and semantic features. The performance of these techniques with Logistic Regression, Support Vector Machine and Neural Networks are evaluated. The proposed technique, with accuracy and F1-score of 0.97 and 0.95, respectively, outperforms existing semi-supervised learning techniques. The learning curves show that the training data was used more efficiently by the proposed technique compared to existing techniques. Overall, n-gram syntactic features with a Logistic Regression classifier records the highest performance. The paper concludes that the proposed semi-supervised learning technique effectively detected implicit and explicit South African abusive language on Twitter.

Download Full-text

A Classification Approach for Predicting COVID-19 Patient Survival Outcome with Machine Learning Techniques

10.1101/2020.08.02.20129767 ◽

2020 ◽

Author(s):

Abdulhameed Ado Osi ◽

Hussaini Garba Dikko ◽

Mannir Abdu ◽

Auwalu Ibrahim ◽

Lawan Adamu Isma'il ◽

...

Keyword(s):

Machine Learning ◽

Survival Outcome ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Support Vector ◽

P Value ◽

Kappa Index ◽

Quality Health Care ◽

Linear Discriminant ◽

Learning Techniques

COVID-19 is an infectious disease discovered after the outbreak began in Wuhan, China, in December 2019. COVID-19 is still becoming an increasing global threat to public health. The virus has been escalated to many countries across the globe. This paper analyzed and compared the performance of three different supervised machine learning techniques; Linear Discriminant Analysis (LDA), Random Forest (RF), and Support Vector Machine (SVM) on COVID-19 dataset. The best level of accuracy between these three algorithms was determined by comparison of some metrics for assessing predictive performance such as accuracy, sensitivity, specificity, F-score, Kappa index, and ROC. From the analysis results, RF was found to be the best algorithm with 100% prediction accuracy in comparison with LDA and SVM with 95.2% and 90.9% respectively. Our analysis shows that out of these three classification models RF predicts COVID-19 patient's survival outcome with the highest accuracy. Chi-square test reveals that all the seven features except sex were significantly correlated with the COVID-19 patient's outcome (P-value < 0.005). Therefore, RF was recommended for COVID-19 patient outcome prediction that will help in early identification of possible sensitive cases for quick provision of quality health care, support and supervision.

Download Full-text

Study of Machine Learning Techniques for EEG Eye State Detection

Proceedings ◽

10.3390/proceedings2020054053 ◽

2020 ◽

Vol 54 (1) ◽

pp. 53

Author(s):

Francisco Laport ◽

Paula M. Castro ◽

Adriana Dapena ◽

Francisco J. Vazquez-Araujo ◽

Daniel Iglesia

Keyword(s):

Machine Learning ◽

Mental States ◽

Machine Learning Techniques ◽

Support Vector ◽

Discrete Wavelet ◽

Linear Discriminant ◽

State Identification ◽

Brain Signals ◽

State Classification ◽

Learning Techniques

A comparison of different machine learning techniques for eye state identification through Electroencephalography (EEG) signals is presented in this paper. (1) Background: We extend our previous work by studying several techniques for the extraction of the features corresponding to the mental states of open and closed eyes and their subsequent classification; (2) Methods: A prototype developed by the authors is used to capture the brain signals. We consider the Discrete Fourier Transform (DFT) and the Discrete Wavelet Transform (DWT) for feature extraction; Linear Discriminant Analysis (LDA) and Support Vector Machine (SVM) for state classification; and Independent Component Analysis (ICA) for preprocessing the data; (3) Results: The results obtained from some subjects show the good performance of the proposed methods; and (4) Conclusion: The combination of several techniques allows us to obtain a high accuracy of eye identification.

Download Full-text

Identifying Botnet on IoT by Using Supervised Learning Techniques

Oriental journal of computer science and technology ◽

10.13005/ojcst12.04.04 ◽

2019 ◽

Vol 12 (4) ◽

pp. 185-193

Author(s):

Amirhossein Rezaei

Keyword(s):

Supervised Learning ◽

Gradient Boosting ◽

Support Vector ◽

K Nearest Neighbors ◽

Malicious Software ◽

Learning Techniques ◽

Security Challenges ◽

Learning Technique ◽

Security Challenge ◽

The Moment

The security challenge on IoT (Internet of Things) is one of the hottest and most pertinent topics at the moment especially the several security challenges. The Botnet is one of the security challenges that most impact for several purposes. The network of private computers infected by malicious software and controlled as a group without the knowledge of owners and each of them running one or more bots is called Botnets. Normally, it is used for sending spam, stealing data, and performing DDoS attacks. One of the techniques that been used for detecting the Botnet is the Supervised Learning method. This study will examine several Supervised Learning methods such as; Linear Regression, Logistic Regression, Decision Tree, Naive Bayes, k- Nearest Neighbors, Random Forest, Gradient Boosting Machines, and Support Vector Machine for identifying the Botnet in IoT with the aim of finding which Supervised Learning technique can achieve the highest accuracy and fastest detection as well as with minimizing the dependent variable.

Download Full-text

Towards a Comprehensive Assessment of Statistical Versus Soft Computing Models in Hydrology: Application to Monthly Pan Evaporation Prediction

Water ◽

10.3390/w13172451 ◽

2021 ◽

Vol 13 (17) ◽

pp. 2451

Author(s):

Mohammad Zounemat-Kermani ◽

Behrooz Keshtegar ◽

Ozgur Kisi ◽

Miklas Scholz

Keyword(s):

Neural Network ◽

Soft Computing ◽

Input Data ◽

Computational Models ◽

The Other ◽

Support Vector ◽

P Value ◽

Pan Evaporation ◽

Statistical Measures ◽

Computing Models

This paper evaluates six soft computational models along with three statistical data-driven models for the prediction of pan evaporation (EP). Accordingly, improved kriging—as a novel statistical model—is proposed for accurate predictions of EP for two meteorological stations in Turkey. In the standard kriging model, the input data nonlinearity effects are increased by using a nonlinear map and transferring input data from a polynomial to an exponential basic function. The accuracy, precision, and over/under prediction tendencies of the response surface method, kriging, improved kriging, multilayer perceptron neural network using the Levenberg–Marquardt (MLP-LM) as well as a conjugate gradient (MLP-CG), radial basis function neural network (RBFNN), multivariate adaptive regression spline (MARS), M5Tree and support vector regression (SVR) were compared. Overall, all the applied models were highly capable of predicting monthly EP in both stations with a mean absolute error (MAE) < 0.77 mm and a Willmott index (d) > 0.95. Considering periodicity as an input parameter, the MLP-LM provided better results than the other methods among the soft computing models (MAE = 0.492 mm and d = 0.981). However, the improved kriging method surpassed all the other models based on the statistical measures (MAE = 0.471 mm and d = 0.983). Finally, the outcomes of the Mann–Whitney test indicated that the applied soft computational models do not have significant superiority over the statistical ones (p-value > 0.65 at α = 0.01 and α = 0.05).

Download Full-text

Machine learning versus logistic regression methods for 2-year mortality prognostication in a small, heterogeneous glioma database

10.1101/472555 ◽

2018 ◽

Cited By ~ 2

Author(s):

Sandip S Panesar ◽

Rhett N D’Souza ◽

Fang-Cheng Yeh ◽

Juan C Fernandez-Miranda

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Machine Learning Techniques ◽

World Health ◽

Support Vector ◽

Molecular Characteristics ◽

Regression Methods ◽

Learning Techniques ◽

The World ◽

Health Organization

AbstractBackgroundMachine learning (ML) is the application of specialized algorithms to datasets for trend delineation, categorization or prediction. ML techniques have been traditionally applied to large, highly-dimensional databases. Gliomas are a heterogeneous group of primary brain tumors, traditionally graded using histopathological features. Recently the World Health Organization proposed a novel grading system for gliomas incorporating molecular characteristics. We aimed to study whether ML could achieve accurate prognostication of 2-year mortality in a small, highly-dimensional database of glioma patients.MethodsWe applied three machine learning techniques: artificial neural networks (ANN), decision trees (DT), support vector machine (SVM), and classical logistic regression (LR) to a dataset consisting of 76 glioma patients of all grades. We compared the effect of applying the algorithms to the raw database, versus a database where only statistically significant features were included into the algorithmic inputs (feature selection).ResultsRaw input consisted of 21 variables, and achieved performance of (accuracy/AUC): 70.7%/0.70 for ANN, 68%/0.72 for SVM, 66.7%/0.64 for LR and 65%/0.70 for DT. Feature selected input consisted of 14 variables and achieved performance of 73.4%/0.75 for ANN, 73.3%/0.74 for SVM, 69.3%/0.73 for LR and 65.2%/0.63 for DT.ConclusionsWe demonstrate that these techniques can also be applied to small, yet highly-dimensional datasets. Our ML techniques achieved reasonable performance compared to similar studies in the literature. Though local databases may be small versus larger cancer repositories, we demonstrate that ML techniques can still be applied to their analysis, though traditional statistical methods are of similar benefit.

Download Full-text

Evaluation of Machine Learning Algorithms for Classification of Primary Biological Aerosol using a new UV-LIF spectrometer

10.5194/amt-2016-214 ◽

2016 ◽

Cited By ~ 1

Author(s):

Simon Ruske ◽

David O. Topping ◽

Virginia E. Foot ◽

Paul H. Kaye ◽

Warren R. Stanley ◽

...

Keyword(s):

Supervised Learning ◽

Fungal Spores ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Support Vector ◽

Data Sets ◽

Agglomerative Clustering ◽

Real World Data ◽

Linear Discriminant ◽

Accuracy Of Measurements

Abstract. Characterisation of bio-aerosols has important implications within Environment and Public Health sectors. Recent developments in Ultra-Violet Light Induced Fluorescence (UV-LIF) detectors such as the Wideband Integrated bio-aerosol Spectrometer (WIBS) and the newly introduced Multiparameter bio-aerosol Spectrometer (MBS) has allowed for the real time collection of fluorescence, size and morphology measurements for the purpose of discriminating between bacteria, fungal Spores and pollen. This new generation of instruments has enabled ever larger data sets to be compiled with the aim of studying more complex environments. In real world data sets, particularly those from an urban environment, the population may be dominated by non-biological fluorescent interferents bringing into question the accuracy of measurements of quantities such as concentrations. It is therefore imperative that we validate the performance of different algorithms which can be used for the task of classification. For unsupervised learning we test Hierarchical Agglomerative Clustering with various different linkages. For supervised learning, ten methods were tested; including decision trees, ensemble methods: Random Forests, Gradient Boosting and AdaBoost; two implementations for support vector machines: libsvm and liblinear; Gaussian methods: Gaussian naïve Bayesian, quadratic and linear discriminant analysis and finally the k-nearest neighbours algorithm. The methods were applied to two different data sets measured using a new Multiparameter bio-aerosol Spectrometer which provides multichannel UV-LIF fluorescence signatures for single airborne biological particles. Clustering, in general performs slightly worse than the supervised learning methods correctly classifying, at best, only 72.7 and 91.1 percent for the two data sets respectively. For supervised learning the gradient boosting algorithm was found to be the most effective, on average correctly classifying 88.1 and 97.8 percent of the testing data respectively across the two data sets.

Download Full-text

Perbandingan Algoritma Klasifikasi Sentimen Twitter Terhadap Insiden Kebocoran Data Tokopedia

JISKA (Jurnal Informatika Sunan Kalijaga) ◽

10.14421/jiska.2021.6.2.120-129 ◽

2021 ◽

Vol 6 (2) ◽

pp. 120-129

Author(s):

Nadhif Ikbar Wibowo ◽

Tri Andika Maulana ◽

Hamzah Muhammad ◽

Nur Aini Rakhmawati

Keyword(s):

Support Vector Machine ◽

Logistic Regression ◽

Random Forest ◽

Supervised Learning ◽

Support Vector ◽

Data Set ◽

Logistic Regression Classifier

Public responses, posted on Twitter reacting to the Tokopedia data leak incident, were used as a data set to compare the performance of three different classifiers, trained using supervised learning modeling, to classify sentiment on the text. All tweets were classified into either positive, negative, or neutral classes. This study compares the performance of Random Forest, Support-Vector Machine, and Logistic Regression classifier. Data was scraped automatically and used to evaluate several models; the SVM-based model has the highest f1-score 0.503583. SVM is the best performing classifier.

Download Full-text

PLS Generalized Linear Regression and Kernel Multilogit Algorithm (KMA) for Microarray Data Classiﬁcation Problem

Revista Colombiana de Estadística ◽

10.15446/rce.v43n2.81811 ◽

2020 ◽

Vol 43 (2) ◽

pp. 233-249

Author(s):

Adolphus Wagala ◽

Graciela González-Farías ◽

Rogelio Ramos ◽

Oscar Dalmau

Keyword(s):

Logistic Regression ◽

Discriminant Analysis ◽

Linear Regression ◽

Linear Discriminant Analysis ◽

Least Squares ◽

Partial Least Squares ◽

Classification Error ◽

Support Vector ◽

Linear Discriminant ◽

Generalized Linear Regression

This study involves the implentation of the extensions of the partial least squares generalized linear regression (PLSGLR) by combining it with logistic regression and linear discriminant analysis, to get a partial least squares generalized linear regression-logistic regression model (PLSGLR-log), and a partial least squares generalized linear regression-linear discriminant analysis model (PLSGLRDA). A comparative study of the obtained classiﬁers with the classical methodologies like the k-nearest neighbours (KNN), linear discriminant analysis (LDA), partial least squares discriminant analysis (PLSDA), ridge partial least squares (RPLS), and support vector machines(SVM) is then carried out. Furthermore, a new methodology known as kernel multilogit algorithm (KMA) is also implemented and its performance compared with those of the other classiﬁers. The KMA emerged as the best classiﬁer based on the lowest classiﬁcation error rates compared to the others when applied to the types of data are considered; the un- preprocessed and preprocessed.

Download Full-text

Classification of all-rounders in limited over cricket - a machine learning approach

Journal of Sports Analytics ◽

10.3233/jsa-200467 ◽

2021 ◽

Vol 6 (4) ◽

pp. 295-306

Author(s):

Ananda B. W. Manage ◽

Ram C. Kafle ◽

Danush K. Wijekularathna

Keyword(s):

Machine Learning ◽

Support Vector Machine ◽

Logistic Regression ◽

Discriminant Function ◽

Linear Discriminant Function ◽

Classification Rule ◽

Support Vector ◽

Classification Methods ◽

Linear Discriminant ◽

Quadratic Discriminant Function

In cricket, all-rounders play an important role. A good all-rounder should be able to contribute to the team by both bat and ball as needed. However, these players still have their dominant role by which we categorize them as batting all-rounders or bowling all-rounders. Current practice is to do so by mostly subjective methods. In this study, the authors have explored different machine learning techniques to classify all-rounders into bowling all-rounders or batting all-rounders based on their observed performance statistics. In particular, logistic regression, linear discriminant function, quadratic discriminant function, naïve Bayes, support vector machine, and random forest classification methods were explored. Evaluation of the performance of the classification methods was done using the metrics accuracy and area under the ROC curve. While all the six methods performed well, logistic regression, linear discriminant function, quadratic discriminant function, and support vector machine showed outstanding performance suggesting that these methods can be used to develop an automated classification rule to classify all-rounders in cricket. Given the rising popularity of cricket, and the increasing revenue generated by the sport, the use of such a prediction tool could be of tremendous benefit to decision-makers in cricket.

Download Full-text