A New Nearest Centroid Neighbor Classifier Based on K Local Means Using Harmonic Mean Distance

Sumet Mehta; Xiangjun Shen; Jiangping Gou; Dejiao Niu

doi:10.3390/info9090234

A New Nearest Centroid Neighbor Classifier Based on K Local Means Using Harmonic Mean Distance

Information ◽

10.3390/info9090234 ◽

2018 ◽

Vol 9 (9) ◽

pp. 234 ◽

Cited By ~ 9

Author(s):

Sumet Mehta ◽

Xiangjun Shen ◽

Jiangping Gou ◽

Dejiao Niu

Keyword(s):

Small Sample Size ◽

Classification Performance ◽

Error Rates ◽

Small Sample ◽

Harmonic Mean ◽

Local Means ◽

Local Mean ◽

Real World Datasets ◽

Nearest Neighbour Classifier ◽

Query Sample

The K-nearest neighbour classifier is very effective and simple non-parametric technique in pattern classification; however, it only considers the distance closeness, but not the geometricalplacement of the k neighbors. Also, its classification performance is highly influenced by the neighborhood size k and existing outliers. In this paper, we propose a new local mean based k-harmonic nearest centroid neighbor (LMKHNCN) classifier in orderto consider both distance-based proximity, as well as spatial distribution of k neighbors. In our method, firstly the k nearest centroid neighbors in each class are found which are used to find k different local mean vectors, and then employed to compute their harmonic mean distance to the query sample. Lastly, the query sample is assigned to the class with minimum harmonic mean distance. The experimental results based on twenty-six real-world datasets shows that the proposed LMKHNCN classifier achieves lower error rates, particularly in small sample-size situations, and that it is less sensitive to parameter k when compared to therelated four KNN-based classifiers.

Download Full-text

A Bonferroni Mean Based Fuzzy K Nearest Centroid Neighbor Classifier

Jurnal Ilmu Komputer dan Informasi ◽

10.21609/jiki.v14i1.959 ◽

2021 ◽

Vol 14 (1) ◽

pp. 65-71

Author(s):

Arya Widyadhana ◽

Cornelius Bagus Purnama Putra ◽

Rarasmaya Indraswari ◽

Agus Zainal Arifin

Keyword(s):

Nearest Neighbor ◽

Small Sample Size ◽

Class Imbalance ◽

Classification Performance ◽

Small Sample ◽

K Nearest Neighbor ◽

Bonferroni Mean ◽

Mean Vector ◽

Neighbor Classifier ◽

Query Sample

K-nearest neighbor (KNN) is an effective nonparametric classifier that determines the neighbors of a point based only on distance proximity. The classification performance of KNN is disadvantaged by the presence of outliers in small sample size datasets and its performance deteriorates on datasets with class imbalance. We propose a local Bonferroni Mean based Fuzzy K-Nearest Centroid Neighbor (BM-FKNCN) classifier that assigns class label of a query sample dependent on the nearest local centroid mean vector to better represent the underlying statistic of the dataset. The proposed classifier is robust towards outliers because the Nearest Centroid Neighborhood (NCN) concept also considers spatial distribution and symmetrical placement of the neighbors. Also, the proposed classifier can overcome class domination of its neighbors in datasets with class imbalance because it averages all the centroid vectors from each class to adequately interpret the distribution of the classes. The BM-FKNCN classifier is tested on datasets from the Knowledge Extraction based on Evolutionary Learning (KEEL) repository and benchmarked with classification results from the KNN, Fuzzy-KNN (FKNN), BM-FKNN and FKNCN classifiers. The experimental results show that the BM-FKNCN achieves the highest overall average classification accuracy of 89.86% compared to the other four classifiers.

Download Full-text

An Ensemble Classification Method for High-Dimensional Data Using Neighborhood Rough Set

Complexity ◽

10.1155/2021/8358921 ◽

2021 ◽

Vol 2021 ◽

pp. 1-12

Author(s):

Jing Zhang ◽

Guang Lu ◽

Jiaquan Li ◽

Chuanwen Li

Keyword(s):

Feature Selection ◽

Rough Set ◽

Small Sample Size ◽

High Dimensional Data ◽

Classification Performance ◽

Small Sample ◽

Ensemble Classification ◽

High Dimensional ◽

Sample Classification ◽

Neighborhood Rough Set

Mining useful knowledge from high-dimensional data is a hot research topic. Efficient and effective sample classification and feature selection are challenging tasks due to high dimensionality and small sample size of microarray data. Feature selection is necessary in the process of constructing the model to reduce time and space consumption. Therefore, a feature selection model based on prior knowledge and rough set is proposed. Pathway knowledge is used to select feature subsets, and rough set based on intersection neighborhood is then used to select important feature in each subset, since it can select features without redundancy and deals with numerical features directly. In order to improve the diversity among base classifiers and the efficiency of classification, it is necessary to select part of base classifiers. Classifiers are grouped into several clusters by k-means clustering using the proposed combination distance of Kappa-based diversity and accuracy. The base classifier with the best classification performance in each cluster will be selected to generate the final ensemble model. Experimental results on three Arabidopsis thaliana stress response datasets showed that the proposed method achieved better classification performance than existing ensemble models.

Download Full-text

A Complete Subspace Analysis of Linear Discriminant Analysis and Its Robust Implementation

Journal of Electrical and Computer Engineering ◽

10.1155/2016/3919472 ◽

2016 ◽

Vol 2016 ◽

pp. 1-10

Author(s):

Zhicheng Lu ◽

Zhizheng Liang

Keyword(s):

Discriminant Analysis ◽

Linear Discriminant Analysis ◽

Null Space ◽

Small Sample Size ◽

Classification Performance ◽

Optimization Criterion ◽

Small Sample ◽

Scatter Matrix ◽

Linear Discriminant ◽

Maximum Margin Criterion

Linear discriminant analysis has been widely studied in data mining and pattern recognition. However, when performing the eigen-decomposition on the matrix pair (within-class scatter matrix and between-class scatter matrix) in some cases, one can find that there exist some degenerated eigenvalues, thereby resulting in indistinguishability of information from the eigen-subspace corresponding to some degenerated eigenvalue. In order to address this problem, we revisit linear discriminant analysis in this paper and propose a stable and effective algorithm for linear discriminant analysis in terms of an optimization criterion. By discussing the properties of the optimization criterion, we find that the eigenvectors in some eigen-subspaces may be indistinguishable if the degenerated eigenvalue occurs. Inspired from the idea of the maximum margin criterion (MMC), we embed MMC into the eigen-subspace corresponding to the degenerated eigenvalue to exploit discriminability of the eigenvectors in the eigen-subspace. Since the proposed algorithm can deal with the degenerated case of eigenvalues, it not only handles the small-sample-size problem but also enables us to select projection vectors from the null space of the between-class scatter matrix. Extensive experiments on several face images and microarray data sets are conducted to evaluate the proposed algorithm in terms of the classification performance, and experimental results show that our method has smaller standard deviations than other methods in most cases.

Download Full-text

Estimating expected error rates of neural network classifiers in small sample size situations: a comparison of cross-validation and bootstrap

Proceedings of ICNN'95 - International Conference on Neural Networks ◽

10.1109/icnn.1995.488074 ◽

2002 ◽

Cited By ~ 9

Author(s):

N. Ueda ◽

R. Nakano

Keyword(s):

Neural Network ◽

Sample Size ◽

Cross Validation ◽

Small Sample Size ◽

Error Rates ◽

Small Sample ◽

Neural Network Classifiers

Download Full-text

Performance and Characteristics of Wearable Sensor Systems Discriminating and Classifying Older Adults According to Fall Risk: A Systematic Review

Sensors ◽

10.3390/s21175863 ◽

2021 ◽

Vol 21 (17) ◽

pp. 5863 ◽

Cited By ~ 1

Author(s):

Annica Kristoffersson ◽

Jiaying Du ◽

Maria Ehn

Keyword(s):

Systematic Review ◽

Risk Assessment ◽

Model Validation ◽

Fall Risk ◽

Small Sample Size ◽

Classification Performance ◽

Small Sample ◽

Risk Of Bias ◽

Classification Models ◽

Fall Risk Assessment

Sensor-based fall risk assessment (SFRA) utilizes wearable sensors for monitoring individuals’ motions in fall risk assessment tasks. Previous SFRA reviews recommend methodological improvements to better support the use of SFRA in clinical practice. This systematic review aimed to investigate the existing evidence of SFRA (discriminative capability, classification performance) and methodological factors (study design, samples, sensor features, and model validation) contributing to the risk of bias. The review was conducted according to recommended guidelines and 33 of 389 screened records were eligible for inclusion. Evidence of SFRA was identified: several sensor features and three classification models differed significantly between groups with different fall risk (mostly fallers/non-fallers). Moreover, classification performance corresponding the AUCs of at least 0.74 and/or accuracies of at least 84% were obtained from sensor features in six studies and from classification models in seven studies. Specificity was at least as high as sensitivity among studies reporting both values. Insufficient use of prospective design, small sample size, low in-sample inclusion of participants with elevated fall risk, high amounts and low degree of consensus in used features, and limited use of recommended model validation methods were identified in the included studies. Hence, future SFRA research should further reduce risk of bias by continuously improving methodology.

Download Full-text

Improving the Efficacy of Deep-Learning Models for Heart Beat Detection on Heterogeneous Datasets

Bioengineering ◽

10.3390/bioengineering8120193 ◽

2021 ◽

Vol 8 (12) ◽

pp. 193

Author(s):

Andrea Bizzego ◽

Giulio Gabrieli ◽

Michelle Jin Yee Neoh ◽

Gianluca Esposito

Keyword(s):

Deep Learning ◽

Small Sample Size ◽

Classification Performance ◽

Heart Beat ◽

Small Sample ◽

Electrocardiogram Signals ◽

Heterogeneous Datasets ◽

Beat Detection ◽

Physiological Markers ◽

Bioelectric Signals

Deep learning (DL) has greatly contributed to bioelectric signal processing, in particular to extract physiological markers. However, the efficacy and applicability of the results proposed in the literature is often constrained to the population represented by the data used to train the models. In this study, we investigate the issues related to applying a DL model on heterogeneous datasets. In particular, by focusing on heart beat detection from electrocardiogram signals (ECG), we show that the performance of a model trained on data from healthy subjects decreases when applied to patients with cardiac conditions and to signals collected with different devices. We then evaluate the use of transfer learning (TL) to adapt the model to the different datasets. In particular, we show that the classification performance is improved, even with datasets with a small sample size. These results suggest that a greater effort should be made towards the generalizability of DL models applied on bioelectric signals, in particular, by retrieving more representative datasets.

Download Full-text

An investigation on the factors affecting machine learning classifications in gamma-ray astronomy

Monthly Notices of the Royal Astronomical Society ◽

10.1093/mnras/staa166 ◽

2020 ◽

Vol 492 (4) ◽

pp. 5377-5390 ◽

Cited By ~ 3

Author(s):

Shengda Luo ◽

Alex P Leung ◽

C Y Hui ◽

K L Li

Keyword(s):

Machine Learning ◽

Gamma Ray ◽

Small Sample Size ◽

Classification Performance ◽

Small Sample ◽

Classification Model ◽

Machine Learning Techniques ◽

Large Area ◽

Actual Performance ◽

Statistical Fluctuations

ABSTRACT We have investigated a number of factors that can have significant impacts on the classification performance of gamma-ray sources detected by Fermi Large Area Telescope (LAT) with machine learning techniques. We show that a framework of automatic feature selection can construct a simple model with a small set of features that yields better performance over previous results. Secondly, because of the small sample size of the training/test sets of certain classes in gamma-ray, nested re-sampling and cross-validations are suggested for quantifying the statistical fluctuations of the quoted accuracy. We have also constructed a test set by cross-matching the identified active galactic nuclei (AGNs) and the pulsars (PSRs) in the Fermi-LAT 8-yr point source catalogue (4FGL) with those unidentified sources in the previous 3rd Fermi-LAT Source Catalog (3FGL). Using this cross-matched set, we show that some features used for building classification model with the identified source can suffer from the problem of covariate shift, which can be a result of various observational effects. This can possibly hamper the actual performance when one applies such model in classifying unidentified sources. Using our framework, both AGN/PSR and young pulsar (YNG)/millisecond pulsar (MSP) classifiers are automatically updated with the new features and the enlarged training samples in 4FGL catalogue incorporated. Using a two-layer model with these updated classifiers, we have selected 20 promising MSP candidates with confidence scores $\gt 98{{\ \rm per\ cent}}$ from the unidentified sources in 4FGL catalogue that can provide inputs for a multiwavelength identification campaign.

Download Full-text

Rough margin-based ν-twin support tensor machine in pattern recognition

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-200573 ◽

2021 ◽

Vol 40 (1) ◽

pp. 685-702

Author(s):

Huiru Wang ◽

Zhijian Zhou

Keyword(s):

Optimization Problem ◽

Small Sample Size ◽

Classification Performance ◽

Small Sample ◽

Twin Support Vector Machine ◽

Support Vector ◽

Rough Boundary ◽

Lower Margin ◽

Rough Theory ◽

Overfitting Problem

In Rough margin-based ν-Twin Support Vector Machine (Rν-TSVM) algorithm, the rough theory is introduced. Rν-TSVM gives different penalties to the corresponding misclassified samples according to their positions, so it avoids the overfitting problem to some extent. While the input data is a tensor, Rν-TSVM cannot handle it directly and may not utilize the data information effectively. Therefore, we propose a novel classifier based on tensor data, termed as Rough margin-based ν-Twin Support Tensor Machine (Rν-TSTM). Similar to Rν-TSVM, Rν-TSTM constructs rough lower margin, rough upper margin and rough boundary in tensor space. Rν-TSTM not only retains the superiority of Rν-TSVM, but also has its unique advantages. Firstly, the data topology is retained more efficiently by the direct use of tensor representation. Secondly, it has better classification performance compared to other classification algorithms. Thirdly, it can avoid overfitting problem to a great extent. Lastly, it is more suitable for high dimensional and small sample size problem. To solve the corresponding optimization problem in Rν-TSTM, we adopt the alternating iteration method in which the parameters corresponding to the hyperplanes are estimated by solving a series of Rν-TSVM optimization problem. The efficiency and superiority of the proposed method are demonstrated by computational experiments.

Download Full-text

A Machine Learning Approach to Assess Differential Item Functioning in Psychometric Questionnaires Using the Elastic Net Regularized Ordinal Logistic Regression in Small Sample Size Groups

BioMed Research International ◽

10.1155/2021/6854477 ◽

2021 ◽

Vol 2021 ◽

pp. 1-12

Author(s):

Vahid Ebrahimi ◽

Zahra Bagheri ◽

Zahra Shayan ◽

Peyman Jafari

Keyword(s):

Sample Size ◽

Type I Error ◽

Small Sample Size ◽

Error Rates ◽

Small Sample ◽

Elastic Net ◽

Small Samples ◽

Type I ◽

Item Functioning ◽

Size Groups

Assessing differential item functioning (DIF) using the ordinal logistic regression (OLR) model highly depends on the asymptotic sampling distribution of the maximum likelihood (ML) estimators. The ML estimation method, which is often used to estimate the parameters of the OLR model for DIF detection, may be substantially biased with small samples. This study is aimed at proposing a new application of the elastic net regularized OLR model, as a special type of machine learning method, for assessing DIF between two groups with small samples. Accordingly, a simulation study was conducted to compare the powers and type I error rates of the regularized and nonregularized OLR models in detecting DIF under various conditions including moderate and severe magnitudes of DIF ( DIF = 0.4 and 0.8 ), sample size ( N ), sample size ratio ( R ), scale length ( I ), and weighting parameter ( w ). The simulation results revealed that for I = 5 and regardless of R , the elastic net regularized OLR model with w = 0.1 , as compared with the nonregularized OLR model, increased the power of detecting moderate uniform DIF ( DIF = 0.4 ) approximately 35% and 21% for N = 100 and 150 , respectively. Moreover, for I = 10 and severe uniform DIF ( DIF = 0.8 ), the average power of the elastic net regularized OLR model with 0.03 ≤ w ≤ 0.06 , as compared with the nonregularized OLR model, increased approximately 29.3% and 11.2% for N = 100 and 150 , respectively. In these cases, the type I error rates of the regularized and nonregularized OLR models were below or close to the nominal level of 0.05. In general, this simulation study showed that the elastic net regularized OLR model outperformed the nonregularized OLR model especially in extremely small sample size groups. Furthermore, the present research provided a guideline and some recommendations for researchers who conduct DIF studies with small sample sizes.

Download Full-text

Enhanced Z-LDA for Small Sample Size Training in Brain-Computer Interface Systems

Computational and Mathematical Methods in Medicine ◽

10.1155/2015/680769 ◽

2015 ◽

Vol 2015 ◽

pp. 1-7 ◽

Cited By ~ 1

Author(s):

Dongrui Gao ◽

Rui Zhang ◽

Tiejun Liu ◽

Fali Li ◽

Teng Ma ◽

...

Keyword(s):

Sample Size ◽

Small Sample Size ◽

Brain Computer Interface ◽

Classification Performance ◽

Small Sample ◽

Computer Interface ◽

Training Set ◽

Training Problem ◽

Additional Information ◽

Testing Set

Background. Usually the training set of online brain-computer interface (BCI) experiment is small. For the small training set, it lacks enough information to deeply train the classifier, resulting in the poor classification performance during online testing.Methods. In this paper, on the basis of Z-LDA, we further calculate the classification probability of Z-LDA and then use it to select the reliable samples from the testing set to enlarge the training set, aiming to mine the additional information from testing set to adjust the biased classification boundary obtained from the small training set. The proposed approach is an extension of previous Z-LDA and is named enhanced Z-LDA (EZ-LDA).Results. We evaluated the classification performance of LDA, Z-LDA, and EZ-LDA on simulation and real BCI datasets with different sizes of training samples, and classification results showed EZ-LDA achieved the best classification performance.Conclusions. EZ-LDA is promising to deal with the small sample size training problem usually existing in online BCI system.

Download Full-text