scholarly journals A New Nearest Centroid Neighbor Classifier Based on K Local Means Using Harmonic Mean Distance

Information ◽  
2018 ◽  
Vol 9 (9) ◽  
pp. 234 ◽  
Author(s):  
Sumet Mehta ◽  
Xiangjun Shen ◽  
Jiangping Gou ◽  
Dejiao Niu

The K-nearest neighbour classifier is very effective and simple non-parametric technique in pattern classification; however, it only considers the distance closeness, but not the geometricalplacement of the k neighbors. Also, its classification performance is highly influenced by the neighborhood size k and existing outliers. In this paper, we propose a new local mean based k-harmonic nearest centroid neighbor (LMKHNCN) classifier in orderto consider both distance-based proximity, as well as spatial distribution of k neighbors. In our method, firstly the k nearest centroid neighbors in each class are found which are used to find k different local mean vectors, and then employed to compute their harmonic mean distance to the query sample. Lastly, the query sample is assigned to the class with minimum harmonic mean distance. The experimental results based on twenty-six real-world datasets shows that the proposed LMKHNCN classifier achieves lower error rates, particularly in small sample-size situations, and that it is less sensitive to parameter k when compared to therelated four KNN-based classifiers.

2021 ◽  
Vol 14 (1) ◽  
pp. 65-71
Author(s):  
Arya Widyadhana ◽  
Cornelius Bagus Purnama Putra ◽  
Rarasmaya Indraswari ◽  
Agus Zainal Arifin

K-nearest neighbor (KNN) is an effective nonparametric classifier that determines the neighbors of a point based only on distance proximity. The classification performance of KNN is disadvantaged by the presence of outliers in small sample size datasets and its performance deteriorates on datasets with class imbalance. We propose a local Bonferroni Mean based Fuzzy K-Nearest Centroid Neighbor (BM-FKNCN) classifier that assigns class label of a query sample dependent on the nearest local centroid mean vector to better represent the underlying statistic of the dataset. The proposed classifier is robust towards outliers because the Nearest Centroid Neighborhood (NCN) concept also considers spatial distribution and symmetrical placement of the neighbors. Also, the proposed classifier can overcome class domination of its neighbors in datasets with class imbalance because it averages all the centroid vectors from each class to adequately interpret the distribution of the classes. The BM-FKNCN classifier is tested on datasets from the Knowledge Extraction based on Evolutionary Learning (KEEL) repository and benchmarked with classification results from the KNN, Fuzzy-KNN (FKNN), BM-FKNN and FKNCN classifiers. The experimental results show that the BM-FKNCN achieves the highest overall average classification accuracy of 89.86% compared to the other four classifiers.


Complexity ◽  
2021 ◽  
Vol 2021 ◽  
pp. 1-12
Author(s):  
Jing Zhang ◽  
Guang Lu ◽  
Jiaquan Li ◽  
Chuanwen Li

Mining useful knowledge from high-dimensional data is a hot research topic. Efficient and effective sample classification and feature selection are challenging tasks due to high dimensionality and small sample size of microarray data. Feature selection is necessary in the process of constructing the model to reduce time and space consumption. Therefore, a feature selection model based on prior knowledge and rough set is proposed. Pathway knowledge is used to select feature subsets, and rough set based on intersection neighborhood is then used to select important feature in each subset, since it can select features without redundancy and deals with numerical features directly. In order to improve the diversity among base classifiers and the efficiency of classification, it is necessary to select part of base classifiers. Classifiers are grouped into several clusters by k-means clustering using the proposed combination distance of Kappa-based diversity and accuracy. The base classifier with the best classification performance in each cluster will be selected to generate the final ensemble model. Experimental results on three Arabidopsis thaliana stress response datasets showed that the proposed method achieved better classification performance than existing ensemble models.


2016 ◽  
Vol 2016 ◽  
pp. 1-10
Author(s):  
Zhicheng Lu ◽  
Zhizheng Liang

Linear discriminant analysis has been widely studied in data mining and pattern recognition. However, when performing the eigen-decomposition on the matrix pair (within-class scatter matrix and between-class scatter matrix) in some cases, one can find that there exist some degenerated eigenvalues, thereby resulting in indistinguishability of information from the eigen-subspace corresponding to some degenerated eigenvalue. In order to address this problem, we revisit linear discriminant analysis in this paper and propose a stable and effective algorithm for linear discriminant analysis in terms of an optimization criterion. By discussing the properties of the optimization criterion, we find that the eigenvectors in some eigen-subspaces may be indistinguishable if the degenerated eigenvalue occurs. Inspired from the idea of the maximum margin criterion (MMC), we embed MMC into the eigen-subspace corresponding to the degenerated eigenvalue to exploit discriminability of the eigenvectors in the eigen-subspace. Since the proposed algorithm can deal with the degenerated case of eigenvalues, it not only handles the small-sample-size problem but also enables us to select projection vectors from the null space of the between-class scatter matrix. Extensive experiments on several face images and microarray data sets are conducted to evaluate the proposed algorithm in terms of the classification performance, and experimental results show that our method has smaller standard deviations than other methods in most cases.


Sensors ◽  
2021 ◽  
Vol 21 (17) ◽  
pp. 5863 ◽  
Author(s):  
Annica Kristoffersson ◽  
Jiaying Du ◽  
Maria Ehn

Sensor-based fall risk assessment (SFRA) utilizes wearable sensors for monitoring individuals’ motions in fall risk assessment tasks. Previous SFRA reviews recommend methodological improvements to better support the use of SFRA in clinical practice. This systematic review aimed to investigate the existing evidence of SFRA (discriminative capability, classification performance) and methodological factors (study design, samples, sensor features, and model validation) contributing to the risk of bias. The review was conducted according to recommended guidelines and 33 of 389 screened records were eligible for inclusion. Evidence of SFRA was identified: several sensor features and three classification models differed significantly between groups with different fall risk (mostly fallers/non-fallers). Moreover, classification performance corresponding the AUCs of at least 0.74 and/or accuracies of at least 84% were obtained from sensor features in six studies and from classification models in seven studies. Specificity was at least as high as sensitivity among studies reporting both values. Insufficient use of prospective design, small sample size, low in-sample inclusion of participants with elevated fall risk, high amounts and low degree of consensus in used features, and limited use of recommended model validation methods were identified in the included studies. Hence, future SFRA research should further reduce risk of bias by continuously improving methodology.


2021 ◽  
Vol 8 (12) ◽  
pp. 193
Author(s):  
Andrea Bizzego ◽  
Giulio Gabrieli ◽  
Michelle Jin Yee Neoh ◽  
Gianluca Esposito

Deep learning (DL) has greatly contributed to bioelectric signal processing, in particular to extract physiological markers. However, the efficacy and applicability of the results proposed in the literature is often constrained to the population represented by the data used to train the models. In this study, we investigate the issues related to applying a DL model on heterogeneous datasets. In particular, by focusing on heart beat detection from electrocardiogram signals (ECG), we show that the performance of a model trained on data from healthy subjects decreases when applied to patients with cardiac conditions and to signals collected with different devices. We then evaluate the use of transfer learning (TL) to adapt the model to the different datasets. In particular, we show that the classification performance is improved, even with datasets with a small sample size. These results suggest that a greater effort should be made towards the generalizability of DL models applied on bioelectric signals, in particular, by retrieving more representative datasets.


2020 ◽  
Vol 492 (4) ◽  
pp. 5377-5390 ◽  
Author(s):  
Shengda Luo ◽  
Alex P Leung ◽  
C Y Hui ◽  
K L Li

ABSTRACT We have investigated a number of factors that can have significant impacts on the classification performance of gamma-ray sources detected by Fermi Large Area Telescope (LAT) with machine learning techniques. We show that a framework of automatic feature selection can construct a simple model with a small set of features that yields better performance over previous results. Secondly, because of the small sample size of the training/test sets of certain classes in gamma-ray, nested re-sampling and cross-validations are suggested for quantifying the statistical fluctuations of the quoted accuracy. We have also constructed a test set by cross-matching the identified active galactic nuclei (AGNs) and the pulsars (PSRs) in the Fermi-LAT 8-yr point source catalogue (4FGL) with those unidentified sources in the previous 3rd Fermi-LAT Source Catalog (3FGL). Using this cross-matched set, we show that some features used for building classification model with the identified source can suffer from the problem of covariate shift, which can be a result of various observational effects. This can possibly hamper the actual performance when one applies such model in classifying unidentified sources. Using our framework, both AGN/PSR and young pulsar (YNG)/millisecond pulsar (MSP) classifiers are automatically updated with the new features and the enlarged training samples in 4FGL catalogue incorporated. Using a two-layer model with these updated classifiers, we have selected 20 promising MSP candidates with confidence scores $\gt 98{{\ \rm per\ cent}}$ from the unidentified sources in 4FGL catalogue that can provide inputs for a multiwavelength identification campaign.


2021 ◽  
Vol 40 (1) ◽  
pp. 685-702
Author(s):  
Huiru Wang ◽  
Zhijian Zhou

 In Rough margin-based ν-Twin Support Vector Machine (Rν-TSVM) algorithm, the rough theory is introduced. Rν-TSVM gives different penalties to the corresponding misclassified samples according to their positions, so it avoids the overfitting problem to some extent. While the input data is a tensor, Rν-TSVM cannot handle it directly and may not utilize the data information effectively. Therefore, we propose a novel classifier based on tensor data, termed as Rough margin-based ν-Twin Support Tensor Machine (Rν-TSTM). Similar to Rν-TSVM, Rν-TSTM constructs rough lower margin, rough upper margin and rough boundary in tensor space. Rν-TSTM not only retains the superiority of Rν-TSVM, but also has its unique advantages. Firstly, the data topology is retained more efficiently by the direct use of tensor representation. Secondly, it has better classification performance compared to other classification algorithms. Thirdly, it can avoid overfitting problem to a great extent. Lastly, it is more suitable for high dimensional and small sample size problem. To solve the corresponding optimization problem in Rν-TSTM, we adopt the alternating iteration method in which the parameters corresponding to the hyperplanes are estimated by solving a series of Rν-TSVM optimization problem. The efficiency and superiority of the proposed method are demonstrated by computational experiments.


2021 ◽  
Vol 2021 ◽  
pp. 1-12
Author(s):  
Vahid Ebrahimi ◽  
Zahra Bagheri ◽  
Zahra Shayan ◽  
Peyman Jafari

Assessing differential item functioning (DIF) using the ordinal logistic regression (OLR) model highly depends on the asymptotic sampling distribution of the maximum likelihood (ML) estimators. The ML estimation method, which is often used to estimate the parameters of the OLR model for DIF detection, may be substantially biased with small samples. This study is aimed at proposing a new application of the elastic net regularized OLR model, as a special type of machine learning method, for assessing DIF between two groups with small samples. Accordingly, a simulation study was conducted to compare the powers and type I error rates of the regularized and nonregularized OLR models in detecting DIF under various conditions including moderate and severe magnitudes of DIF ( DIF = 0.4   and   0.8 ), sample size ( N ), sample size ratio ( R ), scale length ( I ), and weighting parameter ( w ). The simulation results revealed that for I = 5 and regardless of R , the elastic net regularized OLR model with w = 0.1 , as compared with the nonregularized OLR model, increased the power of detecting moderate uniform DIF ( DIF = 0.4 ) approximately 35% and 21% for N = 100   and   150 , respectively. Moreover, for I = 10 and severe uniform DIF ( DIF = 0.8 ), the average power of the elastic net regularized OLR model with 0.03 ≤ w ≤ 0.06 , as compared with the nonregularized OLR model, increased approximately 29.3% and 11.2% for N = 100   and   150 , respectively. In these cases, the type I error rates of the regularized and nonregularized OLR models were below or close to the nominal level of 0.05. In general, this simulation study showed that the elastic net regularized OLR model outperformed the nonregularized OLR model especially in extremely small sample size groups. Furthermore, the present research provided a guideline and some recommendations for researchers who conduct DIF studies with small sample sizes.


2015 ◽  
Vol 2015 ◽  
pp. 1-7 ◽  
Author(s):  
Dongrui Gao ◽  
Rui Zhang ◽  
Tiejun Liu ◽  
Fali Li ◽  
Teng Ma ◽  
...  

Background. Usually the training set of online brain-computer interface (BCI) experiment is small. For the small training set, it lacks enough information to deeply train the classifier, resulting in the poor classification performance during online testing.Methods. In this paper, on the basis of Z-LDA, we further calculate the classification probability of Z-LDA and then use it to select the reliable samples from the testing set to enlarge the training set, aiming to mine the additional information from testing set to adjust the biased classification boundary obtained from the small training set. The proposed approach is an extension of previous Z-LDA and is named enhanced Z-LDA (EZ-LDA).Results. We evaluated the classification performance of LDA, Z-LDA, and EZ-LDA on simulation and real BCI datasets with different sizes of training samples, and classification results showed EZ-LDA achieved the best classification performance.Conclusions. EZ-LDA is promising to deal with the small sample size training problem usually existing in online BCI system.


Sign in / Sign up

Export Citation Format

Share Document