scholarly journals K Important Neighbors: A Novel Approach to Binary Classification in High Dimensional Data

2017 ◽  
Vol 2017 ◽  
pp. 1-9 ◽  
Author(s):  
Hadi Raeisi Shahraki ◽  
Saeedeh Pourahmad ◽  
Najaf Zare

K nearest neighbors (KNN) are known as one of the simplest nonparametric classifiers but in high dimensional setting accuracy of KNN are affected by nuisance features. In this study, we proposed the K important neighbors (KIN) as a novel approach for binary classification in high dimensional problems. To avoid the curse of dimensionality, we implemented smoothly clipped absolute deviation (SCAD) logistic regression at the initial stage and considered the importance of each feature in construction of dissimilarity measure with imposing features contribution as a function of SCAD coefficients on Euclidean distance. The nature of this hybrid dissimilarity measure, which combines information of both features and distances, enjoys all good properties of SCAD penalized regression and KNN simultaneously. In comparison to KNN, simulation studies showed that KIN has a good performance in terms of both accuracy and dimension reduction. The proposed approach was found to be capable of eliminating nearly all of the noninformative features because of utilizing oracle property of SCAD penalized regression in the construction of dissimilarity measure. In very sparse settings, KIN also outperforms support vector machine (SVM) and random forest (RF) as the best classifiers.

2015 ◽  
Vol 2015 ◽  
pp. 1-7 ◽  
Author(s):  
Kang-Mo Jung

Classification is a very important research topic and its applications are various, because data can be easily obtained in these days. Among many techniques of classification the support vector machine (SVM) is widely applied to bioinformatics or genetic analysis, because it gives sound theoretical background and its performance is superior to other methods. The SVM can be rewritten by a combination of the hinge loss function and the penalty function. The smoothly clipped absolute deviation penalty function satisfies desirably statistical properties. Since standard SVM techniques typically treat all classes equally, it is not well suited to unbalanced proportion data. We propose a robust method to treat unbalanced cases based on the weights of the class. Simulation and a numerical example show that the proposed method is effective to analyze unbalanced proportion data.


2014 ◽  
Vol 2014 ◽  
pp. 1-7
Author(s):  
Omid Hamidi ◽  
Lily Tapak ◽  
Aarefeh Jafarzadeh Kohneloo ◽  
Majid Sadeghifar

Microarray technology results in high-dimensional and low-sample size data sets. Therefore, fitting sparse models is substantial because only a small number of influential genes can reliably be identified. A number of variable selection approaches have been proposed for high-dimensional time-to-event data based on Cox proportional hazards where censoring is present. The present study applied three sparse variable selection techniques of Lasso, smoothly clipped absolute deviation and the smooth integration of counting, and absolute deviation for gene expression survival time data using the additive risk model which is adopted when the absolute effects of multiple predictors on the hazard function are of interest. The performances of used techniques were evaluated by time dependent ROC curve and bootstrap .632+ prediction error curves. The selected genes by all methods were highly significant(P<0.001). The Lasso showed maximum median of area under ROC curve over time (0.95) and smoothly clipped absolute deviation showed the lowest prediction error (0.105). It was observed that the selected genes by all methods improved the prediction of purely clinical model indicating the valuable information containing in the microarray features. So it was concluded that used approaches can satisfactorily predict survival based on selected gene expression measurements.


2021 ◽  
Vol 29 (2) ◽  
Author(s):  
Ishaq Abdullahi Baba ◽  
Habshah Midi ◽  
Leong Wah June ◽  
Gafurjan Ibragimove

The widely used least absolute deviation (LAD) estimator with the smoothly clipped absolute deviation (SCAD) penalty function (abbreviated as LAD-SCAD) is known to produce corrupt estimates in the presence of outlying observations. The problem becomes more complicated when the number of predictors diverges. To overcome these problems, the LAD-SCAD based on sure independence screening (SIS) technique is put forward. The SIS method uses the rank correlation screening (RCS) algorithm in the pre-screening step and the traditional Pathwise coordinate descent algorithm for computing the sequence of the regularization parameters in the post screening step for onward model selection. It is now evident that the rank correlation is less robust against outliers. Motivated by these inadequacies, we propose to improvise the LAD-SCAD estimator using robust wrapped correlation screening (WCS) method by replacing the rank correlation in the SIS method with robust wrapped correlation. The proposed estimator is denoted as WCS+LAD-SCAD and will be employed for variable selection. The simulation study and real-life data examples show that the proposed procedure produces more efficient results compared to the existing methods.


Methodology ◽  
2020 ◽  
Vol 16 (2) ◽  
pp. 127-146 ◽  
Author(s):  
Seung Hyun Baek ◽  
Alberto Garcia-Diaz ◽  
Yuanshun Dai

Data mining is one of the most effective statistical methodologies to investigate a variety of problems in areas including pattern recognition, machine learning, bioinformatics, chemometrics, and statistics. In particular, statistically-sophisticated procedures that emphasize on reliability of results and computational efficiency are required for the analysis of high-dimensional data. Optimization principles can play a significant role in the rationalization and validation of specialized data mining procedures. This paper presents a novel methodology which is Multi-Choice Wavelet Thresholding (MCWT) based three-step methodology consists of three processes: perception (dimension reduction), decision (feature ranking), and cognition (model selection). In these steps three concepts known as wavelet thresholding, support vector machines for classification and information complexity are integrated to evaluate learning models. Three published data sets are used to illustrate the proposed methodology. Additionally, performance comparisons with recent and widely applied methods are shown.


2020 ◽  
Vol 17 (2) ◽  
pp. 0550
Author(s):  
Ali Hameed Yousef ◽  
Omar Abdulmohsin Ali

         The issue of penalized regression model has received considerable critical attention to variable selection. It plays an essential role in dealing with high dimensional data. Arctangent denoted by the Atan penalty has been used in both estimation and variable selection as an efficient method recently. However, the Atan penalty is very sensitive to outliers in response to variables or heavy-tailed error distribution. While the least absolute deviation is a good method to get robustness in regression estimation. The specific objective of this research is to propose a robust Atan estimator from combining these two ideas at once. Simulation experiments and real data applications show that the proposed LAD-Atan estimator has superior performance compared with other estimators.  


Sensors ◽  
2019 ◽  
Vol 19 (14) ◽  
pp. 3214 ◽  
Author(s):  
Weiyi Yang ◽  
Yujuan Si ◽  
Di Wang ◽  
Gong Zhang

Cardiovascular disease (CVD) has become one of the most serious diseases that threaten human health. Over the past decades, over 150 million humans have died of CVDs. Hence, timely prediction of CVDs is especially important. Currently, deep learning algorithm-based CVD diagnosis methods are extensively employed, however, most such algorithms can only utilize one-lead ECGs. Hence, the potential information in other-lead ECGs was not utilized. To address this issue, we have developed novel methods for diagnosing arrhythmia. In this work, DL-CCANet and TL-CCANet are proposed to extract abstract discriminating features from dual-lead and three-lead ECGs, respectively. Then, the linear support vector machine specializing in high-dimensional features is used as the classifier model. On the MIT-BIH database, a 95.2% overall accuracy is obtained by detecting 15 types of heartbeats using DL-CCANet. On the INCART database, overall accuracies of 94.01% (II and V1 leads), 93.90% (V1 and V5 leads) and 94.07% (II and V5 leads) are achieved by detecting seven types of heartbeat using DL-CCANet, while TL-CCANet yields a higher overall accuracy of 95.52% using the above three leads. In addition, all of the above experiments are implemented using noisy ECG data. The proposed methods have potential to be applied in the clinic and mobile devices.


2021 ◽  
Vol 7 ◽  
pp. e562
Author(s):  
Muhammad Hamraz ◽  
Naz Gul ◽  
Mushtaq Raza ◽  
Dost Muhammad Khan ◽  
Umair Khalil ◽  
...  

In this paper, a novel feature selection method called Robust Proportional Overlapping Score (RPOS), for microarray gene expression datasets has been proposed, by utilizing the robust measure of dispersion, i.e., Median Absolute Deviation (MAD). This method robustly identifies the most discriminative genes by considering the overlapping scores of the gene expression values for binary class problems. Genes with a high degree of overlap between classes are discarded and the ones that discriminate between the classes are selected. The results of the proposed method are compared with five state-of-the-art gene selection methods based on classification error, Brier score, and sensitivity, by considering eleven gene expression datasets. Classification of observations for different sets of selected genes by the proposed method is carried out by three different classifiers, i.e., random forest, k-nearest neighbors (k-NN), and support vector machine (SVM). Box-plots and stability scores of the results are also shown in this paper. The results reveal that in most of the cases the proposed method outperforms the other methods.


Cancers ◽  
2021 ◽  
Vol 13 (9) ◽  
pp. 2133
Author(s):  
Francisco O. Cortés-Ibañez ◽  
Sunil Belur Nagaraj ◽  
Ludo Cornelissen ◽  
Gerjan J. Navis ◽  
Bert van der Vegt ◽  
...  

Cancer incidence is rising, and accurate prediction of incident cancers could be relevant to understanding and reducing cancer incidence. The aim of this study was to develop machine learning (ML) models that could predict an incident diagnosis of cancer. Participants without any history of cancer within the Lifelines population-based cohort were followed for a median of 7 years. Data were available for 116,188 cancer-free participants and 4232 incident cancer cases. At baseline, socioeconomic, lifestyle, and clinical variables were assessed. The main outcome was an incident cancer during follow-up (excluding skin cancer), based on linkage with the national pathology registry. The performance of three ML algorithms was evaluated using supervised binary classification to identify incident cancers among participants. Elastic net regularization and Gini index were used for variables selection. An overall area under the receiver operator curve (AUC) <0.75 was obtained, the highest AUC value was for prostate cancer (random forest AUC = 0.82 (95% CI 0.77–0.87), logistic regression AUC = 0.81 (95% CI 0.76–0.86), and support vector machines AUC = 0.83 (95% CI 0.78–0.88), respectively); age was the most important predictor in these models. Linear and non-linear ML algorithms including socioeconomic, lifestyle, and clinical variables produced a moderate predictive performance of incident cancers in the Lifelines cohort.


2020 ◽  
Vol 11 (1) ◽  
pp. 24
Author(s):  
Jin Tao ◽  
Kelly Brayton ◽  
Shira Broschat

Advances in genome sequencing technology and computing power have brought about the explosive growth of sequenced genomes in public repositories with a concomitant increase in annotation errors. Many protein sequences are annotated using computational analysis rather than experimental verification, leading to inaccuracies in annotation. Confirmation of existing protein annotations is urgently needed before misannotation becomes even more prevalent due to error propagation. In this work we present a novel approach for automatically confirming the existence of manually curated information with experimental evidence of protein annotation. Our ensemble learning method uses a combination of recurrent convolutional neural network, logistic regression, and support vector machine models. Natural language processing in the form of word embeddings is used with journal publication titles retrieved from the UniProtKB database. Importantly, we use recall as our most significant metric to ensure the maximum number of verifications possible; results are reported to a human curator for confirmation. Our ensemble model achieves 91.25% recall, 71.26% accuracy, 65.19% precision, and an F1 score of 76.05% and outperforms the Bidirectional Encoder Representations from Transformers for Biomedical Text Mining (BioBERT) model with fine-tuning using the same data.


Sign in / Sign up

Export Citation Format

Share Document