Inferring the outcomes of rejected loans: an application of semisupervised clustering

Author(s):  
Zhiyong Li ◽  
Xinyi Hu ◽  
Ke Li ◽  
Fanyin Zhou ◽  
Feng Shen
2016 ◽  
Vol 24 (4) ◽  
pp. 992-999 ◽  
Author(s):  
Irene Diaz-Valenzuela ◽  
M. Amparo Vila ◽  
Maria J. Martin-Bautista

2011 ◽  
Vol 19 (3) ◽  
pp. 562-574 ◽  
Author(s):  
Gleb Beliakov ◽  
Simon James ◽  
Gang Li

2013 ◽  
Vol 2013 ◽  
pp. 1-10 ◽  
Author(s):  
Mingwei Leng ◽  
Jianjun Cheng ◽  
Jinjin Wang ◽  
Zhengquan Zhang ◽  
Hanhai Zhou ◽  
...  

The accuracy of most of the existing semisupervised clustering algorithms based on small size of labeled dataset is low when dealing with multidensity and imbalanced datasets, and labeling data is quite expensive and time consuming in many real-world applications. This paper focuses on active data selection and semisupervised clustering algorithm in multidensity and imbalanced datasets and proposes an active semisupervised clustering algorithm. The proposed algorithm uses an active mechanism for data selection to minimize the amount of labeled data, and it utilizes multithreshold to expand labeled datasets on multidensity and imbalanced datasets. Three standard datasets and one synthetic dataset are used to demonstrate the proposed algorithm, and the experimental results show that the proposed semisupervised clustering algorithm has a higher accuracy and a more stable performance in comparison to other clustering and semisupervised clustering algorithms, especially when the datasets are multidensity and imbalanced.


2020 ◽  
Vol 2020 ◽  
pp. 1-13
Author(s):  
Rongfeng Zheng ◽  
Jiayong Liu ◽  
Weina Niu ◽  
Liang Liu ◽  
Kai Li ◽  
...  

The explosive growth in network traffic in recent times has resulted in increased processing pressure on network intrusion detection systems. In addition, there is a lack of reliable methods for preprocessing network traffic generated by benign applications that do not steal users’ data from their devices. To alleviate these problems, this study analyzed the differences between benign and malicious traffic produced by benign applications and malware, respectively. To fully express these differences, this study proposed a new set of statistical features for training a clustering model. Furthermore, to mine the communication channels generated by benign applications in batches, a semisupervised clustering method was adopted. Using a small number of labeled samples, our method aggregated historical network traffic into two types of clusters. The cluster that did not contain labeled malicious samples was regarded as a benign traffic cluster. The experimental results were compared using four types of clustering algorithms. The density-based spatial clustering of applications with noise (DBSCAN) clustering algorithm was selected to mine benign communication channels. We also compared our method with two other methods, and the results demonstrated that the benign channels mined through our method were more reliable. Finally, using our method, 1,811 benign transport layer security (TLS) channels were mined from 18,357 TLS communication channels. The number of flows carried by these benign channels comprised 65.37% of the entire network flows, and no malicious flow was included in our results, which proves the effectiveness of our method.


2005 ◽  
Vol 17 (11) ◽  
pp. 2482-2507 ◽  
Author(s):  
Qi Zhao ◽  
David J. Miller

The goal of semisupervised clustering/mixture modeling is to learn the underlying groups comprising a given data set when there is also some form of instance-level supervision available, usually in the form of labels or pairwise sample constraints. Most prior work with constraints assumes the number of classes is known, with each learned cluster assumed to be a class and, hence, subject to the given class constraints. When the number of classes is unknown or when the one-cluster-per-class assumption is not valid, the use of constraints may actually be deleterious to learning the ground-truth data groups. We address this by (1) allowing allocation of multiple mixture components to individual classes and (2) estimating both the number of components and the number of classes. We also address new class discovery, with components void of constraints treated as putative unknown classes. For both real-world and synthetic data, our method is shown to accurately estimate the number of classes and to give favorable comparison with the recent approach of Shental, Bar-Hillel, Hertz, and Weinshall (2003).


Sign in / Sign up

Export Citation Format

Share Document