scholarly journals A Log-Based Anomaly Detection Method with Efficient Neighbor Searching and Automatic K Neighbor Selection

2020 ◽  
Vol 2020 ◽  
pp. 1-17
Author(s):  
Bingming Wang ◽  
Shi Ying ◽  
Zhe Yang

Using the k-nearest neighbor (kNN) algorithm in the supervised learning method to detect anomalies can get more accurate results. However, when using kNN algorithm to detect anomaly, it is inefficient at finding k neighbors from large-scale log data; at the same time, log data are imbalanced in quantity, so it is a challenge to select proper k neighbors for different data distributions. In this paper, we propose a log-based anomaly detection method with efficient selection of neighbors and automatic selection of k neighbors. First, we propose a neighbor search method based on minhash and MVP-tree. The minhash algorithm is used to group similar logs into the same bucket, and MVP-tree model is built for samples in each bucket. In this way, we can reduce the effort of distance calculation and the number of neighbor samples that need to be compared, so as to improve the efficiency of finding neighbors. In the process of selecting k neighbors, we propose an automatic method based on the Silhouette Coefficient, which can select proper k neighbors to improve the accuracy of anomaly detection. Our method is verified on six different types of log data to prove its universality and feasibility.

2021 ◽  
Vol 15 (3) ◽  
pp. 1-22
Author(s):  
Shi Ying ◽  
Bingming Wang ◽  
Lu Wang ◽  
Qingshan Li ◽  
Yishi Zhao ◽  
...  

Logs that record system abnormal states (anomaly logs) can be regarded as outliers, and the k-Nearest Neighbor (kNN) algorithm has relatively high accuracy in outlier detection methods. Therefore, we use the kNN algorithm to detect anomalies in the log data. However, there are some problems when using the kNN algorithm to detect anomalies, three of which are: excessive vector dimension leads to inefficient kNN algorithm, unlabeled log data cannot support the kNN algorithm, and the imbalance of the number of log data distorts the classification decision of kNN algorithm. In order to solve these three problems, we propose an efficient log anomaly detection method based on an improved kNN algorithm with an automatically labeled sample set. This method first proposes a log parsing method based on N-gram and frequent pattern mining (FPM) method, which reduces the dimension of the log vector converted with Term frequency.Inverse Document Frequency (TF-IDF) technology. Then we use clustering and self-training method to get labeled log data sample set from historical logs automatically. Finally, we improve the kNN algorithm using average weighting technology, which improves the accuracy of the kNN algorithm on unbalanced samples. The method in this article is validated on six log datasets with different types.


Author(s):  
A. Murat Yagci ◽  
Tevfik Aytekin ◽  
Fikret S. Gurgen

Matrix factorization models often reveal the low-dimensional latent structure in high-dimensional spaces while bringing space efficiency to large-scale collaborative filtering problems. Improving training and prediction time efficiencies of these models are also important since an accurate model may raise practical concerns if it is slow to capture the changing dynamics of the system. For the training task, powerful improvements have been proposed especially using SGD, ALS, and their parallel versions. In this paper, we focus on the prediction task and combine matrix factorization with approximate nearest neighbor search methods to improve the efficiency of top-N prediction queries. Our efforts result in a meta-algorithm, MMFNN, which can employ various common matrix factorization models, drastically improve their prediction efficiency, and still perform comparably to standard prediction approaches or sometimes even better in terms of predictive power. Using various batch, online, and incremental matrix factorization models, we present detailed empirical analysis results on many large implicit feedback datasets from different application domains.


1990 ◽  
Vol 22 (3-4) ◽  
pp. 41-48 ◽  
Author(s):  
Frits A. Fastenau ◽  
Jaap H. J. M. van der Graaf ◽  
Gerard Martijnse

Diffuse pollution, caused by direct discharges from individual houses, small built-up nuclei, farms, camp-sites, etc., for which connection to central wastewater treatment systems is unfeasible, may be significantly reduced by on-site treatment. Based on a large scale research, including intensive field-research work on 14 systems of different types and sizes in a range equal to population equivalents (p.e) of 5 - 200 persons, 8 different types of system were compared. The comparison involved technological features, such as removal efficiency, reliability, operational and maintenance aspects, environmental impacts and land claims, together with economical features showing significant differences. Advantages and disadvantages of each system are highlighted to enable a selection of suitable systems to be made. When no limiting factors are present, it was found that - in general-infiltration systems (infiltration pits; infiltration trenches) have the best features for on-site treatment up to 100 p.e. For larger capacities, or when infiltration is not possible, the rotating biological contactor will be the best solution mainly because of the lower costs.


Author(s):  
Bingming Wang ◽  
Shi Ying ◽  
Guoli Cheng ◽  
Rui Wang ◽  
Zhe Yang ◽  
...  

Logs play an important role in the maintenance of large-scale systems. The number of logs which indicate normal (normal logs) differs greatly from the number of logs that indicate anomalies (abnormal logs), and the two types of logs have certain differences. To automatically obtain faults by K-Nearest Neighbor (KNN) algorithm, an outlier detection method with high accuracy, is an effective way to detect anomalies from logs. However, logs have the characteristics of large scale and very uneven samples, which will affect the results of KNN algorithm on log-based anomaly detection. Thus, we propose an improved KNN algorithm-based method which uses the existing mean-shift clustering algorithm to efficiently select the training set from massive logs. Then we assign different weights to samples with different distances, which reduces the negative effect of unbalanced distribution of the log samples on the accuracy of KNN algorithm. By comparing experiments on log sets from five supercomputers, the results show that the method we proposed can be effectively applied to log-based anomaly detection, and the accuracy, recall rate and F measure with our method are higher than those of traditional keyword search method.


2019 ◽  
Vol 116 (52) ◽  
pp. 27053-27062 ◽  
Author(s):  
Marcus Davidsson ◽  
Gang Wang ◽  
Patrick Aldrin-Kirk ◽  
Tiago Cardoso ◽  
Sara Nolbrant ◽  
...  

Adeno-associated virus (AAV) capsid modification enables the generation of recombinant vectors with tailored properties and tropism. Most approaches to date depend on random screening, enrichment, and serendipity. The approach explored here, called BRAVE (barcoded rational AAV vector evolution), enables efficient selection of engineered capsid structures on a large scale using only a single screening round in vivo. The approach stands in contrast to previous methods that require multiple generations of enrichment. With the BRAVE approach, each virus particle displays a peptide, derived from a protein, of known function on the AAV capsid surface, and a unique molecular barcode in the packaged genome. The sequencing of RNA-expressed barcodes from a single-generation in vivo screen allows the mapping of putative binding sequences from hundreds of proteins simultaneously. Using the BRAVE approach and hidden Markov model-based clustering, we present 25 synthetic capsid variants with refined properties, such as retrograde axonal transport in specific subtypes of neurons, as shown for both rodent and human dopaminergic neurons.


2021 ◽  
pp. 757-777
Author(s):  
Max Landauer ◽  
Georg Höld ◽  
Markus Wurzenberger ◽  
Florian Skopik ◽  
Andreas Rauber

Sign in / Sign up

Export Citation Format

Share Document