LightSEEN: Real-Time Unknown Traffic Discovery via Lightweight Siamese Networks

Security and Communication Networks ◽

10.1155/2021/8267298 ◽

2021 ◽

Vol 2021 ◽

pp. 1-12

Author(s):

Ji Li ◽

Chunxiang Gu ◽

Fushan Wei ◽

Xieli Zhang ◽

Xinyi Hu ◽

...

Keyword(s):

Deep Level ◽

Data Sets ◽

Traffic Classification ◽

Training Time ◽

Traffic Identification ◽

Open World ◽

Closed World ◽

Public Data ◽

Flow Features ◽

True Detection

With the increase in the proportion of encrypted network traffic, encrypted traffic identification (ETI) is becoming a critical research topic for network management and security. At present, ETI under closed world assumption has been adequately studied. However, when the models are applied to the realistic environment, they will face unknown traffic identification challenges and model efficiency requirements. Considering these problems, in this paper, we propose a lightweight unknown traffic discovery model LightSEEN for open-world traffic classification and model update under practical conditions. The overall structure of LightSEEN is based on the Siamese network, which takes three simplified packet feature vectors as input on one side, uses the multihead attention mechanism to parallelly capture the interactions among packets, and adopts techniques including 1D-CNN and ResNet to promote the extraction of deep-level flow features and the convergence speed of the network. The effectiveness and efficiency of the proposed model are evaluated on two public data sets. The results show that the effectiveness of LightSEEN is overall at the same level as the state-of-the-art method and LightSEEN has even better true detection rate, but the parameter used in LightSEEN is 0.51 % of the baseline and its average training time is 37.9 % of the baseline.

Download Full-text

Neighborhood aggregation based graph attention networks for open-world knowledge graph reasoning

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-211889 ◽

2021 ◽

pp. 1-12

Author(s):

Xiaojun Chen ◽

Ling Ding ◽

Yang Xiang

Keyword(s):

Knowledge Graph ◽

World Knowledge ◽

Training Time ◽

Attention Networks ◽

Convolutional Networks ◽

Open World ◽

Closed World ◽

Proposed Model ◽

Reasoning Tasks ◽

Weighted Combination

Knowledge graph reasoning or completion aims at inferring missing facts based on existing ones in a knowledge graph. In this work, we focus on the problem of open-world knowledge graph reasoning—a task that reasons about entities which are absent from KG at training time (unseen entities). Unfortunately, the performance of most existing reasoning methods on this problem turns out to be unsatisfactory. Recently, some works use graph convolutional networks to obtain the embeddings of unseen entities for prediction tasks. Graph convolutional networks gather information from the entity’s neighborhood, however, they neglect the unequal natures of neighboring nodes. To resolve this issue, we present an attention-based method named as NAKGR, which leverages neighborhood information to generate entities and relations representations. The proposed model is an encoder-decoder architecture. Specifically, the encoder devises an graph attention mechanism to aggregate neighboring nodes’ information with a weighted combination. The decoder employs an energy function to predict the plausibility for each triplets. Benchmark experiments show that NAKGR achieves significant improvements on the open-world reasoning tasks. In addition, our model also performs well on the closed-world reasoning tasks.

Download Full-text

A Group of Stable Features Suitable for the Traffic Classification

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.756-759.275 ◽

2013 ◽

Vol 756-759 ◽

pp. 275-279

Author(s):

Kang Liu

Keyword(s):

Data Sets ◽

Traffic Classification ◽

Data Set ◽

Specific Data ◽

Public Data ◽

C4.5 Decision Tree ◽

Stable Performance ◽

Decision Tree Method ◽

Tree Method ◽

Selection Of

In recent years, Internet traffic classification using machine learning has become new direction in network measurement. In this method, choose the appropriate traffic features is the key. The selection of feature in previous studies dependent on the specific data set and does not have the versatility to identify the data sets captured in the actual network conditions. We analyze and select a group of features based on public data set and the data collected in the actual network. Experimental results show that the selected feature set with stable performance and effective identification ability by using C4.5 decision tree method.

Download Full-text

MULFE: Multi-Label Learning via Label-Specific Feature Space Ensemble

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3451392 ◽

2021 ◽

Vol 16 (1) ◽

pp. 1-24

Author(s):

Yaojin Lin ◽

Qinghua Hu ◽

Jinghua Liu ◽

Xingquan Zhu ◽

Xindong Wu

Keyword(s):

Empirical Studies ◽

Feature Space ◽

Training Data ◽

Data Sets ◽

Learning Framework ◽

Feature Spaces ◽

Public Data ◽

Margin Distribution ◽

Label Correlations ◽

Label Correlation

In multi-label learning, label correlations commonly exist in the data. Such correlation not only provides useful information, but also imposes significant challenges for multi-label learning. Recently, label-specific feature embedding has been proposed to explore label-specific features from the training data, and uses feature highly customized to the multi-label set for learning. While such feature embedding methods have demonstrated good performance, the creation of the feature embedding space is only based on a single label, without considering label correlations in the data. In this article, we propose to combine multiple label-specific feature spaces, using label correlation, for multi-label learning. The proposed algorithm, mu lti- l abel-specific f eature space e nsemble (MULFE), takes consideration label-specific features, label correlation, and weighted ensemble principle to form a learning framework. By conducting clustering analysis on each label’s negative and positive instances, MULFE first creates features customized to each label. After that, MULFE utilizes the label correlation to optimize the margin distribution of the base classifiers which are induced by the related label-specific feature spaces. By combining multiple label-specific features, label correlation based weighting, and ensemble learning, MULFE achieves maximum margin multi-label classification goal through the underlying optimization framework. Empirical studies on 10 public data sets manifest the effectiveness of MULFE.

Download Full-text

Improvements for research data repositories: The case of text spam

Journal of Information Science ◽

10.1177/0165551521998636 ◽

2021 ◽

pp. 016555152199863

Author(s):

Ismael Vázquez ◽

María Novo-Lourés ◽

Reyes Pavón ◽

Rosalía Laza ◽

José Ramón Méndez ◽

...

Keyword(s):

Web Application ◽

Research Data ◽

Data Sets ◽

Data Repositories ◽

Software Applications ◽

Public Data ◽

Protection Mechanisms ◽

Experimental Protocols ◽

Learning Research ◽

Processing Steps

Current research has evolved in such a way scientists must not only adequately describe the algorithms they introduce and the results of their application, but also ensure the possibility of reproducing the results and comparing them with those obtained through other approximations. In this context, public data sets (sometimes shared through repositories) are one of the most important elements for the development of experimental protocols and test benches. This study has analysed a significant number of CS/ML ( Computer Science/ Machine Learning) research data repositories and data sets and detected some limitations that hamper their utility. Particularly, we identify and discuss the following demanding functionalities for repositories: (1) building customised data sets for specific research tasks, (2) facilitating the comparison of different techniques using dissimilar pre-processing methods, (3) ensuring the availability of software applications to reproduce the pre-processing steps without using the repository functionalities and (4) providing protection mechanisms for licencing issues and user rights. To show the introduced functionality, we created STRep (Spam Text Repository) web application which implements our recommendations adapted to the field of spam text repositories. In addition, we launched an instance of STRep in the URL https://rdata.4spam.group to facilitate understanding of this study.

Download Full-text

Fast and accurate detection of surface defect based on improved YOLOv4

Assembly Automation ◽

10.1108/aa-04-2021-0044 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Jiawei Lian ◽

Junhong He ◽

Yun Niu ◽

Tianze Wang

Keyword(s):

Feature Extraction ◽

Real Time ◽

Surface Defect ◽

Steel Ingot ◽

Industrial Applications ◽

Data Sets ◽

Data Set ◽

Processing Technologies ◽

Content Type ◽

Public Data

Purpose The current popular image processing technologies based on convolutional neural network have the characteristics of large computation, high storage cost and low accuracy for tiny defect detection, which is contrary to the high real-time and accuracy, limited computing resources and storage required by industrial applications. Therefore, an improved YOLOv4 named as YOLOv4-Defect is proposed aim to solve the above problems. Design/methodology/approach On the one hand, this study performs multi-dimensional compression processing on the feature extraction network of YOLOv4 to simplify the model and improve the feature extraction ability of the model through knowledge distillation. On the other hand, a prediction scale with more detailed receptive field is added to optimize the model structure, which can improve the detection performance for tiny defects. Findings The effectiveness of the method is verified by public data sets NEU-CLS and DAGM 2007, and the steel ingot data set collected in the actual industrial field. The experimental results demonstrated that the proposed YOLOv4-Defect method can greatly improve the recognition efficiency and accuracy and reduce the size and computation consumption of the model. Originality/value This paper proposed an improved YOLOv4 named as YOLOv4-Defect for the detection of surface defect, which is conducive to application in various industrial scenarios with limited storage and computing resources, and meets the requirements of high real-time and precision.

Download Full-text

Feature Extraction From Time Resolved Reacting Flow Data Sets

Volume 4B: Combustion, Fuels, and Emissions ◽

10.1115/gt2018-77051 ◽

2018 ◽

Cited By ~ 2

Author(s):

H. Ek ◽

I. Chterev ◽

N. Rock ◽

B. Emerson ◽

J. Seitzman ◽

...

Keyword(s):

Swirling Flow ◽

Azimuthal Velocity ◽

Flow Fields ◽

Swirl Flow ◽

Reacting Flow ◽

Data Sets ◽

Time Resolved ◽

Recirculating Flow ◽

Flow Features ◽

Dynamical Flow

This paper presents measurements of the simultaneous fuel distribution, flame position and flow velocity in a high pressure, liquid fueled combustor. Its objective is to develop methods to process, display and compare large quantities of instantaneous data with computations. However, time-averaged flow fields rarely represent the instantaneous, dynamical flow fields in combustion systems. It is therefore important to develop methods that can algorithmically extract dynamical flow features and be directly compared between measurements and computations. While a number of data-driven approaches have been previously presented in the literature, the purpose of this paper is to propose several approaches that are based on understanding of key physical features of the flow — for this reacting swirl flow, these include the annular jet, the swirling flow which may be precessing, the recirculating flow between the annular jets, and the helical flow structures in the shear layers. This paper demonstrates nonlinear averaging of axial and azimuthal velocity profiles, which provide insights into the structure of the recirculation zone and degree of flow precession. It also presents probability fields for the location of vortex cores that enables a convenient method for comparison of their trajectory and phasing with computations. Taken together, these methods illustrate the structure and relative locations of the annular fluid jet, recirculating flow zone, spray location, flame location, and trajectory of the helical vortices.

Download Full-text

Ozone highs and associated flow features in the first half of the twentieth century in different data sets

Meteorologische Zeitschrift ◽

10.1127/0941-2948/2012/0284 ◽

2012 ◽

Vol 21 (1) ◽

pp. 49-59 ◽

Cited By ~ 7

Author(s):

Stefan Brönnimann ◽

Gilbert P. Compo

Keyword(s):

Twentieth Century ◽

Data Sets ◽

Flow Features

Download Full-text

A SELF-ORGANIZING MAP FOR MIXED CONTINUOUS AND CATEGORICAL DATA

International Journal of Computing ◽

10.47839/ijc.10.1.733 ◽

2011 ◽

pp. 24-32 ◽

Cited By ~ 1

Author(s):

Nicoleta Rogovschi ◽

Mustapha Lebbah ◽

Younès Bennani

Keyword(s):

Clustering Algorithm ◽

Clustering Algorithms ◽

Mixed Data ◽

Categorical Variables ◽

Data Sets ◽

Self Organizing Map ◽

Data Set ◽

Public Data ◽

Self Organizing

Most traditional clustering algorithms are limited to handle data sets that contain either continuous or categorical variables. However data sets with mixed types of variables are commonly used in data mining field. In this paper we introduce a weighted self-organizing map for clustering, analysis and visualization mixed data (continuous/binary). The learning of weights and prototypes is done in a simultaneous manner assuring an optimized data clustering. More variables has a high weight, more the clustering algorithm will take into account the informations transmitted by these variables. The learning of these topological maps is combined with a weighting process of different variables by computing weights which influence the quality of clustering. We illustrate the power of this method with data sets taken from a public data set repository: a handwritten digit data set, Zoo data set and other three mixed data sets. The results show a good quality of the topological ordering and homogenous clustering.

Download Full-text

Distributed training and scalability for the particle clustering method UCluster

EPJ Web of Conferences ◽

10.1051/epjconf/202125102054 ◽

2021 ◽

Vol 251 ◽

pp. 02054

Author(s):

Olga Sunneborn Gudnadottir ◽

Daniel Gedon ◽

Colin Desmarais ◽

Karl Bengtsson Bernander ◽

Raazesh Sainudiin ◽

...

Keyword(s):

Particle Physics ◽

Hadron Collider ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Training Time ◽

Distributed Training ◽

Machine Learning Methods ◽

Multi Class Classification

In recent years, machine-learning methods have become increasingly important for the experiments at the Large Hadron Collider (LHC). They are utilised in everything from trigger systems to reconstruction and data analysis. The recent UCluster method is a general model providing unsupervised clustering of particle physics data, that can be easily modified to provide solutions for a variety of different decision problems. In the current paper, we improve on the UCluster method by adding the option of training the model in a scalable and distributed fashion, and thereby extending its utility to learn from arbitrarily large data sets. UCluster combines a graph-based neural network called ABCnet with a clustering step, using a combined loss function in the training phase. The original code is publicly available in TensorFlow v1.14 and has previously been trained on a single GPU. It shows a clustering accuracy of 81% when applied to the problem of multi-class classification of simulated jet events. Our implementation adds the distributed training functionality by utilising the Horovod distributed training framework, which necessitated a migration of the code to TensorFlow v2. Together with using parquet files for splitting data up between different compute nodes, the distributed training makes the model scalable to any amount of input data, something that will be essential for use with real LHC data sets. We find that the model is well suited for distributed training, with the training time decreasing in direct relation to the number of GPU’s used. However, further improvements by a more exhaustive and possibly distributed hyper-parameter search is required in order to achieve the reported accuracy of the original UCluster method.

Download Full-text

SpiderSeqR: an R package for crawling the web of high-throughput multi-omic data repositories for data-sets and annotation

10.1101/2020.04.13.039420 ◽

2020 ◽

Author(s):

Anna M. Sozanska ◽

Charles Fletcher ◽

Dóra Bihary ◽

Shamith A. Samarajiwa

Keyword(s):

High Throughput ◽

R Package ◽

Data Reuse ◽

Massively Parallel ◽

Data Sets ◽

Similar Data ◽

Data Generation ◽

Data Repositories ◽

Public Data ◽

Omic Data

AbstractMore than three decades ago, the microarray revolution brought about high-throughput data generation capability to biology and medicine. Subsequently, the emergence of massively parallel sequencing technologies led to many big-data initiatives such as the human genome project and the encyclopedia of DNA elements (ENCODE) project. These, in combination with cheaper, faster massively parallel DNA sequencing capabilities, have democratised multi-omic (genomic, transcriptomic, translatomic and epigenomic) data generation leading to a data deluge in bio-medicine. While some of these data-sets are trapped in inaccessible silos, the vast majority of these data-sets are stored in public data resources and controlled access data repositories, enabling their wider use (or misuse). Currently, most peer reviewed publications require the deposition of the data-set associated with a study under consideration in one of these public data repositories. However, clunky and difficult to use interfaces, subpar or incomplete annotation prevent discovering, searching and filtering of these multi-omic data and hinder their re-purposing in other use cases. In addition, the proliferation of multitude of different data repositories, with partially redundant storage of similar data are yet another obstacle to their continued usefulness. Similarly, interfaces where annotation is spread across multiple web pages, use of accession identifiers with ambiguous and multiple interpretations and lack of good curation make these data-sets difficult to use. We have produced SpiderSeqR, an R package, whose main features include the integration between NCBI GEO and SRA databases, enabling an integrated unified search of SRA and GEO data-sets and associated annotations, conversion between database accessions, as well as convenient filtering of results and saving past queries for future use. All of the above features aim to promote data reuse to facilitate making new discoveries and maximising the potential of existing data-sets.Availabilityhttps://github.com/ss-lab-cancerunit/SpiderSeqR

Download Full-text