Semantic-Based Representation Binary Clone Detection for Cross-Architectures in the Internet of Things

Zhenhao Luo; Baosheng Wang; Yong Tang; Wei Xie

doi:10.3390/app9163283

Semantic-Based Representation Binary Clone Detection for Cross-Architectures in the Internet of Things

Applied Sciences ◽

10.3390/app9163283 ◽

2019 ◽

Vol 9 (16) ◽

pp. 3283 ◽

Cited By ~ 3

Author(s):

Zhenhao Luo ◽

Baosheng Wang ◽

Yong Tang ◽

Wei Xie

Keyword(s):

Internet Of Things ◽

Language Processing ◽

Semantic Representation ◽

Compiler Optimization ◽

Symbolic Execution ◽

Detection Methods ◽

Intermediate Representation ◽

Clone Detection ◽

Code Reuse ◽

Representation Model

Code reuse is widespread in software development as well as internet of things (IoT) devices. However, code reuse introduces many problems, e.g., software plagiarism and known vulnerabilities. Solving these problems requires extensive manual reverse analysis. Fortunately, binary clone detection can help analysts mitigate manual work by matching reusable code and known parts. However, many binary clone detection methods are not robust to various compiler optimization options and different architectures. While some clone detection methods can be applied across different architectures, they rely on manual features based on human prior knowledge to generate feature vectors for assembly functions and fail to consider the internal associations between features from a semantic perspective. To address this problem, we propose and implement a prototype GeneDiff, a semantic-based representation binary clone detection approach for cross-architectures. GeneDiff utilizes a representation model based on natural language processing (NLP) to generate high-dimensional numeric vectors for each function based on the Valgrind intermediate representation (VEX) representation. This is the first work that translates assembly instructions into an intermediate representation and uses a semantic representation model to implement clone detection for cross-architectures. GeneDiff is robust to various compiler optimization options and different architectures. Compared to approaches using symbolic execution, GeneDiff is significantly more efficient and accurate. The area under the curve (AUC) of the receiver operating characteristic (ROC) of GeneDiff reaches 92.35%, which is considerably higher than the approaches that use symbolic execution. Extensive experiments indicate that GeneDiff can detect similarity with high accuracy even when the code has been compiled with different optimization options and targeted to different architectures. We also use real-world IoT firmware across different architectures as targets, therein proving the practicality of GeneDiff in being able to detect known vulnerabilities.

Download Full-text

Improving Syntactical Clone Detection Methods through the Use of an Intermediate Representation

2020 IEEE 14th International Workshop on Software Clones (IWSC) ◽

10.1109/iwsc50091.2020.9047637 ◽

2020 ◽

Author(s):

Pedro M. Caldeira ◽

Kazunori Sakamoto ◽

Hironori Washizaki ◽

Yoshiaki Fukazawa ◽

Takahisa Shimada

Keyword(s):

Detection Methods ◽

Intermediate Representation ◽

Clone Detection

Download Full-text

Semantic Representation With Heterogeneous Information Network Using Matrix Factorization for Clustering in the Internet of Things

IEEE Access ◽

10.1109/access.2019.2903310 ◽

2019 ◽

Vol 7 ◽

pp. 31233-31242 ◽

Cited By ~ 3

Author(s):

Liang Hu ◽

Yanlei Gong ◽

Yongheng Xing ◽

Feng Wang

Keyword(s):

Internet Of Things ◽

Matrix Factorization ◽

Semantic Representation ◽

The Internet ◽

Information Network ◽

Heterogeneous Information Network ◽

Heterogeneous Information ◽

The Internet Of Things

Download Full-text

Automatically Representing TExt Meaning via an Interlingua-based System (ARTEMIS). A further step towards the computational representation of RRG

Journal of Computer-Assisted Linguistic Research ◽

10.4995/jclr.2017.7788 ◽

2017 ◽

Vol 1 (1) ◽

pp. 61 ◽

Cited By ~ 1

Author(s):

Ricardo Mairal-Usón ◽

Francisco Cortés-Rodríguez

Keyword(s):

Language Processing ◽

Semantic Representation ◽

Analysis Data ◽

Logical Structure ◽

Automatic Generation ◽

Role And Reference Grammar ◽

Reference Grammar ◽

Level 1 ◽

Computational Resources ◽

Computational Representation

Within the framework of FUNK Lab – a virtual laboratory for natural language processing inspired on a functionally-oriented linguistic theory like Role and Reference Grammar-, a number of computational resources have been built dealing with different aspects of language and with an application in different scientific domains, i.e. terminology, lexicography, sentiment analysis, document classification, text analysis, data mining etc. One of these resources is ARTEMIS (Automatically Representing TExt Meaning via an Interlingua-Based System), which departs from the pioneering work of Periñán-Pascual (2013) and Periñán-Pascual & Arcas (2014). This computational tool is a proof of concept prototype which allows the automatic generation of a conceptual logical structure (CLS) (cf. Mairal-Usón, Periñán-Pascual and Pérez 2012; Van Valin and Mairal-Usón 2014), that is, a fully specified semantic representation of an input text on the basis of a reduced sample of sentences. The primary aim of this paper is to develop the syntactic rules that form part of the computational grammar for the representation of simple clauses in English. More specifically, this work focuses on the format of those syntactic rules that account for the upper levels of the RRG Layered Structure of the Clause (LSC), that is, the core (and the level-1 construction associated with it), the clause and the sentence (Van Valin 2005). In essence, this analysis, together with that in Cortés-Rodríguez and Mairal-Usón (2016), offers an almost complete description of the computational grammar behind the LSC for simple clauses.

Download Full-text

Detection of the Hardcoded Login Information from Socket and String Compare Symbols

Annals of Emerging Technologies in Computing ◽

10.33166/aetic.2021.01.003 ◽

2021 ◽

Vol 5 (1) ◽

pp. 28-39

Author(s):

Minami Yoda ◽

Shuji Sakuraba ◽

Yuichi Sei ◽

Yasuyuki Tahara ◽

Akihiko Ohsuga

Keyword(s):

Internet Of Things ◽

Static Analysis ◽

Real World ◽

Symbolic Execution ◽

The Internet ◽

User Input ◽

Network Function ◽

Private Data ◽

String Search ◽

Iot Devices

Internet of Things (IoT) for smart homes enhances convenience; however, it also introduces the risk of the leakage of private data. TOP10 IoT of OWASP 2018 shows that the first vulnerability is ”Weak, easy to predict, or embedded passwords.” This problem poses a risk because a user can not fix, change, or detect a password if it is embedded in firmware because only the developer of the firmware can control an update. In this study, we propose a lightweight method to detect the hardcoded username and password in IoT devices using a static analysis called Socket Search and String Search to protect from first vulnerability from 2018 OWASP TOP 10 for the IoT device. The hardcoded login information can be obtained by comparing the user input with strcmp or strncmp. Previous studies analyzed the symbols of strcmp or strncmp to detect the hardcoded login information. However, those studies required a lot of time because of the usage of complicated algorithms such as symbolic execution. To develop a lightweight algorithm, we focus on a network function, such as the socket symbol in firmware, because the IoT device is compromised when it is invaded by someone via the Internet. We propose two methods to detect the hardcoded login information: string search and socket search. In string search, the algorithm finds a function that uses the strcmp or strncmp symbol. In socket search, the algorithm finds a function that is referenced by the socket symbol. In this experiment, we measured the ability of our proposed method by searching six firmware in the real world that has a backdoor. We ran three methods: string search, socket search, and whole search to compare the two methods. As a result, all methods found login information from five of six firmware and one unexpected password. Our method reduces the analysis time. The whole search generally takes 38 mins to complete, but our methods finish the search in 4-6 min.

Download Full-text

A multi-label text classification method via dynamic semantic representation model and deep neural network

Applied Intelligence ◽

10.1007/s10489-020-01680-w ◽

2020 ◽

Vol 50 (8) ◽

pp. 2339-2351 ◽

Cited By ~ 5

Author(s):

Tianshi Wang ◽

Li Liu ◽

Naiwen Liu ◽

Huaxiang Zhang ◽

Long Zhang ◽

...

Keyword(s):

Neural Network ◽

Text Classification ◽

Deep Neural Network ◽

Semantic Representation ◽

Classification Method ◽

Dynamic Semantic ◽

Representation Model

Download Full-text

Weighting-based semantic similarity measure based on topological parameters in semantic taxonomy

Natural Language Engineering ◽

10.1017/s1351324918000190 ◽

2018 ◽

Vol 24 (6) ◽

pp. 861-886 ◽

Cited By ~ 3

Author(s):

ABDULGABBAR SAIF ◽

UMMI ZAKIAH ZAINODIN ◽

NAZLIA OMAR ◽

ABDULLAH SAEED GHAREB

Keyword(s):

Semantic Similarity ◽

Similarity Measure ◽

Language Processing ◽

Knowledge Engineering ◽

Semantic Representation ◽

Semantic Similarity Measure ◽

Topological Parameters ◽

Research Areas ◽

Comparison Results ◽

Feature Based

AbstractSemantic measures are used in handling different issues in several research areas, such as artificial intelligence, natural language processing, knowledge engineering, bioinformatics, and information retrieval. Hierarchical feature-based semantic measures have been proposed to estimate the semantic similarity between two concepts/words depending on the features extracted from a semantic taxonomy (hierarchy) of a given lexical source. The central issue in these measures is the constant weighting assumption that all elements in the semantic representation of the concept possess the same relevance. In this paper, a new weighting-based semantic similarity measure is proposed to address the issues in hierarchical feature-based measures. Four mechanisms are introduced to weigh the degree of relevance of features in the semantic representation of a concept by using topological parameters (edge, depth, descendants, and density) in a semantic taxonomy. With the semantic taxonomy of WordNet, the proposed semantic measure is evaluated for word semantic similarity in four gold-standard datasets. Experimental results show that the proposed measure outperforms hierarchical feature-based semantic measures in all the datasets. Comparison results also imply that the proposed measure is more effective than information-content measures in measuring semantic similarity.

Download Full-text

Enhancing Source-Based Clone Detection Using Intermediate Representation

2010 17th Working Conference on Reverse Engineering ◽

10.1109/wcre.2010.33 ◽

2010 ◽

Cited By ~ 17

Author(s):

Gehan M.K. Selim ◽

King Chun Foo ◽

Ying Zou

Keyword(s):

Intermediate Representation ◽

Clone Detection

Download Full-text

Efficient Large-Scale Stance Detection in Tweets

Deep Learning and Neural Networks ◽

10.4018/978-1-7998-0414-7.ch037 ◽

2020 ◽

pp. 667-683

Author(s):

Yilin Yan ◽

Jonathan Chen ◽

Mei-Ling Shyu

Keyword(s):

Deep Learning ◽

Language Processing ◽

Large Scale ◽

Research Direction ◽

Detection Methods ◽

Use Case ◽

Learning Techniques ◽

Test Use ◽

Presidential Election Campaign ◽

Important Research Direction

Stance detection is an important research direction which attempts to automatically determine the attitude (positive, negative, or neutral) of the author of text (such as tweets), towards a target. Nowadays, a number of frameworks have been proposed using deep learning techniques that show promising results in application domains such as automatic speech recognition and computer vision, as well as natural language processing (NLP). This article shows a novel deep learning-based fast stance detection framework in bipolar affinities on Twitter. It is noted that millions of tweets regarding Clinton and Trump were produced per day on Twitter during the 2016 United States presidential election campaign, and thus it is used as a test use case because of its significant and unique counter-factual properties. In addition, stance detection can be utilized to imply the political tendency of the general public. Experimental results show that the proposed framework achieves high accuracy results when compared to several existing stance detection methods.

Download Full-text

A Novel Sensor Data Pre-Processing Methodology for the Internet of Things Using Anomaly Detection and Transfer-By-Subspace-Similarity Transformation

Sensors ◽

10.3390/s19204536 ◽

2019 ◽

Vol 19 (20) ◽

pp. 4536 ◽

Cited By ~ 2

Author(s):

Yan Zhong ◽

Simon Fong ◽

Shimin Hu ◽

Raymond Wong ◽

Weiwei Lin

Keyword(s):

Data Mining ◽

Internet Of Things ◽

Anomaly Detection ◽

Sensor Data ◽

Detection Methods ◽

The Internet ◽

Quality Of Data ◽

Data Mining Algorithms ◽

Sensing Applications ◽

The Internet Of Things

The Internet of Things (IoT) and sensors are becoming increasingly popular, especially in monitoring large and ambient environments. Applications that embrace IoT and sensors often require mining the data feeds that are collected at frequent intervals for intelligence. Despite the fact that such sensor data are massive, most of the data contents are identical and repetitive; for example, human traffic in a park at night. Most of the traditional classification algorithms were originally formulated decades ago, and they were not designed to handle such sensor data effectively. Hence, the performance of the learned model is often poor because of the small granularity in classification and the sporadic patterns in the data. To improve the quality of data mining from the IoT data, a new pre-processing methodology based on subspace similarity detection is proposed. Our method can be well integrated with traditional data mining algorithms and anomaly detection methods. The pre-processing method is flexible for handling similar kinds of sensor data that are sporadic in nature that exist in many ambient sensing applications. The proposed methodology is evaluated by extensive experiment with a collection of classical data mining models. An improvement over the precision rate is shown by using the proposed method.

Download Full-text