How are functionally similar code clones syntactically different? An empirical study and a benchmark

10.7287/peerj.preprints.1516v2 ◽

2016 ◽

Author(s):

Stefan Wagner ◽

Asim Abdulkhaleq ◽

Ivan Bogicevic ◽

Jan-Peter Ostberg ◽

Jasmin Ramadani

Keyword(s):

Data Structure ◽

Empirical Study ◽

Random Sample ◽

Source Code ◽

Clone Detection ◽

Code Clones ◽

Syntactic Similarity ◽

Syntactic Differences ◽

Similar Code

Background. Today, redundancy in source code, so-called “clones”, caused by copy&paste can be found reliably using clone detection tools. Redundancy can arise also independently, however, caused not by copy&paste. At present, it is not clear how only functionally similar clones (FSC) differ from clones created by copy&paste. Our aim is to understand and categorise the syntactic differences in FSCs that distinguish them from copy&paste clones in a way that helps clone detection research. Methods. We conducted an experiment using known functionally similar programs in Java and C from coding contests. We analysed syntactic similarity with traditional detection tools and explored whether concolic clone detection can go beyond syntax. We ran all tools on 2,800 programs and manually categorised the differences in a random sample of 70 program pairs. Results. We found no FSCs where complete files were syntactically similar. We could detect a syntactic similarity in a part of the files in < 16 % of the program pairs. Concolic detection found 1 of the FSCs. The differences between program pairs were in the categories algorithm, data structure, OO design, I/O and libraries. We selected 58 pairs for an openly accessible benchmark representing these categories. Discussion. The majority of differences between functionally similar clones are beyond the capabilities of current clone detection approaches. Yet, our benchmark can help to drive further clone detection research.

Download Full-text

How are functionally similar code clones syntactically different? An empirical study and a benchmark

10.7287/peerj.preprints.1516 ◽

2016 ◽

Author(s):

Stefan Wagner ◽

Asim Abdulkhaleq ◽

Ivan Bogicevic ◽

Jan-Peter Ostberg ◽

Jasmin Ramadani

Keyword(s):

Data Structure ◽

Empirical Study ◽

Random Sample ◽

Source Code ◽

Clone Detection ◽

Code Clones ◽

Syntactic Similarity ◽

Syntactic Differences ◽

Similar Code

Background. Today, redundancy in source code, so-called “clones”, caused by copy&paste can be found reliably using clone detection tools. Redundancy can arise also independently, however, caused not by copy&paste. At present, it is not clear how only functionally similar clones (FSC) differ from clones created by copy&paste. Our aim is to understand and categorise the syntactic differences in FSCs that distinguish them from copy&paste clones in a way that helps clone detection research. Methods. We conducted an experiment using known functionally similar programs in Java and C from coding contests. We analysed syntactic similarity with traditional detection tools and explored whether concolic clone detection can go beyond syntax. We ran all tools on 2,800 programs and manually categorised the differences in a random sample of 70 program pairs. Results. We found no FSCs where complete files were syntactically similar. We could detect a syntactic similarity in a part of the files in < 16 % of the program pairs. Concolic detection found 1 of the FSCs. The differences between program pairs were in the categories algorithm, data structure, OO design, I/O and libraries. We selected 58 pairs for an openly accessible benchmark representing these categories. Discussion. The majority of differences between functionally similar clones are beyond the capabilities of current clone detection approaches. Yet, our benchmark can help to drive further clone detection research.

Download Full-text

How are functionally similar code clones different?

10.7287/peerj.preprints.1516v1 ◽

2015 ◽

Author(s):

Stefan Wagner ◽

Asim Abdulkhaleq ◽

Ivan Bogicevic ◽

Jan-Peter Ostberg ◽

Jasmin Ramadani

Keyword(s):

Data Structure ◽

Random Sample ◽

Source Code ◽

Clone Detection ◽

Code Clones ◽

Syntactic Similarity ◽

Similar Code

Background. Today, redundancy in source code, so-called “clones”, caused by copy&paste can be found reliably using clone detection tools. Redundancy can arise also independently, however, caused not by copy&paste. At present, it is not clear how only functionally similar clones (FSC) differ from clones created by copy&paste. Our aim is to understand and categorise the differences in FSCs that distinguish them from copy&paste clones in a way that helps clone detection research. Methods. We conducted an experiment using known functionally similar programs in Java and C from coding contests. We analysed syntactic similarity with traditional detection tools and explored whether concolic clone detection can go beyond syntax. We ran all tools on 2,800 programs and manually categorised the differences in a random sample of 70 program pairs. Results. We found no FSCs where complete files were syntactically similar. We could detect a syntactic similarity in a part of the files in < 16 % of the program pairs. Concolic detection found 1 of the FSCs. The differences between program pairs were in the categories algorithm, data structure, OO design, I/O and libraries. We selected 58 pairs for an openly accessible benchmark representing these categories. Discussion. The majority of differences between functionally similar clones are beyond the capabilities of current clone detection approaches. Yet, our benchmark can help to drive further clone detection research.

Download Full-text

Integrated Reasoning Engine for Code Clone Detection

ABC Journal of Advanced Research ◽

10.18034/abcjar.v3i2.575 ◽

2014 ◽

Vol 3 (2) ◽

pp. 143-152 ◽

Cited By ~ 5

Author(s):

Naresh Babu Bynagari

Keyword(s):

Clone Detection ◽

Code Clones ◽

High Pitch ◽

Detection Process ◽

Code Clone ◽

Similar Code ◽

Reasoning Engine

This article seeks to foray into the nitty-gritty of integrated reasoning for code clone detection and how it is effectively carried out, given the amount of analytics usually associated with such activities. Detection of codes requires high-pitch familiarity with cloning systems and their workings. Hence, discovering similar code segments that are often regarded and seen as code imitations (clone) is not an easy responsibility. More especially, this very detection process might possess key purposes in the context of susceptibility findings, refactoring, and imitation detecting. Through the voyage of discovery this article intends to expose you to, you will realize that identical code segments, more often than not described as code clones, appear to be a serious duty, especially for large code bases <1; 2; 3; 4>. There are certain approaches and deep technicalities that this sort of detection is known for. Still, from the avalanche of resources that formed the bedrock of this article, one would discover the easiest formula to adopt in maneuvering such strenuous issues.

Download Full-text

Peer Review #3 of "How are functionally similar code clones syntactically different? An empirical study and a benchmark (v0.1)"

10.7287/peerj-cs.49v0.1/reviews/3 ◽

2016 ◽

Keyword(s):

Empirical Study ◽

Peer Review ◽

Code Clones ◽

Similar Code

Download Full-text

Peer Review #2 of "How are functionally similar code clones syntactically different? An empirical study and a benchmark (v0.1)"

10.7287/peerj-cs.49v0.1/reviews/2 ◽

2016 ◽

Keyword(s):

Empirical Study ◽

Peer Review ◽

Code Clones ◽

Similar Code

Download Full-text

Peer Review #1 of "How are functionally similar code clones syntactically different? An empirical study and a benchmark (v0.1)"

10.7287/peerj-cs.49v0.1/reviews/1 ◽

2016 ◽

Keyword(s):

Empirical Study ◽

Peer Review ◽

Code Clones ◽

Similar Code

Download Full-text

An empirical study on the maintenance of source code clones

Empirical Software Engineering ◽

10.1007/s10664-009-9108-x ◽

2009 ◽

Vol 15 (1) ◽

pp. 1-34 ◽

Cited By ~ 91

Author(s):

Suresh Thummalapenta ◽

Luigi Cerulo ◽

Lerina Aversano ◽

Massimiliano Di Penta

Keyword(s):

Empirical Study ◽

Source Code ◽

Code Clones

Download Full-text

Enhancing the Software Clone Detection in BigCloneBench

International Journal of Open Source Software and Processes ◽

10.4018/ijossp.2021070102 ◽

2021 ◽

Vol 12 (3) ◽

pp. 17-31

Author(s):

Amandeep Kaur ◽

Munish Saini

Keyword(s):

Structural Information ◽

Source Code ◽

Maintenance Cost ◽

Clone Detection ◽

Abstract Syntax ◽

Code Clones ◽

Major Drawback ◽

Abstract Syntax Tree ◽

Detection Techniques ◽

Code Clone

In the software system, the code snippets that are copied and pasted in the same software or another software result in cloning. The basic cause of cloning is either a programmer‘s constraint or language constraints. An increase in the maintenance cost of software is the major drawback of code clones. So, clone detection techniques are required to remove or refactor the code clone. Recent studies exhibit the abstract syntax tree (AST) captures the structural information of source code appropriately. Many researchers used tree-based convolution for identifying the clone, but this technique has certain drawbacks. Therefore, in this paper, the authors propose an approach that finds the semantic clone through square-based convolution by taking abstract syntax representation of source code. Experimental results show the effectiveness of the approach to the popular BigCloneBench benchmark.

Download Full-text

Two-Pass Technique for Clone Detection and Type Classification Using Tree-Based Convolution Neural Network

Applied Sciences ◽

10.3390/app11146613 ◽

2021 ◽

Vol 11 (14) ◽

pp. 6613

Author(s):

Young-Bin Jo ◽

Jihyun Lee ◽

Cheol-Jung Yoo

Keyword(s):

Neural Network ◽

Average Rate ◽

Convolution Neural Network ◽

Clone Detection ◽

Code Clones ◽

Classification Technique ◽

Development Costs ◽

Code Quality ◽

Type Information ◽

Type Classification

Appropriate reliance on code clones significantly reduces development costs and hastens the development process. Reckless cloning, in contrast, reduces code quality and ultimately adds costs and time. To avoid this scenario, many researchers have proposed methods for clone detection and refactoring. The developed techniques, however, are only reliably capable of detecting clones that are either entirely identical or that only use modified identifiers, and do not provide clone-type information. This paper proposes a two-pass clone classification technique that uses a tree-based convolution neural network (TBCNN) to detect multiple clone types, including clones that are not wholly identical or to which only small changes have been made, and automatically classify them by type. Our method was validated with BigCloneBench, a well-known and wildly used dataset of cloned code. Our experimental results validate that our technique detected clones with an average rate of 96% recall and precision, and classified clones with an average rate of 78% recall and precision.

Download Full-text