Source Code Authorship Identification Using Deep Neural Networks

Anna Kurtukova; Aleksandr Romanov; Alexander Shelupanov

doi:10.3390/sym12122044

Source Code Authorship Identification Using Deep Neural Networks

Symmetry ◽

10.3390/sym12122044 ◽

2020 ◽

Vol 12 (12) ◽

pp. 2044 ◽

Cited By ~ 2

Author(s):

Anna Kurtukova ◽

Aleksandr Romanov ◽

Alexander Shelupanov

Keyword(s):

Software Engineering ◽

Programming Languages ◽

Language Processing ◽

Source Code ◽

Research Area ◽

The Public ◽

Common Basis ◽

Average Accuracy ◽

Authorship Identification ◽

Project Lifecycle

Many open-source projects are developed by the community and have a common basis. The more source code is open, the more the project is open to contributors. The possibility of accidental or deliberate use of someone else’s source code as a closed functionality in another project (even a commercial) is not excluded. This situation could create copyright disputes. Adding a plagiarism check to the project lifecycle during software engineering solves this problem. However, not all code samples for comparing can be found in the public domain. In this case, the methods of identifying the source code author can be useful. Therefore, identifying the source code author is an important problem in software engineering, and it is also a research area in symmetry. This article discusses the problem of identifying the source code author and modern methods of solving this problem. Based on the experience of researchers in the field of natural language processing (NLP), the authors propose their technique based on a hybrid neural network and demonstrate its results both for simple cases of determining the authorship of the code and for those complicated by obfuscation and using of coding standards. The results show that the author’s technique successfully solves the essential problems of analogs and can be effective even in cases where there are no obvious signs indicating authorship. The average accuracy obtained for all programming languages was 95% in the simple case and exceeded 80% in the complicated ones.

Download Full-text

COMPARISON OF SOFTWARE COMPLEXITY OF SEARCH ALGORITHM USING CODE BASED COMPLEXITY METRICS

International Journal of Engineering Applied Sciences and Technology ◽

10.33564/ijeast.2021.v06i05.003 ◽

2021 ◽

Vol 6 (5) ◽

Author(s):

Bello Muriana ◽

Ogba Paul Onuh

Keyword(s):

Software Engineering ◽

Programming Languages ◽

Search Algorithm ◽

Source Code ◽

Search Algorithms ◽

Binary Search ◽

Software Systems ◽

Software Complexity ◽

Complexity Metrics ◽

Binary Search Algorithm

Measures of software complexity are essential part of software engineering. Complexity metrics can be used to forecast key information regarding the testability, reliability, and manageability of software systems from study of the source code. This paper presents the results of three distinct software complexity metrics that were applied to two searching algorithms (Linear and Binary search algorithm). The goal is to compare the complexity of linear and binary search algorithms implemented in (Python, Java, and C++ languages) and measure the sample algorithms using line of code, McCabe and Halstead metrics. The findings indicate that the program difficulty of Halstead metrics has minimal value for both linear and binary search when implemented in python. Analysis of Variance (ANOVA) was adopted to determine whether there is any statistically significant differences between the search algorithms when implemented in the three programming languages and it was revealed that the three (3) programming languages do not vary considerably for both linear and binary search techniques which implies that any of the (3) programming languages is suitable for coding linear and binary search algorithms.

Download Full-text

Statistical Unigram Analysis for Source Code Repository

International Journal of Semantic Computing ◽

10.1142/s1793351x18400123 ◽

2018 ◽

Vol 12 (02) ◽

pp. 237-260

Author(s):

Weifeng Xu ◽

Dianxiang Xu ◽

Abdulrahman Alatawi ◽

Omar El Ariss ◽

Yunkai Liu

Keyword(s):

Natural Language Processing ◽

Empirical Study ◽

Natural Language ◽

Programming Languages ◽

Language Processing ◽

Probabilistic Model ◽

Source Code ◽

Code Analysis ◽

Domain Specific ◽

Language Corpus

Unigram is a fundamental element of [Formula: see text]-gram in natural language processing. However, unigrams collected from a natural language corpus are unsuitable for solving problems in the domain of computer programming languages. In this paper, we analyze the properties of unigrams collected from an ultra-large source code repository. Specifically, we have collected 1.01 billion unigrams from 0.7 million open source projects hosted at GitHub.com. By analyzing these unigrams, we have discovered statistical properties regarding (1) how developers name variables, methods, and classes, and (2) how developers choose abbreviations. We describe a probabilistic model which relies on these properties for solving a well-known problem in source code analysis: how to expand a given abbreviation to its original indented word. Our empirical study shows that using the unigrams extracted from source code repository outperforms the using of the natural language corpus by 21% when solving the domain specific problems.

Download Full-text

Large-scale and Robust Code Authorship Identification with Deep Feature Learning

ACM Transactions on Privacy and Security ◽

10.1145/3461666 ◽

2021 ◽

Vol 24 (4) ◽

pp. 1-35

Author(s):

Mohammed Abuhamad ◽

Tamer Abuhmed ◽

David Mohaisen ◽

Daehun Nyang

Keyword(s):

Programming Languages ◽

Real World ◽

Large Scale ◽

Source Code ◽

Feature Learning ◽

Identification Accuracy ◽

Authorship Attribution ◽

Deep Feature ◽

Public Repositories ◽

Authorship Identification

Successful software authorship de-anonymization has both software forensics applications and privacy implications. However, the process requires an efficient extraction of authorship attributes. The extraction of such attributes is very challenging, due to various software code formats from executable binaries with different toolchain provenance to source code with different programming languages. Moreover, the quality of attributes is bounded by the availability of software samples to a certain number of samples per author and a specific size for software samples. To this end, this work proposes a deep Learning-based approach for software authorship attribution, that facilitates large-scale, format-independent, language-oblivious, and obfuscation-resilient software authorship identification. This proposed approach incorporates the process of learning deep authorship attribution using a recurrent neural network, and ensemble random forest classifier for scalability to de-anonymize programmers. Comprehensive experiments are conducted to evaluate the proposed approach over the entire Google Code Jam (GCJ) dataset across all years (from 2008 to 2016) and over real-world code samples from 1,987 public repositories on GitHub. The results of our work show high accuracy despite requiring a smaller number of samples per author. Experimenting with source-code, our approach allows us to identify 8,903 GCJ authors, the largest-scale dataset used by far, with an accuracy of 92.3%. Using the real-world dataset, we achieved an identification accuracy of 94.38% for 745 C programmers on GitHub. Moreover, the proposed approach is resilient to language-specifics, and thus it can identify authors of four programming languages (e.g., C, C++, Java, and Python), and authors writing in mixed languages (e.g., Java/C++, Python/C++). Finally, our system is resistant to sophisticated obfuscation (e.g., using C Tigress) with an accuracy of 93.42% for a set of 120 authors. Experimenting with executable binaries, our approach achieves 95.74% for identifying 1,500 programmers of software binaries. Similar results were obtained when software binaries are generated with different compilation options, optimization levels, and removing of symbol information. Moreover, our approach achieves 93.86% for identifying 1,500 programmers of obfuscated binaries using all features adopted in Obfuscator-LLVM tool.

Download Full-text

A Review and evaluation of Machine Translation methods for Lumasaaba

Journal of Digital Science ◽

10.33847/2686-8296.2.1_1 ◽

2020 ◽

pp. 3-17

Author(s):

Peter Nabende

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Machine Translation ◽

Language Processing ◽

Research Area ◽

Data Driven ◽

East African ◽

Data Set ◽

African Languages ◽

Translation Methods

Natural Language Processing for under-resourced languages is now a mainstream research area. However, there are limited studies on Natural Language Processing applications for many indigenous East African languages. As a contribution to covering the current gap of knowledge, this paper focuses on evaluating the application of well-established machine translation methods for one heavily under-resourced indigenous East African language called Lumasaaba. Specifically, we review the most common machine translation methods in the context of Lumasaaba including both rule-based and data-driven methods. Then we apply a state of the art data-driven machine translation method to learn models for automating translation between Lumasaaba and English using a very limited data set of parallel sentences. Automatic evaluation results show that a transformer-based Neural Machine Translation model architecture leads to consistently better BLEU scores than the recurrent neural network-based models. Moreover, the automatically generated translations can be comprehended to a reasonable extent and are usually associated with the source language input.

Download Full-text

Deep Learning Based High-Resolution Remote Sensing Image classification

International Journal of Advanced Research in Computer Science and Software Engineering ◽

10.23956/ijarcsse.v7i10.384 ◽

2017 ◽

Vol 7 (10) ◽

pp. 22

Author(s):

Sumit Kaur

Keyword(s):

Machine Learning ◽

Remote Sensing ◽

Deep Learning ◽

Image Classification ◽

Language Processing ◽

Object Perception ◽

Remote Sensing Image ◽

Research Area ◽

Remote Sensing Image Classification ◽

Unsupervised Algorithms

Abstract- Deep learning is an emerging research area in machine learning and pattern recognition field which has been presented with the goal of drawing Machine Learning nearer to one of its unique objectives, Artificial Intelligence. It tries to mimic the human brain, which is capable of processing and learning from the complex input data and solving different kinds of complicated tasks well. Deep learning (DL) basically based on a set of supervised and unsupervised algorithms that attempt to model higher level abstractions in data and make it self-learning for hierarchical representation for classification. In the recent years, it has attracted much attention due to its state-of-the-art performance in diverse areas like object perception, speech recognition, computer vision, collaborative filtering and natural language processing. This paper will present a survey on different deep learning techniques for remote sensing image classification.

Download Full-text

PERANGKAT LUNAK KOMPUTER

10.31219/osf.io/tjbfr ◽

2020 ◽

Author(s):

Cut Nabilah Damni

Keyword(s):

Programming Languages ◽

Programming Language ◽

Operating Systems ◽

Source Code ◽

Computer Software ◽

Computer Programs ◽

Application Systems ◽

Executable Programs

AbstrakSoftware komputer atau perangkat lunak komputer merupakan kumpulan instruksi (program atau prosedur) untuk dapat melaksanakan pekerjaan secara otomatis dengan cara mengolah atau memproses kumpulan intruksi (data) yang diberikan. (Yahfizham, 2019 : 19) Sebagian besar dari software komputer dibuat oleh (programmer) dengan menggunakan bahasa pemprograman. Orang yang membuat bahasa pemprograman menuliskan perintah dalam bahasa pemprograman seperti layaknya bahasa yang digunakan oleh orang pada umumnya dalam melakukan perbincangan. Perintah-perintah tersebut dinamakan (source code). Program komputer lainnya dinamakan (compiler) yang digunakan pada (source code) dan kemudian mengubah perintah tersebut kedalam bahasa yang dimengerti oleh komputer lalu hasilnya dinamakan program executable (EXE). Pada dasarnya, komputer selalu memiliki perangkat lunak komputer atau software yang terdiri dari sistem operasi, sistem aplikasi dan bahasa pemograman.AbstractComputer software or computer software is a collection of instructions (programs or procedures) to be able to carry out work automatically by processing or processing the collection of instructions (data) provided. (Yahfizham, 2019: 19) Most of the computer software is made by (programmers) using the programming language. People who make programming languages write commands in the programming language like the language used by people in general in conducting conversation. The commands are called (source code). Other computer programs called (compilers) are used in (source code) and then change the command into a language understood by the computer and the results are called executable programs (EXE). Basically, computers always have computer software or software consisting of operating systems, application systems and programming languages.

Download Full-text

Information technology. Programming languages, their environments and system software interfaces. Code signing for source code

10.3403/30278581u ◽

2015 ◽

Keyword(s):

Information Technology ◽

Programming Languages ◽

Source Code ◽

System Software ◽

Software Interfaces

Download Full-text

Information technology. Programming languages, their environments and system software interfaces. Code signing for source code

10.3403/30278581 ◽

2015 ◽

Keyword(s):

Information Technology ◽

Programming Languages ◽

Source Code ◽

System Software ◽

Software Interfaces

Download Full-text

Global and Latin American female participation in evidence-based software engineering: a systematic mapping study

Journal of the Brazilian Computer Society ◽

10.1186/s13173-021-00109-7 ◽

2021 ◽

Vol 27 (1) ◽

Author(s):

Katia Romero Felizardo ◽

Amanda Möhring Ramos ◽

Claudia de O. Melo ◽

Érica Ferreira de Souza ◽

Nandamudi L. Vijaykumar ◽

...

Keyword(s):

Software Engineering ◽

Latin American ◽

Gender Issue ◽

Research Area ◽

Systematic Mapping Study ◽

Evidence Based ◽

Mapping Study ◽

Systematic Mapping ◽

Participation Of Women ◽

New Generation

Abstract Context While the digital economy requires a new generation of technology for scientists and practitioners, the software engineering (SE) field faces a gender crisis. SE research is a global enterprise that requires the participation of both genders for the advancement of science and evidence-based practice. However, women across the world tend to be significantly underrepresented in such research, receiving less funding and less participation, frequently, than men as authors in research publications. Data about this phenomenon is still sparse and incomplete; particularly in evidence-based software engineering (EBSE), there are no studies that analyze the participation of women in this research area. Objective The objective of this work is to present the results of a systematic mapping study (SM) conducted to collect and evaluate evidence on female researchers who have contributed to the area of EBSE. Method Our SM was performed by manually searching studies in the major conferences and journals of EBSE. We identified 981 studies and 183 were authored/co-authored by women and, therefore, included. Results Contributions from women in secondary studies have globally increased over the years, but it is still concentrated in European countries. Additionally, collaboration among research groups is still fragile, based on a few women as a bridge. Latin American researchers contribute a great deal to the field, despite they do not collaborate as much within their region. Conclusions The findings from this study are expected to be aggregated to the existing knowledge with respect to women’s contribution to the EBSE area. We expect that our results bring up a reflection on the gender issue and motivate actions and policies to attract female researchers to this area.

Download Full-text

Deep Structured Learning for Natural Language Processing

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3433538 ◽

2021 ◽

Vol 20 (3) ◽

pp. 1-14

Author(s):

Yong Li ◽

Xiaojun Yang ◽

Min Zuo ◽

Qingyu Jin ◽

Haisheng Li ◽

...

Keyword(s):

Public Opinion ◽

Food Safety ◽

Language Processing ◽

Early Warning ◽

Conditional Random Field ◽

Semantic Features ◽

Related Sequence ◽

The Public ◽

Network Public Opinion

The real-time and dissemination characteristics of network information make net-mediated public opinion become more and more important food safety early warning resources, but the data of petabyte (PB) scale growth also bring great difficulties to the research and judgment of network public opinion, especially how to extract the event role of network public opinion from these data and analyze the sentiment tendency of public opinion comment. First, this article takes the public opinion of food safety network as the research point, and a BLSTM-CRF model for automatically marking the role of event is proposed by combining BLSTM and conditional random field organically. Second, the Attention mechanism based on vocabulary in the field of food safety is introduced, the distance-related sequence semantic features are extracted by BLSTM, and the emotional classification of sequence semantic features is realized by using CNN. A kind of Att-BLSTM-CNN model for the analysis of public opinion and emotional tendency in the field of food safety is proposed. Finally, based on the time series, this article combines the role extraction of food safety events and the analysis of emotional tendency and constructs a net-mediated public opinion early warning model in the field of food safety according to the heat of the event and the emotional intensity of the public to food safety public opinion events.

Download Full-text