scholarly journals A Context-Aware Neural Embedding for Function-Level Vulnerability Detection

Algorithms ◽  
2021 ◽  
Vol 14 (11) ◽  
pp. 335
Author(s):  
Hongwei Wei ◽  
Guanjun Lin ◽  
Lin Li ◽  
Heming Jia

Exploitable vulnerabilities in software systems are major security concerns. To date, machine learning (ML) based solutions have been proposed to automate and accelerate the detection of vulnerabilities. Most ML techniques aim to isolate a unit of source code, be it a line or a function, as being vulnerable. We argue that a code segment is vulnerable if it exists in certain semantic contexts, such as the control flow and data flow; therefore, it is important for the detection to be context aware. In this paper, we evaluate the performance of mainstream word embedding techniques in the scenario of software vulnerability detection. Based on the evaluation, we propose a supervised framework leveraging pre-trained context-aware embeddings from language models (ELMo) to capture deep contextual representations, further summarized by a bidirectional long short-term memory (Bi-LSTM) layer for learning long-range code dependency. The framework takes directly a source code function as an input and produces corresponding function embeddings, which can be treated as feature sets for conventional ML classifiers. Experimental results showed that the proposed framework yielded the best performance in its downstream detection tasks. Using the feature representations generated by our framework, random forest and support vector machine outperformed four baseline systems on our data sets, demonstrating that the framework incorporated with ELMo can effectively capture the vulnerable data flow patterns and facilitate the vulnerability detection task.

2020 ◽  
Author(s):  
Gleyberson Andrade ◽  
Elder Cirilo ◽  
Vinicius Durelli ◽  
Bruno Cafeo ◽  
Eiji Adachi

Configurable software systems offer a variety of benefits such as supporting easy configuration of custom behaviours for distinctive needs. However, it is known that the presence of configuration options in source code complicates maintenance tasks and requires additional effort from developers when adding or editing code statements. They need to consider multiple configurations when executing tests or performing static analysis to detect vulnerabilities. Therefore, vulnerabilities have been widely reported in configurable software systems. Unfortunately, the effectiveness of vulnerability detection depends on how the multiple configurations (i.e., samples sets) are selected. In this paper, we tackle the challenge of generating more adequate system configuration samples by taking into account the intrinsic characteristics of security vulnerabilities. We propose a new sampling heuristic based on data-flow analysis for recommending the subset of configurations that should be analyzed individually. Our results show that we can achieve high vulnerability-detection effectiveness with a small sample size.


2021 ◽  
Vol 3 (2(59)) ◽  
pp. 19-23
Author(s):  
Yevhenii Kubiuk ◽  
Gennadiy Kyselov

The object of research of this work is the methods of deep learning for source code vulnerability detection. One of the most problematic areas is the use of only one approach in the code analysis process: the approach based on the AST (abstract syntax tree) or the approach based on the program dependence graph (PDG). In this paper, a comparative analysis of two approaches for source code vulnerability detection was conducted: approaches based on AST and approaches based on the PDG. In this paper, various topologies of neural networks were analyzed. They are used in approaches based on the AST and PDG. As the result of the comparison, the advantages and disadvantages of each approach were determined, and the results were summarized in the corresponding comparison tables. As a result of the analysis, it was determined that the use of BLSTM (Bidirectional Long Short Term Memory) and BGRU (Bidirectional Gated Linear Unit) gives the best result in terms of problems of source code vulnerability detection. As the analysis showed, the most effective approach for source code vulnerability detection systems is a method that uses an intermediate representation of the code, which allows getting a language-independent tool. Also, in this work, our own algorithm for the source code analysis system is proposed, which is able to perform the following operations: predict the source code vulnerability, classify the source code vulnerability, and generate a corresponding patch for the found vulnerability. A detailed analysis of the proposed system’s unresolved issues is provided, which is planned to investigate in future researches. The proposed system could help speed up the software development process as well as reduce the number of software code vulnerabilities. Software developers, as well as specialists in the field of cybersecurity, can be stakeholders of the proposed system.


Author(s):  
Joni Salminen ◽  
Maximilian Hopf ◽  
Shammur A. Chowdhury ◽  
Soon-gyo Jung ◽  
Hind Almerekhi ◽  
...  

AbstractThe proliferation of social media enables people to express their opinions widely online. However, at the same time, this has resulted in the emergence of conflict and hate, making online environments uninviting for users. Although researchers have found that hate is a problem across multiple platforms, there is a lack of models for online hate detection using multi-platform data. To address this research gap, we collect a total of 197,566 comments from four platforms: YouTube, Reddit, Wikipedia, and Twitter, with 80% of the comments labeled as non-hateful and the remaining 20% labeled as hateful. We then experiment with several classification algorithms (Logistic Regression, Naïve Bayes, Support Vector Machines, XGBoost, and Neural Networks) and feature representations (Bag-of-Words, TF-IDF, Word2Vec, BERT, and their combination). While all the models significantly outperform the keyword-based baseline classifier, XGBoost using all features performs the best (F1 = 0.92). Feature importance analysis indicates that BERT features are the most impactful for the predictions. Findings support the generalizability of the best model, as the platform-specific results from Twitter and Wikipedia are comparable to their respective source papers. We make our code publicly available for application in real software systems as well as for further development by online hate researchers.


2021 ◽  
Vol 2021 ◽  
pp. 1-12
Author(s):  
Gaigai Tang ◽  
Lin Yang ◽  
Shuangyin Ren ◽  
Lianxiao Meng ◽  
Feng Yang ◽  
...  

Traditional vulnerability detection mostly ran on rules or source code similarity with manually defined vulnerability features. In fact, these vulnerability rules or features are difficult to be defined accurately, which usually cost much expert labor and perform weakly in practical applications. To mitigate this issue, researchers introduced neural networks to automatically extract features to improve the intelligence of vulnerability detection. Bidirectional Long Short-term Memory (Bi-LSTM) network has proved a success for software vulnerability detection. However, due to complex context information processing and iterative training mechanism, training cost is heavy for Bi-LSTM. To effectively improve the training efficiency, we proposed to use Extreme Learning Machine (ELM). The training process of ELM is noniterative, so the network training can converge quickly. As ELM usually shows weak precision performance because of its simple network structure, we introduce the kernel method. In the preprocessing of this framework, we introduce doc2vec for vector representation and multilevel symbolization for program symbolization. Experimental results show that doc2vec vector representation brings faster training and better generalizing performance than word2vec. ELM converges much quickly than Bi-LSTM, and the kernel method can effectively improve the precision of ELM while ensuring training efficiency.


2021 ◽  
Vol 12 ◽  
Author(s):  
Klára Jágrová ◽  
Michael Hedderich ◽  
Marius Mosbach ◽  
Tania Avgustinova ◽  
Dietrich Klakow

This contribution seeks to provide a rational probabilistic explanation for the intelligibility of words in a genetically related language that is unknown to the reader, a phenomenon referred to as intercomprehension. In this research domain, linguistic distance, among other factors, was proved to correlate well with the mutual intelligibility of individual words. However, the role of context for the intelligibility of target words in sentences was subject to very few studies. To address this, we analyze data from web-based experiments in which Czech (CS) respondents were asked to translate highly predictable target words at the final position of Polish sentences. We compare correlations of target word intelligibility with data from 3-g language models (LMs) to their correlations with data obtained from context-aware LMs. More specifically, we evaluate two context-aware LM architectures: Long Short-Term Memory (LSTMs) that can, theoretically, take infinitely long-distance dependencies into account and Transformer-based LMs which can access the whole input sequence at the same time. We investigate how their use of context affects surprisal and its correlation with intelligibility.


2021 ◽  
pp. 221-232
Author(s):  
Marco Pegoraro ◽  
Merih Seran Uysal ◽  
David Benedikt Georgi ◽  
Wil M.P Van der Aalst

The real-time prediction of business processes using historical event data is an important capability of modern business process monitoring systems. Existing process prediction methods are able to also exploit the data perspective of recorded events, in addition to the control-flow perspective. However, while well-structured numerical or categorical attributes are considered in many prediction techniques, almost no technique is able to utilize text documents written in natural language, which can hold information critical to the prediction task. In this paper, we illustrate the design, implementation, and evaluation of a novel text-aware process prediction model based on Long Short-Term Memory (LSTM) neural networks and natural language models. The proposed model can take categorical, numerical and textual attributes in event data into account to predict the activity and timestamp of the next event, the outcome, and the cycle time of a running process instance. Experiments show that the text-aware model is able to outperform state-of-the-art process prediction methods on simulated and real-world event logs containing textual data.


2020 ◽  
Vol 12 (2) ◽  
pp. 84-99
Author(s):  
Li-Pang Chen

In this paper, we investigate analysis and prediction of the time-dependent data. We focus our attention on four different stocks are selected from Yahoo Finance historical database. To build up models and predict the future stock price, we consider three different machine learning techniques including Long Short-Term Memory (LSTM), Convolutional Neural Networks (CNN) and Support Vector Regression (SVR). By treating close price, open price, daily low, daily high, adjusted close price, and volume of trades as predictors in machine learning methods, it can be shown that the prediction accuracy is improved.


2021 ◽  
Vol 186 (Supplement_1) ◽  
pp. 445-451
Author(s):  
Yifei Sun ◽  
Navid Rashedi ◽  
Vikrant Vaze ◽  
Parikshit Shah ◽  
Ryan Halter ◽  
...  

ABSTRACT Introduction Early prediction of the acute hypotensive episode (AHE) in critically ill patients has the potential to improve outcomes. In this study, we apply different machine learning algorithms to the MIMIC III Physionet dataset, containing more than 60,000 real-world intensive care unit records, to test commonly used machine learning technologies and compare their performances. Materials and Methods Five classification methods including K-nearest neighbor, logistic regression, support vector machine, random forest, and a deep learning method called long short-term memory are applied to predict an AHE 30 minutes in advance. An analysis comparing model performance when including versus excluding invasive features was conducted. To further study the pattern of the underlying mean arterial pressure (MAP), we apply a regression method to predict the continuous MAP values using linear regression over the next 60 minutes. Results Support vector machine yields the best performance in terms of recall (84%). Including the invasive features in the classification improves the performance significantly with both recall and precision increasing by more than 20 percentage points. We were able to predict the MAP with a root mean square error (a frequently used measure of the differences between the predicted values and the observed values) of 10 mmHg 60 minutes in the future. After converting continuous MAP predictions into AHE binary predictions, we achieve a 91% recall and 68% precision. In addition to predicting AHE, the MAP predictions provide clinically useful information regarding the timing and severity of the AHE occurrence. Conclusion We were able to predict AHE with precision and recall above 80% 30 minutes in advance with the large real-world dataset. The prediction of regression model can provide a more fine-grained, interpretable signal to practitioners. Model performance is improved by the inclusion of invasive features in predicting AHE, when compared to predicting the AHE based on only the available, restricted set of noninvasive technologies. This demonstrates the importance of exploring more noninvasive technologies for AHE prediction.


AI ◽  
2021 ◽  
Vol 2 (2) ◽  
pp. 195-208
Author(s):  
Gabriel Dahia ◽  
Maurício Pamplona Segundo

We propose a method that can perform one-class classification given only a small number of examples from the target class and none from the others. We formulate the learning of meaningful features for one-class classification as a meta-learning problem in which the meta-training stage repeatedly simulates one-class classification, using the classification loss of the chosen algorithm to learn a feature representation. To learn these representations, we require only multiclass data from similar tasks. We show how the Support Vector Data Description method can be used with our method, and also propose a simpler variant based on Prototypical Networks that obtains comparable performance, indicating that learning feature representations directly from data may be more important than which one-class algorithm we choose. We validate our approach by adapting few-shot classification datasets to the few-shot one-class classification scenario, obtaining similar results to the state-of-the-art of traditional one-class classification, and that improves upon that of one-class classification baselines employed in the few-shot setting.


Sign in / Sign up

Export Citation Format

Share Document