scholarly journals Why My Code Summarization Model Does Not Work

2021 ◽  
Vol 30 (2) ◽  
pp. 1-29
Author(s):  
Qiuyuan Chen ◽  
Xin Xia ◽  
Han Hu ◽  
David Lo ◽  
Shanping Li

Code summarization aims at generating a code comment given a block of source code and it is normally performed by training machine learning algorithms on existing code block-comment pairs. Code comments in practice have different intentions. For example, some code comments might explain how the methods work, while others explain why some methods are written. Previous works have shown that a relationship exists between a code block and the category of a comment associated with it. In this article, we aim to investigate to which extent we can exploit this relationship to improve code summarization performance. We first classify comments into six intention categories and manually label 20,000 code-comment pairs. These categories include “what,” “why,” “how-to-use,” “how-it-is-done,” “property,” and “others.” Based on this dataset, we conduct an experiment to investigate the performance of different state-of-the-art code summarization approaches on the categories. We find that the performance of different code summarization approaches varies substantially across the categories. Moreover, the category for which a code summarization model performs the best is different for the different models. In particular, no models perform the best for “why” and “property” comments among the six categories. We design a composite approach to demonstrate that comment category prediction can boost code summarization to reach better results. The approach leverages classified code-category labeled data to train a classifier to infer categories. Then it selects the most suitable models for inferred categories and outputs the composite results. Our composite approach outperforms other approaches that do not consider comment categories and obtains a relative improvement of 8.57% and 16.34% in terms of ROUGE-L and BLEU-4 score, respectively.

2020 ◽  
Vol 2020 ◽  
pp. 1-7
Author(s):  
Nalindren Naicker ◽  
Timothy Adeliyi ◽  
Jeanette Wing

Educational Data Mining (EDM) is a rich research field in computer science. Tools and techniques in EDM are useful to predict student performance which gives practitioners useful insights to develop appropriate intervention strategies to improve pass rates and increase retention. The performance of the state-of-the-art machine learning classifiers is very much dependent on the task at hand. Investigating support vector machines has been used extensively in classification problems; however, the extant of literature shows a gap in the application of linear support vector machines as a predictor of student performance. The aim of this study was to compare the performance of linear support vector machines with the performance of the state-of-the-art classical machine learning algorithms in order to determine the algorithm that would improve prediction of student performance. In this quantitative study, an experimental research design was used. Experiments were set up using feature selection on a publicly available dataset of 1000 alpha-numeric student records. Linear support vector machines benchmarked with ten categorical machine learning algorithms showed superior performance in predicting student performance. The results of this research showed that features like race, gender, and lunch influence performance in mathematics whilst access to lunch was the primary factor which influences reading and writing performance.


Information ◽  
2019 ◽  
Vol 10 (3) ◽  
pp. 98 ◽  
Author(s):  
Tariq Ahmad ◽  
Allan Ramsay ◽  
Hanady Ahmed

Assigning sentiment labels to documents is, at first sight, a standard multi-label classification task. Many approaches have been used for this task, but the current state-of-the-art solutions use deep neural networks (DNNs). As such, it seems likely that standard machine learning algorithms, such as these, will provide an effective approach. We describe an alternative approach, involving the use of probabilities to construct a weighted lexicon of sentiment terms, then modifying the lexicon and calculating optimal thresholds for each class. We show that this approach outperforms the use of DNNs and other standard algorithms. We believe that DNNs are not a universal panacea and that paying attention to the nature of the data that you are trying to learn from can be more important than trying out ever more powerful general purpose machine learning algorithms.


2019 ◽  
Author(s):  
Shufen Pan ◽  
Naiqing Pan ◽  
Hanqin Tian ◽  
Pierre Friedlingstein ◽  
Stephen Sitch ◽  
...  

Abstract. Evapotranspiration (ET) is a critical component in global water cycle and links terrestrial water, carbon and energy cycles. Accurate estimate of terrestrial ET is important for hydrological, meteorological, and agricultural research and applications, such as quantifying surface energy and water budgets, weather forecasting, and scheduling of irrigation. However, direct measurement of global terrestrial ET is not feasible. Here, we first gave a retrospective introduction to the basic theory and recent developments of state-of-the-art approaches for estimating global terrestrial ET, including remote sensing-based physical models, machine learning algorithms and land surface models (LSMs). Then, we utilized six remote sensing-based models (including four physical models and two machine learning algorithms) and fourteen LSMs to analyze the spatial and temporal variations in global terrestrial ET. The results showed that the mean annual global terrestrial ET ranged from 50.7 × 103 km3 yr−1(454 mm yr−1)to 75.7 × 103 km3 yr−1 (6977 mm yr−1), with the average being 65.5 × 103 km3 yr−1 (588 mm yr−1), during 1982–2011. LSMs had significant uncertainty in the ET magnitude in tropical regions especially the Amazon Basin, while remote sensing-based ET products showed larger inter-model range in arid and semi-arid regions than LSMs. LSMs and remote sensing-based physical models presented much larger inter-annual variability (IAV) of ET than machine learning algorithms in southwestern U.S. and the Southern Hemisphere, particularly in Australia. LSMs suggested stronger control of precipitation on ET IAV than remote sensing-based models. The ensemble remote sensing-based physical models and machine-learning algorithm suggested significant increasing trends in global terrestrial ET at the rate of 0.62 mm yr−2 (p  0.05), even though most of the individual LSMs reproduced the increasing trend. Moreover, all models suggested a positive effect of vegetation greening on ET intensification. Spatially, all methods showed that ET significantly increased in western and southern Africa, western India and northeastern Australia, but decreased severely in southwestern U.S., southern South America and Mongolia. Discrepancies in ET trend mainly appeared in tropical regions like the Amazon Basin. The ensemble means of the three ET categories showed generally good consistency, however, considerable uncertainties still exist in both the temporal and spatial variations in global ET estimates. The uncertainties were induced by multiple factors, including parameterization of land processes, meteorological forcing, lack of in situ measurements, remote sensing acquisition and scaling effects. Improvements in the representation of water stress and canopy dynamics are essentially needed to reduce uncertainty in LSM-simulated ET. Utilization of latest satellite sensors and deep learning methods, theoretical advancements in nonequilibrium thermodynamics, and application of integrated methods that fuse different ET estimates or relevant key biophysical variables will improve the accuracy of remote sensing-based models.


2021 ◽  
pp. 1-15
Author(s):  
Mohammed Ayub ◽  
El-Sayed M. El-Alfy

Web technology has become an indispensable part in human’s life for almost all activities. On the other hand, the trend of cyberattacks is on the rise in today’s modern Web-driven world. Therefore, effective countermeasures for the analysis and detection of malicious websites is crucial to combat the rising threats to the cyber world security. In this paper, we systematically reviewed the state-of-the-art techniques and identified a total of about 230 features of malicious websites, which are classified as internal and external features. Moreover, we developed a toolkit for the analysis and modeling of malicious websites. The toolkit has implemented several types of feature extraction methods and machine learning algorithms, which can be used to analyze and compare different approaches to detect malicious URLs. Moreover, the toolkit incorporates several other options such as feature selection and imbalanced learning with flexibility to be extended to include more functionality and generalization capabilities. Moreover, some use cases are demonstrated for different datasets.


2018 ◽  
Vol 7 (1.7) ◽  
pp. 179
Author(s):  
Nivedhitha G ◽  
Carmel Mary Belinda M.J ◽  
Rupavathy N

The development of the phishing sites is by all accounts amazing. Despite the fact that the web clients know about these sorts of phishing assaults, part of clients move toward becoming casualty to these assaults. Quantities of assaults are propelled with the point of making web clients trust that they are speaking with a trusted entity. Phishing is one among them. Phishing is consistently developing since it is anything but difficult to duplicate a whole site utilizing the HTML source code. By rolling out slight improvements in the source code, it is conceivable to guide the victim to the phishing site. Phishers utilize part of strategies to draw the unsuspected web client. Consequently an efficient mechanism is required to recognize the phishing sites from the real sites keeping in mind the end goal to spare credential data. To detect the phishing websites and to identify it as information leaking sites, the system proposes data mining algorithms. In this paper, machine-learning algorithms have been utilized for modeling the prediction task. The process of identity extraction and feature extraction are discussed in this paper and the various experiments carried out to discover the performance of the models are demonstrated.


2021 ◽  
Vol 2021 ◽  
pp. 1-7
Author(s):  
Keqing Guan ◽  
Shah Nazir ◽  
Xianli Kong ◽  
Sadaqat ur Rehman

Source code transformation is a way in which source code of a program is transformed by observing any operation for generating another or nearly the same program. This is mostly performed in situations of piracy where the pirates want the ownership of the software program. Various approaches are being practiced for source code transformation and code obfuscation. Researchers tried to overcome the issue of modifying the source code and prevent it from the people who want to change the source code. Among the existing approaches, software birthmark was one of the approaches developed with the aim to detect software piracy that exists in the software. Various features are extracted from software which are collectively termed as “software birthmark.” Based on these extracted features, the piracy that exists in the software can be detected. Birthmarks are considered to insist on the source code and executable of certain programming languages. The usability of software birthmark can protect software by any modification or changes and ultimately preserve the ownership of software. The proposed study has used machine learning algorithms for classification of the usability of existing software birthmarks in terms of source code transformation. The K-nearest neighbors (K-NN) algorithm was used for classification of the software birthmarks. For cross-validation, the algorithms of decision rules, decomposition tree, and LTF-C were used. The experimental results show the effectiveness of the proposed research.


Sign in / Sign up

Export Citation Format

Share Document