Queen's School of Computing Software Defect Prediction Using Rich Contextualized Language Use Vectors


Context. Software defect prediction, a part of software quality assurance process, aims to find defect prone source code, and thus reduces the effort, time and cost involved with ensuring the quality of a software system. Different code and non-code metrics are commonly used in this process to train machine learning algorithms to predict software defects. Studies have shown that such metrics-based approaches are failing to give satisfactory results, and they have reached a performance ceiling. This thesis explores the idea of using code profiles, which encode programming language feature use of source files as an alternative to traditional metrics to predict software defects. This code profile-based method proves to be more promising than traditional metrics-based approaches.

Aims. This thesis aims to improve software defect prediction using code profiles as feature variable in place of traditional metrics. Software code profiles represent an encoding of the density of language feature usage, and the context of such usage in Rich Contextualized Language Use Vectors (RCLUVs) by analyzing the parse tree of the source code. This thesis attempts to find whether code profile can be used to train machine learning algorithms, and compares the performance of such models to traditional metrics-based approaches.

Methods. To achieve these aims the learning curves of several machine learning algorithms have been analyzed, and the performance of the models have been evaluated against traditional metrics-based approaches. Two benchmark bug datasets have been used to train the algorithms. One is the Eclipse bug dataset, and the other is the Github bug database. While the Eclipse dataset contains different versions of the same project, the Github database contains different versions of several projects from various domains.

Results. The learning curves of the models show machine learning algorithms can learn from RCLUV-based code profiles. Performance evaluation against existing metrics-based approaches reveals that the code profile-based approach is more promising than traditional metrics-based approaches. The predictive performance of both metrics and code profile-based approaches drop in cross-version predictions. However, in most of the cases, the performance of code profile-based defect prediction is higher than the metrics-based approaches.

Conclusions. This thesis has studied software code profiles as a feature variable to predict software defects. Unlike traditional metrics-based approaches, it uses vectors generated by analyzing language feature usage from the parse trees of source code as a feature variable to train machine learning algorithms. Experimental results using common algorithms encourage us to use software code profiles as an alternative to traditional metrics to predict software defects.