Login | Register

An empirical assessment of the contributing factors for vulnerability detection using machine learning


An empirical assessment of the contributing factors for vulnerability detection using machine learning

mouine, esma ORCID: https://orcid.org/0000-0003-2463-9733 (2021) An empirical assessment of the contributing factors for vulnerability detection using machine learning. Masters thesis, Concordia University.

[thumbnail of MOUINE_MASc_S2022.pdf]
Text (application/pdf)
MOUINE_MASc_S2022.pdf - Accepted Version
Available under License Spectrum Terms of Access.


There is an increasing trend to mine vulnerabilities from software repositories and use machine learning techniques to detect software vulnerabilities automatically. A fundamental but unresolved research question is: how do different factors in the mining and learning process impact the accuracy of identifying vulnerabilities in software projects of varying characteristics? Substantial research has been dedicated in this area, including source code static analysis, software repository mining, and NLP-based machine learning. However, practitioners lack experience regarding the key factors for building a baseline model of the state-of-the-art. In addition, their lack of experience regarding how transferable the vulnerability signatures from a project to another are. This study investigates how the combination of different vulnerability features and three representative machine learning models impact vulnerability detection accuracy in 17 real-world projects.
This thesis proposes different machine learning methods to detect software vulnerabilities.
The first part of this work consists of establishing a baseline model for vulnerability prediction using NLP. For that, two types of vulnerability representations are examined: 1) code features extracted through NLP with varying tokenization strategies and three different embedding techniques (bag-of-words, word2vec, and fastText) and 2) a set of eight architectural metrics that capture the abstract design of the software systems. The four machine learning algorithms include a random forest model, a support vector machine model, and a residual neural network model.
The second part of the study is an effort to evaluate the baseline model sufficiently and fairly by using it to evaluate the performance of another model. More experiments are performed using a bidirectional long short-term memory (BiLSTM) combined with word2vec. The results are compared to the baseline results.
Overall, the first set of experiments, the models returned the following results. 95% of the learning metrics (precision, recall, f1 score, etc.) are above 0.77 in the experiments out of 10 hypothesis tests and 408 experiments. Further analysis shows a recommended baseline model with signatures extracted through bag-of-words embedding, combined with the random forest, consistently increases the detection accuracy by about 4% compared to other combinations in all 17 projects. The observations also show the limitation of transferring vulnerability signatures across domains based on the experiments. Furthermore, the baseline model is shown to perform better than the BiLSTM model.

Divisions:Concordia University > Gina Cody School of Engineering and Computer Science > Electrical and Computer Engineering
Item Type:Thesis (Masters)
Authors:mouine, esma
Institution:Concordia University
Degree Name:M.A. Sc.
Program:Electrical and Computer Engineering
Date:December 2021
Thesis Supervisor(s):liu, yan
ID Code:989972
Deposited By: esma mouine
Deposited On:16 Jun 2022 14:55
Last Modified:01 Dec 2022 01:00
All items in Spectrum are protected by copyright, with all rights reserved. The use of items is governed by Spectrum's terms of access.

Repository Staff Only: item control page

Downloads per month over past year

Research related to the current document (at the CORE website)
- Research related to the current document (at the CORE website)
Back to top Back to top