On Leveraging Next-Generation Deep Learning Techniques for IoT Malware Classification, Family Attribution and Lineage Analysis

Title:

On Leveraging Next-Generation Deep Learning Techniques for IoT Malware Classification, Family Attribution and Lineage Analysis

Dib, Mirabelle ORCID: https://orcid.org/0000-0003-3134-1230 (2021) On Leveraging Next-Generation Deep Learning Techniques for IoT Malware Classification, Family Attribution and Lineage Analysis. Masters thesis, Concordia University.

Preview

Text (application/pdf)
DIB_MSC_S2022.pdf - Accepted Version
Available under License Spectrum Terms of Access.

5MB

Abstract

Recent years have witnessed the emergence of new and more sophisticated malware targeting insecure Internet of Things (IoT) devices, as part of orchestrated large-scale botnets. Moreover, the
public release of the source code of popular malware families such as Mirai [1] has spawned diverse variants, making it harder to disambiguate their ownership, lineage, and correct label. Such a rapidly
evolving landscape makes it also harder to deploy and generalize effective learning models against retired, updated, and/or new threat campaigns. To mitigate such threat, there is an utmost need for effective IoT malware detection, classification and family attribution, which provide essential steps towards initiating attack mitigation/prevention countermeasures, as well as understanding the evolutionary trajectories and tangled relationships of IoT malware. This is particularly challenging
due to the lack of fine-grained empirical data about IoT malware, the diverse architectures of IoT-targeted devices, and the massive code reuse between IoT malware families.

To address these challenges, in this thesis, we leverage the general lack of obfuscation in IoT malware to extract and combine static features from multi-modal views of the executable binaries (e.g., images, strings, assembly instructions), along with Deep Learning (DL) architectures for effective IoT malware classification and family attribution. Additionally, we aim to address concept drift and the limitations of inter-family classification due to the evolutionary nature of IoT malware, by detecting in-class evolving IoT malware variants and interpreting the meaning behind their mutations. To this end, we perform the following to achieve our objectives:

First, we analyze 70,000 IoT malware samples collected by a specialized IoT honeypot and popular malware repositories in the past 3 years. Consequently, we utilize features extracted from strings- and image-based representations of IoT malware to implement a multi-level DL architecture that fuses the learned features from each sub-component (i.e, images, strings) through a neural network classifier. Our in-depth experiments with four prominent IoT malware families highlight
the significant accuracy of the proposed approach (99.78%), which outperforms conventional single-level classifiers, by relying on different representations of the target IoT malware binaries that do not
require expensive feature extraction. Additionally, we utilize our IoT-tailored approach for labeling unknown malware samples, while identifying new malware strains.

Second, we seek to identify when the classifier shows signs of aging, by which it fails to effectively recognize new variants and adapt to potential changes in the data. Thus, we introduce a robust and effective method that uses contrastive learning and attentive Transformer models to learn and compare semantically meaningful representations of IoT malware binaries and codes without the need for expensive target labels. We find that the evolution of IoT binaries can be used as an augmentation strategy to learn effective representations to contrast (dis)similar variant pairs. We discuss the impact and findings of our analysis and present several evaluation studies to highlight the tangled relationships of IoT malware, as well as the efficiency of our contrastively learned fine-grained feature vectors in preserving semantics and reducing out-of-vocabulary size in cross-architecture IoT malware binaries.

We conclude this thesis by summarizing our findings and discussing research gaps that lay the way for future work.

Divisions:	Concordia University > Gina Cody School of Engineering and Computer Science > Concordia Institute for Information Systems Engineering Concordia University > Gina Cody School of Engineering and Computer Science > Computer Science and Software Engineering
Item Type:	Thesis (Masters)
Authors:	Dib, Mirabelle
Institution:	Concordia University
Degree Name:	M. Comp. Sc.
Program:	Computer Science
Date:	2 December 2021
Thesis Supervisor(s):	Assi, Chadi and Bou-Harb, Elias
ID Code:	990079
Deposited By:	Mirabelle Dib
Deposited On:	16 Jun 2022 14:35
Last Modified:	16 Jun 2022 14:35

Repository Staff Only: item control page

Download Statistics

Downloads per month over past year

Research related to the current document (at the CORE website)

Spectrum Research Repository

On Leveraging Next-Generation Deep Learning Techniques for IoT Malware Classification, Family Attribution and Lineage Analysis

On Leveraging Next-Generation Deep Learning Techniques for IoT Malware Classification, Family Attribution and Lineage Analysis

Abstract