Login | Register

Feature selection strategies for spam e-mail filtering

Title:

Feature selection strategies for spam e-mail filtering

Wang, Ren (2006) Feature selection strategies for spam e-mail filtering. Masters thesis, Concordia University.

[thumbnail of MR20756.pdf]
Preview
Text (application/pdf)
MR20756.pdf - Accepted Version
3MB

Abstract

The spam e-mail (also known as junk e-mail) problem is rapidly becoming unmanageable. According to a recent European Union study, junk e-mails cost all of us about 9.4 billion (US) dollars per year, and many major ISPs say that spam adds about 20% to the cost of their service. Feature selection is an important research problem in different text categorization applications including spam e-mail filtering. In designing spam filters, we often represent the e-mail by vector space model (VSM) in which every e-mail is considered as a vector of word terms. Since there are many different terms in the e-mail, and not all classifiers can handle such a high dimension, only the most powerful discriminatory terms should be considered. Also, some of these features may not be influential and might carry redundant information which may confuse the classifier. Thus, feature selection, and hence dimensionality reduction, is a crucial step to get the best out of the constructed features. Many feature selection strategies (FSS) can be applied to produce the desired feature set. In this thesis, we investigate the use of several classifier-dependent feature selection strategies. We cast our feature selection problem as a 0-1 optimization problem and different optimization techniques are compared. These techniques include several local search optimization algorithms such as Hill Climbing, Simulated Annealing, Threshold Accepting and Tabu Search. We also examine some other algorithms inspired by biological systems and artificial life techniques such as Genetic Algorithm, Particle Swarm Optimization, Ant Colony Optimization and Artificial Immune Systems. The performance of all the above algorithms is compared with some traditional dimensionality reduction techniques such as Principle Component Analysis, Linear Discriminant Analysis and Singular Value Decomposition. Our experimental results show that all these techniques can be used not only to reduce the dimensions of the e-mail VSM, but also improve the performance of the spam filter

Divisions:Concordia University > Gina Cody School of Engineering and Computer Science > Electrical and Computer Engineering
Item Type:Thesis (Masters)
Authors:Wang, Ren
Pagination:xi, 75 leaves : ill. ; 29 cm.
Institution:Concordia University
Degree Name:M.A. Sc.
Program:Electrical and Computer Engineering
Date:2006
Thesis Supervisor(s):Youssef, Amr and Elhakeem, Ahmed K.
Identification Number:LE 3 C66E44M 2006 W36
ID Code:9137
Deposited By: Concordia University Library
Deposited On:18 Aug 2011 18:45
Last Modified:13 Jul 2020 20:06
Related URLs:
All items in Spectrum are protected by copyright, with all rights reserved. The use of items is governed by Spectrum's terms of access.

Repository Staff Only: item control page

Downloads per month over past year

Research related to the current document (at the CORE website)
- Research related to the current document (at the CORE website)
Back to top Back to top