Messaging Forensic Framework for Cybercrime Investigation.
PhD thesis, Concordia University.
- Accepted Version
Online predators, botmasters, and terrorists abuse the Internet and associated web technologies by conducting illegitimate activities such as bullying, phishing, and threatening. These activities often involve online messages between a criminal and a victim, or between criminals themselves. The forensic analysis of online messages to collect empirical evidence that can be used to prosecute cybercriminals in a court of law is one way to minimize most cybercrimes. The challenge is to develop innovative tools and techniques to precisely analyze large volumes of suspicious online messages. We develop a forensic analysis framework to help an investigator analyze the textual content of online messages with two main objectives. First, we apply our novel authorship analysis techniques for collecting patterns of authorial attributes to address the problem of anonymity in online communication. Second, we apply the proposed knowledge discovery and semantic anal ysis techniques for identifying criminal networks and their illegal activities. The focus of the framework is to collect creditable, intuitive, and interpretable evidence for both technical and non-technical professional experts including law enforcement personnel and jury members. To evaluate our proposed methods, we share our collaborative work with a local law enforcement agency. The experimental result on real-life data suggests that the presented forensic analysis framework is effective for cybercrime investigation.
|Divisions:||Concordia University > Faculty of Engineering and Computer Science > Computer Science and Software Engineering|
|Item Type:||Thesis (PhD)|
|Degree Name:||Ph. D.|
|Date:||27 January 2011|
|Thesis Supervisor(s):||Debbabi, Mourad and Fung, Benjamin|
|Keywords:||cybercrime, messaging forensic, criminal networks, topic identification, authorship, anonymity, digital investigation, data mining, machine learning, chat mining, email analysis, social network|
|Deposited On:||13 Jun 2011 13:45|
|Last Modified:||04 Nov 2016 23:32|
References: Network E-mail Examiner. Web site: http://www.paraben-enterprise.com/, Retrieved on August 15, 2010. Paraben Corporation.
 Forensic ToolKit. Web site: http://www.accessdata.com/forensictoolkit.html, Retrieved on March 2, 2009. AccessData.
 Encase. Web site: http://www.guidancesoftware.com/, Retrieved on May 10, 2010. Guidance Software.
 A. Abbasi and H. Chen. Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Transactions on Information
Systems, 26(2):1–29, 2008.
 A. Abbasi, H. Chen, and J. Nunamaker. Stylometric identification in electronic markets: Scalability and robustness. Journal of Management Information Systems,
 R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In Proc. of ACM SIGMOD Conference, Seattle, WA, 1998.
 R. Agrawal, T. Imieli´nski, and A. Swami. Mining association rules between sets
of items in large databases. In Proc. of the 1993 ACM SIGMOD international conference on Management of data, pages 207–216, Washington, D.C., United States, 1993. ACM.
 E. Alfonseca and S. Manandhar. An unsupervised method for general named entity recognition and automated concept discovery. In Proc. of International Conference on General WordNet, 2002.
 J. Allan, J. Carbonell, G. Doddington, J. Yamron, and Y. Yang. Topic detection and tracking pilot study final report. In Proc. of the DARPA Broadcast News Transcription
and Understanding Workshop, pages 194–218, 1998.
 M.-H. Antoni-Lay, G. Francopoulo, and L. Zaysser. A generic model for reusable lexicons: The genelex project. Literary and Linguistic Computing, 9(1), 1994.
 S. Argamon, M. Koppel, and G. Avneri. Routing documents according to style. In Proc. of the First International Workshop on Innovative Information Systems,
 S. Argamon and M. Saric. Style mining of electronic messages for multiple authorship discrimination: first results. In Proc. of the 9th ACM International Conference
on Knowledge Discovery and Data Mining (SIGKDD), pages 475–480, Washington, D.C., 2003. ACM.
 R. H. Baayen, H. van Halteren, and F. J. Tweedie. Outside the cave of shadows: using syntactic annotation to enhance authorship attribution. Literary and Linguistic
Computing, 2:110–120, 1996.
 R. Barzilay, N. Elhadad, and K. R. Mckeown. Inferring strategies for sentence ordering in multidocument news summarization. Journal of Artificial Intelligence
Research, 17:35–55, 2002.
 R. Barzilay and K. R. Mckeown. Sentence fusion for multidocument news summarization. Computational Linguistics, 31:297–328, 2005.
 J. Bengel, S. Gauch, E. Mittur, and R. Vijayaraghavan. ChatTrack: Chat Room Topic Detection Using Classification. In Proc. of the 2nd Symposium on Intelligence and Security Informatics (in review, pages 266–277, 2004.
 M. Bhattacharyya, S. Hershkop, E. Eskin, and S. J. Stolfo. MET: An experimental system for malicious email tracking. In Proc. of the 2002 New Security Paradigms
Workshop (NSPW-2002), Virginia Beach, VA, 2002.
 M. D. Buhmann. Radial Basis Functions: Theory and Implementations. Cambridge University Press, Second edition, 2003.
 C. J. C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery, 2:121–167, 1998.
 J. F. Burrows. Word patterns and story shapes: the statistical analysis of narrative style. Literary and Linguistic Computing, 2:61–67, 1987.
 C. E. H. Chua and J.Wareham. Fighting internet auction fraud: An assessment and proposal. Computer, 37:31–37, 2004.
 M. Corney, O. de Vel, A. Anderson, and G. Mohay. Gender-preferential text mining of e-mail discourse. In ACSAC’02: Proc. of the 18th Annual Computer Security
Applications Conference, pages 21–27, Washington, DC, USA, 2002. IEEE Computer Society.
 N. Cristianini and J. Shawe-Taylor. An introduction to Support Vector Machines. Cambridge University Press, UK, 2000.
 D. Cutting, D. Karger, J. Pedersen, and J. Tukey. Scatter/gather: A cluster-based approach to browsing large document collections. In Proc. of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 318–329, 1992.
 D. Das and A. F. T. Martins. A survey on automatic text summarization. Web site: http://www.cs.cmu.edu/ nasmith/LS2/das-martins.07.pdf, 2007. Language Technologies
Institute, Carnegie Mellon University.
 O. de Vel. Mining e-mail authorship. In Proc. of ACM International Conference on Knowledge Discovery and Data Mining (KDD), Boston, 2000.
 O. de Vel, A. Anderson, M. Corney, and G. Mohay. Mining e-mail content for author identification forensics. SIGMOD Record, 30(4):55–64, 2001.
 O. de Vel, A. Anderson, M. Corney, and G. Mohay. Multi-topic e-mail authorship
attribution forensics. In Proc. of ACM Conference on Computer Security -
Workshop on Data Mining for Security Applications, 2001.
 O. de Vel, M. Corney, A. Anderson, and G. Mohay. Language and gender author cohort analysis of e-mail for computer forensics. In Proc. of Digital Forensic Research Workshop, 2002.
 A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39(1):1–38, 1977.
 J. Diesner and K. M. Carley. Exploration of communication networks from the enron email corpus. In Proc. of Workshop on Link Analysis, Counterterrorism and
Security, SIAM International Conference on Data Mining, pages 21–23. SIAM, 2005.
 H. Dong, S. C. Hui, and Y. He. Structural analysis of chat messages for topic detection. Online Information Review, 30(5):496–516, 2006.
 E. Elnahrawy. Log-based chat room monitoring using text categorization: A comparative study. In Proc. of the International Association of Science and Technology
for Development Conference on Information and Knowledge Sharing (IKS 2002), pages 381–388. St. Thomas, US Virgin Islands, USA, 2002.
 J. R. Finkel, T. Grenager, and C. Manning. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. In Proc. of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pages 363–370, 2005.
 J. Foertsch. The impact of electronic networks on scholarly communication: Avenues for research. Discourse Processes, 19(2):301–328, 1995.
 R. S. Forsyth and D. I. Holmes. Feature finding for text classification. Literary and Linguistic Computing, 11(4):163–174, 1996.
 E. Frank and S. Kramer. Ensembles of nested dichotomies for multi-class problems. In Proc. of the 21st International conference of Machine Learning (ICML-2004, pages 305–312. ACM Press, 2004.
 N. Friedman, D. Geiger, and M. Goldszmidt. Bayesian Network Classifiers. Machine Learning, 29:131–163, 1977.
 B. C. M. Fung, K. Wang, and M. Ester. Hierarchical document clustering using frequent itemsets. In Proc. of the 3rd SIAM International Conference on Data Mining (SDM), pages 59–70, San Francisco, CA, May 2003.
 M. Gamon. Linguistic correlates of style: authorship classification with deep linguistic analysis features. In Proc. of the 20th International Conference on Computational
Linguistics, pages 611–617, Geneva, Switzerland, 2004.
 A. M. George. WordNet: A Lexical Database for English. Communications of the
ACM, 38(11):39–41, 1995.
 A. Gray, P. Sallis, and S. Macdonell. Software forensics: Extending authorship
analysis techniques to computer programs. In Proc. of the 3rd Biannual Conf. Int.
Assoc. of Forensic Linguists (IAFL’97, pages 1–8, 1997.
 R. Hadjidj, M. Debbabi, H. Lounis, F. Iqbal, A. Szporer, and D. Benredjem. Towards
an integrated email forensics analysis framework. Digital Investigation, 5(3-
 J. Han and J. Pei. Mining frequent patterns by pattern-growth: methodology and
implications. SIGKDD Explor. Newsl., 2(2):14–20, 2000.
 C. Hansen. To Catch a Predator: Protecting Your Kids from Online Enemies Already
in Your Home. Tantor Media, 2007.
 A. Hartigan and M.A. Wong. A k-means clustering algorithm. Applied Statistics,
 J. Heer, S. K. Card, and J. A. Landay. prefuse: a toolkit for interactive information
visualization. In Proc. of the SIGCHI conference on Human factors in computing
systems, pages 421–430, Portland, Oregon, USA, 2005. ACM.
 M. Hegland. The apriori algorithm - a tutorial. WSPC/Lecture Notes Series, 9(7),
March 2005. http://www2.ims.nus.edu.sg/preprints/2005-29.pdf.
 D. I. Holmes. The evolution of stylometry in humanities. Literary and Linguistic Computing, 13(3):111–117, 1998.
 J. D. Holt and S. M. Chung. Efficient mining of association rules in text databases. In Proc. of the 8th ACM International Conference on Information and Knowledge
Management (CIKM), pages 234–242, Kansas City, Missouri, United States, 1999. ACM.
 F. Iqbal, H. Binsalleeh, B. C. M. Fung, and M. Debbabi. Mining writeprints from anonymous e-mails for forensic investigation. Digital Investigation, pages 1–9,
 F. Iqbal, H. Binsalleeh, B. C. M. Fung, and M. Debbabi. Mining writeprints from anonymous e-mails for forensic investigation. Digital Investigation, in press.
 F. Iqbal, R. Hadjidj, B. C. M. Fung, and M. Debbabi. A novel approach of mining write-prints for authorship attribution in e-mail forensics. Digital Investigation,
 T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In Proc. of European Conf. Machine Learning (ECML’98), pages 137–142. Springer Verlag, 1998.
 T. Kolenda, L. K. Hansen, and J. Larsen. Signal detection using ICA: Application to chat room topic spotting. In Proc. of the Third International Conference on
Independent Component Analysis and Blind Source Separation, pages 540–545, 2001.
All items in Spectrum are protected by copyright, with all rights reserved. The use of items is governed by Spectrum's terms of access
Repository Staff Only: item control page