Iqbal, Farkhund and Binsalleeh, Hamad and Fung, Benjamin C.M. and Debbabi, Mourad
A unified data mining solution for authorship analysis in anonymous textual communications.
Information Sciences, 231
- Accepted Version
Official URL: http://dx.doi.org/10.1016/j.ins.2011.03.006
The cyber world provides an anonymous environment for criminals to conduct malicious activities such as spamming, sending ransom e-mails, and spreading botnet malware. Often, these activities involve textual communication between a criminal and a victim, or between criminals themselves. The forensic analysis of online textual documents for addressing the anonymity problem called authorship analysis is the focus of most cybercrime investigations. Authorship analysis is the statistical study of linguistic and computational characteristics of the written documents of individuals. This paper is the first work that presents a unified data mining solution to address authorship analysis problems based on the concept of frequent pattern-based writeprint. Extensive experiments on real-life data suggest that our proposed solution can precisely capture the writing styles of individuals. Furthermore, the writeprint is effective to identify the author of an anonymous text from a group of suspects and to infer sociolinguistic characteristics of the author.
References: A. Abbasi, H. Chen Writeprints: a stylometric approach to identity-level identification and similarity detection in cyberspace ACM Transactions on Information Systems, 26 (2) (2008), pp. 1–29
 A. Abbasi, H. Chen A comparison of tools for detecting fake websites IEEE Computer, 42 (10) (2009), pp. 78–86
 A. Abbasi, H. Chen, J. Nunamaker Stylometric identification in electronic markets: scalability and robustness Journal of Management Information Systems, 5 (1) (2008), pp. 49–78
 A. Abbasi, Z. Zhang, D. Zimbra, H. Chen, J.F. Nunamaker Jr. Detecting fake websites: the contribution of statistical learning theory MIS Quarterly, 34 (3) (2010), pp. 435–461
 R. Agrawal, T. Imieliński, A. Swami, Mining association rules between sets of items in large databases, in: Proceedings of the ACM SIGMOD International Conference on Management of Data, 1993, pp. 207–216.
S. Argamon, M. Koppel, J.W. Pennebaker, J. Schler Automatically profiling the author of an anonymous text Communications of the ACM, 52 (2) (2009), pp. 119–123
 S. Argamon, M. Šarić, S.S. Stein, Style mining of electronic messages for multiple authorship discrimination: first results, in: Proceedings of the 9th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), 2003, pp. 475–480.
 R.H. Baayen, H. van Halteren, F.J. Tweedie Outside the cave of shadows: using syntactic annotation to enhance authorship attribution Literary and Linguistic Computing, 2 (1996), pp. 110–120
 M.D. Buhmann Radial Basis Functions: Theory and Implementations (second ed.)Cambridge University Press (2003)
 J.F. Burrows Word patterns and story shapes: the statistical analysis of narrative style Literary and Linguistic Computing, 2 (1987), pp. 61–67
 V.R. Carvalho, W.W. Cohen, Learning to extract signature and reply lines from email, in: Proceedings of the Conference on Email and Anti-Spam, 2004.
 M. Corney, O. de Vel, A. Anderson, G. Mohay, Gender-preferential text mining of e-mail discourse, in: Proceedings of the 18th Annual Computer Security Applications Conference (ACSAC), 2002, p. 282.
 O. de Vel, A. Anderson, M. Corney, G. Mohay Mining e-mail content for author identification forensics SIGMOD Record, 30 (4) (2001), pp. 55–64
 O. de Vel, A. Anderson, M. Corney, G. Mohay, Multi-topic e-mail authorship attribution forensics, in: Proceedings of ACM Conference on Computer Security – Workshop on Data Mining for Security Applications, 2001.
 J. Diederich, J. Kindermann, E. Leopold, G. Paas Authorship attribution with support vector machines Applied Intelligence, 19 (2000), pp. 109–123
 F. Fouss, Y. Achbany, M. Saerens A probabilistic reputation model based on transaction ratings Information Sciences, 180 (11) (2010), pp. 2095–2123
 E. Frank, S. Kramer, Ensembles of nested dichotomies for multi-class problems, in: Proceedings of the 21st International Conference of Machine Learning (ICML), 2004, pp. 305–312.
 J. Han, J. Pei Mining frequent patterns by pattern-growth: methodology and implications SIGKDD Exploration Newsletter, 2 (2) (2000), pp. 14–20
 M. Hegland The apriori algorithm – a tutorial WSPC/Lecture Notes Series, 9 (7) (2005) <http://www2.ims.nus.edu.sg/preprints/2005-29.pdf>
 Q. Hu, S. An, D. Yu Soft fuzzy rough sets for robust feature evaluation and selection Information Sciences, 180 (22) (2010), pp. 4384–4400
 Q. Hu, D. Yu, J. Liu, C. Wu Neighborhood rough set based heterogeneous feature subset selection Information Sciences, 178 (2008), pp. 3577–3594
 F. Iqbal, H. Binsalleeh, B.C.M. Fung, M. Debbabi Mining writeprints from anonymous e-mails for forensic investigation Digital Investigation (2010), pp. 1–9
 F. Iqbal, R. Hadjidj, B.C.M. Fung, M. Debbabi A novel approach of mining write-prints for authorship attribution in e-mail forensics Digital Investigation, 5 (1) (2008), pp. 42–51
 F. Iqbal, L.A. Khan, B.C.M. Fung, M. Debbabi, E-mail authorship verification for forensic investigation, in: Proceedings of the 25th ACM SIGAPP Symposium on Applied Computing (SAC), Sierre, Switzerland, March 2010, pp. 1591–1598.
 T. Joachims, Text categorization with support vector machines: learning with many relevant features, in: Proceedings of European Conference on Machine Learning (ECML’98), 1998, pp. 137–142.
 M. Koppel, S. Argamon, A.R. Shimoni Automatically categorizing written texts by author gender Literary and Linguistic Computing, 17 (4) (2002), pp. 401–412
 M. Koppel, J. Schler, S. Argamon Computational methods in authorship attribution Journal of the American Society for Information Science and Technology, 60 (1) (2009), pp. 9–26
 T. Kucukyilmaz, B.B. Cambazoglu, F. Can, C. Aykanat Chat mining: predicting user and message attributes in computer-mediated communication Information Processing and Management, 44 (4) (2008), pp. 1448–1466
 T.C. Mendenhall The characteristic curves of composition Science, 11 (11) (1887), pp. 237–249
 A. Miller Subset Selection in Regression Chapman & Hall/CRC (2002)
 F. Mosteller, D.L. Wallace Applied Bayesian and Classical Inference: The Case of the Federalist Papers (second ed.)Springer-Verlag, New York (1964)
 J. Pearl, Bayesian networks: a model of self-activated memory for evidential reasoning, in: Proceedings of the 7th Conference of the Cognitive Science Society, 1985, pp. 329–334.
 S.R. Pillay, T. Solorio, Authorship attribution of web forum posts, in: eCrime Researchers Summit (eCrime), Dept. of Comput. & Inf. Sci., Univ. of Alabama at Birmingham, Birmingham, AL, USA, 2010, pp. 1–7.
 J.R. Quinlan Induction of decision trees Machine Learning, 1 (1) (1986), pp. 81–106
 J.R. Quinlan C4.5: Programs for machine learning Machine Learning, Morgan Kaufmann, San Mateo, CA (1993), pp. 343–348
 S.E. Robertson, Sparck K. Jones Relevance weighting of search terms Journal of the American Society for Information Science, 27 (3) (1976), pp. 129–146
 F. Sebastiani Machine learning in automated text categorization ACM Computing Surveys, 34 (1) (2002), pp. 1–47
 M. Sewell, Feature selection, 2007. <http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.103.6348&rep=rep1&type=pdf>.
 E. Stamatatos A survey of modern authorship attribution methods Journal of the American Society for Information Science and Technology, 60 (2009), pp. 538–556
 S.K. Tanbeer, C.F. Ahmed, B. Jeong, Y. Lee Efficient single-pass frequent pattern mining using a prefix-tree Information Sciences, 179 (5) (2009), pp. 559–583
 G. Teng, M. Lai, J. Ma, Y. Li, E-mail authorship mining based on svm for computer forensic, in: Proceedings of the 3rd International Conference on Machine Learning and Cyhemetics, August 2004.
 K. Wimmer, The First Amendment and the Media, 2002. <http://www.mediainstitute.org/ONLINE/FAM2002/toc.html>.
 I.H. Witten, E. Frank Data Mining: Practical Machine Learning Tools and Techniques (second ed.)Elsevier (2005)
 G.U. Yule On sentence length as a statistical characteristic of style in prose Biometrika, 30 (1938), pp. 363–390
 M.J. Zaki Scalable algorithms for association mining IEEE Transactions on Knowledge and Data Engineering (TKDE), 12 (2000), pp. 372–390
 Y. Zhao, J. Zobel, Effective and scalable authorship attribution using function words, in: Proceedings of the 2nd AIRS Asian Information Retrieval Symposium, 2005, pp. 174–189.
 R. Zheng, J. Li, H. Chen, Z. Huang A framework for authorship identification of online messages: writing-style features and classification techniques Journal of the American Society for Information Science and Technology, 57 (3) (2006), pp. 1532–2882
All items in Spectrum are protected by copyright, with all rights reserved. The use of items is governed by Spectrum's terms of access
Repository Staff Only: item control page