M. A. Basher, Abdur Rahman and Fung, Benjamin C.M. (2013) Analyzing topics and authors in chat logs for crime investigation. Knowledge and Information Systems . ISSN 0219-1377
Preview |
Text (application/pdf)
1MBfung2013.pdf - Accepted Version |
Official URL: http://dx.doi.org/10.1007/s10115-013-0617-y
Abstract
Cybercriminals have been using the Internet to accomplish illegitimate activities and to execute catastrophic attacks. Computer-Mediated Communication such as online chat provides an anonymous channel for predators to exploit victims. In order to prosecute criminals in a court of law, an investigator often needs to extract evidence from a large volume of chat messages. Most of the existing search tools are keyword-based, and the search terms are provided by an investigator. The quality of the retrieved results depends on the search terms provided. Due to the large volume of chat messages and the large number of participants in public chat rooms, the process is often time-consuming and error-prone. This paper presents a topic search model to analyze archives of chat logs for segregating crime-relevant logs from others. Specifically, we propose an extension of the Latent Dirichlet Allocation-based model to extract topics, compute the contribution of authors in these topics, and study the transitions of these topics over time. In addition, we present a special model for characterizing authors-topics over time. This is crucial for investigation because it provides a view of the activity in which authors are involved in certain topics. Experiments on two real-life datasets suggest that the proposed approach can discover hidden criminal topics and the distribution of authors to these topics.
Divisions: | Concordia University > Gina Cody School of Engineering and Computer Science > Concordia Institute for Information Systems Engineering |
---|---|
Item Type: | Article |
Refereed: | Yes |
Authors: | M. A. Basher, Abdur Rahman and Fung, Benjamin C.M. |
Journal or Publication: | Knowledge and Information Systems |
Date: | 8 March 2013 |
Digital Object Identifier (DOI): | 10.1007/s10115-013-0617-y |
Keywords: | Latent Dirichlet Allocation (LDA) Topic modeling Gibbs sampling Topic evolution Author-topics over time Cybercrime |
ID Code: | 977221 |
Deposited By: | Danielle Dennie |
Deposited On: | 03 May 2013 12:45 |
Last Modified: | 18 Jan 2018 17:44 |
References:
1.Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–10222.Rosen-Zvi M, Griffiths T, Steyvers M, Smyth P (2004) The author-topic model for authors and documents. In: Proceedings of the 20th UAI, pp 487–494
3.Wang X, Mohanty N, McCallum A (2005) Group and topic discovery from relations and text. In: Proceedings of the 3rd ACM LinkKDD, pp 28–35
4.Ramage D, Hall D, Nallapati R, Manning CD (2009) Labeled lda: a supervised topic model for credit attribution in multi-labeled corpora. In: Proceedings of the 2009 EMNLP, vol 1, pp 248–256
5.Hong L, Davison BD (2010) Empirical study of topic modeling in twitter. In: Proceedings of the 1st SOMA, pp 80–88
6.Banerjee S, Agarwal N (2012) Analyzing collective behavior from blogs using swarm intelligence. KAIS, pp 1–25
7.Blei D, McAuliffe J (2008) Supervised topic models. Adv Neural Inf Process Syst 20:121–128
8.Lacoste-julien S, Sha F, Jordan MI (2008) DiscLDA: discriminative learning for dimensionality reduction and classification. In: Proceedings of the 22nd NIPS, pp 897–904
9.Ramage D, Heymann P, Manning CD, Garcia-Molina H (2009) Clustering the tagged web. In: Proceedings of the 2nd ACM WSDM, pp 54–63
10.Rubin T, Chambers A, Smyth P, Steyvers M (2012) Statistical topic models for multi-label document classification. Mach Learn 88:157–208» CrossRef
11.Chang J, Boyd-Graber J, Blei DM (2009) Connections between the lines: augmenting social networks with text. In: Proceedings of the 15th ACM SIGKDD, pp 169–178
12.Song X, Lin CY, Tseng BL, Sun MT (2005) Modeling and predicting personal information dissemination behavior. In: Proceedings of the 11th ACM SIGKDD, pp 479–488
13.Wang X, McCallum A (2006) Topics over time: a non-markov continuous-time model of topical trends. In: Proceedings of the 12th ACM SIGKDD, pp 424–433
14.Wang C, Blei DM, Heckerman D (2008) Continuous time dynamic topic models. In: UAI’08, pp 579–586
15.Blei DM, Lafferty JD (2006) Dynamic topic models. In: Proceedings of the 23rd ICML, pp 113–120
16.AlSumait L, Barbará D, Domeniconi C (2008) On-line lda: adaptive topic models for mining text streams with applications to topic detection and tracking. In: Proceedings of the 8th IEEE ICDM, pp 3–12
17.Du L, Buntine W, Jin H, Chen C (2012) Sequential latent dirichlet allocation. KAIS 31:475–503
18.Manning CD, Raghavan P, Schtze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge» CrossRef
19.Minka T, Lafferty J (2002) Expectation-propagation for the generative aspect model. In: Proceedings of the 18th UAI, pp 352–359
20.Griffiths TL, Steyvers M (2004) Finding scientific topics. Proc Natl Acad Sci 101:5228–5235» CrossRef
21.Heinrich G (2004) Parameter estimation for text analysis. Technical Report
22.Zhao WX, Jiang J, Weng J, He J, Lim EP, Yan H, Li X (2011) Comparing twitter and traditional media using topic models. In: Proceedings of the 33rd ECIR. Springer, Berlin, pp 338–349
23.PJF Inc. Chat log conviction numbers. Available: » http://www.ciise.concordia.ca/~fung/pub/convictions.txt
24.Teh YW, Jordan MI, Beal MJ, Blei DM (2004) Sharing clusters among related groups: hierarchical dirichlet processes. In: Proceedings of the 19th NIPS, pp 1385–1392
Repository Staff Only: item control page