Login | Register

Assembly Code Clone Detection for Malware Binaries


Assembly Code Clone Detection for Malware Binaries

Farhadi, Mohammad Reza (2013) Assembly Code Clone Detection for Malware Binaries. Masters thesis, Concordia University.

Text (application/pdf)
Farhadi_MASc_S2013.pdf - Accepted Version
Available under License Spectrum Terms of Access.


Malware, such as a virus or trojan horse, refers to software designed specifically to gain unauthorized access to a computer system and perform malicious activities. To analyze a piece of malware, one may employ a reverse engineering approach to perform an in-depth analysis on the assembly code of a malware. Yet, the reverse engineering process is tedious and time consuming. One way to speed up the analysis process is to compare the disassembled malware with some previously analyzed malware, identify the similar functions in the assembly code, and transfer the comments from the previously analyzed software to the new malware. The challenge is how to efficiently identify the similar code fragments (i.e., clones) from a large repository of assembly code.

In this thesis, an assembly code clone detection system is presented. Its performance is evaluated in terms of accuracy, efficiency, scalability, and feasibility of finding clones on assembly code decompiled from both Microsoft Windows 7 DLL files and real-life malware binary files. Experimental results suggest that the proposed clone detection algorithm is effective. This system can be used as the basis of future development of assembly code clone detection.

Divisions:Concordia University > Gina Cody School of Engineering and Computer Science > Concordia Institute for Information Systems Engineering
Item Type:Thesis (Masters)
Authors:Farhadi, Mohammad Reza
Institution:Concordia University
Degree Name:M.A. Sc.
Program:Information Systems Security
Date:12 April 2013
Thesis Supervisor(s):Debbabi, Mourad and Fung, Benjamin C.M.
Keywords:Malware, Reverse Engineering, Code Clone, Assembly Code Clone Detection, Malware Analysis
ID Code:977131
Deposited On:07 Jun 2013 14:43
Last Modified:18 Jan 2018 17:43


[1] Re-Google. http://regoogle.carnivore.it.

[2] IDA Pro. http://www.hex-rays.com/products/ida.

[3] BinDiff. http://www.zynamics.com/bindiff.html.

[4] VxClass. http://www.zynamics.com/vxclass.html.

[5] National Cyber-Forensics and Training Alliance CANADA (NCFTA). http://www.ncfta.ca.

[6] M.I. Abouelhoda, S. Kurtz, and E. Ohlebusch. Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms, 2(1):53-86, 2004.

[7] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In Foundations of Computer Science, 2006. FOCS'06. 47th Annual IEEE Symposium on, pages 459-468. IEEE, 2006.

[8] S.S. Anju, P. Harmya, N. Jagadeesh, and R. Darsana. Malware detection using assembly code and control flow graph optimization. In Proceedings of the 1st Amrita ACM-W Celebration on Women in Computing in India, page 65. ACM,

[9] B.S. Baker. A program for identifying duplicated code. Computing Science and Statistics, pages 49-49, 1993.

[10] B.S. Baker. On finding duplication and near-duplication in large software systems. In Reverse Engineering, 1995., Proceedings of 2nd Working Conference on, pages 86-95. IEEE, 1995.

[11] M. Balazinska, E. Merlo, M. Dagenais, B. Lague, and K. Kontogiannis. Measuring clone based reengineering opportunities. In Software Metrics Symposium, 1999. Proceedings. Sixth International, pages 292-303. IEEE, 1999.

[12] H.A. Basit and S. Jarzabek. Efficient token based clone detection with flexible tokenization. In Proceedings of the the 6th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering, pages 513-516. ACM, 2007.

[13] I.D. Baxter, A. Yahin, L. Moura, M. Sant'Anna, and L. Bier. Clone detection using abstract syntax trees. In Software Maintenance, 1998. Proceedings. International Conference on, pages 368-377. IEEE, 1998.

[14] B.H. Bloom. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7):422-426, 1970.

[15] I. Briones and A. Gomez. Graphs, entropy and grid computing: Automatic comparison of malware. Virus Bulletin, 2008.

[16] R. Brixtel, M. Fontaine, B. Lesner, C. Bazin, and R. Robbes. Language-independent clone detection applied to plagiarism detection. In Source Code Analysis and Manipulation (SCAM), 2010 10th IEEE Working Conference on,
pages 77-86. IEEE, 2010.

[17] D. Bruschi, L. Martignoni, and M. Monga. Code normalization for self-mutating malware. Security & Privacy, IEEE, 5(2):46-54, 2007.

[18] E. Burd and J. Bailey. Evaluating clone detection tools for use during preventative maintenance. In Source Code Analysis and Manipulation, 2002. Proceedings. Second IEEE International Workshop on, pages 36-43. IEEE, 2002.

[19] E. Carrera and G. Erdelyi. Digital genome mapping{advanced binary malware analysis. In Virus Bulletin Conference, pages 187-197, 2004.

[20] R. Cilibrasi and P.M.B. Vitanyi. Clustering by compression. Information Theory, IEEE Transactions on, 51(4):1523-1545, 2005.

[21] P.M. Comparetti, G. Salvaneschi, E. Kirda, C. Kolbitsch, C. Kruegel, and S. Zanero. Identifying dormant functionality in malware programs. In Security and Privacy (SP), 2010 IEEE Symposium on, pages 61-76. IEEE, 2010.

[22] I.J. Davis and M.W. Godfrey. Clone detection by exploiting assembler. In Proceedings of the 4th International Workshop on Software Clones, pages 77-78. ACM, 2010.

[23] S. Ducasse, O. Nierstrasz, and M. Rieger. On the effectiveness of clone detection by string matching. Journal of Software Maintenance and Evolution: Research and Practice, 18(1):37-58, 2006.

[24] T. Dullien, E. Carrera, S.M. Eppler, and S. Porst. Automated attacker correlation for malicious code. Technical report, DTIC Document, 2010.

[25] S. Dumais et al. Latent semantic indexing (lsi) and trec-2. NIST SPECIAL PUBLICATION SP, pages 105-105, 1994.

[26] W.S. Evans, C.W. Fraser, and F. Ma. Clone detection via structural abstraction. Software Quality Journal, 17(4):309-330, 2009.

[27] H. Flake. Structural comparison of executable objects. In Proc. of the International GI Workshop on Detection of Intrusions and Malware & Vulnerability Assessment, number P-46 in Lecture Notes in Informatics, pages 161-174, 2004.

[28] B. Hummel, E. Juergens, L. Heinemann, and M. Conradt. Index-based code clone detection: incremental, distributed, scalable. In Software Maintenance (ICSM), 2010 IEEE International Conference on, pages 1-9. IEEE, 2010.

[29] J. Jang and D. Brumley. Bitshred: Fast, scalable code reuse detection in binary code (cmu-cylab-10-006). CyLab, page 28, 2009.

[30] J. Jang, D. Brumley, and S. Venkataraman. Bitshred: Fast, scalable malware triage. Cylab, Carnegie Mellon University, Pittsburgh, PA, Technical Report CMU-Cylab-10-022, 2010.

[31] J.H. Ji, S.H. Park, G. Woo, and H.G. Cho. Source code similarity detection using adaptive local alignment of keywords. In Parallel and Distributed Computing, Applications and Technologies, 2007. PDCAT'07. Eighth International Conference on, pages 179-180. IEEE, 2007.

[32] J.H. Johnson. Identifying redundancy in source code using fingerprints. In Proceedings of the 1993 conference of the Centre for Advanced Studies on Collaborative research: software engineering-Volume 1, pages 171-183. IBM Press, 1993.

[33] J.H. Johnson. Visualizing textual redundancy in legacy source. In Proceedings of the 1994 conference of the Centre for Advanced Studies on Collaborative research, page 32. IBM Press, 1994.

[34] T. Kamiya, S. Kusumoto, and K. Inoue. Ccfinder: a multilinguistic token-based code clone detection system for large scale source code. Software Engineering, IEEE Transactions on, 28(7):654-670, 2002.

[35] M.E. Karim, A. Walenstein, A. Lakhotia, and L. Parida. Malware phylogeny generation using permutations of code. Journal in Computer Virology, 1(1):13-23, 2005.

[36] R.M. Karp. Combinatorics, complexity, and randomness. Communications of the ACM, 29(2):98-109, 1986.

[37] I. Keivanloo, C. K. Roy, J. Rilling, and P. Charland. Shuffling and randomization for scalable source code clone detection. In Software Clones (IWSC), 2012 6th International Workshop on, pages 82-83. IEEE, 2012.

[38] H. Kim, Y. Jung, S. Kim, and K. Yi. Mecc: memory comparison-based clone detector. In Software Engineering (ICSE), 2011 33rd International Conference on, pages 301-310. IEEE, 2011.

[39] R. Komondoor and S. Horwitz. Using slicing to identify duplication in source code. Static Analysis, pages 40-56, 2001.

[40] K.A. Kontogiannis, R. DeMori, E. Merlo, M. Galler, and M. Bernstein. Pattern matching for clone and concept detection. Automated Software Engineering, 3(1):77-108, 1996.

[41] J. Krinke. Identifying similar code with program dependence graphs. In Reverse Engineering, 2001. proceedings. Eighth Working Conference on, pages 301-309. IEEE, 2001.

[42] C. Kruegel, E. Kirda, D. Mutz, W. Robertson, and G. Vigna. Polymorphic worm detection using structural information of executables. In Recent Advances in Intrusion Detection, pages 207-226. Springer, 2006.

[43] V.I. Levenshtein. Binary codes capable of correcting deletions, insertions. Technical report, and reversals. Technical Report 8, 1966.

[44] C. Liu, C. Chen, J. Han, and P.S. Yu. Gplag: detection of software plagiarism by program dependence graph analysis. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 872-881. ACM, 2006.

[45] U. Manber and G. Myers. Suffix arrays: a new method for on-line string searches. siam Journal on Computing, 22(5):935-948, 1993.

[46] A. Marcus and J.I. Maletic. Identification of high-level concept clones in source code. In Automated Software Engineering, 2001.(ASE 2001). Proceedings. 16th Annual International Conference on, pages 107-114. IEEE, 2001.

[47] J. Mayrand, C. Leblanc, and E.M. Merlo. Experiment on the automatic detection of function clones in a software system using metrics. In Software Maintenance 1996, Proceedings., International Conference on, pages 244-253. IEEE, 1996.

[48] C.K. Roy and J.R. Cordy. A survey on software clone detection research. Queens School of Computing TR, 541:115, 2007.

[49] C.K. Roy, J.R. Cordy, and R. Koschke. Comparison and evaluation of code clone detection techniques and tools: A qualitative approach. Science of Computer Programming, 74(7):470-495, 2009.

[50] A. Saebjornsen, J. Willcock, T. Panas, D. Quinlan, and Z. Su. Detecting code clones in binary executables. In Proceedings of the eighteenth international symposium on Software testing and analysis, pages 117-128. ACM, 2009.

[51] M. Schordan and D. Quinlan. A source-to-source architecture for user-defined optimizations. Modular Programming Languages, pages 214-223, 2003.

[52] A. Schulman. Finding binary clones with opstrings & function digests: Part 1- reverse engineering is an invaluable engineering tool. Dr Dobb's Journal-Software Tools for the Professional Programmer, pages 69-73, 2005.

[53] A. Schulman. Finding binary clones with opstrings & function digests: Part ii. Dr. Dobb's Journal, 30(8):56, 2005.

[54] A. Schulman. Finding binary clones with opstrings function digests: Part iii. Dr. Dobb's Journal, 30(9):64, 2005.

[55] D.M. Shawky and A.F. Ali. An approach for assessing similarity metrics used in metric-based clone detection techniques. In Computer Science and Information Technology (ICCSIT), 2010 3rd IEEE International Conference on, volume 1, pages 580-584. IEEE, 2010.

[56] R. Tairas and J. Gray. Phoenix-based clone detection using suffix trees. In Proceedings of the 44th annual Southeast regional conference, pages 679-684. ACM, 2006.

[57] C. J. van Rijsbergen. Information Retrieval. University of Glasgow, 1979.

[58] V. Wahler, D. Seipel, J. Wolff, and G. Fischer. Clone detection in source code by frequent itemset techniques. In Source Code Analysis and Manipulation, 2004. Fourth IEEE International Workshop on, pages 128-135. IEEE, 2004.

[59] A.Walenstein, M. Venable, M. Hayes, C. Thompson, and A. Lakhotia. Exploiting similarity between variants to defeat malware. In Proc. BlackHat DC Conf, 2007.

[60] Z. Wang, K. Pierce, and S. McFarling. Bmat: a binary matching tool for stale profile propagation. The Journal of Instruction-Level Parallelism, 2:1-20, 2000.

[61] R.M. Zeidman. Patent no. 2008/0270991a1. us., 2008.

[62] R.M. Zeidman. Patent no. 7823127b2. us., 2010.

[63] J. Zobel and A. Moffat. Exploring the similarity space. In ACM SIGIR Forum, volume 32, pages 18-34. ACM, 1998.
All items in Spectrum are protected by copyright, with all rights reserved. The use of items is governed by Spectrum's terms of access.

Repository Staff Only: item control page

Downloads per month over past year

Back to top Back to top