Hedayati Somarin, Iraj (2016) DFA Minimization Algorithms in MapReduce. Masters thesis, Concordia University.

Text (application/pdf)
789kBHedayatiSomarin_MCompSc_S2016.pdf  Accepted Version Available under License Spectrum Terms of Access. 
Abstract
MapReduce has been a highly popular paralleldistributed programming model. In this thesis, we study the problem of minimizing Deterministic Finite State Automata (DFA). We focus our attention on two wellknown (serial) algorithms, namely the algorithms of Moore (1956) and of Hopcroft (1971). The central costparameter in MapReduce is that of communication cost i.e., the amount of data that has to be communicated between the processes. Using techniques from Communication Complexity we derive an O(kn log{n}) lower bound and O(kn^3 log{n}) upper bound for the problem, where n is the number of states in the DFA to be minimized,and k is the size of its alphabet. We then develop MapReduce versions of both Moore's and Hopcroft's algorithms, and show that their communication cost is O(kn^2 (log {n} + log {k})). Both methods have been implemented and tested on large DFA, with 131,072 states. The experiments verify our theoretical analysis, and also reveal that Hopcroft's algorithm  considered superior in the sequential framework  is very sensitive to skew in the topology of the graph of the DFA, whereas Moore's algorithm handles skew without major efficiency loss.
Divisions:  Concordia University > Gina Cody School of Engineering and Computer Science > Computer Science and Software Engineering 

Item Type:  Thesis (Masters) 
Authors:  Hedayati Somarin, Iraj 
Institution:  Concordia University 
Degree Name:  M. Comp. Sc. 
Program:  Computer Science 
Date:  17 January 2016 
Thesis Supervisor(s):  Grahne, Gösta K. 
Keywords:  MapReduce, BigData, Hadoop, Automata, DFA Minimization, Communication Complexity, Parallel, Distributed, Complexity Model 
ID Code:  980838 
Deposited By:  HEDAYATISOMARIN IRAJ 
Deposited On:  16 Jun 2016 14:40 
Last Modified:  18 Jan 2018 17:52 
References:
[1] Apache Hadoop. Available at http://hadoop.apache.org/.[2] Foto N. Afrati, Anish Das Sarma, Semih Salihoglu, and Jeffrey D. Ullman. Upper and lower bounds on the cost of a mapreduce computation. PVLDB, 6(4):277–288, 2013.
[3] Fr´ed´erique Bassino, Julien David, and Cyril Nicaud. On the average complexity of Moore’s state minimization algorithm. CoRR, abs/0902.1048, 2009.
[4] Paul Beame, Paraschos Koutris, and Dan Suciu. Communication steps for parallel query processing. In Proceedings of the 32Nd Symposium on Principles of Database Systems, PODS ’13, pages 273–284, New York, NY, USA, 2013. ACM.
[5] Jacek Becla and Daniel L. Wang. Lessons learned from managing a petabyte. In Proceedings, 2nd Biennial Conference on Innovative Data Systems Research, Stanford, CA, Jan 2005.6] Jean Berstel, Luc Boasson, Olivier Carton, and Isabelle Fagnot. Minimization of automata. CoRR, abs/1010.5318, 2010.
[7] Ronald V. Book, Christopher B. Wilson, and Xu MeiRui. Relativizing time, space, and timespace. SIAM Journal on Computing, 11(3):571–581, 1982.
[8] Kirk D. Borne. Scientific data mining in astronomy. CoRR, abs/0911.0505, 2009.
[9] Wilfried Brauer. On minimizing finite automata. Bulletin of the EATCS, 35:113–116, 1988.
[10] Parris M. Caulk. The design of a petabyte archive and distribution system for the NASA ECS project. In Fourth NASA Goddard Conference on Mass Storage Systems and Technologies, pages 7–17. NASA, 1995.
[11] Sang Cho and Dung T. Huynh. The parallel complexity of coarsest set partition problems. Inf. Process. Lett., 42(2):89–94, 1992.
[12] Sang Cho and Dung T. Huynh. The parallel complexity of finitestate automata problems. Information and Computation, 97(1):1–22, 1992.
[13] Julien David. The average complexity of Moore’s state minimization algorithm is O(n log log n). In MFCS, pages 318–329. Springer, 2010.
[14] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008.
[15] Michael J. Flynn. Some computer organizations and their effectiveness. IEEE Transactions on Computers, 21(9):948–960, 1972.
[16] G¨osta Grahne, Shahab Harrafi, Ali Moallemi, and Adrian Onet. Computing NFA intersections in mapreduce. In Proceedings of the Workshops of the EDBT/ICDT Joint Conference (EDBT/ICDT), pages 42–45, 2015.
[17] David Gries. Describing an algorithm by Hopcroft. Acta Informatica, 2(2):97–109, 1973.
[18] Shahab Harrafi Saveh. Finite automata algorithms in mapreduce. Master’s thesis, Concordia University, April 2015.
[19] John E. Hopcroft. An n log n algorithm for minimizing states in a finite automaton. In Theory of Machines and Computations, pages 189–196. Academic Press, New York, 1971.
[20] John E. Hopcroft and Jeffrey D. Ullman. Introduction to Automata Theory, Languages and Computation. AddisonWesley, 1979.
[21] Joseph F. J´aJ´a and Kwan Woo Ryu. An efficient parallel algorithm for the single function coarsest partition problem. In Proceedings of the fifth annual ACM symposium on Parallel algorithms and architectures, pages 230–239. ACM, 1993.
[22] Howard Karloff, Siddharth Suri, and Sergei Vassilvitskii. A model of computation for mapreduce. In Proceedings of the TwentyFirst Annual ACMSIAM Symposium on Discrete Algorithms, pages 938–948. Society for Industrial and Applied Mathematics, 2010.
[23] Eyal Kushilevitz. Communication complexity. Advances in Computers, 44:331–360, 1997.
[24] Jure Leskovec, Anand Rajaraman, and Jeffrey D. Ullman. Mining of Massive Datasets. Cambridge University Press, 2014.
[25] Richard J. Lipton and Robert Sedgewick. Lower bounds for VLSI. In Proceedings of the thirteenth annual ACM symposium on Theory of computing, pages 300–307. ACM, 1981.
[26] Yossi Matias and Uzi Vishkin. On parallel hashing and integer sorting. Journal of Algorithms, 12(4):573–606, 1991.
[27] Edward F. Moore. Gedankenexperiments on sequential machines. Automata studies, 34:129–153, 1956.
[28] Bala Ravikumar and Xuanxing Xiong. A parallel algorithm for minimization of finite automata. In The 10th International Parallel Processing Symposium, pages 187–191, April 1996.
[29] Y. N. Srikant. A parallel algorithm for the minimization of finite state automata. International Journal of Computer Mathematics, 32(12):1–11, 1990.
[30] Siddharth Suri and Sergei Vassilvitskii. Counting triangles and the curse of the last reducer. In Proceedings of the 20th international conference on World wide web, pages 607–614. ACM, 2011.
[31] Deian Tabakov and Moshe Y. Vardi. Experimental evaluation of classical automata constructions. In Logic for Programming, Artificial Intelligence, and Reasoning, 12th International Conference, LPAR 2005, Montego Bay, Jamaica, December 26, 2005, Proceedings, pages 396–411, 2005.
[32] H. M. M. ten Eikelder. Some algorithms to decide the equivalence of recursive types. Technical Report 31, Faculty of Computing Science, Eindhoven University of Technology, 1991.
[33] Ambuj Tewari, Utkarsh Srivastava, and Phalguni Gupta. A parallel DFA minimization algorithm. In High Performance ComputingHiPC, pages 34–40. Springer, 2002.
[34] Gy¨orgy Tur´an. On the computational complexity of mapreduce. In Distributed Computing: 29th International Symposium, volume 9363 of DISC 2015, page 1, Tokyo, Japan, October 2015. Springer.
[35] Friedrich J. Urbanek. On minimizing finite automata. Bulletin of the EATCS, 39:205–206, 1989.
[36] Moshe Y. Vardi. Nontraditional applications of automata theory. In Theoretical Aspects of Computer Software, pages 575–597. Springer, 1994.
[37] Bruce W. Watson. A taxonomy of finite automata minimization algorithms. Computing Science Note 93/44, Eindhoven University of Technology, The Netherlands, 1993.
[38] Bruce W. Watson and Jan Daciuk. An efficient incremental DFA minimization algorithm. Natural Language Engineering, 9(01):49–64, 2003.
[39] Tom White. Hadoop: The Definitive Guide. O’Reilly Media, Inc, 3rd edition, May 2012.
[40] Derick Wood. Theory of Computation. John Wiley, 1987.
[41] Andrew ChiChih Yao. Some complexity questions related to distributive computing. In Proceedings of the Eleventh Annual ACM Symposium on Theory of Computing, STOC ’79, pages 209–213, New York, NY, USA, 1979. ACM.
[42] Michael Yoeli. Canonical representations of chain events. Information and Control, 8(2):180 – 189, 1965.
[43] Paul C. Zikopoulos, Chris Eaton, Drik deRoos, Thomas Deutsch, and George Lapis. Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data. McGrawHill Osborne Media, 1st edition, 2011.
Repository Staff Only: item control page