Login | Register

Efficient Transactional-Memory-based Implementation of Morph Algorithms on GPU


Efficient Transactional-Memory-based Implementation of Morph Algorithms on GPU

Manoochehri, Shayan (2017) Efficient Transactional-Memory-based Implementation of Morph Algorithms on GPU. Masters thesis, Concordia University.

[thumbnail of Manoochehri_MSc_F2017.pdf]
Text (application/pdf)
Manoochehri_MSc_F2017.pdf - Accepted Version
Available under License Spectrum Terms of Access.


General Purpose GPUs (GPGPUs) are ideal platforms for parallel execution of applications with regular shared memory access patterns. However, majority of real world multithreaded applications require access to shared memory with irregular patterns. The morph algorithms, which arise in many real world applications, change their graph data structures in unpredictable ways, thus, leading to irregular access patterns to shared data. Such irregularity makes morph algorithms more challenging to be implemented on GPUs which favor regularity. The Borouvka’s algorithm for calculating Minimum Spanning Forest (MSF), and multilevel graph partitioning are two examples of morph algorithms with varied levels of expressed parallelism. In this work we show that a transactional-memory-based design and implementation of the morph algorithms on GPUs can handle some of the challenges arising due to irregularities such as complexity of code and overhead of synchronization. First, we identify the major phases of the algorithm which requires synchronization of the shared data. If the algorithm exhibits certain algebraic properties (e.g., monotonicity, idempotency, associativity), we can use lock-free synchronizations for performance; otherwise we utilize a Software Transactional Memory (STM) based synchronization method. Experimental results show that our GPU-based implementation of Borouvka’s algorithm outperforms both the fastest sequential implementation and the existing STM-based implementation on multicore CPUs when tested on large-scale graphs with diverse densities. Moreover, to show the applicability of our approach to other morph algorithms, we do a pen-and-paper implementation and complexity analysis of multilevel graph partitioning.

Divisions:Concordia University > Gina Cody School of Engineering and Computer Science > Computer Science and Software Engineering
Item Type:Thesis (Masters)
Authors:Manoochehri, Shayan
Institution:Concordia University
Degree Name:M. Sc.
Program:Computer Science
Date:31 August 2017
Thesis Supervisor(s):Goswami, Dhrubajyoti
Keywords:Minimum Spanning Forest, Graph Partitioning, Software Transactional Memory, lock-free, GPU
ID Code:982993
Deposited On:20 Nov 2017 13:43
Last Modified:18 Jan 2018 17:56


[1] J. Hennessy, D. Patterson, “Computer architecture: a quantitative approach” Elsevier, 2011.
[2] K. Pingali, D. Nguyen, M. Kulkarni, M. Burtscher, M. A. Hassaan, R. Kaleem, T. Lee, A. Lenharth, R. Manevich, M. Méndez-Lojo, D. Prountzos, X. Sui, “The tao of parallelism in algorithms,” In Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation, pp. 12–25, ACM, 2011.
[3] O. Borůvka, “O jistém problému minimálním (About a certain minimal problem),” Práce mor. přírodověd. spol. v Brně III 3, pp. 37-58, 1926.
[4] V. Vineet, P. Harish, S. Patidar, P. J. Narayanan, “Fast minimum spanning tree for large graphs on the GPU,” In Proceeding of the Conference on High Performance Graphics, pp. 167- 171, ACM, 2009.
[5] R. Nasre, M. Burtscher, K. Pingali, “Morph algorithms on GPUs,” In Proceedings of ACM SIGPLAN Notices, 48(8), pp. 147-156, ACM, 2013.
[6] R. Nasre, M. Burtscher, K. Pingali, “Atomic-free irregular computations on GPUs,” In Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units, pp. 96-107, ACM, 2013.
[7] N. Shavit, D. Touitou, “Software transactional memory”, Distributed Computing, 10(2), pp. 99-116, 1997.
[8] D. Cederman, P. Tsigas, M.T. Chaudhry, “Towards a software transactional memory for graphics processors,” In Proceedings of Euro-graphics Symposium on Parallel Graphics and Visualization, pp. 121-129, 2010.
[9] Y. Xu, R. Wang, N. Goswami, T. Li, L, Gao, D. Qian, ”Software transactional memory for GPU architectures,” In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, ACM, 2014.
[10] A. Holey, A. Zhai, “Lightweight software transactions on GPUs, ” In Proceedings of 43rd International Conference on Parallel Processing, pp. 461-470, IEEE, 2014.
[11] Q. Shen, C. Sharp, W. Blewitt, G. Ushaw, G. Morgan, “PR-STM: Priority Rule Based Software Transactions for the GPU,” European Conference on Parallel Processing, pp. 361–372, 2015.
[12] F. Khorasani, R. Gupta Laxmi, N. Bhuyan, “Scalable SIMD-Efficient Graph Processing on GPUs,” In Proceedings of International Conference on Parallel Architecture and Compilation, pp. 39-50, IEEE, 2015.
[13] Nvidia. CUDA. Retrieved September 7, 2016 from http://www.nvidia.com/cuda.
[14] L. G. Valiant. “A bridging model for parallel computation,” Communications of the ACM, 33(8), pp. 103-111, 1990.
[15] M. Herlihy, J. E. B. Moss, “Transactional Memory: Architectural Support for Lock-Free Data Structures,” In Proceedings of the 20th Annual International Symposium on Computer Architecture, pp. 289-300, ACM, 1993.
[16] C. J. Rossbach, O. S. Hofmann, E. Witchel, “Is transactional programming actually easier?” In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 45(5), pp. 47-56, 2010.
[17] V. Pankratius, A.-R. Adl-Tabatabai, “A study of transactional memory vs. locks in practice,” In Proceedings of the 23rd Annual ACM Symposium on Parallelism in Algorithms and Architectures, pp 43-52, ACM, 2011.
[18] R. Ennals, “Efficient software transactional memory,” Technical report, Intel Research Cambridge, UK, 2005.
[19] T. Harris, J. Larus, R. Rajwar, “Transactional Memory,” Synthesis Lectures on Computer Architecture, 5(1), pp. 1-263, 2010.
[20] T. Harris, K. Fraser, “Language Support for Lightweight Transactions,” In ACM SIGPLAN Notices, 38(11), pp. 388-402, ACM, 2003.
[21] Y.C. Tseng, T.T.Y. Juang, M.C. Du, “Building a multicasting tree in a high-speed network,” IEEE Concurrency, 6(4), pp. 57-67, 1998.
[22] L. An, Q.S. Xiang, S. Chavez, “A fast implementation of the minimum spanning tree method for phase unwrapping,” IEEE Transactions on Medical Imaging, 19(8), pp. 805-808, 2000.
[23] S. Kang, D.A. Bader, “An Efficient Transactional Memory Algorithm for Computing Minimum Spanning Forest of Sparse Graphs,” In Proceedings of the 14th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, 44(4), pp. 15-24, 2009.
[24] S.V. Adve, K. Gharachorloo, “Shared Memory Consistency Models: A Tutorial”, IEEE Computer Journal 29(12), pp. 66-76, 1995.
[25] DIMACS 10 challenge graph collection – Graph Partitioning and Graph Clustering, Retrieved January 25, 2016 from http://www.cc.gatech.edu/dimacs10/downloads.shtml.
[26] D. A. Bader, K. Madduri, "GTgraph: A Synthetic Graph Generator Suite," Technical Report, 2006.
[27] J. Siek, L. Lee, A. Lumsdaine, "The Boost Graph Library: User Guide and Reference Manual," Addison-Wesley, 2002.
[28] D. Lasalle , G. Karypis, “Multi-threaded Graph Partitioning,” In Proceedings of 27th International Symposium on Parallel and Distributed Processing, pp. 225-236, IEEE, 2013.
[29] T. Bui, C. Jones, “A heuristic for reducing fill in sparse matrix factorization”, In Proceedings of the 6th SIAM Conference, Parallel Processing for Scientific Computing, pp. 445-452, 1993.
[30] R. Leland, B. Hendrickson, “A Multilevel Algorithm for Partitioning Graphs, Technical Report SAND93-1301, Sandia National Laboratories, 1993.
[31] G. Karypis, V. Kumar, “Analysis of Multilevel Graph Partitioning,” Technical Report, TR 95-037, Department of Computer Science, University of Minnesota, 1995.
[32] G. Karypis, V. Kumar, “A fast and highly quality multilevel scheme for partitioning irregular graphs”, SIAM Journal on Scientific Computing, pp. 359-392, 1998.
[33] B. W. Kernighan, S. Lin, “An efficient heuristic procedure for partitioning graphs”, Bell System. Technical Journal, 49, pp. 291-307, 1970.
[34] C. M. Fiduccia, R. M. Mattheyses, “A linear time heuristic for improving network partitions”, In Proceedings of 19th IEEE Design Automation Conference, pp. 175-181, 1982.
[35] D. LaSalle, G Karypis, “A parallel hill-climbing refinement algorithm for graph partitioning,” 45th International Conference on Parallel Processing, pp. 236-241,IEEE, 2016.
[36] M. Harris, S. Sengupta, J.D. Owens, “Parallel prefix sum (scan) with CUDA,” GPU gems, 3(39), pp. 851-876, 2007.
[37] J.T., Schwartz, “Ultracomputers,” ACM Transactions on Programming Languages and Systems , 2(4), pp. 484-521, 1980.
[38] S. Sengupta, M. Harris, Y. Zhang, J.D. Owens, "Scan primitives for GPU computing," In Graphics hardware, vol. 2017, pp. 97-106, 2007.
[39] L. Ma, K. Agrawal, R. D. Chamberlain, "A memory access model for highly-threaded many-core architectures," Future Generation Computer Systems, vol. 30, pp. 202-215, 2014.
[40] M. Lin, R.D. Chamberlain, K. Agrawal, "Analysis of classic algorithms on GPUs," In Proceedings of High Performance Computing & Simulation, pp. 65-73, IEEE, 2014.
[41] B. Goodarzi, M. Burtscher, D. Goswami, "Parallel Graph Partitioning on a CPU-GPU Architecture," In Proceedings of Parallel and Distributed Processing Symposium Workshops, pp. 58-66. IEEE, 2016.
All items in Spectrum are protected by copyright, with all rights reserved. The use of items is governed by Spectrum's terms of access.

Repository Staff Only: item control page

Downloads per month over past year

Research related to the current document (at the CORE website)
- Research related to the current document (at the CORE website)
Back to top Back to top