Fast and Memory Efficient Strassen’s Matrix Multiplication on GPU Cluster

Title:

Fast and Memory Efficient Strassen’s Matrix Multiplication on GPU Cluster

Gopala Krishnan, Arjun (2021) Fast and Memory Efficient Strassen’s Matrix Multiplication on GPU Cluster. Masters thesis, Concordia University.

[thumbnail of GopalaKrishnan_MSC_F2021.pdf]

Preview

Text (application/pdf)
GopalaKrishnan_MSC_F2021.pdf - Accepted Version
Available under License Spectrum Terms of Access.

1MB

Abstract

Prior implementations of Strassen's matrix multiplication algorithm on GPUs traded additional workspace in the form of global memory or registers for time. Although Strassen's algorithm offers a reduction in computational complexity as compared to the classical algorithm, the memory overhead associated with the algorithm limits its practical utility. While there were past attempts at reducing the memory footprint of Strassen's algorithm by compromising parallelism, no prior implementation, to our knowledge, was able to hide the workspace requirement successfully. This thesis presents an implementation of Strassen's matrix multiplication in CUDA, titled Multi-Stage Memory Efficient Strassen (MSMES), that eliminates additional workspace requirements by reusing and recovering input matrices. MSMES organizes the steps involved in Strassen's algorithm into five stages where multiple steps in the same stage can be executed in parallel. Two additional stages are also discussed in the thesis that allows the recovery of the input matrices. Unlike previous works, MSMES has no additional memory requirements irrespective of the level of recursion of Strassen's algorithm. Experiments performed with MSMES (with the recovery stages) on NVIDIA Tesla V100 GPU and NVIDIA GTX 1660ti GPU yielded higher compute performance and lower memory requirements as compared to the NVIDIA library function for double precision matrix multiplication, cublasDgemm. In the multi-GPU adaptation of matrix multiplication, we explore the performance of a Strassen-based and a tile-based global decomposition scheme. We also checked the performance of using MSMES and cublasDgemm for performing local matrix multiplication with each of the global decomposition schemes. From the experiments, it was identified that the combination of using Strassen-Winograd decomposition with MSMES yielded the highest speedup among all the tested combinations.

Divisions:	Concordia University > Gina Cody School of Engineering and Computer Science > Computer Science and Software Engineering
Item Type:	Thesis (Masters)
Authors:	Gopala Krishnan, Arjun
Institution:	Concordia University
Degree Name:	M. Comp. Sc.
Program:	Computer Science
Date:	30 August 2021
Thesis Supervisor(s):	Goswami, Dhrubajyoti
Keywords:	Strassen's algorithm based memory optimized matrix multiplication on GPU
ID Code:	988999
Deposited By:	Arjun Gopala Krishnan
Deposited On:	29 Nov 2021 16:44
Last Modified:	29 Nov 2021 16:44

References:

[1] J. Huang, C. Yu, and R. Van de Geijn, “Strassen's algorithm reloaded on GPUs,” Strassen's Algorithm Reloaded on GPUs | ACM Transactions on Mathematical Software, 01-Apr-2020.
[2] G. Ballard, J. Demmel, O. Holtz, B. Lipshitz and O. Schwartz, "Communication-optimal parallel algorithm for strassen's matrix multiplication", Proceedinbgs of the 24th ACM symposium on Parallelism in algorithms and architectures - SPAA '12, 2012.
[3] C. Lawson, R. Hanson, D. Kincaid and F. Krogh, "Basic Linear Algebra Subprograms for Fortran Usage", ACM Transactions on Mathematical Software, vol. 5, no. 3, pp. 308-323, 1979.
[4] "BLAS Techical Forum", Netlib.org. [Online]. Available: http://netlib.org/blas/blast-forum/.
[5] V. Strassen, "Gaussian elimination is not optimal", Numerische Mathematik, vol. 13, no. 4, pp. 354-356, 1969.
[6] Nvidia, "NVIDIA TESLA V100 GPU ARCHITECTURE", Nvidia, 2017.
[7] J. Fung, F. Tang and S. Mann, "Mediated Reality using Computer Graphics Hardware for Computer Vision", in International Symposium on Wearable Computing, Seattle, Washington, USA, 2002, pp. 7-10.
[8] M. Romero and R. Urra, "CUDA Overview", Cuda.ce.rit.edu, 2012.
[9] "Programming Guide :: CUDA Toolkit Documentation", Docs.nvidia.com, 2021.
[10] A. Kerr, D. Merrill, J. Demouth and J. Tran, "CUTLASS: Fast Linear Algebra in CUDA C++ | NVIDIA Developer Blog", NVIDIA Developer Blog, 2018.
[11] M. Harris, "Unified Memory for CUDA Beginners | NVIDIA Developer Blog", NVIDIA Developer Blog, 2017.
[12] J. Li, S. Ranka and S. Sahni, "Strassen's Matrix Multiplication on GPUs", 2011 IEEE 17th International Conference on Parallel and Distributed Systems, 2011.
[13] S. Ranka and S. Sahni, "SIMD Matrix Multiplication", Bilkent University Lecture Series, pp. 95-110, 1990.
[14] C. Douglas, M. Heroux, G. Slishman and R. Smith, "GEMMW: A Portable Level 3 BLAS Winograd Variant of Strassen's Matrix-Matrix Multiply Algorithm", Journal of Computational Physics, vol. 110, no. 1, pp. 1-10, 1994.
[15] S. Huss-Lederman, E. Jacobson, A. Tsao, T. Turnbull and J. Johnson, "Implementation of Strassen's algorithm for matrix multiplication", Proceedings of the 1996 ACM/IEEE conference on Supercomputing (CDROM) - Supercomputing '96, 1996.
[16] S. Huss-Lederman, E. Jacobson, J. Johnson, A. Tsao and T. Turnbull, Strassen’s algorithm for matrix multiplication: Modeling, analysis, and implementation, CCS-TR-96-17, Center for Computing Sciences, 1996.
[17] P. Lai, H. Arafat, V. Elango and P. Sadayappan, "Accelerating Strassen-Winograd's matrix multiplication algorithm on GPUs", 20th Annual International Conference on High Performance Computing, 2013.
[18] B. Boyer, J. Dumas, C. Pernet and W. Zhou, "Memory efficient scheduling of Strassen-Winograd's matrix multiplication algorithm", Proceedings of the 2009 international symposium on Symbolic and algebraic computation - ISSAC '09, 2009.
[19] P. Zhang and Y. Gao, “Matrix Multiplication on High-Density Multi-GPU Architectures: Theoretical and Experimental Investigations,” Lecture Notes in Computer Science, pp. 17–30, 2015.
[20] P. Zhang, Y. Gao and M. Qiu, "A Data-Oriented Method for Scheduling Dependent Tasks on High-Density Multi-GPU Systems," 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems, 2015, pp. 694-699, doi: 10.1109/HPCC-CSS-ICESS.2015.314.
[21] T. Herault, Y. Robert, G. Bosilca and J. Dongarra, "Generic Matrix Multiplication for Multi-GPU Accelerated Distributed-Memory Platforms over PaRSEC," 2019 IEEE/ACM 10th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA), 2019, pp. 33-41, doi: 10.1109/ScalA49573.2019.00010.
[22] D. Coppersmith and S. Winograd, On the asymptotic complexity of matrix multiplication. SIAM Journal on Computing 11, 3 (1982), 472–492.
[23] In high-performance computing, what are Rmax and Rpeak? [Online]. Available: https://kb.iu.edu/d/bbzo.
[24] June 2021 | TOP500, Top500.org, 2021. [Online]. Available: https://www.top500.org/lists/top500/2021/06/.
[25] Parsec (parser) - Wikipedia, En.wikipedia.org, 2021. [Online]. Available: https://en.wikipedia.org/wiki/Parsec_(parser).
[26] “Cray XT4,” Wikipedia, 15-Sep-2020. [Online]. Available: https://en.wikipedia.org/wiki/Cray_XT4.
[27] “Memory management,” NVIDIA Developer Documentation. [Online]. Available: https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html.
[28] “CUDA C++ programming guide,” NVIDIA Developer Documentation. [Online]. Available: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html.
[29] “Host api,” NVIDIA Developer Documentation. [Online]. Available: https://docs.nvidia.com/cuda/curand/group__HOST.html.
[30] “std::time,” cppreference.com. [Online]. Available: https://en.cppreference.com/w/cpp/chrono/c/time.
[31] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv.org, 22-Feb-2017. [Online]. Available: https://arxiv.org/abs/1609.02907.
[32] Virya: CSSE GPU server user manual, edition 2.5, Concordia University, 2019, pp. 14
[33] T. Bakker, “Supercomputer,” ECMWF, 07-Jan-2019. [Online]. Available: https://www.ecmwf.int/en/computing/our-facilities/supercomputer.
[34] A. Grama, “The Optimal Matrix-Parenthesization Problem,” in Introduction to parallel computing, Harlow, England: Addison-Wesley, 2013.
[35] “Basic linear algebra subprograms,” Wikipedia, 31-Jul-2021. [Online]. Available: https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms.
[36] “General-purpose computing on graphics processing units,” Wikipedia, 15-Jun-2021. [Online]. Available: https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units#Implementations.
[37] J. Fang, A. L. Varbanescu and H. Sips, "A Comprehensive Performance Comparison of CUDA and OpenCL," 2011 International Conference on Parallel Processing, 2011, pp. 216-225, doi: 10.1109/ICPP.2011.45.
[38] “Thread block (CUDA programming),” Wikipedia, 25-Jun-2021. [Online]. Available: https://en.wikipedia.org/wiki/Thread_block_(CUDA_programming)#cite_note-6.
[39] “CUDA occupancy Calculator,” NVIDIA Developer Documentation. [Online]. Available: https://docs.nvidia.com/cuda/cuda-occupancy-calculator/index.html.
[40] “Parallel thread Execution Isa Version 7.4,” NVIDIA Developer Documentation. [Online]. Available: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html.
[41] “Volta Architecture Whitepaper.” NVIDIA. [Online]. Available: https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf.
[42] R. Smith and N. Oh, “The NVIDIA GeForce Gtx 1660 Ti Review, Feat. Evga XC GAMING: Turing Sheds RTX for the mainstream market,” RSS, 22-Feb-2019. [Online]. Available: https://www.anandtech.com/show/13973/nvidia-gtx-1660-ti-review-feat-evga-xc-gaming/2.
[43] P. Lai, H. Arafat, V. Elango and P. Sadayappan, "Accelerating Strassen-Winograd's matrix multiplication algorithm on GPUs," 20th Annual International Conference on High Performance Computing, 2013, pp. 139-148, doi: 10.1109/HiPC.2013.6799109.
[44] “Unified memory for cuda beginners,” NVIDIA Developer Blog, 05-Jan-2021. [Online]. Available: https://developer.nvidia.com/blog/unified-memory-cuda-beginners/.
[45] “cublas,” NVIDIA Developer Documentation. [Online]. Available: https://docs.nvidia.com/cuda/cublas/index.html.
[46] “Thread block (CUDA programming),” Wikipedia, 25-Jun-2021. [Online]. Available: https://en.wikipedia.org/wiki/Thread_block_(CUDA_programming).
[47] “Achieved Occupancy,” NVIDIA docs. [Online]. Available: https://docs.nvidia.com/gameworks/content/developertools/desktop/analysis/report/cudaexperiments/kernellevel/achievedoccupancy.htm.
[48] “Profiler user's Guide,” NVIDIA Developer Documentation. [Online]. Available: https://docs.nvidia.com/cuda/profiler-users-guide/index.html.
[49] “Host api,” NVIDIA Developer Documentation. [Online]. Available: https://docs.nvidia.com/cuda/curand/group__HOST.html.

Repository Staff Only: item control page

Download Statistics

Downloads per month over past year

Research related to the current document (at the CORE website)

Spectrum Research Repository

Fast and Memory Efficient Strassen’s Matrix Multiplication on GPU Cluster

Fast and Memory Efficient Strassen’s Matrix Multiplication on GPU Cluster

Abstract

References: