Guédon, Timothée (2020) A heuristic to repartition large multi-dimensional arrays with reduced disk seeking. Masters thesis, Concordia University.
Preview |
Text (application/pdf)
579kBGuédon_MCompSc_S2021.pdf - Accepted Version Available under License Spectrum Terms of Access. |
Abstract
Multi-dimensional arrays have become critical scientific data structures, but their manipulation raises performance issues when they exceed memory capacity. In particular, accessing specific array regions can require millions to billions of disk seek operations, with important consequences on I/O performance. While traditional approaches to address this problem focus on file format optimizations, we are searching for algorithmic solutions where applications control I/O to reduce seeking. In this thesis, we propose the keep heuristic to minimize the number of seeks required for the repartitioning of large multi-dimensional arrays. The keep heuristic uses a memory cache to reconstruct contiguous data sections in memory. We evaluate it on arrays of size 85.7 GiB with memory amounts ranging from 4 to 275 GiB. Repartitioning time is reduced by a factor of up to 2.5, and seeking is reduced by four orders of magnitude. Due to the effect of asynchronous writes to memory page cache enabled by the Linux kernel, speed up is only observed for arrays that exceed the size of working memory. The keep heuristic could be applied in platforms that manipulate large data arrays, as is commonly the case in scientific imaging.
Divisions: | Concordia University > Gina Cody School of Engineering and Computer Science > Computer Science and Software Engineering |
---|---|
Item Type: | Thesis (Masters) |
Authors: | Guédon, Timothée |
Institution: | Concordia University |
Degree Name: | M. Comp. Sc. |
Program: | Computer Science |
Date: | 2 December 2020 |
Thesis Supervisor(s): | Glatard, Tristan |
Keywords: | big data, repartitioning, rechunking, chunking, IO, seeking, optimization |
ID Code: | 987785 |
Deposited By: | Timothée Guédon |
Deposited On: | 23 Jun 2021 16:39 |
Last Modified: | 23 Jun 2021 16:39 |
Repository Staff Only: item control page