Login | Register

A heuristic to repartition large multi-dimensional arrays with reduced disk seeking

Title:

A heuristic to repartition large multi-dimensional arrays with reduced disk seeking

Guédon, Timothée (2020) A heuristic to repartition large multi-dimensional arrays with reduced disk seeking. Masters thesis, Concordia University.

[thumbnail of Guédon_MCompSc_S2021.pdf]
Preview
Text (application/pdf)
Guédon_MCompSc_S2021.pdf - Accepted Version
Available under License Spectrum Terms of Access.
579kB

Abstract

Multi-dimensional arrays have become critical scientific data structures, but their manipulation raises performance issues when they exceed memory capacity. In particular, accessing specific array regions can require millions to billions of disk seek operations, with important consequences on I/O performance. While traditional approaches to address this problem focus on file format optimizations, we are searching for algorithmic solutions where applications control I/O to reduce seeking. In this thesis, we propose the keep heuristic to minimize the number of seeks required for the repartitioning of large multi-dimensional arrays. The keep heuristic uses a memory cache to reconstruct contiguous data sections in memory. We evaluate it on arrays of size 85.7 GiB with memory amounts ranging from 4 to 275 GiB. Repartitioning time is reduced by a factor of up to 2.5, and seeking is reduced by four orders of magnitude. Due to the effect of asynchronous writes to memory page cache enabled by the Linux kernel, speed up is only observed for arrays that exceed the size of working memory. The keep heuristic could be applied in platforms that manipulate large data arrays, as is commonly the case in scientific imaging.

Divisions:Concordia University > Gina Cody School of Engineering and Computer Science > Computer Science and Software Engineering
Item Type:Thesis (Masters)
Authors:Guédon, Timothée
Institution:Concordia University
Degree Name:M. Comp. Sc.
Program:Computer Science
Date:2 December 2020
Thesis Supervisor(s):Glatard, Tristan
Keywords:big data, repartitioning, rechunking, chunking, IO, seeking, optimization
ID Code:987785
Deposited By: Timothée Guédon
Deposited On:23 Jun 2021 16:39
Last Modified:23 Jun 2021 16:39
All items in Spectrum are protected by copyright, with all rights reserved. The use of items is governed by Spectrum's terms of access.

Repository Staff Only: item control page

Downloads per month over past year

Research related to the current document (at the CORE website)
- Research related to the current document (at the CORE website)
Back to top Back to top