Data management without reinstrumentation: How to speed up existing Big Data neuroimaging workflows

Title:

Data management without reinstrumentation: How to speed up existing Big Data neuroimaging workflows

Hayot-Sasson, Valerie ORCID: https://orcid.org/0000-0002-4830-4535 (2022) Data management without reinstrumentation: How to speed up existing Big Data neuroimaging workflows. PhD thesis, Concordia University.

[thumbnail of Hayot-Sasson_PhD_S2022.pdf]

Preview

Text (application/pdf)
Hayot-Sasson_PhD_S2022.pdf - Accepted Version
Available under License Spectrum Terms of Access.

2MB

Abstract

Neuroimaging has entered the Big Data era through the adoption of data sharing practices and improved data collection infrastructure enabling higher resolution imaging. Whereas the pipelines have shifted to become increasingly data-intensive, neuroimaging software has minimally adapted to address this shift. Rather, scientific software has primarily focused on ease-of-use, reproducibility, portability and parallelism. While the goals of Big Data and scientific frameworks differ, their strategies can be combined to make scientific frameworks more suitable for the processing of the increasingly prominent scientific Big Data.

The objectives of this thesis are two-fold: 1) determine whether neuroimaging frameworks benefit from incorporation of Big Data management strategies and investigate how to adapt existing solutions, and 2) develop new tools to enable data management within neuroimaging workflows. Our performance analysis determined that neuroimaging frameworks can benefit significantly from the incorporation of data management strategies, by up to a factor of 5.3X in the most data-intensive case. While we found Big Data frameworks (e.g. Apache Spark) to significantly speedup data-intensive neuroimaging workflows, our analysis on overlay pilot-scheduling with Spark determined that large-scale Spark workflows would be difficult to run on HPC. Furthermore, while alternative hardware solutions, such as Intel Optane DCPMM produce speedups similar to in-memory processing with Spark and could be used as an alternative, it remains inaccessible to many researchers.

To bring data-management solutions to neuroimaging applications, we developed two libraries, namely, Rolling Prefetch and Sea. Rolling Prefetch is our data-management solution for cloud-based applications that enables the sequential prefetching of data located on Amazon S3 storage. Experimental results demonstrate that Rolling Prefetch can speed up experiments by a factor of 1.8X and has a theoretical bound of 2X.

Sea targets the standard neuroimaging workflows executed on HPC. It brings prefetching, data-locality and in-memory computing to POSIX-based command-line tools through the interception of glibc calls. Using this approach researchers can benefit from data-management related speedups by incorporating Sea into their standard analysis. Our results on standard neuroimaging pipelines show that Sea can speed up execution by an average of 11X with large datasets writing to a deteriorated shared file system.

Divisions:	Concordia University > Gina Cody School of Engineering and Computer Science > Computer Science and Software Engineering
Item Type:	Thesis (PhD)
Authors:	Hayot-Sasson, Valerie
Institution:	Concordia University
Degree Name:	Ph. D.
Program:	Software Engineering
Date:	6 January 2022
Thesis Supervisor(s):	Glatard, Tristan
ID Code:	990269
Deposited By:	VALERIE HAYOT-SASSON
Deposited On:	16 Jun 2022 15:09
Last Modified:	16 Jun 2022 15:09

Repository Staff Only: item control page

Download Statistics

Downloads per month over past year

Research related to the current document (at the CORE website)

Spectrum Research Repository

Data management without reinstrumentation: How to speed up existing Big Data neuroimaging workflows

Data management without reinstrumentation: How to speed up existing Big Data neuroimaging workflows

Abstract