Barghi, Soudabeh (2018) Predicting Computational Reproducibility of Data Analysis Pipelines in Large Population Studies Using Collaborative Filtering. Masters thesis, Concordia University.
Text (application/pdf)
1MBBarghi_MCompSc_S2019.pdf - Accepted Version |
Abstract
Evaluating the computational reproducibility of data analysis pipelines has become a critical issue. It is, however, a cumbersome process for analyses that involve data from large populations of subjects, due to their computational and storage requirements. We present a method to predict the computational reproducibility of data analysis pipelines in large population studies. We formulate the problem as a collaborative filtering process, with constraints on the construction of the training set. We propose 6 different strategies to build the training set, which we evaluate on 2 datasets, a synthetic one modeling a population with a growing number of subject types, and a real one obtained with neuroinformatics pipelines. Results show that one sampling method, “Random File Numbers (Uniform)” is able to predict computational reproducibility with a good accuracy. We also analyse the relevance of including file and subject biases in the collaborative filtering model. We conclude that the proposed method is able to speed-up reproducibility evaluations substantially, with a reduced accuracy loss.
Divisions: | Concordia University > Gina Cody School of Engineering and Computer Science |
---|---|
Item Type: | Thesis (Masters) |
Authors: | Barghi, Soudabeh |
Institution: | Concordia University |
Degree Name: | M. Comp. Sc. |
Program: | Computer Science |
Date: | 19 November 2018 |
Thesis Supervisor(s): | Glatard, Tristan |
ID Code: | 984854 |
Deposited By: | SOUDABEH BARGHI |
Deposited On: | 27 Oct 2022 13:48 |
Last Modified: | 27 Oct 2022 13:48 |
Repository Staff Only: item control page