Deep learning (DL) Human Activity Recognition (HAR) models using wearable inertial measurement unit (IMU) sensors have shown great promise in applications like continuous healthcare monitoring and early disease prediction. However, most DL HAR models remain untested in real-world scenarios laden with variabilities; rather, they are trained and tested on constrained and closely curated HAR datasets that assume an ideal setting. This thesis explains the effects of real-world variabilities like subject, device, position, and orientation on the performance of DL HAR models. Due to the inability of existing datasets to isolate variabilities, we collect our own, the HARVAR dataset. We isolated the effect of different variabilities and provided a nuanced understanding of how each affects DL HAR models' performance. Maximum Mean Discrepancy (MMD) was used to quantify shifts in data distribution due to each isolated variability and drew a relationship between the drop in performance and the change in data distribution. The REALDISP dataset was used to perform a case study to understand the effects of compounded and unisolated variabilities in the real world. This study found that different variabilities have varying effects on the DL HAR model performance, from insignificant to detrimental. We showed a negative correlation between the MMD and the performance drop of the DL HAR models in the results drawn from both HARVAR and REALDISP datasets. The study emphasizes the need for more robust models and the development of pre-processing methodologies to optimize the IMU data for training robust DL HAR models.