Efficient Fine-Tuning Strategies for Federated Learning: Optimizing Model Performance Across Distributed Networks

Title:

Efficient Fine-Tuning Strategies for Federated Learning: Optimizing Model Performance Across Distributed Networks

Bernier, Nicolas (2024) Efficient Fine-Tuning Strategies for Federated Learning: Optimizing Model Performance Across Distributed Networks. Masters thesis, Concordia University.

Preview

Text (application/pdf)
Bernier_MA_F2024.pdf - Accepted Version
Available under License Spectrum Terms of Access.

7MB

Abstract

Federated Learning (FL) allows a global model to be trained collaboratively by a number of clients without sharing data. This setting is often characterized by resource-constrained clients con- nected over a low-bandwidth network. Hence, algorithms designed for the setting must account for important factors such as computer and memory requirements, robustness under changing data distributions and communication. Recent works, have started demonstrating the benefits of using pretrained models over random initialization on these considerations. We cover these recent ad- vancements before introducing methods conceived along the same lines. We show that in the FL setting, fitting a classifier using the Neurest Class Means (NCM) can be done exactly. We demon- strate its efficiency and combine it with full fine-tuning to produce stronger performance. Then, we introduce an adapted zeroth-order method capable of bringing a model to convergence with a mini- mal per-round compute budget while reducing the memory burden for clients during training down to that of inference. This work presents several experiments demonstrating the effectiveness of the proposed methods and highlights the importance for additional work into the application pretrained models in the FL setting.

Divisions:	Concordia University > Gina Cody School of Engineering and Computer Science > Computer Science and Software Engineering
Item Type:	Thesis (Masters)
Authors:	Bernier, Nicolas
Institution:	Concordia University
Degree Name:	M.A. Sc.
Program:	Computer Science
Date:	24 November 2024
Thesis Supervisor(s):	Belilovsky, Eugene
ID Code:	994829
Deposited By:	Nicolas Bernier
Deposited On:	17 Jun 2025 17:32
Last Modified:	17 Jun 2025 17:32

References:

Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B., Mironov, I., Talwar, K., & Zhang, L. (2016, October). Deep learning with differential privacy. In Proceedings of the 2016 acm sigsac conference on computer and communications security. ACM. Retrieved from http:// dx.doi.org/10.1145/2976749.2978318 doi: 10.1145/2976749.2978318
Alistarh, D., Grubic, D., Li, J., Tomioka, R., & Vojnovic, M. (2017). Qsgd: Communication- efficient sgd via gradient quantization and encoding. Retrieved from https://arxiv .org/abs/1610.02132
Alyafeai, Z., AlShaibani, M. S., & Ahmad, I. (2020). A survey on transfer learning in natural language processing. arXiv preprint arXiv:2007.04239.
Babakniya, S., Elkordy, A. R., Ezzeldin, Y. H., Liu, Q., Song, K.-B., El-Khamy, M., & Avestimehr, S. (2023). Slora: Federated parameter efficient fine-tuning of language models.
Bonawitz, K., Eichner, H., Grieskamp, W., Huba, D., Ingerman, A., Ivanov, V., . . . Roselander, J. (2019). Towards federated learning at scale: System design. Retrieved from https:// arxiv.org/abs/1902.01046
Cattan, Y., Choquette-Choo, C. A., Papernot, N., & Thakurta, A. (2022). Fine-tuning with differential privacy necessitates an additional hyperparameter search. arXiv preprint arXiv:2210.02156.
Chakrabarti, A., & Moseley, B. (2019). Backprop with approximate activations for memory-efficient network training. Retrieved from https://arxiv.org/abs/1901.07988
Chen, H.-Y., Tu, C.-H., Li, Z., Shen, H.-W., & Chao, W.-L. (2023). On the importance and applicability of pre-training for federated learning.
43
Darzidehkalani, E., Ghasemi-rad, M., & van Ooijen, P. (2022). Federated learning in medi- cal imaging: Part i: Toward multicentral health care ecosystems. Journal of the American College of Radiology, 19(8), 969-974. Retrieved from https://www.sciencedirect .com/science/article/pii/S1546144022002800 doi: https://doi.org/10.1016/ j.jacr.2022.03.015
Davari, M., Asadi, N., Mudur, S., Aljundi, R., & Belilovsky, E. (2022). Probing representation forgetting in supervised and unsupervised continual learning. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 16712–16721).
Dhade, P., & Shirke, P. (2023). Federated learning for healthcare: A comprehensive review. En- gineering Proceedings, 59(1). Retrieved from https://www.mdpi.com/2673-4591/ 59/1/230 doi: 10.3390/engproc2023059230
Du, Y., Yang, S., & Huang, K. (2020). High-dimensional stochastic gradient quantization for communication-efficient edge learning. IEEE Transactions on Signal Processing, 68, 2128–2142. Retrieved from http://dx.doi.org/10.1109/TSP.2020.2983166 doi: 10.1109/tsp.2020.2983166
Dwork, C., & Roth, A. (2014). The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci., 9, 211-407. Retrieved from https://api.semanticscholar .org/CorpusID:207178262
Evci, U., Dumoulin, V., Larochelle, H., & Mozer, M. C. (2022). Head2toe: Utilizing intermediate representations for better transfer learning. In International conference on machine learning (pp. 6009–6033).
Fallah, A., Mokhtari, A., & Ozdaglar, A. (2020). Personalized federated learning with theoret- ical guarantees: A model-agnostic meta-learning approach. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, & H. Lin (Eds.), Advances in neural information processing systems (Vol. 33, pp. 3557–3568). Curran Associates, Inc.
Gao, T., Fisch, A., & Chen, D. (2021). Making pre-trained language models better few-shot learners.
Gill, S. S., Golec, M., Hu, J., Xu, M., Du, J., Wu, H., ... Uhlig, S. (2024). Edge ai: A tax- onomy, systematic review and future directions. Retrieved from https://arxiv.org/
44

abs/2407.04053
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate ob- ject detection and semantic segmentation. In Proceedings of the ieee conference on computer vision and pattern recognition (pp. 580–587).
Gomez, A. N., Ren, M., Urtasun, R., & Grosse, R. B. (2017). The reversible residual network: Back- propagation without storing activations. Retrieved from https://arxiv.org/abs/ 1707.04585
Hard, A., Rao, K., Mathews, R., Ramaswamy, S., Beaufays, F., Augenstein, S., ... Ramage, D. (2019). Federated learning for mobile keyboard prediction. Retrieved from https:// arxiv.org/abs/1811.03604
He, C., Shah, A. D., Tang, Z., Sivashunmugam, D. F. N., Bhogaraju, K., Shimpi, M., . . . Aves- timehr, S. (2021). Fedcv: A federated learning framework for diverse computer vision tasks. Retrieved from https://arxiv.org/abs/2111.11066
He, K., Girshick, R., & Dollar, P. (2019, October). Rethinking imagenet pre-training. In Proceed- ings of the ieee/cvf international conference on computer vision (iccv).
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep residual learning for image recognition. Retrieved from https://arxiv.org/abs/1512.03385
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the ieee conference on computer vision and pattern recognition (pp. 770– 778).
Hitaj, B., Ateniese, G., & Perez-Cruz, F. (2017). Deep models under the gan: Information leakage from collaborative deep learning. Retrieved from https://arxiv.org/abs/ 1702.07464
Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., ... Gelly, S. (2019). Parameter-efficient transfer learning for nlp. In International conference on machine learning (pp. 2790–2799).
Hsu, T.-M. H., Qi, H., & Brown, M. (2019). Measuring the effects of non-identical data distribution for federated visual classification. arXiv preprint arXiv:1909.06335.
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., . . . Chen, W. (2021). Lora: Low-rank 45

adaptation of large language models.
Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K., Dally, W. J., & Keutzer, K. (2016). Squeezenet: Alexnet-level accuracy with 50x fewer parameters and ¡0.5mb model size. Jacot, A., Gabriel, F., & Hongler, C. (2018). Neural tangent kernel: Convergence and generalization
in neural networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, & R. Garnett (Eds.), Advances in neural information processing systems (Vol. 31). Curran Associates, Inc.
Janson, P., Zhang, W., Aljundi, R., & Elhoseiny, M. (2022). A simple baseline that questions the use of pretrained-models in continual learning. arXiv preprint arXiv:2210.04428.
Kaplan, J., McCandlish, S., Henighan, T., Brown, T., Chess, B., Child, R., . . . Amodei, D. (2020). Scaling laws for neural language models. Retrieved from https://arxiv.org/abs/ 2001.08361 (arXiv:2001.08361[cs.LG])
Karimireddy, S. P., Kale, S., Mohri, M., Reddi, S., Stich, S., & Suresh, A. T. (2020). Scaffold: Stochastic controlled averaging for federated learning. In International conference on ma- chine learning (pp. 5132–5143).
Konecˇny`, J., McMahan, H. B., Ramage, D., & Richta ́rik, P. (2016). Federated optimization: Dis- tributed machine learning for on-device intelligence. arXiv preprint arXiv:1610.02527. Konecˇny ́, J., McMahan, H. B., Yu, F. X., Richta ́rik, P., Suresh, A. T., & Bacon, D. (2017). Federated
learning: Strategies for improving communication efficiency. Retrieved from https://
arxiv.org/abs/1610.05492
Kornblith, S., Shlens, J., & Le, Q. V. (2019). Do better imagenet models transfer better? In Proceedings of the ieee/cvf conference on computer vision and pattern recognition (pp. 2661– 2671).
Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. , 32–33. Retrieved from https://www.cs.toronto.edu/ ̃kriz/learning-features-2009-TR .pdf
Legate, G., Caccia, L., & Belilovsky, E. (2023). Re-weighted softmax cross-entropy to control forgetting in federated learning. In Proceedings of the 2nd conference on lifelong learning agents.
46

Li, D., & Wang, J. (2019). Fedmd: Heterogenous federated learning via model distillation. Re- trieved from https://arxiv.org/abs/1910.03581
Li, S., Xia, Y., & Xu, Z. (2022). Simultaneous perturbation stochastic approximation: towards one- measurement per iteration. Retrieved from https://arxiv.org/abs/2203.03075
Li, T., Sahu, A. K., Zaheer, M., Sanjabi, M., Talwalkar, A., & Smith, V. (2020a). Federated optimization in heterogeneous networks.
Li, T., Sahu, A. K., Zaheer, M., Sanjabi, M., Talwalkar, A., & Smith, V. (2020b). Federated optimization in heterogeneous networks. Retrieved from https://arxiv.org/abs/ 1812.06127
Li, X. L., & Liang, P. (2021). Prefix-tuning: Optimizing continuous prompts for generation.
Li, Z., & Hoiem, D. (2017). Learning without forgetting. IEEE transactions on pattern analysis
and machine intelligence, 40(12), 2935–2947.
Lian, D., Daquan, Z., Feng, J., & Wang, X. (2022). Scaling & shifting your features: A new baseline
for efficient model tuning. In A. H. Oh, A. Agarwal, D. Belgrave, & K. Cho (Eds.), Advances in neural information processing systems. Retrieved from https://openreview.net/ forum?id=XtyeppctGgc
Lian, X., Zhang, W., Zhang, C., & Liu, J. (2018). Asynchronous decentralized parallel stochastic gradient descent. Retrieved from https://arxiv.org/abs/1710.06952
Lin, Y., Han, S., Mao, H., Wang, Y., & Dally, W. J. (2020). Deep gradient compression: Reducing the communication bandwidth for distributed training. Retrieved from https://arxiv .org/abs/1712.01887
Liu, S., Chen, P.-Y., Kailkhura, B., Zhang, G., Hero, A., & Varshney, P. K. (2020). A primer on zeroth-order optimization in signal processing and machine learning. Retrieved from https://arxiv.org/abs/2006.06224
Liu, S., Kailkhura, B., Chen, P.-Y., Ting, P., Chang, S., & Amini, L. (2018). Zeroth-order stochastic variance reduction for nonconvex optimization. In S. Ben- gio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, & R. Garnett (Eds.), Advances in neural information processing systems (Vol. 31). Curran Associates, Inc. Retrieved from https://proceedings.neurips.cc/paper files/paper/
47

2018/file/ba9a56ce0a9bfa26e8ed9e10b2cc8f46-Paper.pdf
Liu, T., Wang, Z., He, H., Shi, W., Lin, L., Shi, W., . . . Li, C. (2023). Efficient and secure federated learning for financial applications. Retrieved from https://arxiv.org/abs/2303 .08355
Liu, Y., Huang, A., Luo, Y., Huang, H., Liu, Y., Chen, Y., . . . Yang, Q. (2020). Fedvision: An online visual object detection platform powered by federated learning. Retrieved from https:// arxiv.org/abs/2001.06202
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., . . . Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach.
Long, G., Tan, Y., Jiang, J., & Zhang, C. (2021). Federated learning for open banking. Retrieved from https://arxiv.org/abs/2108.10749
Lyu, L., Xu, X., & Wang, Q. (2020). Collaborative fairness in federated learning. Retrieved from https://arxiv.org/abs/2008.12161
Malladi, S., Gao, T., Nichani, E., Damian, A., Lee, J. D., Chen, D., & Arora, S. (2024). Fine-tuning language models with just forward passes.
Malladi, S., Wettig, A., Yu, D., Chen, D., & Arora, S. (2023). A kernel-based view of language model fine-tuning.
McMahan, B., Moore, E., Ramage, D., Hampson, S., & y Arcas, B. A. (2017). Communication- efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics (pp. 1273–1282).
Mixon, D. G., Parshall, H., & Pi, J. (2020). Neural collapse with unconstrained features. Retrieved from https://arxiv.org/abs/2011.11619
Mora, A., Tenison, I., Bellavista, P., & Rish, I. (2022). Knowledge distillation for federated learn- ing: a practical guide. Retrieved from https://arxiv.org/abs/2211.04742 Nguyen, D. C., Ding, M., Pathirana, P. N., Seneviratne, A., Li, J., & Vincent Poor, H. (2021).
Federated learning for internet of things: A comprehensive survey. IEEE Communications Surveys amp; Tutorials, 23(3), 1622–1658. Retrieved from http://dx.doi.org/10 .1109/COMST.2021.3075439 doi: 10.1109/comst.2021.3075439
Nguyen, J., Wang, J., Malik, K., Sanjabi, M., & Rabbat, M. (2023). Where to begin? on the impact 48

of pre-training and initialization in federated learning.
Nguyen, T. D., Nguyen, T., Nguyen, P. L., Pham, H. H., Doan, K., & Wong, K.-S. (2023). Backdoor
attacks and defenses in federated learning: Survey, challenges and future research directions.
Retrieved from https://arxiv.org/abs/2303.02213
Nishio, T., & Yonetani, R. (2019, May). Client selection for federated learning with het-
erogeneous resources in mobile edge. In Icc 2019 - 2019 ieee international conference on communications (icc). IEEE. Retrieved from http://dx.doi.org/10.1109/ ICC.2019.8761315 doi: 10.1109/icc.2019.8761315
Noble, M., Bellet, A., & Dieuleveut, A. (2023). Differentially private federated learning on hetero- geneous data. Retrieved from https://arxiv.org/abs/2111.09278
Papyan, V., Han, X. Y., & Donoho, D. L. (2020, September). Prevalence of neural collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences, 117(40), 24652–24663. Retrieved from http://dx.doi.org/10.1073/ pnas.2015509117 doi: 10.1073/pnas.2015509117
Patel, V. M., Gopalan, R., Li, R., & Chellappa, R. (2015). Visual domain adaptation: A survey of recent advances. IEEE Signal Processing Magazine, 32(3), 53-69. doi: 10.1109/MSP.2014 .2347059
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners..
Ratcliff, R. (1990). Connectionist models of recognition memory: constraints imposed by learning and forgetting functions. Psychological review, 97 2, 285-308.
Rebuffi, S.-A., Kolesnikov, A., Sperl, G., & Lampert, C. H. (2017). icarl: Incremental classifier and representation learning. In Proceedings of the ieee conference on computer vision and pattern recognition (pp. 2001–2010).
Reddi, S., Charles, Z., Zaheer, M., Garrett, Z., Rush, K., Konecˇny`, J., ... McMahan, H. B. (2020). Adaptive federated optimization. arXiv preprint arXiv:2003.00295.
Ren, Y., Guo, S., Bae, W., & Sutherland, D. J. (2023). How to prepare your task head for finetun- ing. In The eleventh international conference on learning representations. Retrieved from https://openreview.net/forum?id=gVOXZproe-e
49

Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations
by back-propagating errors. Nature, 323, 533-536.
.semanticscholar.org/CorpusID:205001834
Retrieved from https://api
Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS.
Snell, J., Swersky, K., & Zemel, R. (2017). Prototypical networks for few-shot learning. In Advances in neural information processing systems (Vol. 30). Curran Associates, Inc. Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A., & Potts, C. (2013).
Recursive deep models for semantic compositionality over a sentiment treebank. In Con- ference on empirical methods in natural language processing. Retrieved from https:// api.semanticscholar.org/CorpusID:990233
Stich, S. U. (2019). Local sgd converges fast and communicates little. In 7th international confer- ence on learning representations, ICLR 2019, new orleans, la, usa, may 6-9, 2019.
Wang, J., & Joshi, G. (2018). Cooperative sgd: A unified framework for the design and analysis of communication-efficient sgd algorithms. arXiv preprint arXiv:1808.07576.
Wei, K., Li, J., Ding, M., Ma, C., Yang, H. H., Farhad, F., . . . Poor, H. V. (2019). Federated learning with differential privacy: Algorithms and performance analysis. Retrieved from https://arxiv.org/abs/1911.00222
Weiss, K., Khoshgoftaar, T. M., & Wang, D. (2016). A survey of transfer learning. Journal of Big data, 3(1), 1–40.
Williams, A., Nangia, N., & Bowman, S. R. (2018). A broad-coverage challenge corpus for sentence understanding through inference. Retrieved from https://arxiv.org/abs/ 1704.05426
Xie, C., Koyejo, S., & Gupta, I. (2020). Asynchronous federated optimization. Retrieved from https://arxiv.org/abs/1903.03934
Yazdanpanah, M., Rahman, A. A., Chaudhary, M., Desrosiers, C., Havaei, M., Belilovsky, E., & Kahou, S. E. (2022, June). Revisiting learnable affines for batch norm in few-shot transfer
50

learning. In Proceedings of the ieee/cvf conference on computer vision and pattern recogni-
tion (cvpr) (p. 9109-9118).
Yu, H., Yang, S., & Zhu, S. (2019). Parallel restarted sgd with faster convergence and less commu-
nication: Demystifying why model averaging works for deep learning. In Proceedings of the
aaai conference on artificial intelligence (Vol. 33, pp. 5693–5700).
Zhang, X., Zhao, J. J., & LeCun, Y. (2015). Character-level convolutional networks for text classi-
fication. In Nips.
Zheng, S., Meng, Q., Wang, T., Chen, W., Yu, N., Ma, Z.-M., & Liu, T.-Y. (2020). Asynchronous
stochastic gradient descent with delay compensation. Retrieved from https://arxiv
.org/abs/1609.08326
Zhu, Z., Hong, J., & Zhou, J. (2021). Data-free knowledge distillation for heterogeneous federated learning. Retrieved from https://arxiv.org/abs/2105.10056
Zhuang, F., Qi, Z., Duan, K., Xi, D., Zhu, Y., Zhu, H., . . . He, Q. (2020). A comprehensive survey on transfer learning. Proceedings of the IEEE, 109(1), 43–76.

Repository Staff Only: item control page

Download Statistics

Downloads per month over past year

Research related to the current document (at the CORE website)

Spectrum Research Repository

Efficient Fine-Tuning Strategies for Federated Learning: Optimizing Model Performance Across Distributed Networks

Efficient Fine-Tuning Strategies for Federated Learning: Optimizing Model Performance Across Distributed Networks

Abstract

References: