Login | Register

Facial Attractiveness Prediction Using a Single and Multi-Task Vision Transformer Framework

Title:

Facial Attractiveness Prediction Using a Single and Multi-Task Vision Transformer Framework

Ghorbanimehr, Mohammad Soroush ORCID: https://orcid.org/0000-0002-4196-9358 (2025) Facial Attractiveness Prediction Using a Single and Multi-Task Vision Transformer Framework. Masters thesis, Concordia University.

[thumbnail of Ghorbanimehr_MSc_F2025.pdf]
Preview
Text (application/pdf)
Ghorbanimehr_MSc_F2025.pdf - Accepted Version
Available under License Spectrum Terms of Access.
1MB

Abstract

Facial attractiveness prediction is a challenging and inherently subjective task in computer vision, with applications spanning social media, cosmetic technology, and aesthetic medicine. While convolutional neural networks (CNNs) have driven significant advances in this area, recent developments in transformer-based architectures, such as the Vision Transformer (ViT), offer new opportunities by capturing global feature relationships and long-range dependencies within images. This thesis explores the use of Vision Transformers for predicting facial attractiveness on the SCUT-FBP5500 dataset, where beauty scores are computed from the average ratings of multiple human annotators. The task is formulated as a regression problem to predict continuous attractiveness scores. To enhance the learned feature representations, a multi-task learning framework is introduced, jointly performing gender and ethnicity classification alongside beauty prediction. The methodology includes systematic image preprocessing, transfer learning with a ViT pretrained on large-scale facial recognition data, and fine-tuning for both primary and auxiliary tasks. Model performance is evaluated using PC, MAE, and RMSE for regression and classification accuracy for auxiliary tasks. Comparative experiments with CNN-based baselines demonstrate that transformer architectures capture more holistic and subtle aesthetic cues, resulting in improved prediction consistency. Experimental results show that the proposed ViT-based approach achieves superior accuracy and robustness compared to conventional CNNs, even with limited training data. These findings highlight the potential of our Vision Transformers as an effective and data-efficient alternative for facial aesthetic analysis. The thesis concludes by emphasizing the value of multi-task learning in enriching feature representations and encourages future research toward interpretable and scalable beauty prediction systems.

Divisions:Concordia University > Gina Cody School of Engineering and Computer Science > Computer Science and Software Engineering
Item Type:Thesis (Masters)
Authors:Ghorbanimehr, Mohammad Soroush
Institution:Concordia University
Degree Name:M. Comp. Sc.
Program:Computer Science
Date:12 September 2025
Thesis Supervisor(s):Suen, Ching Y
ID Code:996307
Deposited By: Mohammad Soroush Ghorbanimehr
Deposited On:04 Nov 2025 15:37
Last Modified:04 Nov 2025 15:37
All items in Spectrum are protected by copyright, with all rights reserved. The use of items is governed by Spectrum's terms of access.

Repository Staff Only: item control page

Downloads per month over past year

Research related to the current document (at the CORE website)
- Research related to the current document (at the CORE website)
Back to top Back to top