Integrating Vision-Language Models with Reinforcement Learning for Human-Aligned Decision-Making of Autonomous Vehicles

Title:

Integrating Vision-Language Models with Reinforcement Learning for Human-Aligned Decision-Making of Autonomous Vehicles

Doroudian, Erfan (2024) Integrating Vision-Language Models with Reinforcement Learning for Human-Aligned Decision-Making of Autonomous Vehicles. Masters thesis, Concordia University.

Preview

Text (application/pdf)
Doroudian_MASc_S2025.pdf - Accepted Version
Available under License Spectrum Terms of Access.

2MB

Abstract

This thesis focuses on developing a new approach for improving the decision-making process
for autonomous vehicles (AVs) in complex urban driving scenarios, particularly at unsignalized
intersections, using the reinforcement learning (RL) framework. One of the primary difficulties
in RL environments is designing a suitable reward model, which can often be challenging to
achieve manually due to the complexity of the interactions and the driving scenarios. To address
this challenge, this work utilizes Vision-Language Models (VLMs), particularly CLIP (Contrastive
Language-Image Pretraining), to build an additional reward model based on visual and textual cues.
CLIP’s ability to align image and text embeddings provides unique features for translating humanlike
instructions into reward signals to guide the AV’s decision-making process. We apply two RL
algorithms, Proximal Policy Optimization (PPO) and Deep Q-Network (DQN), to train an agent in
complex unsignalized intersection environments. The performance of these algorithms is compared
with and without the CLIP-based reward model, which highlights the impact of CLIP on the agent’s
ability to learn and optimize its behavior in a way that aligns with desired driving actions. This
study’s results show VLMs’ capabilities in improving RL-based decision-making in autonomous
driving. We utilize the Highway-env simulation package, built on the OpenAI Gym framework, to
test and validate the effectiveness of the proposed framework. Our simulation experiments indicate
the framework’s effectiveness in optimizing traffic flow, minimizing collisions, and balancing both
individual and collective benefits among road users. The results highlight the potential of integrating
VLMs to provide human-aligned instructions, which could guide autonomous vehicle actions
toward safer and more socially acceptable behaviors and eventually promote AVs’ safe and trustable
deployment in future intelligent transportation systems.

Divisions:	Concordia University > Gina Cody School of Engineering and Computer Science > Mechanical, Industrial and Aerospace Engineering
Item Type:	Thesis (Masters)
Authors:	Doroudian, Erfan
Institution:	Concordia University
Degree Name:	M.A. Sc.
Program:	Mechanical Engineering
Date:	29 November 2024
Thesis Supervisor(s):	Taghavifar, Hamid
ID Code:	995019
Deposited By:	Erfan Doroudian
Deposited On:	17 Jun 2025 17:11
Last Modified:	17 Jun 2025 17:11

Repository Staff Only: item control page

Download Statistics

Downloads per month over past year

Research related to the current document (at the CORE website)

Spectrum Research Repository

Integrating Vision-Language Models with Reinforcement Learning for Human-Aligned Decision-Making of Autonomous Vehicles

Integrating Vision-Language Models with Reinforcement Learning for Human-Aligned Decision-Making of Autonomous Vehicles

Abstract