Instruction-Hierarchy Violations and Robustness of Large Language Models to Prompt Injection Attacks

Title:

Instruction-Hierarchy Violations and Robustness of Large Language Models to Prompt Injection Attacks

Acquah, Mary (2026) Instruction-Hierarchy Violations and Robustness of Large Language Models to Prompt Injection Attacks. Masters thesis, Concordia University.

Preview

Text (application/pdf)
Acquah_MA_F2026.pdf - Accepted Version
Available under License Spectrum Terms of Access.

214kB

Abstract

This thesis presents a systematic evaluation of the robustness of modern Large Language Models
(LLMs) against prompt injection attacks. Prompt injection is formalized as a violation of instruc
tion hierarchy, wherein untrusted user-provided input overrides, modi es, or circumvents trusted
system-level instructions. Rather than relying on informal demonstrations of adversarial prompting,
this work establishes a reproducible benchmarking framework for measuring instruction-separation
failures in a controlled and quanti able fashion. The study builds upon a recent benchmarking
methodology and extends it to newer generations of open-weight LLMs, speci cally LLaMA (Meta
AI) and unexplored Qwen (Alibaba), which currently rival proprietary systems such as GPT models
in reasoning, coding, and multilingual performance. The thesis focuses on simple yet representative
attack classes, including Naive Prompt Injection, Fake Completion attacks, and Escape Character
injection. This deliberate methodological choice enables standardized, interpretable, and repro
ducible robustness evaluation across models and tasks. By demonstrating that even trivial adver
sarial instructions can systematically override trusted prompts, the thesis highlights structural vul
nerabilities in instruction-following architectures rather than weaknesses tied to highly specialized
exploits. Evaluation is conducted across ve task families representing common LLM deployment
settings: (i) Constrained news summarization with strict output-length requirements; (ii) Binary
sentiment classi cation; (iii) Paraphrase identi cation; (iv) SMS spam detection, and; (v) Hate and
offensive content detection. These tasks span both generative settings and structured-output classi
cation scenarios. Different datasets and attack variants are considered to assess consistency and
cross-task robustness patterns. The results reveal that both LLaMA and Qwen exhibit measurable
iii
susceptibility to instruction-hierarchy violations across tasks, even under minimal adversarial per
turbations. These ndings suggest that prompt injection remains a fundamental reliability challenge
in contemporary instruction-tuned LLMs.

Divisions:	Concordia University > Gina Cody School of Engineering and Computer Science > Concordia Institute for Information Systems Engineering
Item Type:	Thesis (Masters)
Authors:	Acquah, Mary
Institution:	Concordia University
Degree Name:	M.A. Sc.
Program:	Information Systems Security
Date:	11 March 2026
Thesis Supervisor(s):	Mohammadi, Dr. Arash
ID Code:	996874
Deposited By:	Mary Acquah
Deposited On:	29 Jun 2026 14:42
Last Modified:	29 Jun 2026 14:42

Repository Staff Only: item control page

Download Statistics

Downloads per month over past year

Research related to the current document (at the CORE website)

Spectrum Research Repository

Instruction-Hierarchy Violations and Robustness of Large Language Models to Prompt Injection Attacks

Instruction-Hierarchy Violations and Robustness of Large Language Models to Prompt Injection Attacks

Abstract