An Effective Large Language Model-based Pipeline to Preprocess Narrative Electronic Medical Records Data for Hospital Adverse Events Detection

Title:

An Effective Large Language Model-based Pipeline to Preprocess Narrative Electronic Medical Records Data for Hospital Adverse Events Detection

Jafarpour, Hamed ORCID: https://orcid.org/0009-0007-9410-7675 (2025) An Effective Large Language Model-based Pipeline to Preprocess Narrative Electronic Medical Records Data for Hospital Adverse Events Detection. PhD thesis, Concordia University.

Text (application/pdf)
Jafarpour_PhD_F2025.pdf - Accepted Version
Restricted to Repository staff only until 1 June 2027.
Available under License Spectrum Terms of Access.

3MB

Abstract

Narrative Electronic Medical Record (EMR) data is a valuable but challenging resource for
analysis due to the need for preprocessing that comprises three essential tasks: section detection,
text normalization, and feature engineering. This thesis endeavors to establish a pipeline leveraging
Large Language Model (LLM) for the preprocessing of narrative EMR, with the objective of
identifying Hospital Adverse Event (HAE). The proposed pipeline aims to enhance the efficiency of
HAE detection by utilizing LLM, while simultaneously reducing reliance on labor-intensive, time-consuming, and costly procedures. The detection of HAE is typically accomplished through a variety of methods, which include manual chart reviews, discharge diagnostic coding, prevalence surveys, and incident reporting systems. Recently, there has been a growing interest among researchers in leveraging narrative EMR data, along with Natural Language Processing (NLP), Machine Learning (ML), and LLM techniques. A significant challenge associated with these techniques is the critical need for preprocessing narrative EMR data. Additionally, it is noteworthy that the existing tools intended for the preprocessing of narrative EMR are predominantly designed for general applications rather than being specifically optimized for the detection of HAE.

This thesis examines the preprocessing of narrative EMR to identify HAE by developing a
pipeline based on LLMs. First, given the increasing use of NLP and, consequently, LLM for HAE
detection, a systematic scoping review is conducted on this topic to summarize the existing literature, find the overlooked research gaps and the challenges related to using narrative EMR to
detect HAE. The review also underscores the essential role of preprocessing tasks in enhancing
the performance of HAE detection. The results emphasize the importance of text normalization
and establishing feature engineering in preprocessing tasks that significantly affect HAE detection
performance.

Second, the LLM-based pipeline tackles the challenges associated with the section detection
task by designing and implementing a novel multi-head attention mechanism aimed at training
LLMs for the accurate identification of section headers within clinical notes. In contrast to the
regular attention mechanisms that analyze all tokens within the input sentence, the proposed customized multi-head attention mechanism selectively directs attention towards tokens that denote
section header titles during the training phase of the LLMs. The results indicate that our approach
resulted in enhanced performance across three distinct LLMs, namely Text-to-Text Transfer Transformer
(T5), Generative Pre-trained Transformer (GPT)-2, and Bidirectional Encoder Representations
from Transformers (BERT). Notably, consistent improvements were observed in T5, which is
a smaller model.

Third, to address the text normalization challenges, a framework is proposed for detecting and
deciphering abbreviations within clinical text, employing LLMs. This framework is structured into
four distinct phases: task definition, properties identification, example selection, and the application
of LLMs through either a fine-tuning approach or an optimized example-based prompting method.
The results demonstrate that the fine-tuning approach for LLMs yields superior performance at a
lower cost compared to the optimized example-based prompt. This finding indicates that fine-tuning
LLMs effectively and efficiently facilitates the detection and deciphering of abbreviations in clinical
notes.

In conclusion, this thesis posits that customized attention directed toward the specific target task
in LLMs significantly enhances both the effectiveness and efficiency of task performance. This
customization may be achieved through various approaches, including the design of a customized
multi-head attention mechanism during training, the formulation of engineered prompts, or the systematic fine-tuning of LLMs.

Divisions:	Concordia University > Gina Cody School of Engineering and Computer Science > Concordia Institute for Information Systems Engineering
Item Type:	Thesis (PhD)
Authors:	Jafarpour, Hamed
Institution:	Concordia University
Degree Name:	Ph. D.
Program:	Information and Systems Engineering
Date:	24 April 2025
Thesis Supervisor(s):	Yan, Jun and Zeng, Yong and Quan, Hude
ID Code:	995663
Deposited By:	Hamed Jafarpour
Deposited On:	04 Nov 2025 16:45
Last Modified:	04 Nov 2025 16:45

Repository Staff Only: item control page

Download Statistics

Downloads per month over past year

Research related to the current document (at the CORE website)

Spectrum Research Repository

An Effective Large Language Model-based Pipeline to Preprocess Narrative Electronic Medical Records Data for Hospital Adverse Events Detection

An Effective Large Language Model-based Pipeline to Preprocess Narrative Electronic Medical Records Data for Hospital Adverse Events Detection

Abstract