Ghazikhani, Hamed (2024) Membrane Protein Classification with Protein Language Models. PhD thesis, Concordia University.
Preview |
Text (application/pdf)
18MBGhazikhani_PhD_S2024 (for Spring).pdf - Accepted Version Available under License Spectrum Terms of Access. |
Abstract
This thesis investigates the application of Protein Language Models (PLMs) to enhance the classification of membrane proteins, which are crucial for cellular functions and pharmacological targeting but challenging to characterize due to their context within a membrane. We employ PLMs derived from Large Language Models of natural language processing, including ProtBERT, ProtT5, ESM1b, ESM2, and Ankh. These PLMs are pretrained using self-supervised learning on extensive datasets such as UniRef50 (40 million proteins) and BFD (2 billion proteins).
Our research comprises four interconnected projects focused on discriminating membrane proteins, transport proteins, and ion channels from proteins not in those classes. We use established state-of-the-art (SOTA) tools with standard datasets for training and testing as a baseline for evaluating our work.
The first project demonstrates that fine-tuning is beneficial in classifying membrane proteins, with a fine-tuned combination of ProtBERT-BFD and logistic regression (LR) outperforming SOTA. The second project shows that Convolutional Neural Networks (CNNs) are superior to traditional classifiers when used with PLMs for membrane protein, transport protein and ion channel classification, again surpassing SOTA performance.
In the third project, we evaluate six PLMs and six downstream classifiers across three tasks, considering fine-tuned and frozen representations, dataset balance, and floating-point precision. ESM-1b emerges as the top performer across most tasks and metrics. We confirm that fine-tuning outperforms frozen representations, imbalanced datasets work best, and there is no statistically significant difference between half- and full-precision computations.
The fourth project incorporates secondary structure information into Ankh. Evaluation across multiple tasks shows little statistically significant difference between Ankh and the modified PLM with secondary structure information.
The tools developed in this research now represent the state-of-the-art in membrane protein classification. Our methodological findings provide insights into PLM applications for protein classification in general, with particular relevance to membrane proteins highly relevant to drug discovery.
Divisions: | Concordia University > Gina Cody School of Engineering and Computer Science > Computer Science and Software Engineering |
---|---|
Item Type: | Thesis (PhD) |
Authors: | Ghazikhani, Hamed |
Institution: | Concordia University |
Degree Name: | Ph. D. |
Program: | Computer Science |
Date: | 11 July 2024 |
Thesis Supervisor(s): | Butler, Gregory |
ID Code: | 994084 |
Deposited By: | Hamed Ghazikhani |
Deposited On: | 24 Oct 2024 16:27 |
Last Modified: | 24 Oct 2024 16:27 |
Repository Staff Only: item control page