Login | Register

Membrane Protein Classification with Protein Language Models

Title:

Membrane Protein Classification with Protein Language Models

Ghazikhani, Hamed (2024) Membrane Protein Classification with Protein Language Models. PhD thesis, Concordia University.

[thumbnail of Ghazikhani_PhD_S2024 (for Spring).pdf]
Preview
Text (application/pdf)
Ghazikhani_PhD_S2024 (for Spring).pdf - Accepted Version
Available under License Spectrum Terms of Access.
18MB

Abstract

This thesis investigates the application of Protein Language Models (PLMs) to enhance the classification of membrane proteins, which are crucial for cellular functions and pharmacological targeting but challenging to characterize due to their context within a membrane. We employ PLMs derived from Large Language Models of natural language processing, including ProtBERT, ProtT5, ESM1b, ESM2, and Ankh. These PLMs are pretrained using self-supervised learning on extensive datasets such as UniRef50 (40 million proteins) and BFD (2 billion proteins).

Our research comprises four interconnected projects focused on discriminating membrane proteins, transport proteins, and ion channels from proteins not in those classes. We use established state-of-the-art (SOTA) tools with standard datasets for training and testing as a baseline for evaluating our work.

The first project demonstrates that fine-tuning is beneficial in classifying membrane proteins, with a fine-tuned combination of ProtBERT-BFD and logistic regression (LR) outperforming SOTA. The second project shows that Convolutional Neural Networks (CNNs) are superior to traditional classifiers when used with PLMs for membrane protein, transport protein and ion channel classification, again surpassing SOTA performance.

In the third project, we evaluate six PLMs and six downstream classifiers across three tasks, considering fine-tuned and frozen representations, dataset balance, and floating-point precision. ESM-1b emerges as the top performer across most tasks and metrics. We confirm that fine-tuning outperforms frozen representations, imbalanced datasets work best, and there is no statistically significant difference between half- and full-precision computations.

The fourth project incorporates secondary structure information into Ankh. Evaluation across multiple tasks shows little statistically significant difference between Ankh and the modified PLM with secondary structure information.

The tools developed in this research now represent the state-of-the-art in membrane protein classification. Our methodological findings provide insights into PLM applications for protein classification in general, with particular relevance to membrane proteins highly relevant to drug discovery.

Divisions:Concordia University > Gina Cody School of Engineering and Computer Science > Computer Science and Software Engineering
Item Type:Thesis (PhD)
Authors:Ghazikhani, Hamed
Institution:Concordia University
Degree Name:Ph. D.
Program:Computer Science
Date:11 July 2024
Thesis Supervisor(s):Butler, Gregory
ID Code:994084
Deposited By: Hamed Ghazikhani
Deposited On:24 Oct 2024 16:27
Last Modified:24 Oct 2024 16:27
All items in Spectrum are protected by copyright, with all rights reserved. The use of items is governed by Spectrum's terms of access.

Repository Staff Only: item control page

Downloads per month over past year

Research related to the current document (at the CORE website)
- Research related to the current document (at the CORE website)
Back to top Back to top