Login | Register

Prediction of Indel Flanking Regions and Its Application in the Alignment of Multiple Protein Sequences

Title:

Prediction of Indel Flanking Regions and Its Application in the Alignment of Multiple Protein Sequences

Al-Shatnawi, Mufleh Saleh (2015) Prediction of Indel Flanking Regions and Its Application in the Alignment of Multiple Protein Sequences. PhD thesis, Concordia University.

[thumbnail of Al-Shatnawi_PhD_F2015.pdf]
Preview
Text (application/pdf)
Al-Shatnawi_PhD_F2015.pdf - Accepted Version
5MB

Abstract

Proteins are the most important molecules in living organism, and they are involved in every function of the cells, such as signal transmission, metabolic regulation, transportation of molecules, and defense mechanism. As new protein sequences are discovered on an everyday basis and protein databases continue to grow exponentially with time, analysis of protein families, understanding their evolutionary trends and detection of remote homologues have become extremely important. The traditional laboratory techniques of studying these proteins are very slow and time consuming. Therefore, biologists have turned to automated methods that are fast and capable of analyzing large amounts of data and determining relationships between proteins that would be difficult, if not impossible, for humans to identify through the traditional techniques.

Insertion/deletion (indel) and substitution of an amino acid are two common events that lead to the evolution of and variations in protein sequences. Further, many of the human diseases and functional divergence between homologous proteins are related more to the indel mutations than to the substitution mutations, even though the former occurs less often than the latter. A reliable detection of indels and their flanking regions is a major challenge in research related to protein evolution, structures and functions.

The first and most important step in studying a newly discovered protein sequence is to search protein databases for proteins that are similar or closely-related to the new protein, and then to align the new protein sequence to these proteins. Thus, the alignment of multiple protein sequences is one of the most commonly performed tasks in bioinformatics analyses, and has been used in many applications, including sequence annotation, phylogenetic tree estimation, evolutionary analysis, secondary structure prediction and protein database search. In spite of considerable research and efforts that have been recently deployed for improving the performance of multiple sequence alignment (MSA) algorithms, finding a highly accurate alignment between multiple protein sequences still remains a challenging problem.

The objectives of this thesis are to develop a novel scheme to predict indel flanking regions (IndelFRs) in a protein sequence and to develop an efficient algorithm for the alignment of multiple protein sequences incorporating the information on the predicted IndelFRs.

In the first part of the thesis, a variable-order Markov model-based scheme to predict indel flanking regions in a protein sequence for a given protein fold is proposed. In this scheme, two predictors, referred to as the PPM IndelFR and PST IndelFR predictors, are designed based on prediction by partial match and probabilistic suffix tree, respectively. The performance of the proposed IndelFR predictors is evaluated in terms of the commonly used metrics, namely, accuracy of prediction and F1-measure. It is shown through extensive performance evaluation that the proposed predictors are able to predict IndelFRs in the protein sequences with high values of accuracy and F1-measure. It is also shown that if one is interested only in predicting IndelFRs in protein sequences, it would be preferable to use the proposed predictors instead of HMMER 3.0 in view of the substantially superior performance of the former.

In the second part of the thesis, a novel and efficient algorithm incorporating the information on the predicted IndelFRs for the alignment of multiple protein sequences is proposed. A new variable gap penalty function is introduced, which makes the gap placement in protein sequences more accurate for the protein alignment. The performance of the proposed alignment algorithm, named as MSAIndelFR algorithm, is evaluated in terms of the so called metrics, sum-of-pairs (SP) and total columns (TC). It is shown through extensive performance evaluation using four popular benchmarks, BAliBASE 3.0, OXBENCH, PREFAB 4.0, and SABmark 1.65, that the performance of MSAIndelFR is superior to that of the six most-widely used alignment algorithms, namely, Clustal W2, Clustal Omega, MSAProbs, Kalign2, MAFFT and MUSCLE.

Through the study undertaken in this thesis it is shown that a reliable detection of indels and their flanking regions can be achieved by using the proposed IndelFR predictors, and a substantial improvement in the protein alignment accuracy can be achieved by using the proposed variable gap penalty function. Thus, it is anticipated that this investigation will not only facilitate future studies on the modeling of indel mutations and protein sequence alignment, but will also open up new avenues for research concerning protein evolution, structures, and functions as well as for research concerning protein sequence alignment.

Divisions:Concordia University > Gina Cody School of Engineering and Computer Science > Electrical and Computer Engineering
Item Type:Thesis (PhD)
Authors:Al-Shatnawi, Mufleh Saleh
Institution:Concordia University
Degree Name:Ph. D.
Program:Electrical and Computer Engineering
Date:10 September 2015
Thesis Supervisor(s):Ahmad, M. Omair and Swamy, M.N.S.
Keywords:Indel Flanking Regions (IndelFRs), IndelFR Predictor, Variable Gap Penalty, Multiple Sequence Alignment (MSA), Protein Sequence Alignment
ID Code:980541
Deposited By: Mufleh Saleh Al-Shatnawi
Deposited On:28 Oct 2015 12:21
Last Modified:18 Jan 2018 17:51
All items in Spectrum are protected by copyright, with all rights reserved. The use of items is governed by Spectrum's terms of access.

Repository Staff Only: item control page

Downloads per month over past year

Research related to the current document (at the CORE website)
- Research related to the current document (at the CORE website)
Back to top Back to top