Kehyayan, Christine Houry (2014) Using Synteny in Phylogenomics Algorithms to Cluster Protein Sequences. PhD thesis, Concordia University.
Preview |
Text (application/pdf)
4MBKehyayan_PhD_S2014.pdf - Accepted Version Available under License Spectrum Terms of Access. |
Abstract
With the rapid development of genome sequencing technologies, complete genomes are becoming more available and the need for computational methods for protein functional annotation is becoming more pressing. A long-standing problem in protein functional annotation is to distinguish orthologs from paralogs. Several academic efforts have recently emerged to automatically cluster proteins based on the premise that proteins in the same cluster are likely to have similar functions -- or are orthologs. The effectiveness of these protein clustering algorithms is fundamental for building accurate functional annotation pipelines.
This dissertation first presents a study of the effectiveness of the similarity graph-based Markov CLuster algorithm (MCL) in detecting protein families and subfamilies when using it to cluster experimentally characterized enzymes from fungal genomes in the mycoCLAP database. Our study shows that the MCL algorithm successfully clusters proteins such that proteins in the same cluster always happen to be from the same family. However, in most cases, the MCL algorithm does not separate subfamilies. We evaluate the clusters with several cluster quality metrics, and show that these metrics can be used to spot outliers.
This dissertation then introduces SynAPhy, a novel graph-based approach for clustering proteins by leveraging the global context of complete genomes for predicting functional similarity. SynAPhy integrates genomic neighborhood information into sequence similarity for better prediction of functionally similar protein clusters. It computes the ``syntenic reciprocal best hits" of proteins across genomes and uses this information to produce modified edge weight protein sequence similarity graphs. The similarity graphs are used as an input to the MCL algorithm to determine orthologous clusters across genomes. The results of applying SynAPhy on eight fungal genomes show that SynAPhy successfully generates clusters with more similar members than the MCL algorithm. However, there is no gold standard genome scale dataset to evaluate the capability of SynAPhy in generating orthologous clusters.
We introduce SynAVal, an evaluation framework that can be applied on an orthology prediction technique. SynAVal first detects paralogs within each input genome, and then detects conserved connections between genomes that are highly likely orthologs using the synteny knowledge of SynAPhy. It uses these data to identify and report confusions raised by paralogs. The results of applying SynAVal on eight fungal genomes show that SynAVal with synteny resolution can successfully resolve potential confusions raised by 9.1\% of all the proteins of the eight fungal genomes, and 23.33\% of the subset of the proteins of the eight fungal genomes that are likely to raise confusions.
Divisions: | Concordia University > Gina Cody School of Engineering and Computer Science > Computer Science and Software Engineering |
---|---|
Item Type: | Thesis (PhD) |
Authors: | Kehyayan, Christine Houry |
Institution: | Concordia University |
Degree Name: | Ph. D. |
Program: | Computer Science and Software Engineering |
Date: | 30 September 2014 |
Thesis Supervisor(s): | Butler, Greg |
ID Code: | 979109 |
Deposited By: | CHRISTINE HOURY KEHYAYAN |
Deposited On: | 27 Oct 2022 13:47 |
Last Modified: | 27 Oct 2022 13:47 |
Repository Staff Only: item control page