Using Synteny in Phylogenomics Algorithms to Cluster Protein Sequences

Title:

Using Synteny in Phylogenomics Algorithms to Cluster Protein Sequences

Kehyayan, Christine Houry (2014) Using Synteny in Phylogenomics Algorithms to Cluster Protein Sequences. PhD thesis, Concordia University.

Preview

Text (application/pdf)
Kehyayan_PhD_S2014.pdf - Accepted Version
Available under License Spectrum Terms of Access.

4MB

Abstract

With the rapid development of genome sequencing technologies, complete genomes are becoming more available and the need for computational methods for protein functional annotation is becoming more pressing. A long-standing problem in protein functional annotation is to distinguish orthologs from paralogs. Several academic efforts have recently emerged to automatically cluster proteins based on the premise that proteins in the same cluster are likely to have similar functions -- or are orthologs. The effectiveness of these protein clustering algorithms is fundamental for building accurate functional annotation pipelines.

This dissertation first presents a study of the effectiveness of the similarity graph-based Markov CLuster algorithm (MCL) in detecting protein families and subfamilies when using it to cluster experimentally characterized enzymes from fungal genomes in the mycoCLAP database. Our study shows that the MCL algorithm successfully clusters proteins such that proteins in the same cluster always happen to be from the same family. However, in most cases, the MCL algorithm does not separate subfamilies. We evaluate the clusters with several cluster quality metrics, and show that these metrics can be used to spot outliers.

This dissertation then introduces SynAPhy, a novel graph-based approach for clustering proteins by leveraging the global context of complete genomes for predicting functional similarity. SynAPhy integrates genomic neighborhood information into sequence similarity for better prediction of functionally similar protein clusters. It computes the ``syntenic reciprocal best hits" of proteins across genomes and uses this information to produce modified edge weight protein sequence similarity graphs. The similarity graphs are used as an input to the MCL algorithm to determine orthologous clusters across genomes. The results of applying SynAPhy on eight fungal genomes show that SynAPhy successfully generates clusters with more similar members than the MCL algorithm. However, there is no gold standard genome scale dataset to evaluate the capability of SynAPhy in generating orthologous clusters.

We introduce SynAVal, an evaluation framework that can be applied on an orthology prediction technique. SynAVal first detects paralogs within each input genome, and then detects conserved connections between genomes that are highly likely orthologs using the synteny knowledge of SynAPhy. It uses these data to identify and report confusions raised by paralogs. The results of applying SynAVal on eight fungal genomes show that SynAVal with synteny resolution can successfully resolve potential confusions raised by 9.1\% of all the proteins of the eight fungal genomes, and 23.33\% of the subset of the proteins of the eight fungal genomes that are likely to raise confusions.

Divisions:	Concordia University > Gina Cody School of Engineering and Computer Science > Computer Science and Software Engineering
Item Type:	Thesis (PhD)
Authors:	Kehyayan, Christine Houry
Institution:	Concordia University
Degree Name:	Ph. D.
Program:	Computer Science and Software Engineering
Date:	30 September 2014
Thesis Supervisor(s):	Butler, Greg
ID Code:	979109
Deposited By:	CHRISTINE HOURY KEHYAYAN
Deposited On:	27 Oct 2022 13:47
Last Modified:	27 Oct 2022 13:47

Repository Staff Only: item control page

Download Statistics

Downloads per month over past year

Research related to the current document (at the CORE website)

Spectrum Research Repository

Using Synteny in Phylogenomics Algorithms to Cluster Protein Sequences

Using Synteny in Phylogenomics Algorithms to Cluster Protein Sequences

Abstract