Halachev, Mihail (2009) Management of biological sequences using suffix trees. PhD thesis, Concordia University.
Preview |
Text (application/pdf)
2MBNR63445.pdf - Accepted Version |
Abstract
The amount of available biological sequences, represented as strings over the DNA and protein alphabets, grows at phenomenal rate. Supporting various search tasks over such data efficiently requires development of sophisticated indexing techniques. Recently, suffix tree (ST) and suffix array (SA) received considerable attention as suitable data structures in this context. However, existing solutions often focus on either efficiency or scalability, but not both. Further, some of the solutions require advanced computational resources or are tailored towards a specific application. We investigate, both theoretically and experimentally, ways to improve efficiency and scalability in management of biological sequence data. Our goal is to develop an indexing technique that is reasonable in construction time and space utilization, and supports efficiently versatile search applications in biological sequences of various sizes, running on a typical desktop computer. The contributions of this research include development of a ST based indexing technique, called HST, together with exact and approximate search algorithms that use the index. The results of our experiments indicate that the index construction cost is comparable to other ST based techniques, such as TDD and Trellis, in terms of construction time and main memory requirement. While HST exhibits slower construction time than Vmatch, the best known SA based solution, with the same amount of main memory HST can handle sequences that are an order of magnitude longer. In terms of the index size, HST is comparable to TDD and Vmatch, which is half of the Trellis index size. We also develop efficient and scalable search applications using HST, including exact match, k-mismatch, and structured motif search. Our experiments using real-life sequences indicated that for short sequences (e.g., human chromosomes), our exact match search is comparable to Vmatch, about 3 times faster than TDD, and more than 10 times faster than Trellis. Further, HST can be used to search directly in longer DNA sequences, as opposed to partitioning such a sequence and search in the parts - the only option to follow with Vmatch. We found that a direct exact match search using HST is twice faster when searching in the entire human genome, compared to using Vmatch on parts. Compared to Trellis, which can handle direct search in human genome, HST was more than 20 times faster. To further compare performance of HST and Vmatch, we considered k-mismatch search. Our results indicated significant improvement of the HST based solution over Vmatch, ranging from 2 to 9 times faster k-mismatch search on average, for short and long sequences, respectively. For structured motif search, HST was about 6 times faster than SMOTIF1, the best known structured motif search tool.
Divisions: | Concordia University > Gina Cody School of Engineering and Computer Science > Computer Science and Software Engineering |
---|---|
Item Type: | Thesis (PhD) |
Authors: | Halachev, Mihail |
Pagination: | xiii, 131 leaves ; 29 cm. |
Institution: | Concordia University |
Degree Name: | Ph. D. |
Program: | Computer Science and Software Engineering |
Date: | 2009 |
Thesis Supervisor(s): | Shiri, N |
Identification Number: | LE 3 C66C67P 2009 H35 |
ID Code: | 976387 |
Deposited By: | lib-batchimporter |
Deposited On: | 22 Jan 2013 16:24 |
Last Modified: | 13 Jul 2020 20:10 |
Related URLs: |
Repository Staff Only: item control page