Thamildurai, Anand (2007) Efficient and scalable indexing techniques for sequence data management. Masters thesis, Concordia University.
Preview |
Text (application/pdf)
3MBMR34725.pdf - Accepted Version |
Abstract
Sequence data is one of the rapidly growing types of data. New efficient and scalable techniques are needed to support fast access to this type of data. We study indexing techniques for sequence data, especially biological sequence data. The existing solutions for this type of data either support efficient index construction for long sequences or support fast search, but not both. We propose two new indexing techniques, Suffix Tree Top-Down 64 bit and Suffix Tree Depth-First 64 bit, which offer a tradeoff between scalable index construction, index size, and support of fast search. They differ in the order in which the index nodes are recorded but have similar performance. We compare our techniques with the best known existing techniques, which are based on suffix trees (TDD) or suffix arrays (ESA). The results of our extensive experiments show that while our proposed techniques have a slightly slower construction time for small sequences and larger index size compared to TDD, they outperform TDD in search. We further show that for very large sequences, such as the human genome (about 3GB), our techniques are superior to TDD due to the use of dynamic buffering and better index representation. Compared to the most search efficient in-memory indexing technique, ESA, our proposed techniques are slower in construction but have comparable index size and search performance. The main advantage of our techniques over ESA is that they are disk-based and can handle large sequences.
Divisions: | Concordia University > Gina Cody School of Engineering and Computer Science > Computer Science and Software Engineering |
---|---|
Item Type: | Thesis (Masters) |
Authors: | Thamildurai, Anand |
Pagination: | x, 98 leaves : ill. ; 29 cm. |
Institution: | Concordia University |
Degree Name: | M. Comp. Sc. |
Program: | Computer Science and Software Engineering |
Date: | 2007 |
Thesis Supervisor(s): | Shiri, Nematollaah |
Identification Number: | LE 3 C66C67M 2007 T53 |
ID Code: | 975347 |
Deposited By: | Concordia University Library |
Deposited On: | 22 Jan 2013 16:06 |
Last Modified: | 13 Jul 2020 20:07 |
Related URLs: |
Repository Staff Only: item control page