Enhanced Suffix Trees for Very Large DNA Sequences

Title:

Enhanced Suffix Trees for Very Large DNA Sequences

Fan, Si Ai (2011) Enhanced Suffix Trees for Very Large DNA Sequences. Masters thesis, Concordia University.

Preview

Text (application/pdf)
Fan_MCompSc_F2011.pdf - Accepted Version

3MB

Abstract

Recent advances in bio-technology have provided rapid accumulation of biological DNA sequence data. New techniques are required for fast, scalable, and versatile processing of such data.
Suffix tree (ST) is a data structure used for indexing genome data. This, however, comes with a price: it occupies a space that is about 10 times more than the input size. Existing disk-based ST index techniques either suffer from data skew problem, like TDD and HST, or are not space efficient for very large sequences, like TRELLIS and B2ST. We propose a new disk-based ST index, called Compact Binary Suffix Tree (CBST), together with a construction algorithm, which can support DNA sequences of size up to 256 terabyte. The results of our numerous experiments indicated that, compared to existing ST and suffix array techniques, CBST is superior in speed, space requirement, and scalability. It is the fastest among the disk-based techniques for very large sequences.

Divisions:	Concordia University > Gina Cody School of Engineering and Computer Science > Computer Science and Software Engineering
Item Type:	Thesis (Masters)
Authors:	Fan, Si Ai
Institution:	Concordia University
Degree Name:	M. Comp. Sc.
Program:	Computer Science
Date:	August 2011
Thesis Supervisor(s):	Shiri, Nematollaah
Keywords:	Suffix Tree, DNA, Sequence, Suffix Array, External, Binary Suffix Tree
ID Code:	35799
Deposited By:	SI AI FAN
Deposited On:	21 Nov 2011 16:50
Last Modified:	18 Jan 2018 17:35

References:

[1000 Genomes Project, 2011] 1000 Genomes Project, http://en.wikipe- dia.org/wiki/The_1000_Genomes_Project, last accessed, Jan., 2011
[Abouelhoda et al., 2004] Abouelhoda, M.I., Kurtz, S., and Ohlebusch, E. Replacing Suffix Trees with Enhanced Suffix Arrays. In Journal of Discrete Algorithms, Vol. 2(1), pp53-86, 2004
[Apostolico and Galil, 1985] Apostolico, A. and Galil, Z., The Myriad Virtues of Subword Trees. In: Combinatorial Algorithms on Words, Vol. 12 of NATO Advance Science Institute Series. Series F: Computer and Systems Sciences. Springer Verlag, Berlin, pp85-95, 1985
[Barsky et al., 2008] Barsky, M., Stege, U., Thomo, A., A New Method for Indexing Genomes Using On-Disk Suffix Trees. Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM, pp649–658, 2008.
[Barsky et al., 2009] Barsky, M., Stege, U., Thomo, A., and Upton, C., Suffix Trees for Very Large Genomic Sequences. CIKM '09: Proceedings of the 18th ACM Conference on Information and Knowledge Management, 2009
[Barsky, 2011] Barsky, M., Research at UVic, http://webhome.cs.uvic.ca/ ~mgbarsky/publications.html, last accessed, Jan. 2011
[Bedathur and Haritsa, 2004] Bedathur, S.J., Haritsa, J.R., Engineering a fast online persistent suffix tree construction. Proceedings of the 20th International Conference on Data Engineering pp. 720-731, 2004
[Bentley and Mcilroy, 1993] Bentley, J. L. and Mcilroy, M. D., Engineering a sort function. Software-Practice and Experience 23, 11, p1249-1265, 1993
[Burrows and Wheeler, 1994] Burrows, M. and Wheeler, D. J., A block sorting lossless data compression algorithms. Tech. Rep. 124, Digital Equipment Corporation, Palo Alto, CA, 1994
[Cameron, 2006] Cameron, M. Efficient Homology Search for Genomic Sequence Databases. PhD Thesis, RMIT University, Melbourne, Victoria, Australia, 2006
[Cheung et al., 2005] Cheung, C., Yu, J., and Lu, H. Constructing suffix tree for gigabyte sequences with megabyte memory. IEEE Transactions on Knowledge and Data Engineering, 17 (1), pp90-105, 2005
[Dementiev et al., 2005] Dementiev, R., Karkkainen, J., Mehnert, J., and Sanders,P. Better external memory suffix array construction. Proc. Of Algorithm Engineering and Experiments, ALENEX’05, pp86-97, 2005
[EST, 2011] Expressed sequence tag, http://en.wikipedia.org/wiki/ Expressed_sequence_tag, last accessed, Jan., 2011
[Farach and Muthukrishnan, 1996] Farach, M., and Muthukrishnan, S., Optimal Logarithmic Time Randomized Suffix Tree Construction. Proceedings of the 23rd international Colloquium on Automata, Languages and Programming, LNCS, 1099, pp550-561, 1996.
[Farach et al., 2000] Farach, M., Ferragina, P., and Muthukrishnan, S., On the sorting complexity of suffix tree construction. Journal of the ACM, 47 (6), pp987-1011, 2000
[Garcia-Molina et al., 1999] Garcia-Molina, H., Ullman, J. D., Widon J. D., Database System Implementation. Prentice-Hall inc., 1999
[GenBank, 2011] GenBank, http://en.wikipedia.org/wiki/GenBank, last accessed, March, 2011
[Giegerich and Kurtz, 1997] Giegerich, R. and Kurtz, S., From Ukkonen to McCreight and Weiner: A Unifying View of Linear-time Suffix Tree Construtcion. Algorithmica, 19(3), pp331-353, 1997
[Giegerich, et al., 2003] Giegerich, R., Kurtz, S., and Stoye, J., Efficient implementation of lazy suffix trees. Software Practice & Experience, 33(11), pp1035–1049, 2003.
[Gusfield, 1997] Gusfield, D., Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, Cambridge University Press, New York, 1997
[Gusfield, 2004] Gusfield, D., Introduction to the IEEE/ACM Transactions on Computational Biology and Bioinformatics, IEEE Transactions on Computational Biology and Bioinformatics, Vol. 1, No. 1, Jan.-Mar., 2004
[Halachev et al., 2005] Halachev, M., Shiri, N., A. Thamildurai, Exact Match Search in Sequence Data Using Suffix Trees. The ACM Conference on Information and Knowledge Management, CIKM’05, Oct. 31-Nov.5, 2005
[Halachev et al., 2007] Halachev, M., Shiri, N., A. Thamildurai, Efficient and Scalable Indexing Techniques for Biological Sequence Data, Bioinformatics Research and Development, BIRD'07, pp464-479, March 2007
[Halachev, 2009] Halachev, M., Management of Biological Sequences Using Suffix Trees. A Thesis In the Department of Engineering and Computer Science (ENCS), Concordia, May 2009
[HGP, 2011] Human Genome Project, http://en.wikipedia.org/ wiki/Human_Genome_Projec, last accessed, Jan., 2011
[Hunt et al., 2002] Hunt, E., Atkinson, M.P., and Irving, R.W., Database indexing for large DNA and protein sequence collections. VLDB Journal, 11, pp256-271, 2002
[Irving and Love, 2003] Irving, R.W. and Love, L., The Suffix Binary Search Tree a Suffix AVL Tree, In Journal of Discrete Algorithms, 1 (2003) pp387–408, 2003.
[Larsson and Sdakane, 1999] Larsson N. J., Sadakane K. Faster Suffix Sorting, Tech. Rep. LUCS-TR: 99-214, Computer Science Department, Lund University, Sweden, 1999
[Dayhoff, 1965] Dayhoff, M., O., Atlas of Protein Sequence and Structure, National Biomedical Research Foundation, 1965
[McCreight, 1976] McCreight, E.M., A Space-economical Suffix Tree Construction Algorithm, Journal of ACM, 23 (2), pp262-272, 1976
[Michael and Simon, 2008] Michael A. M., Simon J. P., An Efficient, Versatile Approach to Suffix Sorting, Journal of Experimental Algorithmics (JEA), Vol 12, June 2008
[Msufsort, 2011] The Msufsort Algorithm, http://www.michael-maniscalco.com /msufsort.htm, Jan. Last accessed, 2011
[NCBI, 2011] NCBI Genomic Biology, Human Genome Resources, http://www.ncbi.nlm.nih.gov/genome/gui-de/human/index.shtml, last accessed, 2011
[Phoophakdee and Zaki, 2007] Phoophakdee B., and Zaki, M. J., Genome-scale Disk-based Suffix Tree Indexing, ACM International Conference on Management of Data, 2007.
[Phoophakdee and Zaki, 2008] Phoophakdee B., and Zaki M. J., TRELLIS+: An Effective Approach for Indexing Massive Sequence, Pacific Symposium on Biocomputing, 2008.
[SACA_Benchmarks, 2011] The benchmark results of implementations of various, latest suffix array construction algorithms, http://code.google.com/p/libdivsufsort/ wiki/SACA_Benchmarks, last accessed, 2011
[Seward, 2011] Seward, J., The bzip2 and libbzip2 homepage, http://sources.red- hat.com/bzip2/, Mar., 2011
[Simon, 2005] Simon J. Puglisi, Exposition and Analysis of a Suffix Sorting Algorithm, Technical Report Number CAS-05-02-WS, Dept of Computing and Software, McMaster University, May, 2005
[Sinha et al., 2008] Sinha, R., Puglisi, S., Moffat, A., and Turpin, A., Improving Suffix Array Locality for Fast Pattern Matching on Disk. Proc. 28th ACM SIGMOD Intl. Conf., pp. 661-671, 2008
[Thamildurai, 2007] Thamildurai, A., Efficient and Scalable Indexing Techniques for Sequence Data Management. A Thesis In the Department of Engineering and Computer Science (ENCS), Concordia, April 2007
[Tian et al., 2005] Tian, Y., Tata, H., Hankins, R., Patel, J., Practical methods for constructing suffix trees. The VLDB Journal, 14(3), pp281–299, 2005.
[Ukkonen, 1995] Ukkonen, E., On-line Construction of Suffix Trees. Algorithmica, 14 (3), 1995
[USCS, 2011] USCS Genome Browser, hgdownload.cse.ucsc.edu /downloads.html, last accessed 2011
[Vmatch, 2011] The Vmatch large scale sequence analysis software, http://www.vmatch.de/, last accessed, Jan. 2011
[Weiner, 1973] Weiner, P., Linear pattern matching algorithms. Proc. 14th Annual Symposium on Switching and Automata Theory, 1973
[Wikipedia, 2011] Suffix tree, http://en.wikipedia.org/wiki/Suffix_tree, last accessed July, 2011

Repository Staff Only: item control page

Download Statistics

Downloads per month over past year

Research related to the current document (at the CORE website)

Spectrum Research Repository

Enhanced Suffix Trees for Very Large DNA Sequences

Enhanced Suffix Trees for Very Large DNA Sequences

Abstract

References: