DM823 String Algorithms

Fall 2016
Rolf Fagerberg


Official Course Description

See the course description at the web pages of the faculty.

Time and Place

The course runs over the entire semester, but is only 5 ECTS. Most weeks we will meet only once, a few weeks we will meet twice (this includes the first week). We have been given three slots: Mon 10-12, Tue 15-17 and Wed 14-16, but we will mainly use the Wed 14-16 slot. Actual times will appear below at least one week ahead. The room will often be the IMADA seminar room, but not always.

Teaching Material

The course material will be handouts (no textbook).

Examination

The exam is oral, with grades on the 7-point marking scale. A programming project must be passed in order to attend the exam.

The exam date is January 26.

At the oral exam, you will draw an exam topic delineating a part of curriculum which you are to present in the first half of the examination. More details of the exam form are described at the bottom of the list of exam topics.

Lectures

Date Time Room Contents Reading
Monday, September 5 10-12 Imada Computer Lab (section at end, past folding wall, which used to be U49D) Introduction to course. String terminology. Borders. Naive algorithm for exact pattern matching, and some heuristics for it. Course introduction slides. The teacher's notes on basic terminology, simple solution to exact pattern matching problem.
Wednesday, September 7 14-16 Imada seminar room The Knuth-Morris-Pratt (KMP) linear time algorithm for exact pattern matching. The teacher's notes on the KMP algorithm. The teacher's notes on finding the border table used in the KMP algorithm.
Monday, September 19 10-12 U61 Periods. Primitivity, cyclic shifts. The Boyer-Moore (BM) algorithm. Proof of linearity (for finding first match) of BM algorithm. The teacher's notes on the BM algorithm. The teacher's notes on a property of primitive strings. The proof of linearity (for finding first match) is removed from curriculum (as I have not found the time to write notes for it).
Wednesday, September 21 14-16 Imada seminar room The Galil variant of the BM algoritm with linear time for finding all occurrences. First part of finding the shift table for the BM algorithm: finding the prefix tabel. The teacher's notes on the Galil variant of the BM algorithm. The teacher's notes on the prefix table.
Wednesday, September 28 14-16 Imada seminar room End of finding the shift table for the BM algorithm. Searching in a sorted set of strings. The teacher's notes on finding the BM shift table. Start of the teacher's notes on searching a sorted set of strings.
Wednesday, October 5 14-16 Imada seminar room Handout of the programming project. Associated file with data sets. More on searching in a sorted set of strings. End of the teacher's notes on searching a sorted set of strings.
Wednesday, October 12 14-16 Imada seminar room Suffix trees and suffix arrays. O(n log n) time construction of suffix arrays by the doubling algorithm. The teacher's notes on suffix arrays and suffix trees. The doubling algorithm is described in Scalable Construction of Text Indexes, T. Bingmann, S. Gog, F. Kurpicz, arXiv:1610.03007, pages 4-6.
Monday, October 24 10-12 U12 Linear time construction of suffix arrays. Linear work suffix array construction, J. Kärkkäinen, P. Sanders, S. Burkhardt G, Journal of the ACM, vol 53 (6), 918 -936, 2005 (pdf-link available from an SDU IP-address), sections 1.0, 1.1, 2, and 3.
Wednesday, November 9 14-16 Imada seminar room Creation of the lcp-array from the suffix array in linear time. Section 3 (two pages) in Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications, T. Kasai, G. Lee, H. Arimura, S. Arikawa and K. Park, in Proceedings of Combinatorial Pattern Matching 2001, Lecture Notes in Computer Science, vol 2089, 181-192, Springer 2001 (pdf-link available from an SDU IP-address. Note a few errors in the paper: In the statements of Fact 2, Fact 3, Lemma 3, and Theorem 1, ">1" should be "=>1", and in the code in Fig. 3, the last IF-statement should be executed at each iteration of the FOR-loop (so it should be moved out of the first IF-statement)).
Monday, November 14 10-12 U31 Suffix + LCP arrays vs. suffix trees. Uses of suffix trees. Pages 566-576 of the book Data Structures and Algorithms in Java (1st ed.) by M.T. Goodrich and R. Tamassia (John Wiley & Sons, 1998, ISBN 0471193089), given as handout. The teacher's notes on suffix arrays and suffix trees.
Monday, November 28 10-12 U146 String compression: LZ78, LZ77, the Burrows-Wheeler transform, the move-to-front heuristic. The description of the Burrows-Wheeler transform and the move-to-front heuristic on this project description from a Princeton course. You may also look at Sections 1-3 in the original paper by Burrows and Wheeler. Note that the encoding essentially can be found by constructing the suffix array on the string (with '$' appended). The information on the LZ78 method in Section 14.1.6 (pages 576-578) in the book Data Structures and Algorithms in Java (1st ed.) by M.T. Goodrich and R. Tamassia (John Wiley & Sons, 1998, ISBN 0471193089), given as handout. The LZ77 method differs from LZ78 in that the substrings pointed to by the generated codewords may be any substring appearing in the part of the string scanned so far. To find such maximal substrings equal to a prefix of the remaining part of the string, one can use a suffix tree on the scanned part of the string. This needs to be updated/extended after scanning the next part of the string, which can be achieved by a linear time construction algorithm for suffix trees by Ukkonen (not covered in course).
Wednesday, November 30 14-16 Imada seminar room Sorting and searching strings. The teacher's notes on the RadixQuicksort algorithm. You may also look at the original paper: Fast Algorithms for Sorting and Searching Strings, J. Bentley and R. Sedgewick, Proceedings of the 8th annual ACM-SIAM symposium on Discrete algorithms, 360-369, 1997.
Monday December 5 10-12 U10 Dynamic programming algorithms: Hirschbergs method for longest common subsequence in linear space. The teacher's notes on Hirschberg's method.
Wednesday December 14 14-16 Imada seminar room More on dynamic programming algorithms: Hunt-Szymanski algorithm for longest common subsequence, start on edit distance. The teacher's notes on the Hunt-Szymanski algorithm, and first part of the teacher's notes on Edit Distance.
Tuesday December 20 15-17 Imada seminar room More on edit distance (global alignment). Alignments with gaps. Discussion of oral exam. Rest of the teacher's notes on Edit Distance. Exam topics


Maintained by Rolf Fagerberg (rolf@imada.sdu.dk)