DM823 String Algorithms

Fall 2013, 2nd quarter
Rolf Fagerberg


Official Course Description

See the course description at the web pages of the faculty.

Time and Place

The course starts Monday, November 4.

Textbook

As textbook, we will use the following:

Algorithms on Strings
By M. Crochemore, C. Hancart, and Thierry Lecroq
Published by Cambridge University Press, 2007
ISBN 978-0-521-84899-2

Examination

The exam is oral, with grades on the 7-point marking scale. The exam date is January 23 (was January 21, but is now moved to make rational use of external examinator time), in rooms O95 and O96. A programming project must be passed in order to attend the exam. There is a separate document describing the experiments to be performed on the data sets provided.

There will be a spørgetime (session for asking questions on the exam and the curriculum) Wednesday, January 22, at 10:15 in U49.

At the oral exam, you will draw an exam topic delineating a part of curriculum which you are to present in the first half of the examination. More details of the exam form are described at the bottom of the list of exam topics.

The grades at the exam ended up with the following distribution.

Lectures

Date Time Room Contents Reading
Monday, November 4 12-14 Imada seminar room Introduction to course. String terminology. Borders. Naive algorithm for exact pattern matching, and some heuristics for it. The Knuth-Morris-Pratt (KMP) linear time algorithm for exact pattern matching. Pages 2-4, 12 (top half), 28-32 (top half), 40 in textbook. The teachers notes on the KMP algorithm.
Thursday, November 7 12-14 Imada seminar room Complexity of the KMP algorithm. Computing the border table used in the KMP algorithm. Pages 40 to 42 in textbook.
Monday, November 11 12-14 Imada seminar room Periods, primitivity. The Boyer-Moore (BM) algorithm. Proof of linearity (for finding first match) of BM algorithm. Note: the Boyer-Moore algorithm is called W-Memoryless-Suffix-Search in the textbook (page 107). Pages 10 (bottom), 11 (top), 16 (middle) in textbook. Pages 102-113 in textbook.
Thursday, November 14 12-14 Imada seminar room End of proof of linearity (for finding first match) of BM algorithm. Proposition on primitive words and cyclic shifts. Same as last time. The teachers notes on a property of primitive strings.
Monday, November 18 12-14 Imada seminar room The Galil variant of the BM algoritm having linear time for finding all occurrences. Note: the algorithm is called WL-Memoryless-Suffix-Search in the textbook (page 112). Start of preprocessing for the BM algorithm: finding the prefix table in linear time. Page 112 in textbook. The teachers notes on the Galil variant of the BM algorithm. Pages 42 (middle) to 45 (bottom) in textbook.
Thursday, November 21 12-14 Imada seminar room Handout of the programming project. End of preprocessing for the BM algorithm. Section 3.3 (pages 113-118) in textbook.
Monday, November 25 12-14 Imada seminar room Searching in a sorted set of strings. Sections 4.1-3 (pages 146-158) in textbook.
Thursday November 28 12-14 Imada seminar room Further details or searching in a sorted set of strings. Start on suffix arrays. Sections 4.3 (pages 155-158) and start of 4.4 (pages 158-159) in textbook.
Monday December 2 12-14 Imada seminar room Suffix arrays and their construction. Handout of separate document describing the experiments to be performed on the data sets provided. Sections 4.4-5 (pages 158-169) in textbook. As an alternative (or compliment) to Section 4.5, you can use the very readable original paper: Linear work suffix array construction, J. Kärkkäinen, P. Sanders, S. Burkhardt G, Journal of the ACM, vol 53 (6), 918 -936, 2005 (pdf-link available from an SDU IP-address) - it is enough to read Sections 1.0, 1.1, 2, and 3 (less than four pages).
Thursday December 5 12-14 Imada seminar room End of suffix array construction in linear time. Creation of the lcp-array from the suffix array in linear time. Section 4.5 (pages 164-169) in textbook (or, as last time, the original paper). Section 4.6 (pages 169-174) in textbook. As an alternative/complement, you may read Section 3 (two pages) in the original paper Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications, T. Kasai, G. Lee, H. Arimura, S. Arikawa and K. Park, in Proceedings of Combinatorial Pattern Matching 2001, Lecture Notes in Computer Science, vol 2089, 181-192, Springer 2001 (pdf-link available from an SDU IP-address, note a few errors in the paper: In the statements of Fact 2, Fact 3, Lemma 3, and Theorem 1, ">1" should be "=>1", and in the code in Fig. 3, the last IF-statement should be executed at each iteration of the FOR-loop (so it should be moved out of the first IF-statement)).
Monday December 9 12-14 Imada seminar room End of creation of the lcp-array from the suffix array in linear time. Tries and suffix trees. Section 4.6 (pages 169-174) in textbook (or, as last time, Section 3 in the original paper). Pages 566-576 of the book Data Structures and Algorithms in Java (1st ed.) by M.T. Goodrich and R. Tamassia (John Wiley & Sons, 1998, ISBN 0471193089), given as handout. Pages 177-186 and 219-223 in the textbook (Answering the four queries on page 222 via a suffix tree is easy by appropriate annotation (preprocessing via a DFS traversal of the tree) of the nodes of the tree, e.g. by the number of leaves in the subtree of a node, or by the minimal index in a leaf in the subtree of the node. This remark replaces pages 224-227, which discusses how to do it in various "automaton" versions of the suffix tree). Section 6.4 (pages 230-231) in the textbook.
Thursday December 12 12-14 Imada seminar room More uses of suffix trees. String compression: LZ78, LZ77, the Burrows-Wheeler transform, the move-to-front heuristic. Section 6.4 (pages 230-231) in the textbook. The description of the Burrows-Wheeler transform and the move-to-front heuristic on this project description from a Princeton course. You may also look at Sections 1-3 in the original paper by Burrows and Wheeler. Note that the encoding essentially can be found by constructing the suffix array on the string (with '$' appended). The information on the LZ78 method in Section 14.1.6 (pages 576-578) in the book Data Structures and Algorithms in Java (1st ed.) by M.T. Goodrich and R. Tamassia (John Wiley & Sons, 1998, ISBN 0471193089), previously given as handout. The LZ77 method was only given on the blackboard. It differs from LZ78 in that the substrings pointed to by the generated codewords may be any substring appearing in the part of the string scanned so far. To find such maximal substrings equal to a prefix of the remaining part of the string, one can use a suffix tree on the scanned part of the string. This needs to be updated/extended after scanning the next part of the string, which can be achieved by a linear time construction algorithm for suffix trees by Ukkonen (not covered in course).
Monday December 16 12-14 Imada seminar room Sorting and searching strings. Start on dynamic programming algorithms (recap of longest common subsequence). Fast Algorithms for Sorting and Searching Strings, J. Bentley and R. Sedgewick, Proceedings of the 8th annual ACM-SIAM symposium on Discrete algorithms, 360-369, 1997. Section 7.3 (pages 262-267) in the textbook.
Thursday December 19 12-14 Imada seminar room Handout of exam topics. More dynamic programming algorithms: Hirschbergs method for longest common subsequence in linear space, edit distance (alignment), alignment with gaps. Rest of sections 7.1-4 (pages 243-276) in the textbook.


Maintained by Rolf Fagerberg (rolf@imada.sdu.dk)