DM823 String Algorithms
Fall 2013, 2nd quarter Rolf Fagerberg
Official Course Description
See the
course
description at the web pages of the faculty.
Time and Place
The course starts Monday, November 4.
Textbook
As textbook, we will use the following:
Algorithms
on Strings
By M. Crochemore, C. Hancart, and Thierry Lecroq
Published by Cambridge University Press, 2007
ISBN 9780521848992
Examination
The exam is oral, with grades on the 7point marking scale. The exam
date is January 23 (was January 21, but is now moved to make rational
use of external examinator time), in rooms
O95 and
O96. A
programming project must be passed in order to
attend the exam. There is a separate document
describing the experiments to be performed on the
data sets provided.
There will be a spørgetime (session for asking questions on the exam and
the curriculum) Wednesday, January 22, at 10:15 in
U49.
At the oral exam, you will draw an exam topic
delineating a part of curriculum which you are to present in the first half
of the examination. More details of the exam form are described at the
bottom of the list of exam topics.
The grades at the exam ended up with the following
distribution.
Lectures
Date 
Time 
Room 
Contents 
Reading 
Monday, November 4 
1214 
Imada seminar room 
Introduction to course. String
terminology. Borders. Naive algorithm for exact pattern
matching, and some heuristics for it. The KnuthMorrisPratt (KMP)
linear time algorithm for exact pattern matching. 
Pages 24, 12 (top half), 2832 (top half), 40 in textbook. The teachers
notes on the KMP algorithm.

Thursday, November 7 
1214 
Imada seminar room 
Complexity of the KMP algorithm. Computing the border table used in
the KMP algorithm.

Pages 40 to 42 in textbook.

Monday, November 11 
1214 
Imada seminar room 
Periods, primitivity. The BoyerMoore (BM) algorithm. Proof of linearity
(for finding first match) of BM algorithm. Note: the BoyerMoore
algorithm is called WMemorylessSuffixSearch in the textbook
(page 107).

Pages 10 (bottom), 11 (top), 16 (middle) in textbook. Pages 102113
in textbook.

Thursday, November 14 
1214 
Imada seminar room 
End of proof of linearity (for finding first match) of BM algorithm.
Proposition on primitive words and cyclic shifts.

Same as last time. The teachers notes on a
property of primitive strings.

Monday, November 18 
1214 
Imada seminar room 
The Galil variant of the BM algoritm having linear time for finding all
occurrences. Note: the algorithm is called WLMemorylessSuffixSearch
in the textbook (page 112). Start of preprocessing for the BM algorithm:
finding the prefix table in linear time.

Page 112 in textbook. The teachers notes on the
Galil variant of the BM algorithm. Pages 42 (middle) to 45 (bottom) in
textbook.

Thursday, November 21 
1214 
Imada seminar room 
Handout of the programming project. End of
preprocessing for the BM algorithm.

Section 3.3 (pages 113118) in textbook.

Monday, November 25 
1214 
Imada seminar room 
Searching in a sorted set of strings.

Sections 4.13 (pages 146158) in textbook.

Thursday November 28 
1214 
Imada seminar room 
Further details or searching in a sorted set of strings. Start on
suffix arrays.

Sections 4.3 (pages 155158) and start of 4.4 (pages 158159) in textbook.

Monday December 2 
1214 
Imada seminar room 
Suffix arrays and their construction. Handout of
separate document describing the experiments to
be performed on the data sets provided.

Sections 4.45 (pages 158169) in textbook. As an alternative (or
compliment) to Section 4.5, you can use the very readable original paper:
Linear work suffix
array construction, J. Kärkkäinen, P. Sanders, S. Burkhardt G, Journal of
the ACM, vol 53 (6), 918 936, 2005 (pdflink available from an SDU
IPaddress)  it is enough to read Sections 1.0, 1.1, 2, and 3 (less
than four pages).

Thursday December 5 
1214 
Imada seminar room 
End of suffix array construction in linear time. Creation of the
lcparray from the suffix array in linear time.

Section 4.5 (pages 164169) in textbook (or, as last time, the
original paper).
Section 4.6 (pages 169174) in textbook. As an alternative/complement,
you may read Section 3 (two pages) in the original paper
LinearTime
LongestCommonPrefix Computation in Suffix Arrays and Its Applications,
T. Kasai, G. Lee, H. Arimura, S. Arikawa and K. Park, in Proceedings of
Combinatorial Pattern Matching 2001, Lecture Notes in Computer Science,
vol 2089, 181192, Springer 2001 (pdflink available from an SDU
IPaddress,
note a few errors in the paper: In the statements of Fact 2, Fact 3, Lemma 3,
and Theorem 1, ">1" should be "=>1", and in the code in Fig. 3, the last
IFstatement should be executed at each iteration of the FORloop (so it
should be moved out of the first IFstatement)).

Monday December 9 
1214 
Imada seminar room 
End of creation of the lcparray from the suffix array in linear
time. Tries and suffix trees.

Section 4.6 (pages 169174) in textbook (or, as last time, Section 3 in
the
original
paper). Pages 566576 of the book Data Structures and Algorithms in
Java (1st ed.) by M.T. Goodrich and R. Tamassia (John Wiley & Sons,
1998, ISBN 0471193089), given as handout. Pages 177186 and 219223 in
the textbook (Answering the four queries on page 222 via a suffix tree
is easy by appropriate annotation (preprocessing via a DFS traversal of
the tree) of the nodes of the tree, e.g. by the number of leaves in the
subtree of a node, or by the minimal index in a leaf in the subtree of
the node. This remark replaces pages 224227, which discusses how to do
it in various "automaton" versions of the suffix tree). Section 6.4
(pages 230231) in the textbook.

Thursday December 12 
1214 
Imada seminar room 
More uses of suffix trees. String compression: LZ78, LZ77, the
BurrowsWheeler transform, the movetofront heuristic.

Section 6.4 (pages 230231) in the textbook. The description of the
BurrowsWheeler transform and the movetofront heuristic on this
project
description from a Princeton course. You may also look at Sections 13
in the
original
paper by Burrows and Wheeler. Note that the encoding essentially can be
found by constructing the suffix array on the string (with '$'
appended). The information on the LZ78 method in Section 14.1.6 (pages
576578) in the book Data Structures and Algorithms in Java (1st
ed.) by M.T. Goodrich and R. Tamassia (John Wiley & Sons, 1998, ISBN
0471193089), previously given as handout. The LZ77 method was only given
on the blackboard. It differs from LZ78 in that the substrings pointed
to by the generated codewords may be any substring appearing in the
part of the string scanned so far. To find such maximal substrings equal
to a prefix of the remaining part of the string, one can use a suffix
tree on the scanned part of the string. This needs to be
updated/extended after scanning the next part of the string, which can
be achieved by a linear time construction algorithm for suffix trees
by Ukkonen (not covered in course).

Monday December 16 
1214 
Imada seminar room 
Sorting and searching strings. Start on dynamic programming
algorithms (recap of longest common subsequence).

Fast Algorithms for
Sorting and Searching Strings, J. Bentley and R. Sedgewick,
Proceedings of the 8th annual ACMSIAM symposium on Discrete algorithms,
360369, 1997. Section 7.3 (pages 262267) in the textbook.

Thursday December 19 
1214 
Imada seminar room 
Handout of exam topics. More dynamic programming
algorithms: Hirschbergs method for longest common subsequence in linear
space, edit distance (alignment), alignment with gaps.

Rest of sections 7.14 (pages 243276) in the textbook.

Maintained by Rolf Fagerberg
(rolf@imada.sdu.dk)

