DM79 Algorithms for Web Indexing and Searching

Fall 2004
Rolf Fagerberg


Official Course Description

See the course description at the web pages of the faculty.

Time and Place

The course runs from September 2 to December 16, except for the fall break in week 42 (October 11-15).

The schedule is as follows:

  • Tuesdays (some) 14.15 to 16.00. We will use approximately every second Tuesday. These will be announced one week ahead.
  • Thursdays 14.15 to 16.00.

All lessons take place in Imadas seminar room (across room U49A).

Most lessons will be lectures, but a few may be used for exercises.

Exam

The exam is a 30 minutes oral exam (no preparation time) and will take place on Thursday, January 27. A list of questions (ps, pdf), from which you will draw one at the exam, is now available. Curriculum (ps, pdf) is also available.

There is a "spørgetime" (session allowing you to ask questions to the lecturer on the curriculum and the exam) on January 25 at 10:15 in room U49D.

The grades at the exam ended up as follows: ps, pdf.

Lectures

  • September 2. Introduction to course (slides: ps.gz, pdf). Some Google facts (slides: ps.gz, pdf). Overview of search engines (slides: ps.gz, pdf).

    Reading: The Anatomy of a Large-Scale Hypertextual Web Search Engine. Sergey Brin and Lawrence Page. Proceedings of the 7th International WWW conference, 1998.

  • September 7. Discussion of project (slides: ps.gz, pdf). Further discussion of PageRank.

    We also decided to do double lectures (14.15 to 16.00) on Tuesdays, approximately every second Tuesday (instead of one lecture every Tuesday). Tuesday lectures will be announce the week before.

    Reading: The slides.

  • September 9. Dealing with data on disk: the I/O-model (slides: ps.gz, pdf), external sorting (slides: ps.gz, pdf). Start on crawling.

    Reading: The slides (Note: the lower bound for external sorting is just optional reading and is NOT part of the curriculum. The same goes for the paper mentioned at the start of the slides (which was not handed out)). High-Performance Web Crawling. Marc Najork and Allan Heydon. Compaq SRC Research Report 173, 2001. Also appeared in Handbook of Massive Data Sets, Kluwer, 2001.

    We also settled the project groups, and discussed the process a bit more: An obvious placement of global team discussions (i.e. all sub-groups of a team) is the Tuesdays where we are not having lectures.

  • September 14. Crawling (slides: ps.gz, pdf).

    Reading: Breadth-First Search Crawling Yields High-Quality Pages. Marc Najork and Janet L. Wiener. In Proceedings of the Tenth Internal World Wide Web Conference, pages 114-118, May 2001.

  • September 16. More on crawling. Start on indexing (slides: ps.gz, pdf).

    Reading:

    Pages 239-241 in HTTP: The Definitive Guide. David Gourley, Brian Totty. O'Reilly, 2002. ISBN: 1-56592-509-2.

    Breadth-First Search Crawling Yields High-Quality Pages. Marc Najork and Janet L. Wiener. In Proceedings of the Tenth Internal World Wide Web Conference, pages 114-118, May, 2001.

  • September 21. The will be no lectures this Tuesday. Instead, global team meetings should be arranged (possible at another day/time if this suits the team better).

  • September 23 More on indexing.

    Reading:

    All except Sections 4.2-3 of: Building a Distributed Full-Text Index for the Web. Sergey Melnik, Sriram Raghavan, Beverly Yang, and Hector Garcia-Molina. Stanford Technical Report 2000-55. Short version of paper appeared in the proceedings of the 10th International WWW conference, May 2-5, 2001, Hong Kong.

    Pages 109-122, 145-151, 156-161, 169-170 and 175-179 of: Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edition. Ian H. Witten, Alistair Moffat, and Timothy C. Bell. Morgan Kaufmann Publishing, San Francisco, 1999. ISBN 1-55860-570-3.

  • September 28 More on indexing: index compression. PageRank computation.

    Reading: Efficient Computation of PageRank. Taher H. Haveliwala. Stanford Technical Report 1999-31.

  • September 30 More on computing PageRank. Start on querying: The vector space model.

    Reading: Pages 31-41 in Understanding Search Engines: Mathematical Modeling and Text Retrieval. M.W. Berry and M. Browne. SIAM Book Series: Software, Environments, and Tools, (June 1999), ISBN: 0-89871-437-0.

  • October 5 The will be no lectures this Tuesday.

  • October 7 More on the vector space model. More bits and pieces on querying. Retrieval performance evaluation.

    Reading: Pages 24-30, 74-81, 99-106, 118-119, and 207-208 in Modern Information Retrieval. Ricardo Baeza-Yates, Berthier Ribiero-Neto. Addison Wesley Higher Education, 1999. ISBN: 020139829X.

  • No lectures during the fall break in week 42 (Oct. 11 to 15).

  • October 19. There will be no lectures this Tuesday (as the seminar room is used for a computer science colloquium).

  • October 21. Tries. Sorting strings. Ternary search trees. It was agreed that the dates for remaining hand-ins of meeting minutes from the project work are: October 28, November 25, and December 20 (all of which are Thursdays).

    Reading:

    Pages 566-576 in Data Structures and Algorithms in Java. Michael T. Goodrich and Roberto Tamassia. Wiley, 1998. ISBN: 0-471-19308-9.

    Pages 116-119 in Algorithms on Strings, Trees, and Sequences. Dan Gusfield. Cambridge University Press, 1997. ISBN: 0521585198.

    Fast Algorithms for Sorting and Searching Strings. Jon Bentley and Robert Sedgewick. Proceedings of the eighth annual ACM-SIAM symposium on Discrete algorithms. New Orleans, January, 1997. Pages 360-369.

  • October 26. Lectures this day was cancelled (in last moment) because a kandidateksamen had been scheduled in the room (actions are now being taken to better prevent double-booking for the rest of the semester).

  • October 28. Suffix tree construction. We will also gather information in order to settle exam dates for DM79 - bring you existing exam dates for other courses.

    Reading: Pages 94-107 in Algorithms on Strings, Trees, and Sequences. Dan Gusfield. Cambridge University Press, 1997. ISBN: 0521585198.

  • November 2. No lectures this Tuesday.

  • November 4. End of proof of Ukkonens suffix tree construction algorithm. Second lesson was a guest lecture by Lawrence Lessig on copyright issues relating to the Internet.

  • November 9. There will be no lectures this Tuesday, as the seminar room is used for a computer science colloquium.

  • November 11. Suffix arrays. Latent Semantic Indexing.

    Reading:

    Pages 149-155 in Algorithms on Strings, Trees, and Sequences. Dan Gusfield. Cambridge University Press, 1997. ISBN: 0521585198.

    Pages 33, 41-45, and 53-58 in Understanding Search Engines: Mathematical Modeling and Text Retrieval. M.W. Berry and M. Browne. SIAM Book Series: Software, Environments, and Tools, (June 1999), ISBN: 0-89871-437-0.

  • November 16. More on latent semantic indexing.

    Reading: Sections 1, 2, 4, and 5 in Spectral Analysis of Data. Yossi Azar, Amos Fiat, Anna R. Karlin, Frank McSherry, and Jared Saia. Proceedings of the 33rd Annual ACM Symposium on Theory of Computing (STOC), pages 619--626, 2001.

  • November 18. Other types of link based ranking: HITS and (briefly) SALSA.

    Reading: A Survey of Eigenvector Methods of Web Information Retrieval. Amy N. Langville and Carl D. Meyer, 2004. To appear in SIAM Review.

  • November 23. No lectures this Tuesday.

  • November 25. Structure of the Web Graph. The date for the exam was moved to either January 26 or January 27 (depending on the forthcoming teaching schedule of censor, which will be settled during December).

    Reading: Graph structure in the web . Andrei Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Rajagopalan, Raymie Stata, Andrew Tomkins, and Janet Wiener. In proceedings of Ninth International World Wide Web Conference (WWW9), 2000.

  • November 30. Details of SALSA.

    Reading: A Survey of Eigenvector Methods of Web Information Retrieval. Amy N. Langville and Carl D. Meyer, 2004. To appear in SIAM Review.

  • December 2. Models for the web graph.

    Reading: Non-proof parts of Stochastic models for the Web graph . R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, and E. Upfal. Proceedings of the 41th IEEE Symp. on Foundations of Computer Science. November 2000, pp. 57-65.

  • December 7. No lectures this Tuesday.

    Note that you need to sign up for next semesters courses at the administration web pages. If you are taking elective courses at Imada next semester, you should also give information on your selected courses to Imada if you wish to influence the schedule of the elective courses. Deadline for both is Friday, December 10. As a new thing, signing up for a course also means signing up for its exam (i.e. there will not be any signing up for exams in March, and you will loose an exam attempt if you sign up for a course but do not show up for the exam, unless you in the mean time sign off for the exam).

  • December 9. Bipartite cliques in the web graph. Document similarity.

    Reading:

    Trawling the Web for emerging cyber-communities. Ravi Kumar, Prabhakar Raghavan, Rajagopalan Rajagopalan, and Andrew Tomkins. Computer Networks, 31 (11-16), pages 1481-1493, 1999. Also in Proceedings of the 8th WWW conference.

    On the resemblance and containment of documents. Andrei Broder. In Compression and Complexity of Sequences (SEQUENCES'97), pages 21-29. IEEE Computer Society, 1998. (Section 4 can be read lightly).

  • December 14 Document similarity. Google cluster architecture. The exam date has now been finally settled to Thursday, January 27. The deadline for the project has been moved to Tuesday, January 11, 11:00, where a small inauguration reception for the crawler projects will take place (room to be announced later).

    Reading: Web Search for a Planet: The Google Cluster Architecture. Luiz André Barroso, Jeffrey Dean, and Urs Hölzle. IEEE Micro, 23(2), pp. 22-28, 2003.

  • December 16. Game theory and the Internet. There will be a "spørgetime" (session allowing you to ask questions to the lecturer on the curriculum and the exam) on January 25 at 10:15 in room U49D.

    Reading: Algorithms, games, and the Internet. Christos Papadimitriou. Proceedings of the 33rd STOC, pp. 749-753, ACM Press, 2001.


Maintained by Rolf Fagerberg (rolf@imada.sdu.dk)