Unsupervised Learning
DM843

General Information

One trend can be observed over almost all fields of informatics: we have to cope with an ever-increasing amount of available data of all kinds. This amount of data renders it impossible to inspect the dataset "by hand", or even deduce knowledge from the given data, without sophisticated computer aided help. In this course we will discuss one of the most common mechanism of unsupervised machine learning for investigating datasets: Clustering. Clustering separates a given dataset into groups of similar objects, the clusters, and thus allows for a better understanding of the data and their structure. We discuss a number of clustering methods and their application to various different fields such as biology, economics or sociology.

Evaluation

According to SDU regulations, the course was evaluated. Please find the evaluation here and the according action plan here.

Procedure of the oral exam

The exam will last about 25 minutes. At the beginning, one topic from the list below will be drawn randomly. For each topic the examinee should be prepared to make a short presentation of about 8 minutes. It is allowed to bring one page of hand-written notes (DIN A4 or US-Letter, one-sided) for each of the topics. The examinee will have 2 minutes to briefly study the notes for the drawn topic before the presentation. The notes may be consulted during the exam if needed but it will negatively influence the evaluation of the examinee's performance. During the presentation, only the blackboard can be used (you cannot use overhead transparencies, for instance).

After the short presentation, additional question about the presentation's topic but also about other topics in the curriculum will be asked.

Below is the list of possible topics and some suggested content. The listed content are only suggestions and is not necessarily complete nor must everything be covered in the short presentation. It is the responsibility of the examinee to gather and select among all relevant information for each topic from the course material. On the course website you can find suggested readings for each of these topics.

Topics for the Oral Exam:

  1. Graphical Analysis of Clusters
    • Histograms, Scatter Plots
    • Density Estimation
    • Principal Component Analysis
    • Principal Coordinate Analysis
  2. Proximity Measures
    • Different datatypes
    • Common measures for various data types
    • Metrics
    • Sequence similarity (BLAST)
  3. Hierarchical Clustering
    • Function principles / linking functions
    • Dendrograms
    • Comparison to crisp clusterings
    • Function principle of BIRCH
  4. Optimization Based Clustering
    • Dissection of the co-variance matrix (W and B)
    • Cluster criteria
    • K-means
    • GAP statistic
  5. Cluster Analysis
    • Steps of a cluster analysis
    • Data preprocessing
    • Cluster Validation
  6. Modern Clustering Algorithms
    • Graph clustering
    • Tools discussed in the lecture
  7. Expectation Maximization
    • Function principle
    • Gaussian Mixture Models
    • Maximum Likelihood Estimators
    • Similarity to k-means

Lectures

# Date Content Slides Readings
1 Wed, 15.04.2015 General Introduction / Mathematical Foundation here -
2 Thu, 16.04.2015 Graphical Analysis of Datasets / PCA / PCoA here Book, Chapter 2
Interesting read about PCA
3 Wed, 22.04.2015 Proximity Measures 1/2 here Book, Chapter 3
Russell, David J, Multiple Sequence Alignment Methods, Chapter 1
BLAST: Phil McClean Lecture Notes on BLAST
4 Thu, 23.04.2015 Proximity Measures 2/2 see above -
5 Wed, 29.04.2015 Hierarchical Clustering here Book, Chapter 4
Aggarwal and Reddy Data Clustering - Algorithms and Applications, Chapter 11.2
6 Thu, 30.04.2015 Optimization Based Clustering here Book, Chapter 5
7 Wed, 06.05.2015 Performing a Cluster Analysis here Book Chapter 9.1, 9.2
Aggarwal and Reddy Clustering - Algorithms and Applications, Chapter 23
additionally, see readings of the project
8 Thu, 07.05.2015 Popular Clustering Tools here Spectral clustering
Affinity propagation
Transitivity clustering / FORCE
Markov clustering
9 Tue, 12.05.2015, 8am Mixture Models & EM here Book, Chapter 6.1 & 6.2
Bishop, Pattern Recognition and Machine Learning, Chapter 9
Please also watch this video
10 Wed, 13.05.2015 Wrap-Up / Q&A here -

Project & Preparation

Project Material

  • Please find the project desciption here. If there is anything unclear or you think I didn't have captured the protocol as we discussed, please let me know. If you have any further questions, don't hesitate to write me an email or stop by at my office.
  • Protein sequences here. And the partial gold standard here.
  • A very short description how to perform a BLAST all-vs.-all run: https://www.biostars.org/p/6541/ Don't forget to set the E-value cut-off to 100.
  • You can download BLAST from the NCBI website: here. There you find the source code as well as precompiled versions for all usual operating systems.

Some Remarks regarding BLAST

If you run blastp including the parameter -outfmt 6 (as stated in the little tutorial) you will receive a file with 12 columns. The meaning of them is for example described here.

If you look at the link, you will see that you have to use the 11th column (the second last) in oder to get an E-value. With the -log(E-value) conversion, we are calculating a similarities. Now it might happen that some E-values are zero, that means the proteins are extremely similar and thus should also receive a very high similarity. Nevertheless, -log(0) equals infinity which is technically very hihg but not really practical. This can be repaired, e.g., by one of the following ways:

  • add some small pseudo count to each similarity (e.g., the smallest real float or double value > 0 before you calculate the log)
  • set all zeros to the lowest otherwise observed value >0.

With that you will receive a valid similarity matrix. If your tool requires distances, you need to convert these similarities to distances in a second preprocessing step.

Summary of the Project

In this project, we aim to separate a set of proteins into their protein families (it is not of utmost importance to know what exactly a protein family is). Without any prior biological knowledge, it is very hard to identify suitable parameters in order to identify protein families. Luckily, we have a partial gold-standard dataset consisting of 27 families, all belonging to one superfamily. The overall dataset consists of several superfamilies with an unknown number of families.

After selecting three clustering methods, we will use the gold standard in order to train the parameter sets. The best clustering is determined by using the discussed evaluation protocol of last class. These parameters are used in order to cluster the entire dataset. The results of clustering the entire dataset are also evaluated according to the protocol defined in class. At the end, you are supposed to summarize all your findings in a protocol, convincing me to favor one tool over another.

Introduction Slides

Your Prepartation

On Wednesday (May 6), we will have a discussion session in class. We will mainly discuss how a good cluster analysis looked like, how to judge the quality of a clustering and how to compare clusterings. At the end of the class we will have a protocol for performing a cluster analysis, for training the clustering tools' parameter and a meaningful evaluation. Based on this discussion, I will compile the final project description, which can be downloaded here after the discussion. As you will decide on what course the project should take, it is of crucial importance that you will show up prepared to class. For that, please read the following two papers (and an optional, rather technical third)

  • Rendón, E. et al. (2011). Internal versus External cluster validation indexes. International Journal of computers and communications, 5(1), 27-34. download
  • Handl, J. et al. (2005). Computational cluster validation in post-genomic data analysis. Bioinformatics, 21(15), 3201-3212. download
  • (Optional) Halkidi, M. et al. (2001). On clustering validation techniques. Journal of Intelligent Information Systems, 17(2-3), 107-145. download
  • (Optional) Xu, Rui, et al. Survey of clustering algorithms. Neural Networks, IEEE Transactions on, 2005, 16. Jg., Nr. 3, S. 645-678. download

Available Clustering Tools

As already announced in the lecture, you have to choose three of the tools listed below. If available, you can download the corresponding paper (either the original publication, or a good overview article) of the tool/method. We also will have a session explaining these tools next week.

Materials

  • All lecture slides are relevant for the exams.
  • All readings noted in the lecture list are relevant for the exam.
  • Brian S. Everitt, Sabine Landau, Morven Leese, Daniel Stahl, Cluster Analysis, 5th Edition, ISBN: 9780470749913