# DM825 - Introduction to Machine Learning Sheet 11, Spring 2013 [pdf format]

Exercise 1 – Probability theory

Prove the following rule:

p(xi|xi)=
p(x1,…,xN)
 ∫ p(x1,…,xN) dxi

where xi={x1,…,xN} ∖ xi.

Exercise 2 – Naive Bayes

Consider the binary classification problem of spam email in which a binary label Y ∈ {0, 1} is to be predicted from a feature vector X = (X1, X2, …, Xn), where Xi=1 if the word i is present in the email and 0 otherwise. Consider a naive Bayes model, in which the components Xi are assumed mutually conditionally independent given the class label Y.

1. [a] Draw a directed graphical model corresponding to the naive Bayes model.
2. Find a mathematical expression for the posterior class probability p(Y = 1 | x), in terms of the prior class probability p(Y = 1) and the class-conditional densities p(xi | y).
3. Make now explicit the hyperparameters of the Bernoulli distributions for Y and Xi. Call them, µ and θi, respectively. Assume a beta distribution for the prior of these hyperparameters and show how to learn the hyperparameters from a set of training data (yj,xj)j=1m using a Bayesian approach. Compare this solution with the one developed in class via maximum likelihood.

Exercise 3 – Directed Graphical Models

Consider the graph in Figure left.

• Write down the standard factorization for the given graph.
• For what pairs (i, j) does the statement Xi is independent of Xj hold? (Don’t assume any conditioning in this part.)
• Suppose that we condition on {X2, X9}, shown shaded in the graph. What is the largest set A for which the statement X1 is conditionally independent of XA given {X2, X9} holds?
• What is the largest set B for which X8 is conditionally independent of XB given {X2, X9} holds?
• Suppose that I wanted to draw a sample from the marginal distribution p(x5) = Pr[X5 = x5]. (Don’t assume that X2 and X9 are observed.) Describe an efficient algorithm to do so without actually computing the marginal.