Export performed by marco at 1:14pm, 01 May 2013.
Exporting All questions in chronological order (most recent first).
Field  Value 
ID  530770 
Created  20130318 11:28:26 
Question  Consider the graphical model in the figure. Each node represent a gaussian variable. The mean of the variables is assumed to depend linearly on the parent variables so the conditional distributions can written as: Which of the statements about the covariance matrix for gaussian that describes the joint distribution of the variables are not true? 
A  
*B*  
C  
D  
Explanation  The mean and covariance matrix can be found by using (8.16) from Bishop recursively: The rest of the components follow by symmetry. 
Augmented explanation 1  Note that each computation of will be converted to if
This corresponds to the cryptic explanation in the book "...and so the covariance can similarly be evaluated recursively starting from the lowest numbered node." (by: larsgmathiasen [lamat10]) 
Tags  lecture_11 
Author  ggbn (glnie07) 
Avg Rating  3.3300 
Avg Difficulty  2.0000 
Total ratings  3 
Field  Value 
ID  530853 
Created  20130318 08:57:06 
Question  Consider the following training data with input variables of four features, along with an output that can assume either the classification A or B.
x1 x2 x3 x4 y  3.99 1.65 6.52 2.78 B You are given a new input vector: = (3,3,4,5)^{T}
Your task is to calculate the probabilities of belonging to the two classes respectively, i.e. and
by using the gaussian discriminant analysis. 
A  
*B*  
C  
D  
E  
Explanation  First, the mean vectors, and are calculated. This is done by using the formlua on page 8 on the slides from lecture 7. I get the following vectors:
The probabilites P(Y=A) and P(Y=B) are calculated as the frequency of these classes in the training data:
I now need to calculate the posterior probability as: Since I use the gaussian discriminant analysis and given the covariance matrix as the identity matrix, I have that .
Given that the covariance matrix is the 4x4 identity matrix, I can use the following two facts: to write the formula as:
Where k is the number of features (in this case k = 4). The same calculation is done for , by using in stead of .
I thus get the posterior as:
Inserting my values I get the results: 
Tags  lecture_7 
Author  tvh10 (tomha10) 
Avg Rating  4.5000 
Avg Difficulty  1.0000 
Total ratings  4 
Field  Value 
ID  534876 
Created  20130318 07:33:06 
Question  Suppose you want to predict if the movie you are currently watching is a starwars movie using a multinomial event model. To this end you have classified a number of previously watched movies as a starwars movie or not, based on the number of starwars related props used in the movie. You choose to represent each movie as a vector, where each used props are descretized into one of three bucket; few, some or many, depending on how many times the props occur in the movie: You assume that the probability for a prop to be discretized to a bucket, k, given that the movie is a starwars movie, is the same for all props.
Your previously watched movies are used as training data, and represented below as two matrixes where each row represents a movie:
Given that the movie you are currently watching have the input vector: what classification will the movie be predicted to? 
A  The movie will be predicted to not be a starwars movie 
*B*  The movie will be predicted to be a starwars movie 
Explanation  We represent the discretized buckets few, some and many as 0, 1 and 2 respectively. Let represent the number of occurences for the props at position j in the ith movie. Let y=0 be the case that the movie was not a starwars movie and y=1 the case there it was. We estimate the parameters in the multinomial distribution, as done in the slides of lecture 7. This can be done as we assume that is the same for all j's as stated in the description. Thus we end up with the following solution, where m is the number of observations, n is the number of props on each observation and k is a specific props : Given that m=8 and n=5 we can calculate all parameters:
To predict our input vector,x, we maximize y: We use logarithms to prevent underflow: Thus we predict the movie to be a starwars movie.

Tags  lecture_7 
Author  nnoej10 (nnoej10) 
Avg Rating  5.0000 
Avg Difficulty  1.0000 
Total ratings  1 
Field  Value 
ID  530518 
Created  20130318 04:53:36 
Question  Consider an electron emitter which emits electrons with some interarrival time between each emission. Suppose the interarrival times are independent and identically exponentially distributed. The rate depends on a constant and on the temperature of the cathode which is determined by the electric current :
Suppose the relationship is .
We want to learn the constant using the Bayesian approach, so we assume a conjugate prior. The exponential distribution is a special case of the gamma distribution, so the gamma distribution will work fine. Thus we define the prior:
Suppose, based on previous work, you choose the parameters to be .
You now observe 5 interarrival times at different currents:
Derive the posterior. What is the expected value of given the new observations? 
*A*  41.8798 
B  43.0205 
C  42.0000 
D  43.0384 
E  41.0273 
Explanation  The posterior is given by:
Since the observations are independent and identically exponential distributed we have:
Thus we have:
From this we see that the posterior is a new gamma distribution given by:
The expected value of a gamma distributed variable is simply so in this case we have:

Tags  lecture_2 
Author  troelsmn (trnie09) 
Avg Rating  4.6000 
Avg Difficulty  1.8000 
Total ratings  5 
Field  Value 
ID  523866 
Created  20130313 03:35:34 
Question  Suppose we have the Bernoulli distribution and are iid with Your task is now to derive the maximum likelihood estimate of 
A  
*B*  
C  
D  
E  
Explanation  We have the likelihood function Taking the natural logarithm of the likelihood function yields Now, taking the derivative To find the maximum of it, we set it equal to zero 
Author  acarbalacar (daand09) 
Avg Rating  3.0000 
Avg Difficulty  0.2000 
Total ratings  5 
Comment 1  This is at the limit of what would be defined as a too easy question... Do not expect something as easy at the exam. (by: marco [marco]) 
Field  Value 
ID  522394 
Created  20130312 00:48:49 
Question  Let
be a poison distribution with parameter and .
In this exercise we know that either or . and for the subjective prior probabilities to the two possible values.
(note that it might be unrealistic that can only take one of the two values, instead of a ) 
A  
B  
C  
*D*  
E  
Explanation 
So the posterior probability of was smaller than the prior probability. 
Tags  lecture_2, lecture_12 
Author  valdemar (chha309) 
Avg Rating  4.2000 
Avg Difficulty  1.2000 
Total ratings  5 
Field  Value 
ID  523937 
Created  20130311 16:44:43 
Question  In the figure below is 4 different graphical models. Each node in these is a binary variable and the maximal width of the networks is given by an even number M. (so each row with 3 "dots" contains M nodes)
Select below the answer which do not correspond to the number of parameter needed to describe one of the models above. 
A  
B  
C  
D  
*E*  
Explanation  For each node the number of needed parameters is given by the number of parameters needed to describe the node it self times the number of different states its parent nodes can take.

Tags  lecture_11 
Author  ggbn (glnie07) 
Avg Rating  3.4000 
Avg Difficulty  0.4000 
Total ratings  5 
Field  Value 
ID  520454 
Created  20130309 02:06:07 
Question  You are given the following training set, X, containing six inputs of a single feature, along with the corresponding observed outputs, Y:
X = (10,3,1,8,4,9)^{T} Y = (9,4,2,6,3,5)^{T}
Your task is to predict the outcome, , of a new input, , using the locally weighted linear regression model, with a bandwidth of . This entails minimizing parameter (in this case a scalar), and computing
Answer which one of the following statements is false. 
*A*  
B  
C  
D  
Explanation  Calculation of the weights yields the following vector, W:
W = (0.135.., 0.324.., 0.043.., 0.606.., 0.606.., 0.324..)^{T}
I rewrite the expression in the following manner: Firstly, can be expressed in the form , where
I can thus rewrite the sum as:
With concrete values this becomes:
To minimize with respect to I compute the derivative and set it equal 0:
<=> <=>
The prediction for can now be computed:

Tags  lecture_2 
Author  tvh10 (tomha10) 
Avg Rating  4.2000 
Avg Difficulty  1.0000 
Total ratings  5 
Field  Value 
ID  518688 
Created  20130308 11:01:13 
Question  Suppose you have been studying a dripping water tap. It turns out that the time intervals between drops are independently and identically distributed according to the distribution:
From the study you found out that the parameter is distributed according to:
where the parameters are .
You now observe the next 5 interval times:
Derive the posterior distribution. What is the expected given ? 
A  1.870726 
B  1.582982 
C  2.419451 
*D*  1.818182 
E  2.499716 
Explanation  The posterior must again be a gamma distribution and satisfies the proportionality:
Since the observations are independent and identically exponential distributed we have:
Thus we have:
From this we see that the posterior is a new gamma distribution given by:
The expected value of a gamma distributed variable is simply so in this case we have:

Tags  lecture_2 
Author  troelsmn (trnie09) 
Avg Rating  4.5000 
Avg Difficulty  1.7500 
Total ratings  4 
Field  Value 
ID  517698 
Created  20130308 04:46:57 
Question  Given the poisson distribution:
which of the following statements are the conclusion to the proof, showing that the distribution is part of the exponential family of distributions. 
*A*  
B  
C 

Explanation 
= = = = 
Tags  lecture_4 
Author  nnoej10 (nnoej10) 
Avg Rating  3.5000 
Avg Difficulty  0.7500 
Total ratings  4 
Field  Value 
ID  515538 
Created  20130307 02:57:44 
Question  What is the difference between Generative classifiers and Discriminative classifier models? Generally let x be input parameters and y be the class. 
A  Bayes’ theorem, represents an example of discriminative modeling. Where in the generative approach we maximize the likelihood function for the conditional distribution p(yx). 
*B*  Generative classifiers learn a model of the joint probability p(x,y), and make their predictions by using Bayes rules to calculate p(yx), and then choosing the most likely class y. Discriminative classifiers model the posterior p(yx) directly, or learn a direct map from inputs x to the class labels. 
C  Only the discriminative training is used in supervised learning. Since the generative model only can learn from p(xy), and thus it cannot tell us anything useful about the posterior. 
D  The Generative and Discriminative classifier models learn the same way from data and predict the same way, and differs only in the way that the input in the discriminative model are considered to be y (the classifier) and x (parameters) to be the output in contrary to the generative model. 
Explanation  See notes from Andrew Ag. Part IV. Or book B1 page 204 Or the sildes from lecture 7 
Tags  lecture_7 
Author  valdemar (chha309) 
Avg Rating  3.5000 
Avg Difficulty  0.5000 
Total ratings  4 
Field  Value 
ID  515518 
Created  20130307 02:20:40 
Question  Consider the following categorical data and assume that a logistic regression is appropriate Obs Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Group 1 25.53 300 0.001 0 0.164 0 1.65 0.36 0 2 12.98 387 0.786 0.34 0.600 0.002 1.15 0.60 0 3 29.27 182 0.08 0.2 0.386 0.175 0.45 0.04 0 4 23.67 367 0.001 0 0 0 0.56 0.97 0 ..... 27 21.54 312 0.651 0.834 0.084 0 1.04 0.83 1 28 17.45 242 1.337 0.060 0.724 0.38 0.89 0.86 1 29 19.45 140 0.453 0 0.194 0 1.23 1.66 1 30 24.94 303 0.541 0.484 0.534 0 0.86 0.87 1
The data are fictive and does not present anything. As it can be seen we have 30 observations and 8 variables.
You are to consider how many of these variables that should be included in the model and why. (You should not consider which exact variables you would include, but just the number of variables) 
A  All 8 variables should be used, because more variables will always give a more accurate model and when we have as few datapoints as we do in this case, the model contruction will be fast no matter what, so there is no need to consider simplified models. 
*B*  When having only 30 data points one should use 36 variables, since having less than 1/10 of the data as variables, will make it hard to get a good fit and more than 1/5 of the data in variables gives rise to the risk of overfitting. For verification of what variables to include, one should verify the pvalues testing for if the estimation of the parameter is different from zero. 
C  As few variables as possible should be chosen. It should be 1 or 2, which is still verified by the pvalues as in answer B. When having only 1 or 2 variables it will also be possible to do a visualization of the grouping structure done by the model, by simply making a 2 or 3dimensional plot, so one should always consider not using more than 2 variables. 
Explanation  Having to many variables will often result in overfitting which is illustrated in the slides of lecture 2. Essentially, the problem is that there is an infinite number of ways that he model can choose the parameters, so they still covers the points in the data. If this occurs the model will only describe the data, but likely want be usable for other data, since they do not describe the behavior of the data in general. This is equal to the warning message in R saying: Warning message: 
Tags  lecture_2 
Author  acarbalacar (daand09) 
Avg Rating  4.0000 
Avg Difficulty  0.5000 
Total ratings  4 
Field  Value 
ID  513131 
Created  20130305 10:34:53 
Question  Consider the following neural network.
The weights, including bias, are defined as follows.
Futhermore, the activation function at the hidden layer, as well as the activation function at the output layer are given by the logistic sigmoid function;
Your task is to use forward propagation to calculate the estimate,
that the neural network produce on the following input.

*A*  0.003309 
B  0.213426 
C  0.013655 
D  0.748314 
E  0.508049 
Explanation  The formula for forward propagation is given as follows.
,
where M = D = 2 and f is the logistic sigmoid function in our case.
We evaluate the sum with our weights;
Introducing the input yields the following;
f was the logistic sigmoid function;

Tags  lecture_5 
Author  larsgmathiasen (lamat10) 
Avg Rating  3.5000 
Avg Difficulty  1.0000 
Total ratings  6 
Field  Value 
ID  511287 
Created  20130303 12:54:57 
Question  Suppose you have implemented your favorite learning algorithm, namely logistic regression, to solve a given learning problem. Unfortunately, you are getting an intolerable test error with the parameters learned from your training data. Analysis of the situation showed that it is a problem of either high bias or high variance. State, for each of the following approaches whether it will solve high bias, high variance or both.
a) Acquire more training data b) Reduce the set of features c) Increase the set of features 
A  a) Solves both high bias and high variance b) Solves high variance c) Solves high bias 
B  a) Solves high bias b) Solves high bias c) Solves high variance 
C  a) Solves high variance b) Solves high variance c) Solves both high bias and high variance 
D  a) Solves high bias b) Solves high variance c) Solves high bias 
*E*  a) Solves high variance b) Solves high variance c) Solves high bias 
Explanation  If the bias is high, then we are consistently learning the same wrong thing, regardless of the amount of training data we have. E.g. trying to fit a linear function to quadratic data (underfitting). This can often be seen if high training error is observed together with high test error. Thus, the only remedy is increasing the set of features; more training examples will not help and reducing the set of features will definitely not help.
If the variance is high, then we are overfitting the data, i.e. fitting on a too small training set that does not reflect the true pattern of the data. E.g. fitting a 9order polynomial on 10 training points. This can be seen if the training error is low but the test error is high. The conclusion is that we have too many features, so we should be reducing the set of features and not increase it. Furthermore we could solve the high variance by acquiring more test examples to get rid of the overfitting. 
Tags  lecture_10 
Author  larsgmathiasen (lamat10) 
Avg Rating  3.4000 
Avg Difficulty  0.6000 
Total ratings  5 