DM825-2013

PeerWise question export

Export performed by marco at 1:14pm, 01 May 2013.
Exporting All questions in chronological order (most recent first).

Field	Value
ID	530770
Created	2013-03-18 11:28:26
Question	Consider the graphical model in the figure. Each node represent a gaussian variable. The mean of the variables is assumed to depend linearly on the parent variables so the conditional distributions can written as: Where and describes the linear connection and the variance. Which of the statements about the covariance matrix for gaussian that describes the joint distribution of the variables are not true?
A
B
C
D
Explanation	The mean and covariance matrix can be found by using (8.16) from Bishop recursively: The rest of the components follow by symmetry.
Augmented explanation 1	Note that each computation of will be converted to if This corresponds to the cryptic explanation in the book "...and so the covariance can similarly be evaluated recursively starting from the lowest numbered node." (by: larsgmathiasen* [lamat10]*)
Tags	lecture_11
Author	ggbn (glnie07)
Avg Rating	3.3300
Avg Difficulty	2.0000
Total ratings	3

Field	Value
ID	530853
Created	2013-03-18 08:57:06
Question	Consider the following training data with input variables of four features, along with an output that can assume either the classification A or B. x1 x2 x3 x4 y --------------------------------- 3.99 -1.65 6.52 2.78 B 4.45 1.68 7.54 1.59 B 1.61 5.44 3.79 7.51 A 1.61 4.83 2.92 9.50 A 2.74 0.66 5.17 4.12 B 4.26 -0.20 5.43 1.11 B 3.30 4.85 3.95 9.23 A 3.34 0.77 5.73 2.21 B 2.04 5.70 3.97 7.55 A You are given a new input vector: $\hat{x}$ = (3,3,4,5)^T Your task is to calculate the probabilities of $\hat{x}$ belonging to the two classes respectively, i.e. $p(Y=A\|\hat{x})$ and $p(Y=B\|\hat{x})$ by using the gaussian discriminant analysis. In order to simplify the calculations, you should assume that the covariance matrix, $\sum$ , is the 4x4 identity matrix.
A	$p(Y=A\|\hat{x}) = 0.543$ $p(Y=B\|\hat{x}) = 0.457$
B	$p(Y=A\|\hat{x}) = 0.663$ $p(Y=B\|\hat{x})=0.337$
C	$p(Y=A\|\hat{x}) = 0.82$ $p(Y=B\|\hat{x})=0.18$
D	$p(Y=A\|\hat{x})=0.337$ $p(Y=B\|\hat{x})= 0.663$
E	$p(Y=A\|\hat{x}) = 0.457$ $p(Y=B\|\hat{x}) = 0.543$
Explanation	First, the mean vectors, $\vec{\mu}_A$ and $\vec{\mu}_B$ are calculated. This is done by using the formlua on page 8 on the slides from lecture 7. I get the following vectors: $\vec{\mu}_A = (2.14,\ 5.205,\ 3.6575,\ 8.4475)^T$ $\vec{\mu}_B = (3.756,\ 0.252,\ 6.078,\ 2.362)^T$ The probabilites P(Y=A) and P(Y=B) are calculated as the frequency of these classes in the training data: $\phi_A = \frac{4}{9}$ $\phi_B = \frac{5}{9}$ I now need to calculate the posterior probability as: $p(Y=A \| \hat{x}) = \frac{p(\hat{x} \| Y=A)p(Y=A)}{p(\hat{x})}$ Since I use the gaussian discriminant analysis and given the covariance matrix as the identity matrix, I have that $p(\hat{x} \| Y = A) \sim \mathcal{N}(\vec{\mu}_A, I)$ . Given that the covariance matrix is the 4x4 identity matrix, I can use the following two facts: $I^{-1} = I$ to write the formula as: $p(\hat{x}\|Y=A) = \frac{1}{(2\pi)^{k/2}}exp\left(-\frac{1}{2}(\hat{x}-\vec{\mu}_A)^T(\hat{x}-\vec{\mu}_A)\right)$ Where k is the number of features (in this case k = 4). The same calculation is done for $p(\hat{x}\|Y=B)$ , by using $\vec{\mu}_B$ in stead of $\vec{\mu}_A$ . I thus get the posterior as: $p(Y=A \| \hat{x}) = \frac{p(\hat{x}\| Y=A)\phi_A}{p(\hat{x}\|Y=A)\phi_A+p(\hat{x}\|Y=B)\phi_B}$ Inserting my values I get the results: $p(Y = A \| \hat{x}) = 0.663$ $p(Y=B \| \hat{x}) = 0.337$
Tags	lecture_7
Author	tvh10 (tomha10)
Avg Rating	4.5000
Avg Difficulty	1.0000
Total ratings	4

Field	Value
ID	534876
Created	2013-03-18 07:33:06
Question	Suppose you want to predict if the movie you are currently watching is a starwars movie using a multinomial event model. To this end you have classified a number of previously watched movies as a starwars movie or not, based on the number of starwars related props used in the movie. You choose to represent each movie as a vector, where each used props are descretized into one of three bucket; few, some or many, depending on how many times the props occur in the movie: You assume that the probability for a prop to be discretized to a bucket, k, given that the movie is a starwars movie, is the same for all props. Your previously watched movies are used as training data, and represented below as two matrixes where each row represents a movie: Given that the movie you are currently watching have the input vector: what classification will the movie be predicted to?
A	The movie will be predicted to not be a starwars movie
B	The movie will be predicted to be a starwars movie
Explanation	We represent the discretized buckets few, some and many as 0, 1 and 2 respectively. Let represent the number of occurences for the props at position j in the ith movie. Let y=0 be the case that the movie was not a starwars movie and y=1 the case there it was. We estimate the parameters in the multinomial distribution, as done in the slides of lecture 7. This can be done as we assume that is the same for all j's as stated in the description. Thus we end up with the following solution, where m is the number of observations, n is the number of props on each observation and k is a specific props : Given that m=8 and n=5 we can calculate all parameters: To predict our input vector,x, we maximize y: We use logarithms to prevent underflow: Thus we predict the movie to be a starwars movie.
Tags	lecture_7
Author	nnoej10 (nnoej10)
Avg Rating	5.0000
Avg Difficulty	1.0000
Total ratings	1

Field	Value
ID	530518
Created	2013-03-18 04:53:36
Question	Consider an electron emitter which emits electrons with some interarrival time between each emission. Suppose the interarrival times are independent and identically exponentially distributed. The rate depends on a constant and on the temperature of the cathode which is determined by the electric current : Suppose the relationship is . We want to learn the constant using the Bayesian approach, so we assume a conjugate prior. The exponential distribution is a special case of the gamma distribution, so the gamma distribution will work fine. Thus we define the prior: Suppose, based on previous work, you choose the parameters to be . You now observe 5 interarrival times at different currents: Derive the posterior. What is the expected value of given the new observations?
A	41.8798
B	43.0205
C	42.0000
D	43.0384
E	41.0273
Explanation	The posterior is given by: Since the observations are independent and identically exponential distributed we have: Thus we have: From this we see that the posterior is a new gamma distribution given by: The expected value of a gamma distributed variable is simply so in this case we have:
Tags	lecture_2
Author	troelsmn (trnie09)
Avg Rating	4.6000
Avg Difficulty	1.8000
Total ratings	5

Field	Value
ID	523866
Created	2013-03-13 03:35:34
Question	Suppose we have the Bernoulli distribution and are iid with Your task is now to derive the maximum likelihood estimate of
A
B
C
D
E
Explanation	We have the likelihood function Taking the natural logarithm of the likelihood function yields Now, taking the derivative To find the maximum of it, we set it equal to zero
Author	acarbalacar (daand09)
Avg Rating	3.0000
Avg Difficulty	0.2000
Total ratings	5
Comment 1	This is at the limit of what would be defined as a too easy question... Do not expect something as easy at the exam. (by: marco* [marco]*)

Field	Value
ID	522394
Created	2013-03-12 00:48:49
Question	Let be a poison distribution with parameter and . In this exercise we know that either or . In bayesian inference, the parameter is treated as a random variable . We are given that and for the subjective prior probabilities to the two possible values. Now suppose that we are given a random sample of size n=2, with observations and . What is the posterior probabilities of and given the data? (note that it might be unrealistic that can only take one of the two values, instead of a )
A
B
C
D
E
Explanation	So the posterior probability of was smaller than the prior probability. Similarly, the posterior probability of was greater than the corresponding prior. So the observations seemed to favor , and that would agree with our intuition, since we have that for a poison distribution the expected value is , which here should be close to the mean of the observation .
Tags	lecture_2, lecture_12
Author	valdemar (chha309)
Avg Rating	4.2000
Avg Difficulty	1.2000
Total ratings	5

Field	Value
ID	523937
Created	2013-03-11 16:44:43
Question	In the figure below is 4 different graphical models. Each node in these is a binary variable and the maximal width of the networks is given by an even number M. (so each row with 3 "dots" contains M nodes) Select below the answer which do not correspond to the number of parameter needed to describe one of the models above.
A
B
C
D
E
Explanation	For each node the number of needed parameters is given by the number of parameters needed to describe the node it self times the number of different states its parent nodes can take. If one is counting from the top row and down, and from left to right, then can the number of parameters N for each of the models be found as:
Tags	lecture_11
Author	ggbn (glnie07)
Avg Rating	3.4000
Avg Difficulty	0.4000
Total ratings	5

Field	Value
ID	520454
Created	2013-03-09 02:06:07
Question	You are given the following training set, X, containing six inputs of a single feature, along with the corresponding observed outputs, Y: X = (10,3,1,8,4,9)^T Y = (9,4,2,6,3,5)^T Your task is to predict the outcome, $\hat{y}$ , of a new input, $\hat{x} = 6$ , using the locally weighted linear regression model, with a bandwidth of $\tau = 2$ . This entails minimizing parameter $\theta$ (in this case a scalar), and computing $\hat{y} = \theta\hat{x}$ The weight function to be used in the case of single feature inputs is: $w_i = exp\left(-\frac{(x_i-\hat{x})^2}{2\tau^2}\right)$ Answer which one of the following statements is false.
A	$\sum_iw_i(y_i-\theta x_i)^2 \approx 40.75\theta^2 - 59.47\theta + 22.37$
B	$\hat{y} \approx 4.41$
C	$\sum_iw_i \approx 2.04$
D	$\theta \approx 0.74$
Explanation	Calculation of the weights yields the following vector, W: W = (0.135.., 0.324.., 0.043.., 0.606.., 0.606.., 0.324..)^T $\sum_iw_i \approx 2.04$ I rewrite the expression $\sum_iw_i(y_i-\theta x_i)^2$ in the following manner: Firstly, $(y_i-\theta x_i)^2$ can be expressed in the form $a\theta^2 + b\theta + c$ , where $a=x_i^{2}$ $b=-\left(2\right) \cdot x_iy_i$ $c=y_i^{2}$ I can thus rewrite the sum as: $\sum_i \left( w_ix_i^2\theta^2 - w_i(2\cdot x_iy_i)\theta + w_iy_i^2 \right)$ With concrete values this becomes: $91.319\theta^2 - 134.33\theta + 51.743$ To minimize with respect to $\theta$ I compute the derivative and set it equal 0: $\frac{d}{d\theta} 91.319\theta^2 - 134.33\theta + 51.743 = 0$ <=> $182.638\theta - 134.33 = 0$ <=> $\theta = \frac{134.33}{182.638} = 0.7354$ The prediction for $\hat{x} = 6$ can now be computed: $\hat{y} = 0.7354 \cdot 6 = 4.4124$
Tags	lecture_2
Author	tvh10 (tomha10)
Avg Rating	4.2000
Avg Difficulty	1.0000
Total ratings	5

Field	Value
ID	518688
Created	2013-03-08 11:01:13
Question	Suppose you have been studying a dripping water tap. It turns out that the time intervals between drops are independently and identically distributed according to the distribution: From the study you found out that the parameter is distributed according to: where the parameters are . You now observe the next 5 interval times: Derive the posterior distribution. What is the expected given ?
A	1.870726
B	1.582982
C	2.419451
D	1.818182
E	2.499716
Explanation	The posterior must again be a gamma distribution and satisfies the proportionality: Since the observations are independent and identically exponential distributed we have: Thus we have: From this we see that the posterior is a new gamma distribution given by: The expected value of a gamma distributed variable is simply $alpha \/ beta$ so in this case we have:
Tags	lecture_2
Author	troelsmn (trnie09)
Avg Rating	4.5000
Avg Difficulty	1.7500
Total ratings	4

Field	Value
ID	517698
Created	2013-03-08 04:46:57
Question	Given the poisson distribution: which of the following statements are the conclusion to the proof, showing that the distribution is part of the exponential family of distributions.
A
B
C
Explanation	= = = =
Tags	lecture_4
Author	nnoej10 (nnoej10)
Avg Rating	3.5000
Avg Difficulty	0.7500
Total ratings	4

Field	Value
ID	515538
Created	2013-03-07 02:57:44
Question	What is the difference between Generative classifiers and Discriminative classifier models? Generally let x be input parameters and y be the class.
A	Bayes’ theorem, represents an example of discriminative modeling. Where in the generative approach we maximize the likelihood function for the conditional distribution p(y\|x).
B	Generative classifiers learn a model of the joint probability p(x,y), and make their predictions by using Bayes rules to calculate p(y\|x), and then choosing the most likely class y. Discriminative classifiers model the posterior p(y\|x) directly, or learn a direct map from inputs x to the class labels.
C	Only the discriminative training is used in supervised learning. Since the generative model only can learn from p(x\|y), and thus it cannot tell us anything useful about the posterior.
D	The Generative and Discriminative classifier models learn the same way from data and predict the same way, and differs only in the way that the input in the discriminative model are considered to be y (the classifier) and x (parameters) to be the output in contrary to the generative model.
Explanation	See notes from Andrew Ag. Part IV. Or book B1 page 204 Or the sildes from lecture 7
Tags	lecture_7
Author	valdemar (chha309)
Avg Rating	3.5000
Avg Difficulty	0.5000
Total ratings	4

Field	Value
ID	515518
Created	2013-03-07 02:20:40
Question	Consider the following categorical data and assume that a logistic regression is appropriate Obs Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Group 1 25.53 300 0.001 0 -0.164 0 1.65 0.36 0 2 12.98 387 0.786 0.34 0.600 0.002 1.15 0.60 0 3 29.27 182 -0.08 -0.2 -0.386 0.175 0.45 0.04 0 4 23.67 367 0.001 0 0 0 0.56 0.97 0 ..... 27 21.54 312 0.651 0.834 -0.084 0 1.04 0.83 1 28 17.45 242 1.337 0.060 0.724 0.38 0.89 0.86 1 29 19.45 140 0.453 0 -0.194 0 1.23 1.66 1 30 24.94 303 0.541 0.484 0.534 0 0.86 0.87 1 The data are fictive and does not present anything. As it can be seen we have 30 observations and 8 variables. You are to consider how many of these variables that should be included in the model and why. (You should not consider which exact variables you would include, but just the number of variables)
A	All 8 variables should be used, because more variables will always give a more accurate model and when we have as few datapoints as we do in this case, the model contruction will be fast no matter what, so there is no need to consider simplified models.
B	When having only 30 data points one should use 3-6 variables, since having less than 1/10 of the data as variables, will make it hard to get a good fit and more than 1/5 of the data in variables gives rise to the risk of overfitting. For verification of what variables to include, one should verify the p-values testing for if the estimation of the parameter is different from zero.
C	As few variables as possible should be chosen. It should be 1 or 2, which is still verified by the p-values as in answer B. When having only 1 or 2 variables it will also be possible to do a visualization of the grouping structure done by the model, by simply making a 2- or 3-dimensional plot, so one should always consider not using more than 2 variables.
Explanation	Having to many variables will often result in overfitting which is illustrated in the slides of lecture 2. Essentially, the problem is that there is an infinite number of ways that he model can choose the parameters, so they still covers the points in the data. If this occurs the model will only describe the data, but likely want be usable for other data, since they do not describe the behavior of the data in general. This is equal to the warning message in R saying: Warning message: glm.fit: fitted probabilities numerically 0 or 1 occurred
Tags	lecture_2
Author	acarbalacar (daand09)
Avg Rating	4.0000
Avg Difficulty	0.5000
Total ratings	4

Field	Value
ID	513131
Created	2013-03-05 10:34:53
Question	Consider the following neural network. The weights, including bias, are defined as follows. Futhermore, the activation function at the hidden layer, as well as the activation function at the output layer are given by the logistic sigmoid function; Your task is to use forward propagation to calculate the estimate, that the neural network produce on the following input.
A	0.003309
B	0.213426
C	0.013655
D	0.748314
E	0.508049
Explanation	The formula for forward propagation is given as follows. , where M = D = 2 and f is the logistic sigmoid function in our case. We evaluate the sum with our weights; Introducing the input yields the following; f was the logistic sigmoid function;
Tags	lecture_5
Author	larsgmathiasen (lamat10)
Avg Rating	3.5000
Avg Difficulty	1.0000
Total ratings	6

Field	Value
ID	511287
Created	2013-03-03 12:54:57
Question	Suppose you have implemented your favorite learning algorithm, namely logistic regression, to solve a given learning problem. Unfortunately, you are getting an intolerable test error with the parameters learned from your training data. Analysis of the situation showed that it is a problem of either high bias or high variance. State, for each of the following approaches whether it will solve high bias, high variance or both. a) Acquire more training data b) Reduce the set of features c) Increase the set of features
A	a) Solves both high bias and high variance b) Solves high variance c) Solves high bias
B	a) Solves high bias b) Solves high bias c) Solves high variance
C	a) Solves high variance b) Solves high variance c) Solves both high bias and high variance
D	a) Solves high bias b) Solves high variance c) Solves high bias
E	a) Solves high variance b) Solves high variance c) Solves high bias
Explanation	If the bias is high, then we are consistently learning the same wrong thing, regardless of the amount of training data we have. E.g. trying to fit a linear function to quadratic data (underfitting). This can often be seen if high training error is observed together with high test error. Thus, the only remedy is increasing the set of features; more training examples will not help and reducing the set of features will definitely not help. If the variance is high, then we are overfitting the data, i.e. fitting on a too small training set that does not reflect the true pattern of the data. E.g. fitting a 9-order polynomial on 10 training points. This can be seen if the training error is low but the test error is high. The conclusion is that we have too many features, so we should be reducing the set of features and not increase it. Furthermore we could solve the high variance by acquiring more test examples to get rid of the overfitting.
Tags	lecture_10
Author	larsgmathiasen (lamat10)
Avg Rating	3.4000
Avg Difficulty	0.6000
Total ratings	5