DM825-2013

PeerWise question export

Export performed by marco at 1:14pm, 01 May 2013.
Exporting All questions in chronological order (most recent first).

FieldValue
ID530770
Created2013-03-18 11:28:26
Question

Consider the graphical model in the figure. Each node represent a gaussian variable. The mean of the variables is assumed to depend linearly on the parent variables so the conditional distributions can written as:

Where  and  describes the linear connection and  the variance.

no description

Which of the statements about the covariance matrix for gaussian that describes the joint distribution of the variables are not true?

A

*B*

C

D

Explanation

The mean and covariance matrix can be found by using (8.16) from Bishop recursively:

The rest of the components follow by symmetry.

Augmented explanation 1

Note that each computation of

will be converted to

if

 

This corresponds to the cryptic explanation in the book "...and so the covariance can similarly be evaluated recursively starting from the lowest numbered node."

(by: larsgmathiasen [lamat10])
Tagslecture_11
Authorggbn (glnie07)
Avg Rating3.3300
Avg Difficulty2.0000
Total ratings3
FieldValue
ID530853
Created2013-03-18 08:57:06
Question

Consider the following training data with input variables of four features, along with an output that can assume either the classification A or B.

 

x1      x2      x3      x4      y

---------------------------------

3.99   -1.65    6.52    2.78    B
4.45    1.68    7.54    1.59    B
1.61    5.44    3.79    7.51    A
1.61    4.83    2.92    9.50    A
2.74    0.66    5.17    4.12    B
4.26   -0.20    5.43    1.11    B
3.30    4.85    3.95    9.23    A
3.34    0.77    5.73    2.21    B
2.04    5.70    3.97    7.55    A


You are given a new input vector:

 = (3,3,4,5)T

 

Your task is to calculate the probabilities of  belonging to the two classes respectively, i.e.


  and  

 

by using the gaussian discriminant analysis.
In order to simplify the calculations, you should assume that the covariance matrix, , is the 4x4 identity matrix.

A

*B*

C

D

E

Explanation

First, the mean vectors,  and  are calculated. This is done by using the formlua on page 8 on the slides from lecture 7. I get the following vectors:

 

 

The probabilites P(Y=A) and P(Y=B) are calculated as the frequency of these classes in the training data:

 

 

I now need to calculate the posterior probability as:

Since I use the gaussian discriminant analysis and given the covariance matrix as the identity matrix, I have that .

 

Given that the covariance matrix is the 4x4 identity matrix, I can use the following two facts:

to write the formula as:

 

 

Where k is the number of features (in this case k = 4).

The same calculation is done for , by using  in stead of .

 

I thus get the posterior as:

 

Inserting my values I get the results:

Tagslecture_7
Authortvh10 (tomha10)
Avg Rating4.5000
Avg Difficulty1.0000
Total ratings4
FieldValue
ID534876
Created2013-03-18 07:33:06
Question

Suppose you want to predict if the movie you are currently watching is a starwars movie using a multinomial event model. To this end you have classified a number of previously watched movies as a starwars movie or not, based on the number of starwars related props used in the movie. You choose to represent each movie as a vector, where each used props are descretized into one of three bucket; few, some or many, depending on how many times the props occur in the movie:

You assume that the probability for a prop to be discretized to a bucket, k, given that the movie is a starwars movie, is the same for all props.

 

Your previously watched movies are used as training data, and represented below as two matrixes where each row represents a movie:

       

Given that the movie you are currently watching have the input vector:

what classification will the movie be predicted to?

A

The movie will be predicted to not be a starwars movie

*B*

The movie will be predicted to be a starwars movie

Explanation

We represent the discretized buckets few, some and many as 0, 1 and 2 respectively.

Let  represent the number of occurences for the props at position j in the ith movie. Let y=0 be the case that the movie was not a starwars movie and y=1 the case there it was. We estimate the parameters in the multinomial distribution, as done in the slides of lecture 7. This can be done as we assume that  is the same for all j's as stated in the description. Thus we end up with the following solution, where m is the number of observations, n is the number of props on each observation and k is a specific props :

Given that m=8 and n=5 we can calculate all parameters:

 

To predict our input vector,x, we maximize y: 

We use logarithms to prevent underflow:

Thus we predict the movie to be a starwars movie.

 

Tagslecture_7
Authornnoej10 (nnoej10)
Avg Rating5.0000
Avg Difficulty1.0000
Total ratings1
FieldValue
ID530518
Created2013-03-18 04:53:36
Question

Consider an electron emitter which emits electrons with some interarrival time between each emission. Suppose the interarrival times are independent and identically exponentially distributed. The rate depends on a constant and on the temperature of the cathode which is determined by the electric current :

 

   

 

Suppose the relationship is .

 

We want to learn the constant using the Bayesian approach, so we assume a conjugate prior. The exponential distribution is a special case of the gamma distribution, so the gamma distribution will work fine. Thus we define the prior:

 

   

 

Suppose, based on previous work, you choose the parameters to be  .

 

You now observe 5 interarrival times at different currents:

 

     

 

Derive the posterior. What is the expected value of  given the new observations?

*A*

41.8798

B

43.0205

C

42.0000

D

43.0384

E

41.0273

Explanation

The posterior is given by:

 

   

 

Since the observations are independent and identically exponential distributed we have:

 

   

 

Thus we have:

 

   

    

 

From this we see that the posterior is a new gamma distribution given by:

 

   

 

The expected value of a gamma distributed variable is simply  so in this case we have:

 

   

 

Tagslecture_2
Authortroelsmn (trnie09)
Avg Rating4.6000
Avg Difficulty1.8000
Total ratings5
FieldValue
ID523866
Created2013-03-13 03:35:34
Question

Suppose we have the Bernoulli distribution and  are iid with

Your task is now to derive the maximum likelihood estimate of 

A

*B*

C

D

E

Explanation

We have the likelihood function

Taking the natural logarithm of the likelihood function yields

Now, taking the derivative

To find the maximum of it, we set it equal to zero

Authoracarbalacar (daand09)
Avg Rating3.0000
Avg Difficulty0.2000
Total ratings5
Comment 1

This is at the limit of what would be defined as a too easy question... Do not expect something as easy at the exam. (by: marco [marco])

FieldValue
ID522394
Created2013-03-12 00:48:49
Question

Let

 

be a poison distribution with parameter  and .

 

In this exercise we know that either  or .
In bayesian inference, the parameter is treated as a random variable .
We are given that

    and     

for the subjective prior probabilities to the two possible values.
Now suppose that we are given a random sample of size n=2, with observations  and .


What is the posterior probabilities of  and  given the data? 

 

(note that it might be unrealistic that  can only take one of the two values, instead of a )

A

B

C

*D*

E

Explanation


 

 

 

So the posterior probability of  was smaller than the prior probability.
Similarly, the posterior probability of  was greater than the corresponding prior.
So the observations seemed to favor , and that would agree with our intuition, since we have that for a poison distribution the expected value is , which here should be close to the mean of the observation .

Tagslecture_2, lecture_12
Authorvaldemar (chha309)
Avg Rating4.2000
Avg Difficulty1.2000
Total ratings5
FieldValue
ID523937
Created2013-03-11 16:44:43
Question

In the figure below is 4 different graphical models. Each node in these is a binary variable and the maximal width of the networks is given by an even number M. (so each row with 3 "dots" contains M nodes)

 

no description

 

Select below the answer which do not correspond to the number of parameter needed to describe one of the models above.

A

B

C

D

*E*

Explanation

For each node the number of needed parameters is given by the number of parameters needed to describe the node it self times the number of different states its parent nodes can take.
If one is counting from the top row and down, and from left to right, then can the number of parameters N for each of the models be found as:

 

Tagslecture_11
Authorggbn (glnie07)
Avg Rating3.4000
Avg Difficulty0.4000
Total ratings5
FieldValue
ID520454
Created2013-03-09 02:06:07
Question

You are given the following training set, X, containing six inputs of a single feature, along with the corresponding observed outputs, Y:

 

X = (10,3,1,8,4,9)T

Y = (9,4,2,6,3,5)T

 

Your task is to predict the outcome, , of a new input, , using the locally weighted linear regression model, with a bandwidth of . This entails minimizing parameter  (in this case a scalar), and computing  
The weight function to be used in the case of single feature inputs is:

 

 

Answer which one of the following statements is false.

*A*

B

C

D

Explanation

Calculation of the weights yields the following vector, W:

 

W = (0.135.., 0.324.., 0.043.., 0.606.., 0.606.., 0.324..)T

 

I rewrite the  expression  in the following manner:

Firstly,  can be expressed in the form , where

 

I can thus rewrite the sum as:

 

 

With concrete values this becomes:

 

 

To minimize with respect to  I compute the derivative and set it equal 0:

 

<=>

<=>

 

The prediction for  can now be computed:

 

 

Tagslecture_2
Authortvh10 (tomha10)
Avg Rating4.2000
Avg Difficulty1.0000
Total ratings5
FieldValue
ID518688
Created2013-03-08 11:01:13
Question

Suppose you have been studying a dripping water tap. It turns out that the time intervals between drops are independently and identically distributed according to the distribution:

 

      

 

From the study you found out that the parameter  is distributed according to:

 

      

 

where the parameters are  .

 

You now observe the next 5 interval times:

 

      

 

Derive the posterior distribution. What is the expected  given ?

A

1.870726

B

1.582982

C

2.419451

*D*

1.818182

E

2.499716

Explanation

The posterior must again be a gamma distribution and satisfies the proportionality:

 

      

 

Since the observations are independent and identically exponential distributed we have:

 

      

 

Thus we have:

 

      

 

From this we see that the posterior is a new gamma distribution given by:

 

      

 

 

 The expected value of a gamma distributed variable is simply so in this case we have:

 

      

Tagslecture_2
Authortroelsmn (trnie09)
Avg Rating4.5000
Avg Difficulty1.7500
Total ratings4
FieldValue
ID517698
Created2013-03-08 04:46:57
Question

Given the poisson distribution:

 

 

which of the following statements are the conclusion to the proof, showing that the distribution is part of the exponential family of distributions.

*A*

B

C

 

Explanation

=

=

=

=

Tagslecture_4
Authornnoej10 (nnoej10)
Avg Rating3.5000
Avg Difficulty0.7500
Total ratings4
FieldValue
ID515538
Created2013-03-07 02:57:44
Question

What is the difference between Generative classifiers and Discriminative classifier models?

Generally let x be input parameters and y be the class.

A

Bayes’ theorem, represents an example of discriminative modeling. Where in the generative approach we maximize the likelihood function for the conditional distribution p(y|x).

*B*

Generative classifiers learn a model of the joint probability p(x,y), and make their predictions by using Bayes rules to calculate p(y|x), and then choosing the most likely class y. Discriminative classifiers model the posterior p(y|x) directly, or learn a direct map from inputs x to the class labels.

C

Only the discriminative training is used in supervised learning. Since the generative model only can learn from p(x|y), and thus it cannot tell us anything useful about the posterior.

D

The Generative and Discriminative classifier models learn the same way from data and predict the same way, and differs only in the way that the input in the discriminative model are considered to be y (the classifier) and x (parameters) to be the output in contrary to the generative model.

Explanation

See notes from Andrew Ag. Part IV.

Or book B1 page 204

Or the sildes from lecture 7

Tagslecture_7
Authorvaldemar (chha309)
Avg Rating3.5000
Avg Difficulty0.5000
Total ratings4
FieldValue
ID515518
Created2013-03-07 02:20:40
Question

Consider the following categorical data and assume that a logistic regression is appropriate

Obs    Var1    Var2    Var3    Var4    Var5    Var6    Var7    Var8    Group

  1      25.53    300    0.001      0      -0.164    0        1.65    0.36        0

  2      12.98    387    0.786    0.34    0.600   0.002  1.15    0.60        0

  3      29.27    182    -0.08     -0.2   -0.386   0.175   0.45    0.04        0

  4      23.67    367    0.001      0          0        0        0.56    0.97        0

.....

27      21.54    312    0.651   0.834  -0.084    0        1.04    0.83        1

28      17.45    242    1.337   0.060   0.724   0.38    0.89    0.86        1

29      19.45    140    0.453      0      -0.194    0        1.23    1.66        1

30      24.94    303    0.541   0.484   0.534    0        0.86    0.87        1

 

The data are fictive and does not present anything. As it can be seen we have 30 observations and 8 variables.

 

You are to consider how many of these variables that should be included in the model and why.

(You should not consider which exact variables you would include, but just the number of variables)

A

All 8 variables should be used, because more variables will always give a more accurate model and when we have as few datapoints as we do in this case, the model contruction will be fast no matter what, so there is no need to consider simplified models.

*B*

When having only 30 data points one should use 3-6 variables, since having less than 1/10 of the data as variables, will make it hard to get a good fit and more than 1/5 of the data in variables gives rise to the risk of overfitting. For verification of what variables to include, one should verify the p-values testing for if the estimation of the parameter is different from zero. 

C

As few variables as possible should be chosen. It should be 1 or 2, which is still verified by the p-values as in answer B. When having only 1 or 2 variables it will also be possible to do a visualization of the grouping structure done by the model, by simply making a 2- or 3-dimensional plot, so one should always consider not using more than 2 variables.

Explanation

Having to many variables will often result in overfitting which is illustrated in the slides of lecture 2. Essentially, the problem is that there is an infinite number of ways that he model can choose the parameters, so they still covers the points in the data. If this occurs the model will only describe the data, but likely want be usable for other data, since they do not describe the behavior of the data in general. This is equal to the warning message in R saying:

Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred

Tagslecture_2
Authoracarbalacar (daand09)
Avg Rating4.0000
Avg Difficulty0.5000
Total ratings4
FieldValue
ID513131
Created2013-03-05 10:34:53
Question

Consider the following neural network.

no description

 

The weights, including bias, are defined as follows.

 

 

 

Futhermore, the activation function at the hidden layer, as well as the activation function at the output layer are given by the logistic sigmoid function;

 

 

Your task is to use forward propagation to calculate the estimate,

 

 

that the neural network produce on the following input.

 

*A*

0.003309

B

0.213426

C

0.013655

D

0.748314

E

0.508049

Explanation

The formula for forward propagation is given as follows.

 

,

 

where M = D = 2 and f is the logistic sigmoid function in our case.

 

We evaluate the sum with our weights;

 

 

Introducing the input yields the following;

 

 

f was the logistic sigmoid function;

 

Tagslecture_5
Authorlarsgmathiasen (lamat10)
Avg Rating3.5000
Avg Difficulty1.0000
Total ratings6
FieldValue
ID511287
Created2013-03-03 12:54:57
Question

Suppose you have implemented your favorite learning algorithm, namely logistic regression, to solve a given learning problem.

Unfortunately, you are getting an intolerable test error with the parameters learned from your training data.

Analysis of the situation showed that it is a problem of either high bias or high variance.

State, for each of the following approaches whether it will solve high bias, high variance or both.

 

a) Acquire more training data

b) Reduce the set of features

c) Increase the set of features

A

a) Solves both high bias and high variance

b) Solves high variance

c) Solves high bias

B

a) Solves high bias

b) Solves high bias

c) Solves high variance

C

a) Solves high variance

b) Solves high variance

c) Solves both high bias and high variance

D

a) Solves high bias

b) Solves high variance

c) Solves high bias

*E*

a) Solves high variance

b) Solves high variance

c) Solves high bias

Explanation

If the bias is high, then we are consistently learning the same wrong thing, regardless of the amount of training data we have. E.g. trying to fit a linear function to quadratic data (underfitting). This can often be seen if high training error is observed together with high test error.

Thus, the only remedy is increasing the set of features; more training examples will not help and reducing the set of features will definitely not help.

 

If the variance is high, then we are overfitting the data, i.e. fitting on a too small training set that does not reflect the true pattern of the data. E.g. fitting a 9-order polynomial on 10 training points. This can be seen if the training error is low but the test error is high.

The conclusion is that we have too many features, so we should be reducing the set of features and not increase it.

Furthermore we could solve the high variance by acquiring more test examples to get rid of the overfitting.

Tagslecture_10
Authorlarsgmathiasen (lamat10)
Avg Rating3.4000
Avg Difficulty0.6000
Total ratings5