DM825 - Introduction to Machine Learning
Sheet 6, Spring 2013 [pdf format]



Exercise 1 – Neural Networks for Time Series Prediction

A common data analysis task is time series prediction, where we have a set of data that show something varying over time, and we want to predict how the data will vary in the future. Examples are stock markets, river levels and house prices.

The data set PNoz.dat contains the daily measurement of the thickness of the ozone layer above Palmerston North in New Zealand between 1996 and 2004. Ozone thickness is measured in Dobson units, which are 0.01 mm thickness at 0 degree Celsius and 1 atmosphere pressure. The reduction in stratosferic ozone is partly responsible for global warming and the increased incidence of skin cancer. The thickness of the ozone varies naturally over the year, as you can see from the plot. (There are four fields in the data, and the ozone level is the third).

K <- read.table("PNoz.dat") names(K) <- c("year","day","ozone.level","sulphur.dioxide.level") plot(K$ozone.level,xlab="Time (Days)",ylab="Ozone (Dobson units)",pch=".",cex=1.5)

Your task is to use the multi-layer perceptron to predict the ozone levels into the future and see if you can detect an overall drop in the mean ozone level. Plot 400 predicted values together with the actual value.


The following is a reminder of the steps to carry out in the analysis:



Exercise 2 – Prediction of count outcomes

The software engineers of an online shopping portal wish to predict the number of clicks that a new product will receive in their online price search engine. New products are those that are not yet in the data base of the portal and for which there is no historical data available. The engineers would like to charge the vendors of the products on the basis of these predictions.

When a vendor inserts its product in the data base, the products are classified as belonging to an existing category or initiating a new category. In the following, we will assume that the category of a new product is known and it comprises products with similar names and characteristics.

The engineers collected historical data including the following variables:

The historical data are relative to one single day and the prediction is asked on the same unit of time.

A useful nonlinear regression model when the outcome is a count, with large-count outcomes being rare events, is the Poisson model. The Poisson probability distribution is given by

p(x)=
λx e−λ
x!
    x=0,1,2,…

Your tasks:

  1. [(a)]Model the prediction task by means of generalized linear regression using the Poisson distribution. More specifically, indicate the consequent link function for the GLM.
  2. Indicate an appropriate measure for a quantitative assessment of the predictor.
  3. Which other predictive model studied in class can be applied to this task?