# DM825 - Introduction to Machine Learning Sheet 6, Spring 2013 [pdf format]

Exercise 1 – Neural Networks for Time Series Prediction

A common data analysis task is time series prediction, where we have a set of data that show something varying over time, and we want to predict how the data will vary in the future. Examples are stock markets, river levels and house prices.

The data set PNoz.dat contains the daily measurement of the thickness of the ozone layer above Palmerston North in New Zealand between 1996 and 2004. Ozone thickness is measured in Dobson units, which are 0.01 mm thickness at 0 degree Celsius and 1 atmosphere pressure. The reduction in stratosferic ozone is partly responsible for global warming and the increased incidence of skin cancer. The thickness of the ozone varies naturally over the year, as you can see from the plot. (There are four fields in the data, and the ozone level is the third).

 K <- read.table("PNoz.dat") names(K) <- c("year","day","ozone.level","sulphur.dioxide.level") plot(K\$ozone.level,xlab="Time (Days)",ylab="Ozone (Dobson units)",pch=".",cex=1.5)

Your task is to use the multi-layer perceptron to predict the ozone levels into the future and see if you can detect an overall drop in the mean ozone level. Plot 400 predicted values together with the actual value.

The following is a reminder of the steps to carry out in the analysis:

• Select inputs and outputs for your problem and consequently the input and output nodes for the network.
• Normalize the data by rescaling.
• Split the data into training, validation and test (use the rule 50/25/25 if enough data or use cross validation with little data).
• Identify the main parameters to configure, e.g., the network architecture and others.
• Train the network and compare for different parameters
• Assess the performance on the test data.
• Analyse the bias and variance trade off.

Exercise 2 – Prediction of count outcomes

The software engineers of an online shopping portal wish to predict the number of clicks that a new product will receive in their online price search engine. New products are those that are not yet in the data base of the portal and for which there is no historical data available. The engineers would like to charge the vendors of the products on the basis of these predictions.

When a vendor inserts its product in the data base, the products are classified as belonging to an existing category or initiating a new category. In the following, we will assume that the category of a new product is known and it comprises products with similar names and characteristics.

The engineers collected historical data including the following variables:

• offer_id: a numerical identifier of the entry
• product_name: the name of the product (this field is at the moment problematic as it exhibits no formalism)
• category_id: a numerical identifier for the category.
• price: the price of the product in euro (-1 if not given)
• deliver_price: the price for delivery (-1 if not given)
• total_price: the sum of the previous two prices (-1 if not given)
• availability: -1 see merchant site, 0 not available, 1 available, 2 available soon, 3 limited availability
• merchant: an encrypted name of the vendor
• number_reviews: the number of reviewers who rated the vendor
• average_rating: the average rating of the reviews
• number_clicks: the response variable indicating the number of clicks.

The historical data are relative to one single day and the prediction is asked on the same unit of time.

A useful nonlinear regression model when the outcome is a count, with large-count outcomes being rare events, is the Poisson model. The Poisson probability distribution is given by

p(x)=
 λx e−λ x!
x=0,1,2,…