Monday, February 16, 2009

Lec2 - Linear Regression - Gradient Descent

Video Lecture - 2

This will be the first of a few posts on linear regression, which will be used to perfectly clarify some of the concepts exposed in the introduction.

Regression problems refer to problems where the target variable y is a continuous variable (this is in contrast to a classification problem, where the target variable is discrete), and we seek to model the relationship between this dependent target variable and one or more independent variables (aka regressors or features).

For example, take a look at the following housing dataset, a description for which can be found here. This Boston Housing dataset from the CMU StatLib Library concerns housing prices in Boston suburbs. The sample consists of 13 attribute values (indicating parameters like crime rate, accessibility to major highways, etc.) and the median value of housing in thousands, which we would like to be able to predict. Specifically, for the particular features below, not in the sample dataset, can we estimate the median value of the house?

1. CRIM: 0.04741
2. ZN: 0.00
3. INDUS: 11.930
4. CHAS: 0
5. NOX: 0.5730
6. RM: 6.0300
7. AGE: 80.80
8. DIS: 2.5050
9. RAD: 1
10. TAX: 273.0
11. PTRATIO: 21.00
12. B: 396.90
13. LSTAT: 7.88
14. MEDV: y = ?

This is an example of supervised learning, also referred to as predictive modeling, where we seek to learn a mapping from an input vector x to a scalar (or vector) output y, by analyzing a bunch of examples where the output for a given input is known.

Following Tom Mitchell's recipe, we could frame this task of learning to predict housing prices as follows:
  1. Choose a performance measure: we will use the mean squared error.
  2. Choose the training experience: this will consist of our training dataset.
  3. Choose the target function: this will be a real-valued function which takes as input a vector of 1..n features x, f(x1, ..., xn) = y.
  4. Choose a representation for the target function: for this we will use a function which is linear with respect to each specific feature and each weight/coefficient associated with that particular feature, f(x0, ..., xn) = x0 * w0+...+xn * wn. By convention, x0=1.
  5. Choose a function approximation algorithm: to start off we will use an iterative algorithm to approximate our target function, gradient descent.
As an exercise, let us also map this task's components into Hand, Mannila, and Smyth's bins:
  1. Task: regression.
  2. Structure: linear function.
  3. Score function: squared error.
  4. Search/optimization method: gradient descent on weights/parameters.
  5. Data management technique: batch mode (nothing fancy needed as our dataset is small and fits comfortably in RAM). Later on we will also examine a common variant, online mode.
Pretty much the same thing... pick the one you like best, or modify as you please... it will help to have this sort of mental map as we learn more algorithms.

0 comments:

Post a Comment