Friday, February 27, 2009

Lec3 - Overfitting vs Underfitting

Video Lecture - 3

Lecture 3 starts with a brief discussion on the concept of [over|under]fitting and [non]parametric algorithms. This relates to the so-called bias-variance trade-off (which will be discussed in more [mathematical] detail when we cover learning theory, in lectures 9-11), but I would like to add an additional perspective to the example of fitting a polynomial to a dataset, given in the lecture. One key concept we need to understand is that the fundamental problem of inductive reasoning is the ability to generalize to unseen instances.

Without getting too philosophical or dogmatic, what does it actually mean to reason? We don't exactly know yet (not in the way brains do it), but we can make some qualitative statements about it. For example, we distinguish between deductive reasoning and inductive reasoning, and one easy way to relate the two is that deductive reasoning goes from general to specific, while inductive reasoning goes from specific to general, as illustrated here. It's easy to relate that idea to the following diagram (which I've seen in a couple of places):



As the diagram suggests, given a data generation process (or a model of such process), probability theory enables us to reason about the outcome (observed data), and I can think of no better example than doing inference on bayesian networks. Although we won't really discuss bayesian networks during our coverage of cs229, we will explore them in detail later, when we cover cs228. In the meantime, besides mentioning that this kind of inference generalizes deductive inference, I will leave you with a nice introduction.

Right now, we are more interested in the bottom part of the diagram, where statistical inference is depicted as the path that allows us to build models from data; in other words, to generalize from the specific. In fact, machine learning, data mining, pattern recognition, inductive reasoning, and statistical inference are all different facets of the same: using observed data to estimate unknown things.

Before we talk about what these unknown things might be, let us briefly and informally examine whether or not we are even justified in attempting to generalize from specific instances. It turns out we, humans, do it constantly; we couldn't survive otherwise. Imagine, for example, a person - let's call her Jane - who just became allergic to dogs and does not yet know it (she does know, however, about the possibility of being allergic to animals or things). One day, while visiting a friend who owns both a cat and a dog, Jane gets sick. The mild reaction subsides after Jane gets back home and so she thinks nothing of it. A few days later, while visiting another friend who also owns a cat and a dog, she develops the same reaction, only this time it is more severe (not severe enough to warrant a trip to the doctor, though, who might order some tests and explain everything). After Jane gets home and gets better, she starts to think about her experiences and reaches the conclusion that cats are the culprits, since both houses had cats, and the cats (unlike the dogs) where generally indoors.

It would not be unreasonable for Jane to (1) generalize that all cats might cause her an allergic reaction, and so choose to avoid them until she could see a doctor and know for sure. An alternative to this generalization would be to (2) consider all animals as dangerous, dogs and cats. Another alternative would be to (3) only consider dangerous those dogs and cats present when she actually had an allergic reaction. Yet another alternative would be to (4) consider everything about the houses she visited as features indicative of whether or not she would have a reaction (number of pets, quality of ventilation, presence of carpeted floors, kinds of pets, breeds, house color*, etc).

We can characterize the hypotheses available to Jane by the so-called inductive bias of each hypothesis. The inductive bias of hypothesis (1) being that knowing nothing other than the fact that there are cats present is needed to predict whether or not an allergic reaction is likely, a bias which is comparatively high compared with the bias of hypothesis (4), which makes almost no assumptions (it does not even assume, for example, that house color isn't a factor)**.

Too high a bias, though simpler, may lead to underfitting, resulting in the inability to successfully classify unseen instances as well as seen instances (Jane might have since visited a house with cats but no dogs, where she did not have a reaction). Too low a bias, on the other hand, evidently more complex and thus requiring more computational/informational resources, may lead to overfitting, resulting in excellent classification in seen instances (once they have been seen) but poor classification in unseen instances (since chances of a particular configuration repeating itself decrease as we increase the number of features we consider). The topic of inductive bias of learning algorithms is further explained here, but the sweet spot is obviously somewhere in between, where we don't make too many or too few assumptions***, as is the case with hypothesis (2), which if adhered to, will keep Jane from having another [potentially fatal] allergic reaction despite the fact that it will unnecessarily cause her to avoid cats.

This leads us into a discussion of parametric (more assumptions) and nonparametric (less assumptions) methods, which we will leave for next time.

*Not being an allergist, it is understandable for Jane not to have, a priori (see here and here), a completely accurate causal model of allergic mechanisms.

**If you look closely, you will notice that even the most unbiased of hypotheses is not without bias, for the mere belief that there may be a hypothesis at all is itself bias. Luckily for us, there are enough stable and persistent causal mechanisms out there, generating data we can perceive, that such belief is not unfounded. You might also find this to be very interesting.

***Can't we just make the right [and only those right] assumptions?

0 comments:

Post a Comment