From: Brian Junker [brian@stat.cmu.edu] Sent: Tuesday, April 25, 2006 2:06 PM To: bj20@andrew.cmu.edu Subject: 36-724 hw06 hint #1 Dear [copy for BJ], This is the first of two or three emails offering hints to let you finish hw06. This has to do with problem 5 on the hw sheet. The next one will come around 530 or 600 this afternoon, and will deal with some conceptual and computational issues people are having with problems 1, 2, and 3. I have uploaded into the hw06 area the file adaboost.r which contains basic adaboost code (the basic code was written by Dan Ting, the grader, and then much-adulterated by me to illustrate behavior with different linear classifiers). If you run tryit(fitnext1,400), you will see the behavior of adaboost with a linear classifier fitted by minimizing weighted classification errors (with the weights calculated as in step 2(d), p 301 of HTF). There are two graphical outputs: First, the sequence of linear classifiers is plotted, with linethickness related to the size of the coefficient alpha_m in step 3, p. 301, so yuo can see which classifiers get a lot of weight (thick line) and which get only a little weight (thin line). Second, an error plot similar to fig 10.2 of HTF is plotted using both training error (solid line) and test error (dashed line). The behavior is not quite as smooth as with tree stumps (as used in fig 10.2) but is similar. Note that on repeated runs you will get somewhat different classifiers, due to naive random starting values to optimize each linear classifier. I am not quite sure why the error/learning curves are not as smooth as in Fig 10.2. I suspect it is a combination of random starting coefficients for each individual classifier, and some deficiency in the "optim" function for optimizing a stepwise-constant loss function (weighted misclassification rate is stepwise constant in the coefficients of the linear classifier). This gives you enough material to answer parts a, b, and c, quickly. you can do part d on your own, by suitably modifying the simulation that generates X.test, y.test, X.train and y.train. If you run tryit(fitnext2,400) you will get the same demonstration of adaboost, but now the linear classifier is fitted with weighted least squares using lm. You will see immediately from the plotted individual classifiers what is going wrong (can you explain why, in terms of what least-squares fitting is doing here?). and the error/learning plot in the end looks terrible. If you run tryit(fitnext3,400) you get the same demo again, but now with linear classifiers with *totally random* coefficients. On many runs, this does almost as well as the first demo (fitnext1), and it always seems to do better than the lm-based demo. (There is quite a lot of variation run-to-run, due to the randomness of the individual classifiers.) This seems to be good empirical evidence that - contrary to the advice I gave in class, the loss function you use to fit the linear classifier does NOT have to be exponential loss. [my poor advice here came from a too-quick look at the proof that adaboost with weighted misclassification loss is equivalent to stagewise additive modeling with exponential loss on p. 305 of HTF!] - still you can choose a bad loss function, e.g. lm() is bad for this configuration of two classes because it always leads to classifiers that have about a 50% classification error rate on this data. - in addition to reweighting the data by the w's for each fit, adaboost "wins" by weighting the individual classifiers by the alphas, emphasizing classifiers that do well, and de-emphasizing classifiers that do badly. The problem with lm() is that it never produces any classifiers that are worth "weighting up" with the alphas in adaboost. It seems worthwhile at this point to read original source material on adaboost (e.g. Freund & Shapire, 1997; or the Friedman papers listed in sect's 10.1--10.3). see you in class, -BJ