Finger Exercises to be discussed Wed Sept 25 2002 ------------------------------------------------- Consider the following data sets: A. The Longley data (Monahan, p. 86), which is highly collinear. * Splus: X <- longley.x y <- longley.y * R: data(longley) X <- longley[,1:6] y <- longley[,7] B. The Steel and Torre data (Monahan, p. 109 ex 5.2) X <- 1:6 y <- c(4.5, 5.5, 6.5, 8.0, 10.0, 12.0) ----------------------------------------------------------------------- 1. Compute the least-squares regression coefficients for the Longley and Steel/Torre datasets, using each of the following methods. Nb., don't forget to say X <- cbind(1,X)) to get the intercept in the model, before using methods (a), (b), (c). (a) Inverting XTX directly: XTX _ t(X) %*% X XTXinv _ solve(XTX) (b) Cholesky decomposition [use chol() to find the cholesky decomposition] (c) QR decomposition [use qr(), qr.Q() and qr.R() to find the QR decomposition] (d) lm in SPLUS or R [don't say X _ cbind(1,X) in this case] and compare the results. [There will not be too much difference because Splus does all its calculations in double precision, and typically shows only 7 or so significant decimal digits. How could you find differences in the lower-order digits? Single precision calculations might not be able to handle the longley data, for example] [if you are using Splus, explore help(lm.fit) as well. This is the fitting function that lm uses. lm.fit in R is less versatile.] 2. As described on pp. 84--85 of Monahan, the condition number for a linear regression bounds relative changes in the estimated regression coefficients in terms of relative changes in the response variables Y, or in the predictors X, in a regression problem. Computing the exact condition number is not hard for small problems: If X is the matrix of regressors (without the intercept), the condition number may be computed as: XTX _ cbind(1,X)%*%cbind(1,X) v _ eigen(XTX)$values k _ sqrt(max(v)/min(v)) Two alternatives: XTX _ cbind(1,X)%*%cbind(1,X) v _ svd(XTX)$d k _ sqrt(max(v)/min(v)) v _ svd(cbind(1,X)) k _ max(v)/min(v) Argue that these are all the same (cf. Monahan p. 123). (There is also a function kappa() that computes a fast approximate condition number, but the exact calculation is usually fast enough.) 3. Compute exact (as in #2) and approximate (using kappa()) condition numbers for the following X's [don't forget to cbind with 1!]: (a) X _ rnorm(6) # cbind(1,X) full rank (b) X _ 1:6 # cbind(1,X) full rank, very mild collinearity (c) X _ cbind(1:6,2:7) # cbind(1,X) not full rank (d) X _ longley.x # cbind(1,X) full rank, highly collinear 4. For 3(a), (b) and (c), use the y vector for the Steel/Torre data; for 3(d) use the y vector from the Longley data. For all four examples, compare the regression coefficients in lm(y ~ X) vs y1 _ y + rnorm(length(y)); lm(y1 ~ X) and relate to the results in #3.