Beyond Linear Models

Statistical Computing, 36-350

Friday November 5, 2016

What other kinds of models are there?

Linear regression models, as we’ve said, are useful and ubiquitous. But there’s a lot else out there. What else?

Today we’ll quickly visit logistic regression and generalized additive models. In some ways, they are similar to linear regression; in others, they’re quite different, and you’ll learn a lot more about them in the Advanced Methods for Data Analysis 36-402 course

Logistic regression modeling

Given response \(Y\) and predictors \(X_1,\ldots,X_p\), where \(Y \in \{0,1\}\) is a binary outcome. In a logistic regression model, we posit the relationship:

\[ \log\frac{\mathbb{P}(Y=1|X)}{\mathbb{P}(Y=0|X)} = \beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p \]

(where \(Y|X\) is shorthand for \(Y|X_1,\ldots,X_p\)). Goal is to estimate parameters, also called coefficients \(\beta_0,\beta_1,\ldots,\beta_p\)

Prostate cancer data, once again

Recall, data set from 97 men who have prostate cancer (from the book The Elements of Statistical Learning):

pros.df = read.table("http://www.stat.cmu.edu/~ryantibs/statcomp/data/pros.dat")
dim(pros.df)
## [1] 97  9
head(pros.df)
##       lcavol  lweight age      lbph svi       lcp gleason pgg45       lpsa
## 1 -0.5798185 2.769459  50 -1.386294   0 -1.386294       6     0 -0.4307829
## 2 -0.9942523 3.319626  58 -1.386294   0 -1.386294       6     0 -0.1625189
## 3 -0.5108256 2.691243  74 -1.386294   0 -1.386294       7    20 -0.1625189
## 4 -1.2039728 3.282789  58 -1.386294   0 -1.386294       6     0 -0.1625189
## 5  0.7514161 3.432373  62 -1.386294   0 -1.386294       6     0  0.3715636
## 6 -1.0498221 3.228826  50 -1.386294   0 -1.386294       6     0  0.7654678

Fitting a logistic regression model with glm()

We can use glm() to fit a logistic regression model. The arguments are very similar to lm()

The first argument is a formula, of the form Y ~ X1 + X2 + ... + Xp, where Y is the response and X1, …, Xp are the predictors. These refer to column names of variables in a data frame, that we pass in through the data argument. We must also specify family="binomial" to get logistic regression

E.g., for the prostate data, suppose we add a column lpsa.high to our data frame pros.df, as the indicator of whether the lpsa variable is larger than log(10) (equivalently, whether the PSA score is larger than 10). To regress the binary response variable lpsa.high onto the predictor variables lcavol and lweight:

pros.df$lpsa.high = as.numeric(pros.df$lpsa > log(10)) # New binary outcome
table(pros.df$lpsa.high) # There are 56 men with high PSA scores
## 
##  0  1 
## 41 56
pros.glm = glm(lpsa.high ~ lcavol + lweight, data=pros.df, family="binomial")
class(pros.glm) # Really, a specialized list
## [1] "glm" "lm"
pros.glm # It has a special way of printing
## 
## Call:  glm(formula = lpsa.high ~ lcavol + lweight, family = "binomial", 
##     data = pros.df)
## 
## Coefficients:
## (Intercept)       lcavol      lweight  
##     -12.551        1.520        3.034  
## 
## Degrees of Freedom: 96 Total (i.e. Null);  94 Residual
## Null Deviance:       132.1 
## Residual Deviance: 75.44     AIC: 81.44

Utility functions work as before

For retrieving coefficients, fitted values, residuals, summarizing, plotting, making predictions, the utility functions coef(), fitted(), residuals(), summary(), plot(), predict() work pretty much just as with lm(). E.g.,

coef(pros.glm) # Logisitic regression coefficients 
## (Intercept)      lcavol     lweight 
##  -12.551478    1.520006    3.034018
summary(pros.glm) # Special way of summarizing
## 
## Call:
## glm(formula = lpsa.high ~ lcavol + lweight, family = "binomial", 
##     data = pros.df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7957  -0.6413   0.2192   0.5608   2.2209  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -12.5515     3.2422  -3.871 0.000108 ***
## lcavol        1.5200     0.3604   4.218 2.47e-05 ***
## lweight       3.0340     0.8615   3.522 0.000429 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 132.142  on 96  degrees of freedom
## Residual deviance:  75.436  on 94  degrees of freedom
## AIC: 81.436
## 
## Number of Fisher Scoring iterations: 5
p.hat = fitted(pros.glm) # These are probabilities! Not binary outcomes
y.hat = round(p.hat) # This is one way we'd compute fitted outcomes
table(y.hat, y.true=pros.df$lpsa.high) # This is a 2 x 2 "confusion matrix"
##      y.true
## y.hat  0  1
##     0 33  5
##     1  8 51

What does a logistic regression coefficient mean?

How do you interpret a logistic regression coefficient? Note, from our logistic model:

\[ \frac{\mathbb{P}(Y=1|X)}{\mathbb{P}(Y=0|X)} = \exp(\beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p) \]

So, increasing predictor \(X_j\) by one unit, while holding all other predictor fixed, multiplies the odds by \(e^{\beta_j}\). E.g.,

coef(pros.glm)  
## (Intercept)      lcavol     lweight 
##  -12.551478    1.520006    3.034018

So, increasing lcavol (log cancer volume) by one unit (one cc), while holding lweight (log cancer weight) fixed, multiplies the odds of lpsa.high (high PSA score, over 10) by \(\approx e^{1.52} \approx 4.57\)

Creating a binary variable “on-the-fly”

We can easily create a binary variable “on-the-fly” by using the I() function inside a call to glm() (same would work with lm()):

pros.glm = glm(I(lpsa > log(10)) ~ lcavol + lweight, data=pros.df, 
               family="binomial")
summary(pros.glm) # Same as before
## 
## Call:
## glm(formula = I(lpsa > log(10)) ~ lcavol + lweight, family = "binomial", 
##     data = pros.df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7957  -0.6413   0.2192   0.5608   2.2209  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -12.5515     3.2422  -3.871 0.000108 ***
## lcavol        1.5200     0.3604   4.218 2.47e-05 ***
## lweight       3.0340     0.8615   3.522 0.000429 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 132.142  on 96  degrees of freedom
## Residual deviance:  75.436  on 94  degrees of freedom
## AIC: 81.436
## 
## Number of Fisher Scoring iterations: 5

Generalized additive modeling

Generalized additive models allow us to do something that is like linear regression or logistic regression, but with a more flexible way of modeling the effects of predictors (rather than limiting their effects to be linear). For a continuous response \(Y\), our model is:

\[ \mathbb{E}(Y|X) = \beta_0 + f_1(X_1) + \ldots + f_p(X_p) \]

and the goal is to estimate \(\beta_0,f_1,\ldots,f_p\). For a binary response \(Y\), our model is:

\[ \log\frac{\mathbb{P}(Y=1|X)}{\mathbb{P}(Y=0|X)} = \beta_0 + f_1(X_1) + \ldots + f_p(X_p) \]

and the goal is again to estimate \(\beta_0,f_1,\ldots,f_p\)

Fitting a generalized additive model with gam()

We can use the gam() function, from the gam package, to fit a generalized additive model. The arguments are similar to glm() (and to lm()), with a key distinction

The formula is now of the form Y ~ s(X1) + X2 + ... + s(Xp), where Y is the response and X1, …, Xp are the predictors. The notation s() is used around a predictor name to denote that we want to model this as a smooth effect (nonlinear); without this notation, we simply model it as a linear effect

So, e.g., to fit the model

\[ \mathbb{E}(\mathrm{lpsa}\,|\,\mathrm{lcavol},\mathrm{lweight}) = \beta_0 + f_1(\mathrm{lcavol}) + \beta_2 \mathrm{lweight} \]

we use:

library(gam, quiet=TRUE)
pros.gam = gam(lpsa ~ s(lcavol) + lweight, data=pros.df)
class(pros.gam) # Again, a specialized list
## [1] "gam" "glm" "lm"
pros.gam # It has a special way of printing
## Call:
## gam(formula = lpsa ~ s(lcavol) + lweight, data = pros.df)
## 
## Degrees of Freedom: 96 total; 91.00006 Residual
## Residual Deviance: 49.40595

Most utility functions work as before

Most of our utility functions work just as before. E.g.,

summary(pros.gam)
## 
## Call: gam(formula = lpsa ~ s(lcavol) + lweight, data = pros.df)
## Deviance Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.6343202 -0.4601627  0.0004965  0.5121757  1.8516801 
## 
## (Dispersion Parameter for gaussian family taken to be 0.5429)
## 
##     Null Deviance: 127.9177 on 96 degrees of freedom
## Residual Deviance: 49.406 on 91.0001 degrees of freedom
## AIC: 223.8339 
## 
## Number of Local Scoring Iterations: 2 
## 
## Anova for Parametric Effects
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## s(lcavol)  1 69.003  69.003 127.095 < 2.2e-16 ***
## lweight    1  7.181   7.181  13.227 0.0004571 ***
## Residuals 91 49.406   0.543                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##             Npar Df Npar F  Pr(F)
## (Intercept)                      
## s(lcavol)         3 1.4344 0.2379
## lweight

But now, plot(), instead of producing a bunch of diagnostic plots, shows us the effects that were fit to each predictor (nonlinear or linear, depending on whether or not we used s()):

plot(pros.gam)

We can see that, even though we allowed lcavol to have a nonlinear effect, this didn’t seem to matter much as its estimated effect came out to be pretty close to linear anyway!