Statistical Computing, 36-350

Friday November 5, 2016

Linear regression models, as weâ€™ve said, are useful and ubiquitous. But thereâ€™s a lot else out there. What else?

- Logistic regression
- Poisson regression
- Generalized linear models
- Generalized additive models
- Many, many others (generative models, Bayesian models, hidden Markov models, nonparametric models, machine learning â€śmodelsâ€ť, etc.)

Today weâ€™ll quickly visit logistic regression and generalized additive models. In some ways, they are similar to linear regression; in others, theyâ€™re quite different, and youâ€™ll learn a lot more about them in the Advanced Methods for Data Analysis 36-402 course

Given response \(Y\) and predictors \(X_1,\ldots,X_p\), where \(Y \in \{0,1\}\) is a **binary outcome**. In a **logistic regression model**, we posit the relationship:

\[ \log\frac{\mathbb{P}(Y=1|X)}{\mathbb{P}(Y=0|X)} = \beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p \]

(where \(Y|X\) is shorthand for \(Y|X_1,\ldots,X_p\)). Goal is to estimate parameters, also called coefficients \(\beta_0,\beta_1,\ldots,\beta_p\)

Recall, data set from 97 men who have prostate cancer (from the book The Elements of Statistical Learning):

```
pros.df = read.table("http://www.stat.cmu.edu/~ryantibs/statcomp/data/pros.dat")
dim(pros.df)
```

`## [1] 97 9`

`head(pros.df)`

```
## lcavol lweight age lbph svi lcp gleason pgg45 lpsa
## 1 -0.5798185 2.769459 50 -1.386294 0 -1.386294 6 0 -0.4307829
## 2 -0.9942523 3.319626 58 -1.386294 0 -1.386294 6 0 -0.1625189
## 3 -0.5108256 2.691243 74 -1.386294 0 -1.386294 7 20 -0.1625189
## 4 -1.2039728 3.282789 58 -1.386294 0 -1.386294 6 0 -0.1625189
## 5 0.7514161 3.432373 62 -1.386294 0 -1.386294 6 0 0.3715636
## 6 -1.0498221 3.228826 50 -1.386294 0 -1.386294 6 0 0.7654678
```

`glm()`

We can use `glm()`

to fit a logistic regression model. The arguments are very similar to `lm()`

The first argument is a formula, of the form `Y ~ X1 + X2 + ... + Xp`

, where `Y`

is the response and `X1`

, â€¦, `Xp`

are the predictors. These refer to column names of variables in a data frame, that we pass in through the `data`

argument. We must also specify `family="binomial"`

to get logistic regression

E.g., for the prostate data, suppose we add a column `lpsa.high`

to our data frame `pros.df`

, as the indicator of whether the `lpsa`

variable is larger than log(10) (equivalently, whether the PSA score is larger than 10). To regress the binary response variable `lpsa.high`

onto the predictor variables `lcavol`

and `lweight`

:

```
pros.df$lpsa.high = as.numeric(pros.df$lpsa > log(10)) # New binary outcome
table(pros.df$lpsa.high) # There are 56 men with high PSA scores
```

```
##
## 0 1
## 41 56
```

```
pros.glm = glm(lpsa.high ~ lcavol + lweight, data=pros.df, family="binomial")
class(pros.glm) # Really, a specialized list
```

`## [1] "glm" "lm"`

`pros.glm # It has a special way of printing`

```
##
## Call: glm(formula = lpsa.high ~ lcavol + lweight, family = "binomial",
## data = pros.df)
##
## Coefficients:
## (Intercept) lcavol lweight
## -12.551 1.520 3.034
##
## Degrees of Freedom: 96 Total (i.e. Null); 94 Residual
## Null Deviance: 132.1
## Residual Deviance: 75.44 AIC: 81.44
```

For retrieving coefficients, fitted values, residuals, summarizing, plotting, making predictions, the utility functions `coef()`

, `fitted()`

, `residuals()`

, `summary()`

, `plot()`

, `predict()`

work pretty much just as with `lm()`

. E.g.,

`coef(pros.glm) # Logisitic regression coefficients `

```
## (Intercept) lcavol lweight
## -12.551478 1.520006 3.034018
```

`summary(pros.glm) # Special way of summarizing`

```
##
## Call:
## glm(formula = lpsa.high ~ lcavol + lweight, family = "binomial",
## data = pros.df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7957 -0.6413 0.2192 0.5608 2.2209
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -12.5515 3.2422 -3.871 0.000108 ***
## lcavol 1.5200 0.3604 4.218 2.47e-05 ***
## lweight 3.0340 0.8615 3.522 0.000429 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 132.142 on 96 degrees of freedom
## Residual deviance: 75.436 on 94 degrees of freedom
## AIC: 81.436
##
## Number of Fisher Scoring iterations: 5
```

```
p.hat = fitted(pros.glm) # These are probabilities! Not binary outcomes
y.hat = round(p.hat) # This is one way we'd compute fitted outcomes
table(y.hat, y.true=pros.df$lpsa.high) # This is a 2 x 2 "confusion matrix"
```

```
## y.true
## y.hat 0 1
## 0 33 5
## 1 8 51
```

How do you interpret a logistic regression coefficient? Note, from our logistic model:

\[ \frac{\mathbb{P}(Y=1|X)}{\mathbb{P}(Y=0|X)} = \exp(\beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p) \]

So, increasing predictor \(X_j\) by one unit, while holding all other predictor fixed, multiplies the odds by \(e^{\beta_j}\). E.g.,

`coef(pros.glm) `

```
## (Intercept) lcavol lweight
## -12.551478 1.520006 3.034018
```

So, increasing `lcavol`

(log cancer volume) by one unit (one cc), while holding `lweight`

(log cancer weight) fixed, multiplies the odds of `lpsa.high`

(high PSA score, over 10) by \(\approx e^{1.52} \approx 4.57\)

We can easily create a binary variable â€śon-the-flyâ€ť by using the `I()`

function inside a call to `glm()`

(same would work with `lm()`

):

```
pros.glm = glm(I(lpsa > log(10)) ~ lcavol + lweight, data=pros.df,
family="binomial")
summary(pros.glm) # Same as before
```

```
##
## Call:
## glm(formula = I(lpsa > log(10)) ~ lcavol + lweight, family = "binomial",
## data = pros.df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7957 -0.6413 0.2192 0.5608 2.2209
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -12.5515 3.2422 -3.871 0.000108 ***
## lcavol 1.5200 0.3604 4.218 2.47e-05 ***
## lweight 3.0340 0.8615 3.522 0.000429 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 132.142 on 96 degrees of freedom
## Residual deviance: 75.436 on 94 degrees of freedom
## AIC: 81.436
##
## Number of Fisher Scoring iterations: 5
```

**Generalized additive models** allow us to do something that is like linear regression or logistic regression, but with a more flexible way of modeling the effects of predictors (rather than limiting their effects to be linear). For a continuous response \(Y\), our model is:

\[ \mathbb{E}(Y|X) = \beta_0 + f_1(X_1) + \ldots + f_p(X_p) \]

and the goal is to estimate \(\beta_0,f_1,\ldots,f_p\). For a binary response \(Y\), our model is:

\[ \log\frac{\mathbb{P}(Y=1|X)}{\mathbb{P}(Y=0|X)} = \beta_0 + f_1(X_1) + \ldots + f_p(X_p) \]

and the goal is again to estimate \(\beta_0,f_1,\ldots,f_p\)

`gam()`

We can use the `gam()`

function, from the `gam`

package, to fit a generalized additive model. The arguments are similar to `glm()`

(and to `lm()`

), with a key distinction

The formula is now of the form `Y ~ s(X1) + X2 + ... + s(Xp)`

, where `Y`

is the response and `X1`

, â€¦, `Xp`

are the predictors. The notation `s()`

is used around a predictor name to denote that we want to model this as a smooth effect (nonlinear); without this notation, we simply model it as a linear effect

So, e.g., to fit the model

\[ \mathbb{E}(\mathrm{lpsa}\,|\,\mathrm{lcavol},\mathrm{lweight}) = \beta_0 + f_1(\mathrm{lcavol}) + \beta_2 \mathrm{lweight} \]

we use:

```
library(gam, quiet=TRUE)
pros.gam = gam(lpsa ~ s(lcavol) + lweight, data=pros.df)
class(pros.gam) # Again, a specialized list
```

`## [1] "gam" "glm" "lm"`

`pros.gam # It has a special way of printing`

```
## Call:
## gam(formula = lpsa ~ s(lcavol) + lweight, data = pros.df)
##
## Degrees of Freedom: 96 total; 91.00006 Residual
## Residual Deviance: 49.40595
```

Most of our utility functions work just as before. E.g.,

`summary(pros.gam)`

```
##
## Call: gam(formula = lpsa ~ s(lcavol) + lweight, data = pros.df)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6343202 -0.4601627 0.0004965 0.5121757 1.8516801
##
## (Dispersion Parameter for gaussian family taken to be 0.5429)
##
## Null Deviance: 127.9177 on 96 degrees of freedom
## Residual Deviance: 49.406 on 91.0001 degrees of freedom
## AIC: 223.8339
##
## Number of Local Scoring Iterations: 2
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## s(lcavol) 1 69.003 69.003 127.095 < 2.2e-16 ***
## lweight 1 7.181 7.181 13.227 0.0004571 ***
## Residuals 91 49.406 0.543
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## s(lcavol) 3 1.4344 0.2379
## lweight
```

But now, `plot()`

, instead of producing a bunch of diagnostic plots, shows us the effects that were fit to each predictor (nonlinear or linear, depending on whether or not we used `s()`

):

`plot(pros.gam)`

We can see that, even though we allowed `lcavol`

to have a nonlinear effect, this didnâ€™t seem to matter much as its estimated effect came out to be pretty close to linear anyway!