---
title: "Beyond Linear Models"
author: "Statistical Computing, 36-350"
date: "Friday November 5, 2016"
---
What other kinds of models are there?
===
Linear regression models, as we've said, are useful and ubiquitous. But there's a lot else out there. What else?
- Logistic regression
- Poisson regression
- Generalized linear models
- Generalized additive models
- Many, many others (generative models, Bayesian models, hidden Markov models, nonparametric models, machine learning "models", etc.)
Today we'll quickly visit logistic regression and generalized additive models. In some ways, they are similar to linear regression; in others, they're quite different, and you'll learn a lot more about them in the Advanced Methods for Data Analysis 36-402 course
Logistic regression modeling
===
Given response $Y$ and predictors $X_1,\ldots,X_p$, where $Y \in \{0,1\}$ is a **binary outcome**. In a **logistic regression model**, we posit the relationship:
$$
\log\frac{\mathbb{P}(Y=1|X)}{\mathbb{P}(Y=0|X)} =
\beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p
$$
(where $Y|X$ is shorthand for $Y|X_1,\ldots,X_p$). Goal is to estimate parameters, also called coefficients $\beta_0,\beta_1,\ldots,\beta_p$
Prostate cancer data, once again
===
Recall, data set from 97 men who have prostate cancer (from the book [The Elements of Statistical Learning](http://statweb.stanford.edu/~tibs/ElemStatLearn/)):
```{r}
pros.df = read.table("http://www.stat.cmu.edu/~ryantibs/statcomp/data/pros.dat")
dim(pros.df)
head(pros.df)
```
Fitting a logistic regression model with `glm()`
===
We can use `glm()` to fit a logistic regression model. The arguments are very similar to `lm()`
The first argument is a formula, of the form `Y ~ X1 + X2 + ... + Xp`, where `Y` is the response and `X1`, ..., `Xp` are the predictors. These refer to column names of variables in a data frame, that we pass in through the `data` argument. We must also specify `family="binomial"` to get logistic regression
E.g., for the prostate data, suppose we add a column `lpsa.high` to our data frame `pros.df`, as the indicator of whether the `lpsa` variable is larger than log(10) (equivalently, whether the PSA score is larger than 10). To regress the binary response variable `lpsa.high` onto the predictor variables `lcavol` and `lweight`:
```{r}
pros.df$lpsa.high = as.numeric(pros.df$lpsa > log(10)) # New binary outcome
table(pros.df$lpsa.high) # There are 56 men with high PSA scores
pros.glm = glm(lpsa.high ~ lcavol + lweight, data=pros.df, family="binomial")
class(pros.glm) # Really, a specialized list
pros.glm # It has a special way of printing
```
Utility functions work as before
===
For retrieving coefficients, fitted values, residuals, summarizing, plotting, making predictions, the utility functions `coef()`, `fitted()`, `residuals()`, `summary()`, `plot()`, `predict()` work pretty much just as with `lm()`. E.g.,
```{r}
coef(pros.glm) # Logisitic regression coefficients
summary(pros.glm) # Special way of summarizing
p.hat = fitted(pros.glm) # These are probabilities! Not binary outcomes
y.hat = round(p.hat) # This is one way we'd compute fitted outcomes
table(y.hat, y.true=pros.df$lpsa.high) # This is a 2 x 2 "confusion matrix"
```
What does a logistic regression coefficient mean?
===
How do you interpret a logistic regression coefficient? Note, from our logistic model:
$$
\frac{\mathbb{P}(Y=1|X)}{\mathbb{P}(Y=0|X)} =
\exp(\beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p)
$$
So, increasing predictor $X_j$ by one unit, while holding all other predictor fixed, multiplies the odds by $e^{\beta_j}$. E.g.,
```{r}
coef(pros.glm)
```
So, increasing `lcavol` (log cancer volume) by one unit (one cc), while holding `lweight` (log cancer weight) fixed, multiplies the odds of `lpsa.high` (high PSA score, over 10) by $\approx e^{1.52} \approx 4.57$
Creating a binary variable "on-the-fly"
===
We can easily create a binary variable "on-the-fly" by using the `I()` function inside a call to `glm()` (same would work with `lm()`):
```{r}
pros.glm = glm(I(lpsa > log(10)) ~ lcavol + lweight, data=pros.df,
family="binomial")
summary(pros.glm) # Same as before
```
Generalized additive modeling
===
**Generalized additive models** allow us to do something that is like linear regression or logistic regression, but with a more flexible way of modeling the effects of predictors (rather than limiting their effects to be linear). For a continuous response $Y$, our model is:
$$
\mathbb{E}(Y|X) = \beta_0 + f_1(X_1) + \ldots + f_p(X_p)
$$
and the goal is to estimate $\beta_0,f_1,\ldots,f_p$. For a binary response $Y$, our model is:
$$
\log\frac{\mathbb{P}(Y=1|X)}{\mathbb{P}(Y=0|X)}
= \beta_0 + f_1(X_1) + \ldots + f_p(X_p)
$$
and the goal is again to estimate $\beta_0,f_1,\ldots,f_p$
Fitting a generalized additive model with `gam()`
===
We can use the `gam()` function, from the `gam` package, to fit a generalized additive model. The arguments are similar to `glm()` (and to `lm()`), with a key distinction
The formula is now of the form `Y ~ s(X1) + X2 + ... + s(Xp)`, where `Y` is the response and `X1`, ..., `Xp` are the predictors. The notation `s()` is used around a predictor name to denote that we want to model this as a smooth effect (nonlinear); without this notation, we simply model it as a linear effect
So, e.g., to fit the model
$$
\mathbb{E}(\mathrm{lpsa}\,|\,\mathrm{lcavol},\mathrm{lweight})
= \beta_0 + f_1(\mathrm{lcavol}) + \beta_2 \mathrm{lweight}
$$
we use:
```{r}
library(gam, quiet=TRUE)
pros.gam = gam(lpsa ~ s(lcavol) + lweight, data=pros.df)
class(pros.gam) # Again, a specialized list
pros.gam # It has a special way of printing
```
Most utility functions work as before
===
Most of our utility functions work just as before. E.g.,
```{r}
summary(pros.gam)
```
But now, `plot()`, instead of producing a bunch of diagnostic plots, shows us the effects that were fit to each predictor (nonlinear or linear, depending on whether or not we used `s()`):
```{r}
plot(pros.gam)
```
We can see that, even though we allowed `lcavol` to have a nonlinear effect, this didn't seem to matter much as its estimated effect came out to be pretty close to linear anyway!