--- title: "Beyond Linear Models" author: "Statistical Computing, 36-350" date: "Friday November 5, 2016" --- What other kinds of models are there? === Linear regression models, as we've said, are useful and ubiquitous. But there's a lot else out there. What else? - Logistic regression - Poisson regression - Generalized linear models - Generalized additive models - Many, many others (generative models, Bayesian models, hidden Markov models, nonparametric models, machine learning "models", etc.) Today we'll quickly visit logistic regression and generalized additive models. In some ways, they are similar to linear regression; in others, they're quite different, and you'll learn a lot more about them in the Advanced Methods for Data Analysis 36-402 course Logistic regression modeling === Given response $Y$ and predictors $X_1,\ldots,X_p$, where $Y \in \{0,1\}$ is a **binary outcome**. In a **logistic regression model**, we posit the relationship: $$\log\frac{\mathbb{P}(Y=1|X)}{\mathbb{P}(Y=0|X)} = \beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p$$ (where $Y|X$ is shorthand for $Y|X_1,\ldots,X_p$). Goal is to estimate parameters, also called coefficients $\beta_0,\beta_1,\ldots,\beta_p$ Prostate cancer data, once again === Recall, data set from 97 men who have prostate cancer (from the book [The Elements of Statistical Learning](http://statweb.stanford.edu/~tibs/ElemStatLearn/)): {r} pros.df = read.table("http://www.stat.cmu.edu/~ryantibs/statcomp/data/pros.dat") dim(pros.df) head(pros.df)  Fitting a logistic regression model with glm() === We can use glm() to fit a logistic regression model. The arguments are very similar to lm() The first argument is a formula, of the form Y ~ X1 + X2 + ... + Xp, where Y is the response and X1, ..., Xp are the predictors. These refer to column names of variables in a data frame, that we pass in through the data argument. We must also specify family="binomial" to get logistic regression E.g., for the prostate data, suppose we add a column lpsa.high to our data frame pros.df, as the indicator of whether the lpsa variable is larger than log(10) (equivalently, whether the PSA score is larger than 10). To regress the binary response variable lpsa.high onto the predictor variables lcavol and lweight: {r} pros.df$lpsa.high = as.numeric(pros.df$lpsa > log(10)) # New binary outcome table(pros.df$lpsa.high) # There are 56 men with high PSA scores pros.glm = glm(lpsa.high ~ lcavol + lweight, data=pros.df, family="binomial") class(pros.glm) # Really, a specialized list pros.glm # It has a special way of printing  Utility functions work as before === For retrieving coefficients, fitted values, residuals, summarizing, plotting, making predictions, the utility functions coef(), fitted(), residuals(), summary(), plot(), predict() work pretty much just as with lm(). E.g., {r} coef(pros.glm) # Logisitic regression coefficients summary(pros.glm) # Special way of summarizing p.hat = fitted(pros.glm) # These are probabilities! Not binary outcomes y.hat = round(p.hat) # This is one way we'd compute fitted outcomes table(y.hat, y.true=pros.df$lpsa.high) # This is a 2 x 2 "confusion matrix"  What does a logistic regression coefficient mean? === How do you interpret a logistic regression coefficient? Note, from our logistic model: $$\frac{\mathbb{P}(Y=1|X)}{\mathbb{P}(Y=0|X)} = \exp(\beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p)$$ So, increasing predictor $X_j$ by one unit, while holding all other predictor fixed, multiplies the odds by $e^{\beta_j}$. E.g., {r} coef(pros.glm)  So, increasing lcavol (log cancer volume) by one unit (one cc), while holding lweight (log cancer weight) fixed, multiplies the odds of lpsa.high (high PSA score, over 10) by $\approx e^{1.52} \approx 4.57$ Creating a binary variable "on-the-fly" === We can easily create a binary variable "on-the-fly" by using the I() function inside a call to glm() (same would work with lm()): {r} pros.glm = glm(I(lpsa > log(10)) ~ lcavol + lweight, data=pros.df, family="binomial") summary(pros.glm) # Same as before  Generalized additive modeling === **Generalized additive models** allow us to do something that is like linear regression or logistic regression, but with a more flexible way of modeling the effects of predictors (rather than limiting their effects to be linear). For a continuous response $Y$, our model is: $$\mathbb{E}(Y|X) = \beta_0 + f_1(X_1) + \ldots + f_p(X_p)$$ and the goal is to estimate $\beta_0,f_1,\ldots,f_p$. For a binary response $Y$, our model is: $$\log\frac{\mathbb{P}(Y=1|X)}{\mathbb{P}(Y=0|X)} = \beta_0 + f_1(X_1) + \ldots + f_p(X_p)$$ and the goal is again to estimate $\beta_0,f_1,\ldots,f_p$ Fitting a generalized additive model with gam() === We can use the gam() function, from the gam package, to fit a generalized additive model. The arguments are similar to glm() (and to lm()), with a key distinction The formula is now of the form Y ~ s(X1) + X2 + ... + s(Xp), where Y is the response and X1, ..., Xp are the predictors. The notation s() is used around a predictor name to denote that we want to model this as a smooth effect (nonlinear); without this notation, we simply model it as a linear effect So, e.g., to fit the model $$\mathbb{E}(\mathrm{lpsa}\,|\,\mathrm{lcavol},\mathrm{lweight}) = \beta_0 + f_1(\mathrm{lcavol}) + \beta_2 \mathrm{lweight}$$ we use: {r} library(gam, quiet=TRUE) pros.gam = gam(lpsa ~ s(lcavol) + lweight, data=pros.df) class(pros.gam) # Again, a specialized list pros.gam # It has a special way of printing  Most utility functions work as before === Most of our utility functions work just as before. E.g., {r} summary(pros.gam)  But now, plot(), instead of producing a bunch of diagnostic plots, shows us the effects that were fit to each predictor (nonlinear or linear, depending on whether or not we used s()): {r} plot(pros.gam)  We can see that, even though we allowed lcavol to have a nonlinear effect, this didn't seem to matter much as its estimated effect came out to be pretty close to linear anyway!