---
title: 'Lecture 9: Models With Data'
author: "36-350"
date: "24 September 2014"
output: beamer_presentation
font-family: Garamond
transition: none
---
## In Previous Episodes
- Until now: processing existing data into R
- String manipulation, scraping and collecting data
## Today
- Using data frames for statistical purposes
- Manipulation of data into more convenient forms
- (Re-)Introduction to linear models and the model space
## So You've Got A Data Frame
What can we do with it?
- Plot it: examine multiple variables and distributions
- Test it: compare groups of individuals to each other
- Check it: does it conform to what we'd like for our needs?
## Test Case: Birth weight data
Included in R already:
```{r}
library(MASS)
data(birthwt)
summary(birthwt)
```
## From R help
Go to R help for more info, because someone documented this (thanks, someone!)
```
help(birthwt)
```
## Make it readable!
```{r}
colnames(birthwt)
colnames(birthwt) <- c("birthwt.below.2500", "mother.age",
"mother.weight", "race",
"mother.smokes", "previous.prem.labor",
"hypertension", "uterine.irr",
"physician.visits", "birthwt.grams")
```
## Make it readable, again!
Let's make all the factors more descriptive.
```{r}
birthwt$race <- factor(c("white", "black", "other")[birthwt$race])
birthwt$mother.smokes <- factor(c("No", "Yes")[birthwt$mother.smokes + 1])
birthwt$uterine.irr <- factor(c("No", "Yes")[birthwt$uterine.irr + 1])
birthwt$hypertension <- factor(c("No", "Yes")[birthwt$hypertension + 1])
```
## Make it readable, again!
```{r}
summary(birthwt)
```
## Explore it!
R's basic plotting functions go a long way.
```{r}
plot (birthwt$race)
title (main = "Count of Mother's Race in
Springfield MA, 1986")
```
## Explore it!
R's basic plotting functions go a long way.
```{r}
plot (birthwt$mother.age)
title (main = "Mother's Ages in Springfield MA, 1986")
```
## Explore it!
R's basic plotting functions go a long way.
```{r}
plot (sort(birthwt$mother.age))
title (main = "(Sorted) Mother's Ages in Springfield MA, 1986")
```
## Explore it!
R's basic plotting functions go a long way.
```{r}
plot (birthwt$mother.age, birthwt$birthwt.grams)
title (main = "Birth Weight by Mother's Age in Springfield MA, 1986")
```
## Basic statistical testing
Let's fit some models to the data pertaining to our outcome(s) of interest.
```{r}
plot (birthwt$mother.smokes, birthwt$birthwt.grams, main="Birth Weight by Mother's Smoking Habit", ylab = "Birth Weight (g)", xlab="Mother Smokes")
```
## Basic statistical testing
Tough to tell! Simple two-sample t-test:
```{r}
t.test (birthwt$birthwt.grams[birthwt$mother.smokes == "Yes"],
birthwt$birthwt.grams[birthwt$mother.smokes == "No"])
```
## Basic statistical testing
Does this difference match the linear model?
```{r}
linear.model.1 <- lm (birthwt.grams ~ mother.smokes, data=birthwt)
linear.model.1
```
## Basic statistical testing
Does this difference match the linear model?
```{r}
summary(linear.model.1)
```
## Basic statistical testing
Does this difference match the linear model?
```{r}
linear.model.2 <- lm (birthwt.grams ~ mother.age, data=birthwt)
linear.model.2
```
## Basic statistical testing
```{r}
summary(linear.model.2)
```
## Basic statistical testing
Diagnostics: R tries to make it as easy as possible (but no easier). Try in R proper:
```{r}
plot(linear.model.2)
```
## Detecting Outliers
These are the default diagnostic plots for the analysis. Note that our oldest mother and her heaviest child are greatly skewing this analysis as we suspected.
```{r}
birthwt.noout <- birthwt[birthwt$mother.age <= 40,]
linear.model.3 <- lm (birthwt.grams ~ mother.age, data=birthwt.noout)
linear.model.3
```
## Detecting Outliers
```{r}
summary(linear.model.3)
```
## More complex models
Add in smoking behavior:
```{r}
linear.model.3a <- lm (birthwt.grams ~ + mother.smokes + mother.age, data=birthwt.noout)
summary(linear.model.3a)
```
## More complex models
```{r}
plot(linear.model.3a)
```
## More complex models
Add in smoking behavior:
```{r}
linear.model.3b <- lm (birthwt.grams ~ mother.age + mother.smokes*race, data=birthwt.noout)
summary(linear.model.3b)
```
## More complex models
```{r}
plot(linear.model.3b)
```
## Everything Must Go (In)
Let's do a kitchen sink model on this new data set:
```{r}
linear.model.4 <- lm (birthwt.grams ~ ., data=birthwt.noout)
linear.model.4
```
## Everything Must Go (In), Except What Must Not
Whoops! One of those variables was `birthwt.below.2500` which is a function of the outcome.
```{r}
linear.model.4a <- lm (birthwt.grams ~ . - birthwt.below.2500, data=birthwt.noout)
summary(linear.model.4a)
```
## Everything Must Go (In), Except What Must Not
Whoops! One of those variables was `birthwt.below.2500` which is a function of the outcome.
```{r}
plot(linear.model.4a)
```
## Generalized Linear Models
Maybe a linear increase in birth weight is less important than if it's below a threshold like 2500 grams (5.5 pounds). Let's fit a generalized linear model instead:
```{r}
glm.0 <- glm (birthwt.below.2500 ~ . - birthwt.grams, data=birthwt.noout)
plot(glm.0)
```
## Generalized Linear Models
The default value is a Gaussian model (a standard linear model). Change this:
```{r}
glm.1 <- glm (birthwt.below.2500 ~ . - birthwt.grams, data=birthwt.noout, family=binomial(link=logit))
```
## Generalized Linear Models
```{r}
summary(glm.1)
```
## Generalized Linear Models
```{r}
plot(glm.1)
```
## What Do We Do With This, Anyway?
Let's take a subset of this data to do predictions.
```{r}
odds <- seq(1, nrow(birthwt.noout), by=2)
birthwt.in <- birthwt.noout[odds,]
birthwt.out <- birthwt.noout[-odds,]
linear.model.half <-
lm (birthwt.grams ~
. - birthwt.below.2500, data=birthwt.in)
```
## What Do We Do With This, Anyway?
```{r}
summary (linear.model.half)
```
## What Do We Do With This, Anyway?
```{r}
birthwt.predict <- predict (linear.model.half)
cor (birthwt.in$birthwt.grams, birthwt.predict)
plot (birthwt.in$birthwt.grams, birthwt.predict)
```
## What Do We Do With This, Anyway?
```{r}
birthwt.predict.out <- predict (linear.model.half, birthwt.out)
cor (birthwt.out$birthwt.grams, birthwt.predict.out)
plot (birthwt.out$birthwt.grams, birthwt.predict.out)
```