24 September 2014

In Previous Episodes

  • Until now: processing existing data into R
  • String manipulation, scraping and collecting data

Today

  • Using data frames for statistical purposes
  • Manipulation of data into more convenient forms
  • (Re-)Introduction to linear models and the model space

So You've Got A Data Frame

What can we do with it?

  • Plot it: examine multiple variables and distributions
  • Test it: compare groups of individuals to each other
  • Check it: does it conform to what we'd like for our needs?

Test Case: Birth weight data

Included in R already:

library(MASS)
data(birthwt)
summary(birthwt)
##       low             age            lwt           race     
##  Min.   :0.000   Min.   :14.0   Min.   : 80   Min.   :1.00  
##  1st Qu.:0.000   1st Qu.:19.0   1st Qu.:110   1st Qu.:1.00  
##  Median :0.000   Median :23.0   Median :121   Median :1.00  
##  Mean   :0.312   Mean   :23.2   Mean   :130   Mean   :1.85  
##  3rd Qu.:1.000   3rd Qu.:26.0   3rd Qu.:140   3rd Qu.:3.00  
##  Max.   :1.000   Max.   :45.0   Max.   :250   Max.   :3.00  
##      smoke            ptl              ht               ui       
##  Min.   :0.000   Min.   :0.000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:0.000  
##  Median :0.000   Median :0.000   Median :0.0000   Median :0.000  
##  Mean   :0.392   Mean   :0.196   Mean   :0.0635   Mean   :0.148  
##  3rd Qu.:1.000   3rd Qu.:0.000   3rd Qu.:0.0000   3rd Qu.:0.000  
##  Max.   :1.000   Max.   :3.000   Max.   :1.0000   Max.   :1.000  
##       ftv             bwt      
##  Min.   :0.000   Min.   : 709  
##  1st Qu.:0.000   1st Qu.:2414  
##  Median :0.000   Median :2977  
##  Mean   :0.794   Mean   :2945  
##  3rd Qu.:1.000   3rd Qu.:3487  
##  Max.   :6.000   Max.   :4990

From R help

Go to R help for more info, because someone documented this (thanks, someone!)

help(birthwt)

Make it readable!

colnames(birthwt)
##  [1] "low"   "age"   "lwt"   "race"  "smoke" "ptl"   "ht"    "ui"   
##  [9] "ftv"   "bwt"
colnames(birthwt) <- c("birthwt.below.2500", "mother.age", 
                       "mother.weight", "race",
                       "mother.smokes", "previous.prem.labor", 
                       "hypertension", "uterine.irr",
                       "physician.visits", "birthwt.grams")

Make it readable, again!

Let's make all the factors more descriptive.

birthwt$race <- factor(c("white", "black", "other")[birthwt$race])
birthwt$mother.smokes <- factor(c("No", "Yes")[birthwt$mother.smokes + 1])
birthwt$uterine.irr <- factor(c("No", "Yes")[birthwt$uterine.irr + 1])
birthwt$hypertension <- factor(c("No", "Yes")[birthwt$hypertension + 1])

Make it readable, again!

summary(birthwt)
##  birthwt.below.2500   mother.age   mother.weight    race    mother.smokes
##  Min.   :0.000      Min.   :14.0   Min.   : 80   black:26   No :115      
##  1st Qu.:0.000      1st Qu.:19.0   1st Qu.:110   other:67   Yes: 74      
##  Median :0.000      Median :23.0   Median :121   white:96                
##  Mean   :0.312      Mean   :23.2   Mean   :130                           
##  3rd Qu.:1.000      3rd Qu.:26.0   3rd Qu.:140                           
##  Max.   :1.000      Max.   :45.0   Max.   :250                           
##  previous.prem.labor hypertension uterine.irr physician.visits
##  Min.   :0.000       No :177      No :161     Min.   :0.000   
##  1st Qu.:0.000       Yes: 12      Yes: 28     1st Qu.:0.000   
##  Median :0.000                                Median :0.000   
##  Mean   :0.196                                Mean   :0.794   
##  3rd Qu.:0.000                                3rd Qu.:1.000   
##  Max.   :3.000                                Max.   :6.000   
##  birthwt.grams 
##  Min.   : 709  
##  1st Qu.:2414  
##  Median :2977  
##  Mean   :2945  
##  3rd Qu.:3487  
##  Max.   :4990

Explore it!

R's basic plotting functions go a long way.

plot (birthwt$race)
title (main = "Count of Mother's Race in 
       Springfield MA, 1986")

plot of chunk unnamed-chunk-5

Explore it!

R's basic plotting functions go a long way.

plot (birthwt$mother.age)
title (main = "Mother's Ages in Springfield MA, 1986")

plot of chunk unnamed-chunk-6

Explore it!

R's basic plotting functions go a long way.

plot (sort(birthwt$mother.age))
title (main = "(Sorted) Mother's Ages in Springfield MA, 1986")

plot of chunk unnamed-chunk-7

Explore it!

R's basic plotting functions go a long way.

plot (birthwt$mother.age, birthwt$birthwt.grams)
title (main = "Birth Weight by Mother's Age in Springfield MA, 1986")

plot of chunk unnamed-chunk-8

Basic statistical testing

Let's fit some models to the data pertaining to our outcome(s) of interest.

plot (birthwt$mother.smokes, birthwt$birthwt.grams, main="Birth Weight by Mother's Smoking Habit", ylab = "Birth Weight (g)", xlab="Mother Smokes")

plot of chunk unnamed-chunk-9

Basic statistical testing

Tough to tell! Simple two-sample t-test:

t.test (birthwt$birthwt.grams[birthwt$mother.smokes == "Yes"], 
        birthwt$birthwt.grams[birthwt$mother.smokes == "No"])
## 
##  Welch Two Sample t-test
## 
## data:  birthwt$birthwt.grams[birthwt$mother.smokes == "Yes"] and birthwt$birthwt.grams[birthwt$mother.smokes == "No"]
## t = -2.73, df = 170.1, p-value = 0.007003
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -488.98  -78.57
## sample estimates:
## mean of x mean of y 
##      2772      3056

Basic statistical testing

Does this difference match the linear model?

linear.model.1 <- lm (birthwt.grams ~ mother.smokes, data=birthwt)
linear.model.1
## 
## Call:
## lm(formula = birthwt.grams ~ mother.smokes, data = birthwt)
## 
## Coefficients:
##      (Intercept)  mother.smokesYes  
##             3056              -284

Basic statistical testing

Does this difference match the linear model?

summary(linear.model.1)
## 
## Call:
## lm(formula = birthwt.grams ~ mother.smokes, data = birthwt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2062.9  -475.9    34.3   545.1  1934.3 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        3055.7       66.9   45.65   <2e-16 ***
## mother.smokesYes   -283.8      107.0   -2.65   0.0087 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 718 on 187 degrees of freedom
## Multiple R-squared:  0.0363, Adjusted R-squared:  0.0311 
## F-statistic: 7.04 on 1 and 187 DF,  p-value: 0.00867

Basic statistical testing

Does this difference match the linear model?

linear.model.2 <- lm (birthwt.grams ~ mother.age, data=birthwt)
linear.model.2
## 
## Call:
## lm(formula = birthwt.grams ~ mother.age, data = birthwt)
## 
## Coefficients:
## (Intercept)   mother.age  
##      2655.7         12.4

Basic statistical testing

summary(linear.model.2)
## 
## Call:
## lm(formula = birthwt.grams ~ mother.age, data = birthwt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2294.8  -517.6    10.5   530.8  1774.9 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2655.7      238.9   11.12   <2e-16 ***
## mother.age      12.4       10.0    1.24     0.22    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 728 on 187 degrees of freedom
## Multiple R-squared:  0.00816,    Adjusted R-squared:  0.00285 
## F-statistic: 1.54 on 1 and 187 DF,  p-value: 0.216

Basic statistical testing

Diagnostics: R tries to make it as easy as possible (but no easier). Try in R proper:

plot(linear.model.2)

plot of chunk unnamed-chunk-15plot of chunk unnamed-chunk-15plot of chunk unnamed-chunk-15