Fitting Models to Data

Statistical Computing, 36-350

Tuesday November 2, 2021

Last week: Simulation

Part I

Exploratory data analysis

Reading in data from the outside

All along, we’ve already been reading in data from the outside, using:

This is usually a precursor to modeling! And it is not always a trivial first step. To learn more about the above functions and related considerations, you can check out the notes from a previous offering of this course

Why fit statistical (regression) models?

You have some data \(X_1,\ldots,X_p,Y\): the variables \(X_1,\ldots,X_p\) are called predictors, and \(Y\) is called a response. You’re interested in the relationship that governs them

So you posit that \(Y|X_1,\ldots,X_p \sim P_\theta\), where \(\theta\) represents some unknown parameters. This is called regression model for \(Y\) given \(X_1,\ldots,X_p\). Goal is to estimate parameters. Why?

Prostate cancer data

Recall the data set on 97 men who have prostate cancer (from the book The Elements of Statistical Learning). The measured variables:

pros.df = 
  read.table("http://www.stat.cmu.edu/~ryantibs/statcomp/data/pros.dat")
dim(pros.df)
## [1] 97  9
head(pros.df)
##       lcavol  lweight age      lbph svi       lcp gleason pgg45       lpsa
## 1 -0.5798185 2.769459  50 -1.386294   0 -1.386294       6     0 -0.4307829
## 2 -0.9942523 3.319626  58 -1.386294   0 -1.386294       6     0 -0.1625189
## 3 -0.5108256 2.691243  74 -1.386294   0 -1.386294       7    20 -0.1625189
## 4 -1.2039728 3.282789  58 -1.386294   0 -1.386294       6     0 -0.1625189
## 5  0.7514161 3.432373  62 -1.386294   0 -1.386294       6     0  0.3715636
## 6 -1.0498221 3.228826  50 -1.386294   0 -1.386294       6     0  0.7654678

Some example questions we might be interested in:

Exploratory data analysis

Before pursuing a specific model, it’s generally a good idea to look at your data. When done in a structured way, this is called exploratory data analysis. E.g., you might investigate:

Distributions of prostate cancer variables

colnames(pros.df) # These are the variables
## [1] "lcavol"  "lweight" "age"     "lbph"    "svi"     "lcp"     "gleason" "pgg45"  
## [9] "lpsa"
par(mfrow=c(3,3), mar=c(4,4,2,0.5)) # Setup grid, margins
for (j in 1:ncol(pros.df)) {
  hist(pros.df[,j], xlab=colnames(pros.df)[j],
       main=paste("Histogram of", colnames(pros.df)[j]),
       col="lightblue", breaks=20)
}

What did we learn? A bunch of things! E.g.,

After asking our doctor friends some questions, we learn:

Correlations between prostate cancer variables

pros.cor = cor(pros.df)
round(pros.cor,3) 
##         lcavol lweight   age   lbph    svi    lcp gleason pgg45  lpsa
## lcavol   1.000   0.281 0.225  0.027  0.539  0.675   0.432 0.434 0.734
## lweight  0.281   1.000 0.348  0.442  0.155  0.165   0.057 0.107 0.433
## age      0.225   0.348 1.000  0.350  0.118  0.128   0.269 0.276 0.170
## lbph     0.027   0.442 0.350  1.000 -0.086 -0.007   0.078 0.078 0.180
## svi      0.539   0.155 0.118 -0.086  1.000  0.673   0.320 0.458 0.566
## lcp      0.675   0.165 0.128 -0.007  0.673  1.000   0.515 0.632 0.549
## gleason  0.432   0.057 0.269  0.078  0.320  0.515   1.000 0.752 0.369
## pgg45    0.434   0.107 0.276  0.078  0.458  0.632   0.752 1.000 0.422
## lpsa     0.734   0.433 0.170  0.180  0.566  0.549   0.369 0.422 1.000

Some strong correlations! Let’s find the biggest (in absolute value):

pros.cor[lower.tri(pros.cor,diag=TRUE)] = 0 # Why only upper tri part?
pros.cor.sorted = sort(abs(pros.cor),decreasing=T)
pros.cor.sorted[1]
## [1] 0.7519045
vars.big.cor = arrayInd(which(abs(pros.cor)==pros.cor.sorted[1]), 
                        dim(pros.cor)) # Note: arrayInd() is useful
colnames(pros.df)[vars.big.cor] 
## [1] "gleason" "pgg45"

This is not surprising, given what we know about pgg45 and gleason; essentially this is saying: if their Gleason score is high now, then they likely had a bad history of Gleason scores

Let’s find the second biggest correlation (in absolute value):

pros.cor.sorted[2]
## [1] 0.7344603
vars.big.cor = arrayInd(which(abs(pros.cor)==pros.cor.sorted[2]), 
                        dim(pros.cor))
colnames(pros.df)[vars.big.cor] 
## [1] "lcavol" "lpsa"

This is more interesting! If we wanted to predict lpsa from the other variables, then it seems like we should at least include lcavol as a predictor

Visualizing relationships among variables, with pairs()

Can easily look at multiple scatter plots at once, using the pairs() function. The first argument is written like a formula, with no response variable. We’ll hold off on describing more about formulas until we learn lm(), shortly

pairs(~ lpsa + lcavol + lweight + lcp, data=pros.df)

Inspecting relationships over a subset of the observations

As we’ve seen, the lcp takes a bunch of really low values, that don’t appear to have strong relationships with other variables. Let’s get rid of them and see what the relationships look like

pros.df.subset = pros.df[pros.df$lcp > min(pros.df$lcp),]
nrow(pros.df.subset) # Beware, we've lost a half of our data! 
## [1] 52
pairs(~ lpsa + lcavol + lweight + lcp, data=pros.df.subset)

Testing means between two different groups

Recall that svi, the presence of seminal vesicle invasion, is binary:

table(pros.df$svi)
## 
##  0  1 
## 76 21

From http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1476128/:

“When the pathologist’s report following radical prostatectomy describes seminal vesicle invasion (SVI) … prostate cancer in the areolar connective tissue around the seminal vesicles and outside the prostate …generally the outlook for the patient is poor.”

Does seminal vesicle invasion relate to the volume of cancer? Weight of cancer?

Let’s do some plotting first:

pros.df$svi = factor(pros.df$svi) 
par(mfrow=c(1,2))
plot(pros.df$svi, pros.df$lcavol, main="lcavol versus svi",
     xlab="SVI (0=no, 1=yes)", ylab="Log cancer volume")
plot(pros.df$svi, pros.df$lweight, main="lweight versus svi",
     xlab="SVI (0=no, 1=yes)", ylab="Log cancer weight")