Last week: Simulation

• Running simulations is an integral part of being a statistician in the 21st century
• R provides us with a utility functions for simulations from a wide variety of distributions
• To make your simulation results reproducible, you must set the seed, using set.seed()
• There is a natural connection between iteration, functions, and simulations
• Saving and loading results can be done in two formats: rds and rdata formats

Part I

Exploratory data analysis

Reading in data from the outside

All along, we’ve already been reading in data from the outside, using:

• readLines(): reading in lines of text from a file or webpage; returns vector of strings
• read.table(): read in a data table from a file or webpage; returns data frame
• read.csv(): like the above, but assumes comma separated data; returns data frame

This is usually a precursor to modeling! And it is not always a trivial first step. To learn more about the above functions and related considerations, you can check out the notes from a previous offering of this course

Why fit statistical (regression) models?

You have some data $$X_1,\ldots,X_p,Y$$: the variables $$X_1,\ldots,X_p$$ are called predictors, and $$Y$$ is called a response. You’re interested in the relationship that governs them

So you posit that $$Y|X_1,\ldots,X_p \sim P_\theta$$, where $$\theta$$ represents some unknown parameters. This is called regression model for $$Y$$ given $$X_1,\ldots,X_p$$. Goal is to estimate parameters. Why?

• To assess model validity, predictor importance (inference)
• To predict future $$Y$$’s from future $$X_1,\ldots,X_p$$’s (prediction)

Prostate cancer data

Recall the data set on 97 men who have prostate cancer (from the book The Elements of Statistical Learning). The measured variables:

• lpsa: log PSA score
• lcavol: log cancer volume
• lweight: log prostate weight
• age: age of patient
• lbph: log of the amount of benign prostatic hyperplasia
• svi: seminal vesicle invasion
• lcp: log of capsular penetration
• gleason: Gleason score
• pgg45: percent of Gleason scores 4 or 5
pros.df =
dim(pros.df)
## [1] 97  9

head(pros.df)
##       lcavol  lweight age      lbph svi       lcp gleason pgg45       lpsa
## 1 -0.5798185 2.769459  50 -1.386294   0 -1.386294       6     0 -0.4307829
## 2 -0.9942523 3.319626  58 -1.386294   0 -1.386294       6     0 -0.1625189
## 3 -0.5108256 2.691243  74 -1.386294   0 -1.386294       7    20 -0.1625189
## 4 -1.2039728 3.282789  58 -1.386294   0 -1.386294       6     0 -0.1625189
## 5  0.7514161 3.432373  62 -1.386294   0 -1.386294       6     0  0.3715636
## 6 -1.0498221 3.228826  50 -1.386294   0 -1.386294       6     0  0.7654678

Some example questions we might be interested in:

• What is the relationship between lcavol and lweight?
• What is the relationship between svi and lcavol, lweight?
• Can we predict lpsa from the other variables?
• Can we predict whether lpsa is high or low, from other variables?

Exploratory data analysis

Before pursuing a specific model, it’s generally a good idea to look at your data. When done in a structured way, this is called exploratory data analysis. E.g., you might investigate:

• What are the distributions of the variables?
• Are there distinct subgroups of samples?
• Are there any noticeable outliers?
• Are there interesting relationship/trends to model?

Distributions of prostate cancer variables

colnames(pros.df) # These are the variables
## [1] "lcavol"  "lweight" "age"     "lbph"    "svi"     "lcp"     "gleason" "pgg45"
## [9] "lpsa"
par(mfrow=c(3,3), mar=c(4,4,2,0.5)) # Setup grid, margins
for (j in 1:ncol(pros.df)) {
hist(pros.df[,j], xlab=colnames(pros.df)[j],
main=paste("Histogram of", colnames(pros.df)[j]),
col="lightblue", breaks=20)
}

What did we learn? A bunch of things! E.g.,

• svi, the presence of seminal vesicle invasion, is binary
• lcp, the log amount of capsular penetration, is very skewed, a bunch of men with little (or none?), then a big spread; why is this?
• gleason, takes integer values of 6 and larger; how does it relate to pgg45, the percentage of Gleason scores 4 or 5?
• lpsa, the log PSA score, is close-ish to normally distributed

After asking our doctor friends some questions, we learn:

• When the actual capsular penetration is very small, it can’t be properly measured, so it just gets arbitrarily set to 0.25 (and we can check that min(pros.df\$lcp) $$\approx \log{0.25}$$)
• The variable pgg45 measures the percentage of 4 or 5 Gleason scores that were recorded over their visit history before their final current Gleason score, stored in gleason; a higher Gleason score is worse, so pgg45 tells us something about the severity of their cancer in the past

Correlations between prostate cancer variables

pros.cor = cor(pros.df)
round(pros.cor,3) 
##         lcavol lweight   age   lbph    svi    lcp gleason pgg45  lpsa
## lcavol   1.000   0.281 0.225  0.027  0.539  0.675   0.432 0.434 0.734
## lweight  0.281   1.000 0.348  0.442  0.155  0.165   0.057 0.107 0.433
## age      0.225   0.348 1.000  0.350  0.118  0.128   0.269 0.276 0.170
## lbph     0.027   0.442 0.350  1.000 -0.086 -0.007   0.078 0.078 0.180
## svi      0.539   0.155 0.118 -0.086  1.000  0.673   0.320 0.458 0.566
## lcp      0.675   0.165 0.128 -0.007  0.673  1.000   0.515 0.632 0.549
## gleason  0.432   0.057 0.269  0.078  0.320  0.515   1.000 0.752 0.369
## pgg45    0.434   0.107 0.276  0.078  0.458  0.632   0.752 1.000 0.422
## lpsa     0.734   0.433 0.170  0.180  0.566  0.549   0.369 0.422 1.000

Some strong correlations! Let’s find the biggest (in absolute value):

pros.cor[lower.tri(pros.cor,diag=TRUE)] = 0 # Why only upper tri part?
pros.cor.sorted = sort(abs(pros.cor),decreasing=T)
pros.cor.sorted[1]
## [1] 0.7519045
vars.big.cor = arrayInd(which(abs(pros.cor)==pros.cor.sorted[1]),
dim(pros.cor)) # Note: arrayInd() is useful
colnames(pros.df)[vars.big.cor] 
## [1] "gleason" "pgg45"

This is not surprising, given what we know about pgg45 and gleason; essentially this is saying: if their Gleason score is high now, then they likely had a bad history of Gleason scores

Let’s find the second biggest correlation (in absolute value):

pros.cor.sorted[2]
## [1] 0.7344603
vars.big.cor = arrayInd(which(abs(pros.cor)==pros.cor.sorted[2]),
dim(pros.cor))
colnames(pros.df)[vars.big.cor] 
## [1] "lcavol" "lpsa"

This is more interesting! If we wanted to predict lpsa from the other variables, then it seems like we should at least include lcavol as a predictor

Visualizing relationships among variables, with pairs()

Can easily look at multiple scatter plots at once, using the pairs() function. The first argument is written like a formula, with no response variable. We’ll hold off on describing more about formulas until we learn lm(), shortly

pairs(~ lpsa + lcavol + lweight + lcp, data=pros.df)