- Running simulations is an integral part of being a statistician in the 21st century
- R provides us with a utility functions for simulations from a wide variety of distributions
- To make your simulation results reproducible, you must set the seed, using
`set.seed()`

- There is a natural connection between iteration, functions, and simulations
- Saving and loading results can be done in two formats: rds and rdata formats

*Exploratory data analysis*

All along, weâ€™ve already been reading in data from the outside, using:

`readLines()`

: reading in lines of text from a file or webpage; returns vector of strings`read.table()`

: read in a data table from a file or webpage; returns data frame`read.csv()`

: like the above, but assumes comma separated data; returns data frame

This is usually a precursor to modeling! And it is not always a trivial first step. To learn more about the above functions and related considerations, you can check out the notes from a previous offering of this course

You have some data \(X_1,\ldots,X_p,Y\): the variables \(X_1,\ldots,X_p\) are called predictors, and \(Y\) is called a response. Youâ€™re interested in the relationship that governs them

So you posit that \(Y|X_1,\ldots,X_p \sim P_\theta\), where \(\theta\) represents some unknown parameters. This is called **regression model** for \(Y\) given \(X_1,\ldots,X_p\). Goal is to estimate parameters. Why?

- To assess model validity, predictor importance (
**inference**) - To predict future \(Y\)â€™s from future \(X_1,\ldots,X_p\)â€™s (
**prediction**)

Recall the data set on 97 men who have prostate cancer (from the book The Elements of Statistical Learning). The measured variables:

`lpsa`

: log PSA score`lcavol`

: log cancer volume`lweight`

: log prostate weight`age`

: age of patient`lbph`

: log of the amount of benign prostatic hyperplasia`svi`

: seminal vesicle invasion`lcp`

: log of capsular penetration`gleason`

: Gleason score`pgg45`

: percent of Gleason scores 4 or 5

```
pros.df =
read.table("http://www.stat.cmu.edu/~ryantibs/statcomp/data/pros.dat")
dim(pros.df)
```

`## [1] 97 9`

`head(pros.df)`

```
## lcavol lweight age lbph svi lcp gleason pgg45 lpsa
## 1 -0.5798185 2.769459 50 -1.386294 0 -1.386294 6 0 -0.4307829
## 2 -0.9942523 3.319626 58 -1.386294 0 -1.386294 6 0 -0.1625189
## 3 -0.5108256 2.691243 74 -1.386294 0 -1.386294 7 20 -0.1625189
## 4 -1.2039728 3.282789 58 -1.386294 0 -1.386294 6 0 -0.1625189
## 5 0.7514161 3.432373 62 -1.386294 0 -1.386294 6 0 0.3715636
## 6 -1.0498221 3.228826 50 -1.386294 0 -1.386294 6 0 0.7654678
```

Some example questions we might be interested in:

- What is the relationship between
`lcavol`

and`lweight`

? - What is the relationship between
`svi`

and`lcavol`

,`lweight`

? - Can we predict
`lpsa`

from the other variables? - Can we predict whether
`lpsa`

is high or low, from other variables?

Before pursuing a specific model, itâ€™s generally a good idea to look at your data. When done in a structured way, this is called **exploratory data analysis**. E.g., you might investigate:

- What are the distributions of the variables?
- Are there distinct subgroups of samples?
- Are there any noticeable outliers?
- Are there interesting relationship/trends to model?

`colnames(pros.df) # These are the variables`

```
## [1] "lcavol" "lweight" "age" "lbph" "svi" "lcp" "gleason" "pgg45"
## [9] "lpsa"
```

```
par(mfrow=c(3,3), mar=c(4,4,2,0.5)) # Setup grid, margins
for (j in 1:ncol(pros.df)) {
hist(pros.df[,j], xlab=colnames(pros.df)[j],
main=paste("Histogram of", colnames(pros.df)[j]),
col="lightblue", breaks=20)
}
```

What did we learn? A bunch of things! E.g.,

`svi`

, the presence of seminal vesicle invasion, is binary`lcp`

, the log amount of capsular penetration, is very skewed, a bunch of men with little (or none?), then a big spread; why is this?`gleason`

, takes integer values of 6 and larger; how does it relate to`pgg45`

, the percentage of Gleason scores 4 or 5?`lpsa`

, the log PSA score, is close-ish to normally distributed

After asking our doctor friends some questions, we learn:

- When the actual capsular penetration is very small, it canâ€™t be properly measured, so it just gets arbitrarily set to 0.25 (and we can check that
`min(pros.df$lcp)`

\(\approx \log{0.25}\)) - The variable
`pgg45`

measures the percentage of 4 or 5 Gleason scores that were recorded over their visit history*before*their final current Gleason score, stored in`gleason`

; a higher Gleason score is worse, so`pgg45`

tells us something about the severity of their cancer in the past

```
pros.cor = cor(pros.df)
round(pros.cor,3)
```

```
## lcavol lweight age lbph svi lcp gleason pgg45 lpsa
## lcavol 1.000 0.281 0.225 0.027 0.539 0.675 0.432 0.434 0.734
## lweight 0.281 1.000 0.348 0.442 0.155 0.165 0.057 0.107 0.433
## age 0.225 0.348 1.000 0.350 0.118 0.128 0.269 0.276 0.170
## lbph 0.027 0.442 0.350 1.000 -0.086 -0.007 0.078 0.078 0.180
## svi 0.539 0.155 0.118 -0.086 1.000 0.673 0.320 0.458 0.566
## lcp 0.675 0.165 0.128 -0.007 0.673 1.000 0.515 0.632 0.549
## gleason 0.432 0.057 0.269 0.078 0.320 0.515 1.000 0.752 0.369
## pgg45 0.434 0.107 0.276 0.078 0.458 0.632 0.752 1.000 0.422
## lpsa 0.734 0.433 0.170 0.180 0.566 0.549 0.369 0.422 1.000
```

Some strong correlations! Letâ€™s find the biggest (in absolute value):

```
pros.cor[lower.tri(pros.cor,diag=TRUE)] = 0 # Why only upper tri part?
pros.cor.sorted = sort(abs(pros.cor),decreasing=T)
pros.cor.sorted[1]
```

`## [1] 0.7519045`

```
vars.big.cor = arrayInd(which(abs(pros.cor)==pros.cor.sorted[1]),
dim(pros.cor)) # Note: arrayInd() is useful
colnames(pros.df)[vars.big.cor]
```

`## [1] "gleason" "pgg45"`

This is not surprising, given what we know about `pgg45`

and `gleason`

; essentially this is saying: if their Gleason score is high now, then they likely had a bad history of Gleason scores

Letâ€™s find the second biggest correlation (in absolute value):

`pros.cor.sorted[2]`

`## [1] 0.7344603`

```
vars.big.cor = arrayInd(which(abs(pros.cor)==pros.cor.sorted[2]),
dim(pros.cor))
colnames(pros.df)[vars.big.cor]
```

`## [1] "lcavol" "lpsa"`

This is more interesting! If we wanted to predict `lpsa`

from the other variables, then it seems like we should at least include `lcavol`

as a predictor

`pairs()`

Can easily look at multiple scatter plots at once, using the `pairs()`

function. The first argument is written like a **formula**, with no response variable. Weâ€™ll hold off on describing more about formulas until we learn `lm()`

, shortly

`pairs(~ lpsa + lcavol + lweight + lcp, data=pros.df)`