Statistical Computing, 36-350

Tuesday November 2, 2021

- Running simulations is an integral part of being a statistician in the 21st century
- R provides us with a utility functions for simulations from a wide variety of distributions
- To make your simulation results reproducible, you must set the seed, using
`set.seed()`

- There is a natural connection between iteration, functions, and simulations
- Saving and loading results can be done in two formats: rds and rdata formats

*Exploratory data analysis*

All along, we’ve already been reading in data from the outside, using:

`readLines()`

: reading in lines of text from a file or webpage; returns vector of strings`read.table()`

: read in a data table from a file or webpage; returns data frame`read.csv()`

: like the above, but assumes comma separated data; returns data frame

This is usually a precursor to modeling! And it is not always a trivial first step. To learn more about the above functions and related considerations, you can check out the notes from a previous offering of this course

You have some data \(X_1,\ldots,X_p,Y\): the variables \(X_1,\ldots,X_p\) are called predictors, and \(Y\) is called a response. You’re interested in the relationship that governs them

So you posit that \(Y|X_1,\ldots,X_p \sim P_\theta\), where \(\theta\) represents some unknown parameters. This is called **regression model** for \(Y\) given \(X_1,\ldots,X_p\). Goal is to estimate parameters. Why?

- To assess model validity, predictor importance (
**inference**) - To predict future \(Y\)’s from future \(X_1,\ldots,X_p\)’s (
**prediction**)

Recall the data set on 97 men who have prostate cancer (from the book The Elements of Statistical Learning). The measured variables:

`lpsa`

: log PSA score`lcavol`

: log cancer volume`lweight`

: log prostate weight`age`

: age of patient`lbph`

: log of the amount of benign prostatic hyperplasia`svi`

: seminal vesicle invasion`lcp`

: log of capsular penetration`gleason`

: Gleason score`pgg45`

: percent of Gleason scores 4 or 5

`## [1] 97 9`

```
## lcavol lweight age lbph svi lcp gleason pgg45 lpsa
## 1 -0.5798185 2.769459 50 -1.386294 0 -1.386294 6 0 -0.4307829
## 2 -0.9942523 3.319626 58 -1.386294 0 -1.386294 6 0 -0.1625189
## 3 -0.5108256 2.691243 74 -1.386294 0 -1.386294 7 20 -0.1625189
## 4 -1.2039728 3.282789 58 -1.386294 0 -1.386294 6 0 -0.1625189
## 5 0.7514161 3.432373 62 -1.386294 0 -1.386294 6 0 0.3715636
## 6 -1.0498221 3.228826 50 -1.386294 0 -1.386294 6 0 0.7654678
```

Some example questions we might be interested in:

- What is the relationship between
`lcavol`

and`lweight`

? - What is the relationship between
`svi`

and`lcavol`

,`lweight`

? - Can we predict
`lpsa`

from the other variables? - Can we predict whether
`lpsa`

is high or low, from other variables?

Before pursuing a specific model, it’s generally a good idea to look at your data. When done in a structured way, this is called **exploratory data analysis**. E.g., you might investigate:

- What are the distributions of the variables?
- Are there distinct subgroups of samples?
- Are there any noticeable outliers?
- Are there interesting relationship/trends to model?

```
## [1] "lcavol" "lweight" "age" "lbph" "svi" "lcp" "gleason" "pgg45"
## [9] "lpsa"
```

What did we learn? A bunch of things! E.g.,

`svi`

, the presence of seminal vesicle invasion, is binary`lcp`

, the log amount of capsular penetration, is very skewed, a bunch of men with little (or none?), then a big spread; why is this?`gleason`

, takes integer values of 6 and larger; how does it relate to`pgg45`

, the percentage of Gleason scores 4 or 5?`lpsa`

, the log PSA score, is close-ish to normally distributed

After asking our doctor friends some questions, we learn:

- When the actual capsular penetration is very small, it can’t be properly measured, so it just gets arbitrarily set to 0.25 (and we can check that
`min(pros.df$lcp)`

\(\approx \log{0.25}\)) - The variable
`pgg45`

measures the percentage of 4 or 5 Gleason scores that were recorded over their visit history*before*their final current Gleason score, stored in`gleason`

; a higher Gleason score is worse, so`pgg45`

tells us something about the severity of their cancer in the past

```
## lcavol lweight age lbph svi lcp gleason pgg45 lpsa
## lcavol 1.000 0.281 0.225 0.027 0.539 0.675 0.432 0.434 0.734
## lweight 0.281 1.000 0.348 0.442 0.155 0.165 0.057 0.107 0.433
## age 0.225 0.348 1.000 0.350 0.118 0.128 0.269 0.276 0.170
## lbph 0.027 0.442 0.350 1.000 -0.086 -0.007 0.078 0.078 0.180
## svi 0.539 0.155 0.118 -0.086 1.000 0.673 0.320 0.458 0.566
## lcp 0.675 0.165 0.128 -0.007 0.673 1.000 0.515 0.632 0.549
## gleason 0.432 0.057 0.269 0.078 0.320 0.515 1.000 0.752 0.369
## pgg45 0.434 0.107 0.276 0.078 0.458 0.632 0.752 1.000 0.422
## lpsa 0.734 0.433 0.170 0.180 0.566 0.549 0.369 0.422 1.000
```

Some strong correlations! Let’s find the biggest (in absolute value):

```
pros.cor[lower.tri(pros.cor,diag=TRUE)] = 0 # Why only upper tri part?
pros.cor.sorted = sort(abs(pros.cor),decreasing=T)
pros.cor.sorted[1]
```

`## [1] 0.7519045`

```
vars.big.cor = arrayInd(which(abs(pros.cor)==pros.cor.sorted[1]),
dim(pros.cor)) # Note: arrayInd() is useful
colnames(pros.df)[vars.big.cor]
```

`## [1] "gleason" "pgg45"`

This is not surprising, given what we know about `pgg45`

and `gleason`

; essentially this is saying: if their Gleason score is high now, then they likely had a bad history of Gleason scores

Let’s find the second biggest correlation (in absolute value):

`## [1] 0.7344603`

```
vars.big.cor = arrayInd(which(abs(pros.cor)==pros.cor.sorted[2]),
dim(pros.cor))
colnames(pros.df)[vars.big.cor]
```

`## [1] "lcavol" "lpsa"`

This is more interesting! If we wanted to predict `lpsa`

from the other variables, then it seems like we should at least include `lcavol`

as a predictor

`pairs()`

Can easily look at multiple scatter plots at once, using the `pairs()`

function. The first argument is written like a **formula**, with no response variable. We’ll hold off on describing more about formulas until we learn `lm()`

, shortly

As we’ve seen, the `lcp`

takes a bunch of really low values, that don’t appear to have strong relationships with other variables. Let’s get rid of them and see what the relationships look like

```
pros.df.subset = pros.df[pros.df$lcp > min(pros.df$lcp),]
nrow(pros.df.subset) # Beware, we've lost a half of our data!
```

`## [1] 52`

Recall that `svi`

, the presence of seminal vesicle invasion, is binary:

```
##
## 0 1
## 76 21
```

From http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1476128/:

“When the pathologist’s report following radical prostatectomy describes seminal vesicle invasion (SVI) … prostate cancer in the areolar connective tissue around the seminal vesicles and outside the prostate …generally the outlook for the patient is poor.”

Does seminal vesicle invasion relate to the volume of cancer? Weight of cancer?

Let’s do some plotting first: