Name:

Andrew ID:

Collaborated with:

This lab is to be done in class (completed outside of class time if need be). You can collaborate with your classmates, but you must identify their names above, and you must submit **your own** lab as an knitted PDF file on Gradescope, by Saturday 6pm, this week.

**This week’s agenda**: getting familiar with basic plotting tools; understanding the way layers work; recalling basic text manipulations; producing histograms and overlaid histograms; heatmaps.

**1a.**Below is some code that is very similar to that from the lecture, but with one key difference. Explain: why does the`plot()`

result with with`type="p"`

look normal, but the`plot()`

result with`type="l"`

look abnormal, having crossing lines? Then modify the code below (hint: modify the definition of`x`

), so that the lines on the second plot do not cross.

```
n = 50
set.seed(0)
x = runif(n, min=-2, max=2)
y = x^3 + rnorm(n)
plot(x, y, type="p")
```

`plot(x, y, type="l")`

`# YOUR CODE GOES HERE`

**1b.**The`cex`

argument can used to shrink or expand the size of the points that are drawn. Its default value is 1 (no shrinking or expansion). Values between 0 and 1 will shrink points, and values larger than 1 will expand points. Plot`y`

versus`x`

, first with`cex`

equal to 0.5 and then 2 (so, two separate plots). Give titles “Shrunken points”, and “Expanded points”, to the plots, respectively.

`# YOUR CODE GOES HERE`

**1c.**The`xlim`

and`ylim`

arugments can be used to change the limits on the x-axis and y-axis, repsectively. Each argument takes a vector of length 2, as in`xlim = c(-1, 0)`

, to set the x limit to be from -1 to 0. Plot`y`

versus`x`

, with the x limit set to be from -1 to 1, and the y limit set to be from -5 to 5. Assign x and y labels “Trimmed x” and “Trimmed y”, respectively.

`# YOUR CODE GOES HERE`

**1d.**Again plot`y`

versus`x`

, only showing points whose x values are between -1 and 1. But this time, define`x.trimmed`

to be the subset of`x`

between -1 and 1, and define`y.trimmed`

to be the corresponding subset of`y`

. Then plot`y.trimmed`

versus`x.trimmed`

without setting`xlim`

and`ylim`

: now you should see that the y limit is (automatically) set as “tight” as possible. Hint: use logical indexing to define`x.trimmed`

,`y.trimmed`

.

`# YOUR CODE GOES HERE`

**1e.**The`pch`

argument, recall, controls the point type in the display. In the lecture examples, we set it to a single number. But it can also be a vector of numbers, with one entry per point in the plot. So, e.g.,`plot(1:10, 1:10, pch=1:10)`

displays the first 10 point types. If

`pch`

is a vector whose length is shorter than the total number of points to be plotted, then its entries are recycled, as appropriate. Plot`y`

versus`x`

, with the point type alternating in between an empty circle and a filled circle.

`# YOUR CODE GOES HERE`

**1f.**The`col`

argument, recall, controls the color the points in the display. It operates similar to`pch`

, in the sense that it can be a vector, and if the length of this vector is shorter than the total number of points, then it is recycled appropriately. Plot`y`

versus`x`

, and repeat the following pattern for the displayed points: a black empty circle, a blue filled circle, a black empty circle, a red filled circle.

`# YOUR CODE GOES HERE`

**2a.**Produce a scatter plot of`y`

versus`x`

, and set the title and axes labels as you see fit. Then overlay on top a scatter plot of`y2`

versus`x2`

, using the`points()`

function, where`x2`

and`y2`

are as defined below. In the call to`points()`

, set the`pch`

and`col`

arguments appropriately so that the overlaid points are drawn as filled blue circles.

```
x2 = sort(runif(n, min=-2, max=2))
y2 = x^2 + rnorm(n)
```

`# YOUR CODE GOES HERE`

**2b.**Starting with your solution code from the last question, overlay a line plot of`y2`

versus`x2`

on top of the plot (which contains empty black circles of`y`

versus`x`

, and filled blue circles of`y2`

versus`x2`

), using the`lines()`

function. In the call to`lines()`

, set the`col`

and`lwd`

arguments so that the line is drawn in red, with twice the normal thickness. Look carefully at your resulting plot. Does the red line pass overtop of or underneath the blue filled circles? What do you conclude about the way R*layers*these additions to your plot?

`# YOUR CODE GOES HERE`

**2c.**Starting with your solution code from the last question, add a legend to the bottom right corner of the the plot using`legend()`

. The legend should display the text: “Cubic” and “Quadratic”, with corresponding symbols: an empty black circle and a filled blue circle, respectively. Hint: it will help to look at the documentation for`legend()`

.

`# YOUR CODE GOES HERE`

**2d.**Produce a plot of`y`

versus`x`

, but with a gray rectangle displayed underneath the points, which runs has a lower left corner at`c(-2, qnorm(0.1))`

, and an upper right corner at`c(2, qnorm(0.9))`

. Hint: use`rect()`

and consult its documentation. Also, remember how layers work; call`plot()`

, with`type="n"`

or`col="white"`

in order to refrain from drawing any points in the first place, then call`rect()`

, then call`points()`

.

`# YOUR CODE GOES HERE`

**Challenge.**Produce a plot of`y`

versus`x`

, but with a gray tube displayed underneath the points. Specifically, this tube should fill in the space between the two curves defined by \(y=x^3 \pm q\), where \(q\) is the 90th percentile of the standard normal distribution (i.e., equal to`qnorm(0.90)`

). Hint: use`polygon()`

and consult its documentation; this function requires that the x coordinates of the polygon be passed in an appropriate order; you might find it useful to use`c(x, rev(x))`

for the x coordinates. Lastly, add a legend to the bottom right corner of the plot, with the text: “Data”, “Confidence band”, and corresponding symbols: an empty circle, a very thick gray line, respectively.

`# YOUR CODE GOES HERE`

Below, we read in two data sets of the 1000 fastest times ever recorded for the 100m sprint, in men’s and women’s track., as seen in previous labs.

```
sprint.m.df = read.table(
file="http://www.stat.cmu.edu/~ryantibs/statcomp/data/sprint.m.txt",
sep="\t", quote="", header=TRUE)
sprint.w.df = read.table(
file="http://www.stat.cmu.edu/~ryantibs/statcomp/data/sprint.w.txt",
sep="\t", quote="", header=TRUE)
```

**3a.**Define`sprint.m.times`

to be the first 4 characters of the`Time`

column of`sprint.m.df`

, and`sprint.m.years`

to be the last 4 characters of the`Date`

column of`sprint.m.df`

. Hint: use`substr()`

. Convert both to numeric vectors, and print the first 10 entries of each.

`# YOUR CODE GOES HERE`

**3b.**Plot`sprint.m.times`

versus`sprint.m.years`

. For the point type, use small, filled black circles. Label the x-axis “Year” and the y-axis “Time (seconds)”. Title the plot “Fastest men’s 100m sprint times”. Using`abline()`

, draw a dashed blue horizontal line at 9.95 seconds. Using`text()`

, draw below this line, in text on the plot, the string “N men”, replacing “N” here by the number of men who have run under 9.95 seconds. Your code should programmatically determine the correct number here, and use`paste()`

to form the string. Comment on what you see visually, as per the sprint times across the years. What does the trend look like for the fastest time in any given year?

`# YOUR CODE GOES HERE`

**3c.**Reproduce the previous plot, but this time, draw a light blue rectangle underneath all of the points below the 9.95 second mark. The rectangle should span the entire region of the plot below the horizontal line at \(y=9.95\). And not only the points of sprint times, but the blue dashed line, and the text “N men” (with “N” replaced by the appropriate number) should appear*on top*of the rectangle. Hint: use`rect()`

and layering as appropriate.

`# YOUR CODE GOES HERE`

**3d.**Repeat Q3a but for the women’s sprint data, with one critical difference—to form the vector of times, take the**first 5 characters**of the`Time`

column—and arrive at vectors`sprint.w.times`

and`sprint.w.years`

. Then repeat Q3c for this data, but with the 9.95 second cutoff being replaced by 10.95 seconds, the rectangle colored pink, and the dashed line colored red. Comment on the differences between this plot for the women and your plot for the men, from Q4c. In particular, is there any apparent difference in the trend for the fastest sprint time in any given year?

`# YOUR CODE GOES HERE`

**4a.**Extract the birth years of the sprinters from the data frame`sprint.m.df`

. To do so, define a character vector`sprint.m.byears`

to contain the last 2 characters of the`Birthdate`

column of`sprint.m.df`

. Then convert`sprint.m.byears`

into a numeric vector, add 1900 to each entry, and redefine`sprint.m.byears`

to be the result. Repeat the same, but for the women’s data, arriving at a vector called`sprint.w.byears`

.

`# YOUR CODE GOES HERE`

**4b.**What are the ranges of birth years you computed in the last part, for the men’s and women’s data? Do you see any problems with this? You should! Fix the errant entries in`sprint.m.byears`

and`sprint.w.byears`

using simple indexing and arithmetic. Hint: none of these athletes were born before 1921.

`# YOUR CODE GOES HERE`

**4c.**Now that you have fixed their birthdates, create a vector`sprint.m.ages`

containing the age (in years) of each male sprinter when their sprint time was recorded. Do the same for the female sprinters, resulting in`sprint.w.ages`

. Hint: use`sprint.m.years`

and`sprint.w.years`

.

`# YOUR CODE GOES HERE`

**4d.**Using one of the apply functions, compute the average sprint time for each age in`sprint.m.ages`

, calling the result`time.m.avg.by.age`

. Similarly, compute the analogous quantity for the women, calling the result`time.w.avg.by.age`

. Are there any ages for which the men’s average time is faster than 9.9 seconds, and if so, which ones? Are there any ages for which the women’s average time is faster than 10.9 seconds, and if so, which ones?

`# YOUR CODE GOES HERE`

**4e.**Plot a histogram of`sprint.m.ages`

, with break locations occuring at every age in between 17 and 40. Color the histogram to your liking; label the x-axis, and title the histogram appropriately. What is the mode, i.e., the most common age? Also, describe what you see around the mode: do we see more sprinters who are younger, or older?

`# YOUR CODE GOES HERE`

**Challenge.**Plot a histogram of`sprint.m.ages`

, now with`probability=TRUE`

(so it is on the probability scale, rather than raw frequency scale). Overlay a histogram of`sprint.w.ages`

, also with`probability=TRUE`

. Set the break locations so that the plot captures the full range of the very youngest to the very oldest sprinter present among both men and women. Your code should determine these limits programmatically. Choose colors of your liking, but use transparency as appropriate so that the shapes of both histograms are visible; label the x-axis, and title the histogram appropriately. Add a legend to the histogram, identifying the histogram bars from the men and women. Compare, roughly, the shapes of the two histograms: is there a difference between the age distributions of the world’s fastest men and fastest women?

`# YOUR CODE GOES HERE`

**Challenge.**The`volcano`

object in R is a matrix of dimension 87 x 61. It is a digitized version of a topographic map of the Maungawhau volcano in Auckland, New Zealand. Plot a heatmap of the volcano using`image()`

, with 25 colors from the terrain color palette.

`# YOUR CODE GOES HERE`

**Challenge.**Each row of`volcano`

corresponds to a grid line running east to west. Each column of`volcano`

corresponds to a grid line running south to north. Define a matrix`volcano.rev`

by reversing the order of the rows, as well as the order of the columns, of`volcano`

. Therefore, each row`volcano.rev`

should now correspond to a grid line running west to east, and each column of`volcano.rev`

a grid line running north to south.

`# YOUR CODE GOES HERE`

**Challenge.**If we printed out the matrix`volcano.rev`

to the console, then the elements would follow proper geographic order: left to right means west to east, and top to bottom means north to south. Now, produce a heatmap of the volcano that follows the same geographic order. Hint: recall that the`image()`

function rotates a matrix 90 degrees counterclockwise before displaying it; and recall the function`clockwise90()`

from the lecture, which you can copy and paste into your code here. Label the x-axis “West –> East”, and the y-axis “South –> North”. Title the plot “Heatmap of Maungawhau volcano”.

`# YOUR CODE GOES HERE`

**Challenge.**Reproduce the previous plot, and now draw contour lines on top of the heatmap.

`# YOUR CODE GOES HERE`

**Challenge.**The function`filled.contour()`

provides an alternative way to create a heatmap with contour lines on top. It uses the same orientation as`image()`

when plotting a matrix. Use`filled.contour()`

to plot a heatmap of the volcano, with (light) contour lines automatically included. Make sure the orientation of the plot matches proper geographic orientation, as in the previous question. Use a color scale of your choosing, and label the axes and title the plot appropriately. It will help to consult the documentation for`filled.contour()`

.

`# YOUR CODE GOES HERE`