STATISTICS 545 - DATA ANALYSIS

(97/98 Term 1)

The introductory handout is here (latex file, readable as text). End-of-term comments from last year's course are also available.

The text for the course is

Modern Applied Statistics with S-Plus, Second Edition, by Bill Venables and Brian Ripley, 1997.

The book has a web page, containing useful information and links. Check out the extra material in the "complements".

In order to hone our data analysis skills, we will require lots of data sets to analyze! Here are some sources of data sets

The software which accompanies the textbook includes all the data sets mentioned in the book, as well as some others. See Appendix A of the text for a list.
StatLib , is a general purpose server for the statistical community. There are several sites on Statlib where data sets are available, including
- The datasets archive contains many varied data sets, as well as collections of data sets from some books.
- The Data and Story Library is an "online library of datafiles and stories that illustrate the use of basic statistics methods."
- There are several other sites on StatLib which contain data sets. Check out the index on the StatLib home page.
A Data Sources page is maintained at the University of Nevada.
UBC Statistics members can access some data sets available on the department network. Check the department gopher under computing.

If you are looking for S software, try the S-archive.

As they become available, course materials are posted below.

The galaxy velocity examples were used to illustrate histograms and kernel density estimation ( S-code )
The baseball salary examples were used to illustrate bootstrapping ( S-code , data ). I found these data at the Chance Database.
There were a couple of graphical output examples, one based on the Stormer Viscometer data (S-code), and the other based on the Boston housing data (S-code).
The Mercury in Bass data (found at DASL) were analyzed with with weighted linear regression (S-code). We also had some examples of the S-Plus model formulae syntax (S-code).
During our discussion of generalized linear models, we used simulations to learn about residuals (S-code) and overdispersion (S-code).
We analyzed the CPU data (available in the MASS library) using additive models and projection pursuit regression (S-code), and then later we tried fitting neural networks to these data (S-code).
We used a small simulation study to see how cross-validation prevents overfit (S-code).
We examined how well regression trees worked with some simulated data (S-code).