Advanced Data Analysis I

CMU 36-401 (Fall 2000)

Course Policies

Prerequisites:

I will assume that you have had a course in statistical methods, including confidence intervals, hypothesis testing, maximum likelihood estimation, and basic distributions, such as the normal, t, chi squared and F distributions. If you have taken the CMU courses 36-226, 36-326 or 36-310, then you have the appropriate statistical background. You will need to know some linear algebra, such as might be covered in a one-semester matrix algebra course. You will make extensive use of computer packages during the semester. You will need to use S-plus on the campus Unix system. If I assume that you know something that you have never learned, stop me in class or see me after class as soon as possible.

Objectives:

  1. To develop skills in exploring data, building and fitting models, investigating model assumptions, and interpreting results from statistical models.
  2. To apply the theory of linear regression analysis with applications, in order to investigate model assumptions, and take appropriate action if the assumptions don't hold.
  3. To develop skills in the use of a modern statistical software package.
  4. To learn to write cogent and concise data analysis reports.

Textbooks:

Class Organization:

Class time will include highlighting of important or difficult concepts from the Regression with Graphics textbook in lecture format, demonstration of Splus, additional examples of real data analysis, small group work with homework-type problems, and discussion of data analysis issues. You will be expected to learn the concepts found in both the lectures and assigned readings, even if the latter are not discussed in class.

Course Overview:

This course is intended to be an introduction to the real world of statistics, focusing on perhaps the most widely used tool in statistical science, namely linear regression. From econometrics to medical imaging, linear regression provides a language and a body of techniques for specifying and measuring the quantitative relationships between inputs and outputs in a system. We will look at real data, try various models for the data, assess the validity of assumptions, and try to reach conclusions. Computer programs will do most of the calculations, and we will be able to concentrate on the thinking. More broadly, it can be fairly said that most parametric statistical models are in some sense a generalization of the linear regression model. If you understand the issues in linear regression, you understand, at least qualitatively, the issues in all statistical modeling and estimation problems. So we will spend some time on the theory underlying linear regression, as well as some generalizations and extensions, but only so far as it aids your ability to analyze real data.

I will stress the fact that linear regression, like all statistical techniques, involve assumptions. A typical statistical analysis begins with simple graphical and descriptive analyses followed by the application of formal statistical techniques. The analysis must then be followed up by diagnostics that check whether the assumptions have been violated. Most good analyses include a heavy dose of graphical techniques. The golden rule in this course is: Never do a statistical analysis without first plotting the data. Also bear in mind that there is never one right way to do an analysis - typically, many analyses are performed.

You will also get plenty of practice writing reports of the types expected of you in the real world. We will primarily focus on reports for intelligent non-statisticians (e.g. a company officer) and brief documentation of what you did for a statistically competent supervisor.

Homework and Assessment:

There will be weekly assignments, a mid-term in-class exam, a project, and an end-of-term final exam. Do not schedule your plane home on or before 12/21 until we know what day our final will be! The homeworks will include both mathematical problems and data analyses. I encourage you to discuss the assignments with each other, but the work that you hand in should be your own. This means that each student must perform all analyses on her/his own computer, and must independently write up the analysis. Because homework will be an important component of the final grade, you must not copy mathematical derivations, computer output and input, or written descriptions from anyone or anywhere else, without reporting the source with your work. Plagiarism will be swiftly dealt with to the full extent allowed under CMU policies. Please review the CMU policies on cheating and plagiarism.

The assignments will make use of computer packages, especially S-plus. I will teach you the minimum you need to know about S-plus: enough to do the work, but not enough to call yourself an expert. Never hand in raw computer output. Cut out (either electronically, or with scissors) plots, tables, etc. from the output and include them in your report as needed.

Assignments will request one or more of the following components: