Statistics 36-490: Undergraduate Research
Spring 2014
MW 10:30--11:50, Wean Hall 8427
36-490 is a semester-long course in applied statistics. Students will work in
teams of about three to solve problems facing actual scientific investigators
with real data. The goal is to learn how to translate scientific questions
into statistical problems, develop and assess solutions to those problems, and
translate the statistical solutions back into scientific answers. Students
will build on the skills of data exploration, model development, model fitting
and checking, and interpretation that they began in earlier classes, but also
practice working with subject-area scientists, collaborative research, and both
written and oral scientific communication.
At the end of the semester, each team will present a poster at
the Meeting of the Minds
undergraduate research symposium, as well as presenting a written report in the
style of a scientific paper.
Pre-requisites
Students must passed 36-401, modern regression, and either passed or be
enrolled in 36-402,
advanced methods of data analysis. Admission to the class is by special
application and consent of the instructor only.
Projects
In addition to the projects provided by the instructor, students are invited to
come up with their own, subject to instructor approval. Any project must
involve both real data and an outside investigator.
Please read the handout
on interacting with your
investigator.
Course Mechanics
Each group will make multiple in-class presentations on their progress to date,
and submit drafts of their written report and poster presentation. There will
also be homework assignments connected with the lectures. See below for
details of deadlines and grading.
We will meet twice a week. Mondays will usually be a lecture on a
relevant methodological topic or aspect of the research process; teams will
meet separately with Prof. Shalizi on Wednesdays during class time.
Office hours are by appointment; please see
Prof. Shalizi's public
calendar.
Grades will be available through the
class Blackboard site.
Lectures
Approximately once a week (most Mondays) there will be a lecture on a topic
which will be useful for your projects. Usually these topics will be
statistical ones (previous topics have included categorical data analysis,
missing data and non-response bias, clustering, factor analysis, Markov models,
etc.). Most lectures will come with short homework assignments: install a
package in R, try a small data analysis on a particular data set, read a paper
and discuss in the next class, etc.
You are encouraged to discuss the homework assignments with each other, but the
work you hand in must be your own. You must not copy mathematical derivations,
computer output and input, or written descriptions from anyone or anywhere
else, without reporting the source within your work. Please review
the
CMU Policy on Academic Integrity.
Lecture Schedule
Subject to revision as we go along.
- Statistical consulting and statistical collaboration (15 January)
- Modeling count data (27 January)
- Decision trees, bagging, random forests (3 February)
- Missing data (10 February)
- Causal inference: identification and estimation (17 February)
- Causal inference: partial identification and discovery (24 February)
- Resampling dependent data (17 March)
- Model checking: residuals, calibration, simulation tests (24 March)
- Poisson process and other models for events over time (7 April)
- Mixed-membership models (14 April)
- Writing papers (21 April)
- Giving talks (28 April)
Notes and associated assignments will be posted here after the lectures.
Project Meetings, Presentations, and Reports
The projects will consume the majority of your time in this class. Instead of
lectures, most Wednesdays you will have a group meeting with the professor.
You should also plan on meeting at least once a week within your project group,
and at least once a month with your faculty investigator.
During the semester, each group will make brief presentations to the whole
class on the progress of their projects. Each group member must participate
in each of these presentations. The complete project work will be presented
in an end-of-the-year poster session.
Each group must turn in a formal, written report on the last day of class.
A draft of the written report is due in early April. There will be no exams
for this class, but several of the lectures will have associated, written
homework assignments.
Two or three times during the semester, each student will be asked to assess
the contribution of each group member to the team effort, and this will be
factored into your project grade.
Assignments
Unless you are told otherwise, all electronic assignments should be submitted
as either plain text files or PDFs. Do not send Word files; if you
want to write in word, convert to PDF before turning it in. (This ensure that
we can read your file, and it appears exactly the same to us as it does to
you.)
Key Dates
Slide Presentation I | March 3 and 5 |
Slide Presentation II | March 31 and April 2 |
Meeting of the Minds Registration | April 2 |
Draft Paper | April 14 at 10:30 AM |
Draft Poster | April 28 at 10:30 AM |
Final Paper | May 7 |
Final Poster | May 7, Meeting of the Minds |
Grading
Homework | 15% |
Participation during class discussion | 10% |
Participation during group project meetings | 10% |
Oral presentations | 15% |
Written report | 30% |
Poster presentation | 20% |
Resources
Texts
These books are required:
- Michael Alley, The Craft of Scientific Writing (3rd edition, Berlin: Springer, 1996, ISBN 0-387-94766-3)
- D. R. Cox and Christl Donnelly, Principles of Applied Statistics (Cambridge: Cambridge University Press, 2011, ISBN 978-1-107-64445-8)
- George Polya, How to Solve It: A New Aspect of Mathematical Method (2nd edition, Princeton: Princeton University Press, 1957, ISBN 0-691-02356-5)
These books are optional but recommended:
- Wayne C. Booth, Gregoy G. Colomb and Joseph M. Williams, The Craft of Research (3rd edition, Chicago: University of Chicago Press, 2008, ISBN 0-226-06566-9)
- W. N. Venables and Brian D. Ripley, Modern Applied Statistics with S (4th edition, Berlin: Springer, 2002, 978-1-441-93008-8)
- Joseph Williams, Style: Toward Clarity and Grace (Chicago: University of Chicago Press, 1990, ISBN 0-226-89915-2)
Handouts
- Interacting with your
faculty investigator
R
You don't have to use
R, but you should think hard before doing
otherwise. By this point, students in the class are expected to be fairly
familiar with at least the basics of the language and of R programming.
Some useful online resources:
- The official intro, "An Introduction to R", available online in
HTML
and PDF
- John Verzani, "simpleR",
in PDF
- Google R Style Guide offers some rules for naming, spacing, etc., which are generally good ideas
- Quick-R. This is
primarily aimed at those who already know a commercial statistics package like
SAS, SPSS or Stata, but it's very clear and well-organized, and others may find
it useful as well.
- Patrick
Burns, The R
Inferno. "If you are using R and you think you're in hell, this is a map
for you."
- Thomas Lumley, "R Fundamentals and Programming Techniques"
(large
PDF)
- The website Software
Carpentry is not specifically R related, but contains a lot of valuable
advice and information on scientific programming.
- RStudio is an "integrated
development environment" (IDE) for R. It's designed to make the common tasks
of writing and running R code more efficient, easier, and more reproducible.
- Minimal Advice on
Programming, Especially in R, and the lecture notes for 36-350, statistical computing may also be helpful.
There are also some handy books:
- Venables and Ripley's Modern Applied Statistics with S
is one of our recommended texts; it covers the implementation of a lot
of standard statistical methods. (R is a dialect or descendant of the S
language.) It does tend to presume both some knowledge of the language and
some knowledge of the methods, however. (It answers "How do I do X in
S?", not "What is X, anyway?")
- Paul
Teetor, The R
Cookbook (Sebastopol, California: O'Reilly, 2011) and Winston
Chang, The R Graphics
Cookbook (O'Reilly, 2012) are good references on the day-to-day
basics of getting stuff done in R; they're organized by task rather than
command.
- John
M. Chambers, Software
for Data Analysis: Programming with R (New York: Springer, 2008,
ISBN 978-0-387-75935-7)
is the best book on writing programs in R.
Scientific Writing, Statistical Consulting, Professional Ethics
Alley's Craft of Scientific Writing is one of our required
texts; it's got a lot of sound advice and information on what you need to do to
write a readable scientific paper. Booth et al.'s Craft of
Research (recommended) is not so specifically focused on scientific
work, but is very sound on the process of figuring out what it is you actually
want to research, refining it into a series of manageable problems, and
assembling compelling arguments. Williams's Style is
(recommended) is the best book of writing advice I've ever found.
Further resources on scientific writing:
- G. D. Gopen and J. A. Swan, "The Science of Scientific Writing", American Scientist 78 (1990): 550--558
- Peter B. Medawar, "Is the Scientific Paper a Fraud?", The
Listener 70 (12 September 1963): 377--378. Reprinted
in many collections (e.g., Pluto's Republic [Oxford: Oxford
University Press, 1982]), and online in various versions
(e.g.)
On statistical consulting:
- C. Chatfield, "Avoiding statistical pitfalls", Statistical Science 6 (1991): 249--252 [JSTOR]
- D. J. Finney, "Ethical aspects of statistical practice", Biometrics 47 (1991): 331--339 [JSTOR]
- W. G. Hunter, "The practice of statistics: The real world is an idea
whose time has come", American Statistician 35 (1981): 72--76 [JSTOR]
- R. E. Kirk, "Statistical consulting in a university: Dealing with people and other challenges", American Statistician 45 (1991): 28--34 [JSTOR]
- R. Tweedie, "Consulting: Real problems, real interactions, real outcomes", Statistical Science 13 (1998): 1--29 [JSTOR]
- D. A. Zahn and D. J. Isenberg, "Nonstatistical aspects of statistical consulting", American Statistician 37 (1983): 297--302 [JSTOR]
Useful References on Statistical Models, Statistical Methods, and Statistical Modeling
- Robert P. Abelson, Statistics as Principled Argument (Hillsdale, New Jersey: Lawrence Erlbaum Associates, 1995). A wise and witty
look guide to using statistics in making an honest case for or against some proposition; it would have been a required text, if it wasn't
out of print.
- A. C. Davidson, Statistical Models (Cambridge,
England: Cambridge university Press, 2003). Massive reference on the most
common statistical models, what they really assume, and how they really work.
Includes just enough theory to be helpful, and good practical examples.
- Julian J. Faraway, Linear Models with R (Boca Raton, Florida: Chapman and Hall/CRC, 2005)
- Julian J Faraway, Extending the Linear Model with R:
Generalized Linear, Mixed Effects, and Nonparametric Regression Models
(Boca Raton, Florida: Chapman and Hall/CRC, 2006)
- Peter Guttorp, Stochastic Modeling of Scientific Data
(London: Chapman and Hall, 1995). Statistical inference for stochastic
processes, and building stochastic process models from scientific theories.
- Trevor Hastie, Robert Tibshirani and Jerome
Friedman, The
Elements of Statistical Learning: Data Mining, Inference, and
Prediction (2nd edition, Berlin: Springer, 2009). A
deservedly-standard textbook on modern, computer-intensive statistical methods.
- Jeffrey S. Simonoff, Smoothing Methods in Statistics
(Berlin: Springer-Verlag, 1996). A gentle introduction to nonparametric
smoothing and its uses.