Cosma Shalizi > 36-757 (Fall 2010)

Some Advice on Process for ADA, or, Riding the Big Hairy Research Project

ADA is supposed to take a large part of your thought and energy for the next year or so. It is (most likely) bigger and complicated than anything you've done before. Staying on top of a piece of work like this is harder than more limited assignments. Many graduate programs teach this only through attrition, i.e., people who don't figure it out on their own leave the program. Here, however, are some pointers, based on what's worked for me and my friends. Your mileage will vary.

(Some of this will carry over to your dissertation; some of it is even more general advice for scholarly practice.)

Have a plan

Right now, what you should do for ADA is probably quite vague; the only clear thing might be that it will be a lot of work. This makes it hard to know what to do. Clarifying what you should be doing, and developing a plan, can help lift vagueness-induced paralysis. Don't worry about getting the plan exactly right at first. ("Truth comes more readily out of error than out of confusion".)
  1. Figure out what your data is (you've all already done this)
  2. Figure out what your investigator's scientific problem is (they've told you this in what they think is adequate detail)
  3. Translate the scientific problem into a statistical problem — i.e., what analysis you can do to the data which would solve the real-world problem.
  4. Figure out what a solution to the statistical problem would look like, how you'd know that you've solved it.
  5. Identify the major components of the solution, say 2--4 big pieces which, fitted together, would solve your problem. These will probably themselves be kind of vague, so —
  6. Recurse: take those components as statistical problems themselves and break them down into simpler sub-problems. Keep going until you have turned everything into sub-sub-(sub-)problems which are simple enough that you have an idea of how to attack each one of them. It can help to draw this out as a tree, of problems branching into sub-problems. (If you can't figure out how to break up a part of the problem, don't worry about that right now; just get at least one part of the tree down in enough detail that you can begin.)
You have now reduced your problem to a collection of sub-problems which are small enough to attack; list them and get started.

Write as you go

Remember that ADA culminates in producing a paper which should be good enough to submit to a journal. Typically, people produce a report in a last-minute burst of writing, but this usually leads to both stress and poor writing. It is better to start now and keep working on the writing as you go. You will produce a better piece of work with less stress, and your ideas will be clearer.

You have likely had the experience of going to a teacher or class-mate for help with a problem, and found that as you explained what you needed help with, you realized what you needed to do. Writing as you go means explaining what you are doing to yourself, which has many of the same benefits.

Write a draft right now
You've just worked out a detailed description of what your problem is, what a solution would look like, and your plan for getting from the problem to the solution. Turn this into a draft report, with paragraphs and complete sentences and so forth. Leave place-holders for the stuff you've still got to do.
If, in writing your draft, you realize that your plan is not as clear as you thought it was, go back and revise the plan.
Update your draft regularly
Writing a draft now and not touching it until April is still better than not starting writing until April. But better still is the slow steady application of time to the writing. You might try setting aside, say, 30 minutes once a week to revise your draft — incorporate what you have learned and done that week, revise the prose, make or improve figures, etc. Even if you feel like you haven't done enough new work that week to spend 30 minutes writing it up, you can certainly spend the time improving your write-up.
Integrating writing and coding
Lots of what will be in your report (figures, tables, etc.) will be the result of computational analysis. While documenting what you did to get your results is a good practice anyway (see below), it can be especially useful to embed your analysis code into your write-up, so that the associations are clear. The easy way, which is what I do, is to simply paste the relevant parts of your R (or whatever) session/code into the LaTeX file of your document, and then comment out the code. Sweave is a more elaborate system which will actually re-run your R code, re-doing the analysis live. This provides a stronger guarantee that you are writing about what you actually did, but it is more work and more run-time.
(Most of you are probably better programmers than I am, but on the off-chance that you're not, you might find this helpful.)

Track your work in writing

Your memory is small, unreliable and fleeting, the project is large and complicated, distractions are numerous and contextual cues are weak. You will come back to things after a month, sometimes a day, and not remember what you did, or how, or even perhaps why. Fortunately, the wonderful technology of "writing" will remember for you.
Write down everything you do
As much as possible, try to write down everything you did for the project; how exactly you did it; and save the output. I recommend buying a physical notebook, which you can keep on your person, and use for tracking work by hand, as well as maintaining an electronic copy of what you are doing — either by updating your report, or by keeping some sort of project log file.
The standard isn't quite "Notes or it didn't happen", but close.
Write down what you want to do
Lots of ideas will occur to you for things which might be useful or interesting to do in connection with the project. Write them down, again either in the notebook or in a special file of idea for the project; revisit those periodically to see if any are worth pursuing. Writing them down means that you don't have to devote attention to remembering them!
Consider a version-control system
There are lots of free software systems for tracking revisions to multiple files associated with projects (CVS, SVN/Subversion, the unfortunately named Git, etc., etc.). These are mostly designed for software, and have features to make it easier to people to work on small parts of big projects without getting in each others' way, but they also make it easy to keep track of revisions to your documents (and code!), compare changes, and roll back some or all of your work if turns out to have been a blind alley. I use Git, but it's probably not very important which one you use.

Revise your plan

There is, or should be, a feedback between your plan and your work. Your goals constrain what you do, but as you do your work, you learn more about the problem, and this knowledge tells you about what you can and should do. In other words, you ought to revise your goals in light of your actions. Many people have trouble with the idea of deviating from the initial plan, or wonder what good a revisable plan could be. But the plan was something you made up to help yourself, a guess about how to guide yourself to a place where you could better see which way you should go, not to a fixed path you must follow without swerving or meet your doom.

Set aside time periodically to revisit your plan, see what has been accomplished, what no longer makes sense, and what needs to be changed. I suggest doing this less often than updating your draft, say every two weeks or once a month (perhaps around the weeks when you'll present in class).


Get a starting point
Get initial references from the faculty members your working with, both on the scientific problem and on the statistical aspects. Ask them for reviews, introductions, background, etc., as well as any specific contributions you are building on, paralleling, debunking, etc. Try to read broadly and soak in information. You are supposed to be a collaborator in a process of scientific investigation, not a human interface to R, so you need to really understand your problem domain and what is already known about it.
Work references backwards
As you read, note what's cited in connection with topics which puzzle or interest you. Write down the references, track them down, and start reading the ones which look promising. Also, ask your advisers about what to read on those topics. Recurse.
Work references forward
Use the citation databases to see what's been done which builds on stuff you've read and found interesting; scan it to see which parts of it might be useful to you. — Now would be a good time to check out Mathematical Reviews, if you haven't already. It exclusively publishes short summaries and critiques of papers in other journals and books, explaining their contribution and their connection to the rest of the literature, and providing, on the website, really outstanding linkage. It's most useful for work which is of mathematical interest (as the name suggests), so better for say theoretical statistics than really applied papers, but definitely worth exploring.
Expect to have to keep reading
You should identify the journals which publish relevant work — they're the ones you're seeing the most citations to — and start getting their tables of contents in electronic form, if you're not already. (Basically all journals now offer alerting e-mails, and most have RSS feeds, if you use those.) Likewise, sign up for alerts from, the preprint server. (Actually, you should plan to post your papers, once they are written, to Spend a little time keeping up with these. (It's easy to spend too much, which brings me to the the next point.)
Filter ruthlessly
Almost all academic work you run across will be completely irrelevant (and/or hopelessly bad). You need to develop skill in triage: stopping with the title (and/or authors); stopping with the abstract; stopping with the introduction; stopping with the conclusion; stopping with a skim over the paper. Only a tiny fraction of papers will be worth reading in depth.
Track references
Develop a system for keeping track of bibliographic information, both for things you have read and might use, and for things which look like they might be interesting one day.

Reading More

Kieran Healy's Choosing Your Workflow Applications, but requires two caveats: some of the advice is targeted as social scientists rather than statisticians, and under no circumstances should you ever write a paper in Word. (Really, nothing should ever be written in Word, but the larger battle was lost long ago, alas.)

I also strongly, strongly, strongly recommend reading Herbert Simon's The Sciences of the Artificial; the connection should be clear by the time you're done.