Not to make excuses, but this is why stopping the oil is hard.

I subscribe to R Bloggers, a nifty service that aggregates blog posts that have to do with R. The top post just now was The Deepwater Horizon, in context which shows an infographic of the scale of the Earth from Mt. Everest to the Mariana trench. The Deepwater Horizon Rig covers a good chunk of the span from sea level downwards.

What the graph made clear is that the name Deepwater Horizon isn’t a poetic name, it is a description. We are dealing with oil gushing from a point as far below the ocean as Denver is above sea-level. I wonder if the regulators who approved the project had a sense of how remote the “top” of the well was. I have great faith in the ability of engineers to eventually stop the flow, but that BP was “exempted in 2008 from filing a plan on how they would clean up a major spill” seem laughable given the context.

Pegging your multicore CPU in Revolution R, Good and Bad

Seven of eight cores at maximum usage

I take an almost unhealthy pleasure in pushing my computer to its limits. This has become easier with Revolution R and its free license for academic use. One of its best features is debugger that allows you to step through R code interactively like you can with python on PyDev. The other useful thing it packages is a simple way to run embarrassingly parallel jobs on a multicore box with the doSMP package.


library(doSMP)

# This declares how many processors to use.
# Since I still wanted to use my laptop, during the simulation I chose cores-1.
workers <- startWorkers(7)
registerDoSMP(workers)

# Make Revolution R not try to go multi-core since we're already explicitly running in parallel
# Tip from: http://blog.revolutionanalytics.com/2010/06/performance-benefits-of-multithreaded-r.html
setMKLthreads(1)

chunkSize <- ceiling(runs / getDoParWorkers())
smpopts <- list(chunkSize=chunkSize)

#This just let's me see how long the simulation ran
beginTime <- Sys.time()

#This is the crucial piece. It parallelizes a for loop among the workers and aggregates their results
#with cbind. Since my function returns c(result1, result2, result3), r becomes a matrix with 3 rows and
# "runs" columns.
r <- foreach(icount(runs), .combine=cbind, .options.smp=smpopts) %dopar% {
# repeatExperiment is just a wrapper function that returns a c(result1, result2, result3)
tmp <- repeatExperiment(N,ratingsPerQuestion, minRatings, trials, cutoff, studentScores)
}

runTime <- Sys.time() - beginTime

#So now I can do something like this:
boxplot(r[1,], r[2,], r[3,],
main=paste("Distribution of Percent of rmse below ", cutoff,
"n Runs=", runs, " Trials=",trials, " Time=",round(runTime,2)," minsn",
"scale: ",scaleLow,"-",scaleHigh,
sep=""),
names=c("Ave3","Ave5","Ave7"))

If you are intersested in finding out more of about this, their docs are pretty good.

The only drawback is that Revolution R is a bit rough around the edges and crashes much more than it should. Worse, for me at least the support forum doesn’t show any posts when I’m logged in and I can’t post anything. Although I’ve filled out (what I think is) the appropriate web-form no one has gotten back to me about fixing my account. I’m going to try twitter in a bit. Your mileage may vary.

Update: 6/9/2010 22:03 EST

Revolution Analytics responded to my support request after I mentioned it on twitter. Apparently, they had done something to the forums which corrupted my account. Creating a new account fixed the problem, so now I can report the bugs that I
find and get some help.

Update: 6/11/2010 16:03 EST

It turns out that you get a small speed improvement by setting setMKLthreads(1). Apparently, the libraries Revolution R links against attempt to use multiple cores by default. If you are explicitly parrallel programing, this means that your code is competing with itself for resources. Thanks for the tip!

CMU Web Publishing is Strange

The My Andrew web publishing workflow is a little strange. Once you get everything setup, maintenance is a pain. Once you copy the updated files to ~/www, you must visit a webpage to “publish” those files. Presumably the script that copies the files from ~/www to somewhere else in the infrastructure does some security checking or something.

Visiting a webform is a pain. Luckily, it is unauthenticated and uses GET so it is simplicity itself to “publish” from the command line. A quick script:

#!/bin/sh

#Change these to where your local copy of www sits and your username
SOURCE=/home/nathanvan/winhome/workspace/www/
USER=nmv

rsync -rv $SOURCE $USER@unix.andrew.cmu.edu:www
#Since I'm triggering an event, I'm not worried about certificate integrity.
wget --no-check-certificate 'https://www.andrew.cmu.edu/cgi-bin/publish?FLAG=0&NAME='$USER

#Clean up what wget left behind.
rm 'publish@FLAG=0&NAME='$USER

Clearly there are other ways to do this, but this is simple and compatible with my workflow. Since I don’t do this often, I’m okay with typing my password. I could use ssh-agent and such, but I haven’t got that setup on this machine yet and doubtfully ever will.

Rules

This is where I’ll post small things that might be useful to others as I play around with code.