Monthly Archives: June 2010

Messing with R packages

This was really frustrating. I’m trying to modify a package from Matt Johnson and although I could get the package he sent me to install flawlessly, I couldn’t un-tar it, make a change, re-tar it, and then R CMD INSTALL it. I was about to pull out my hair. The error I got was:
ERROR: cannot extract package from ‘hrm-rev9.tar.gz’

The secret: you have to have the name correct.
R CMD INSTALL hrm-rev9.tar.gz
barfs. But
R CMD INSTALL hrm_0.1-9.tar.gz
works fine. I’m sure it’s somewhere in the docs. I just couldn’t find it.

As always, I made a script to do it for me: (Updated 6/17/2010 15:41)

#!/bin/bash
# Quick script to tar & gzip the package, remove the old one, and install the new one
# I'll add options automatically tag and release it later.

#Set the library that I'm using
LIB="/home/vanhoudn/R/i486-pc-linux-gnu-library/2.10/"

#Commit
svn commit -m "Build commit"

#get the revision number from svn
REV=`svn info -R | grep Revision | cut -d: -f 2 | sort -g -r | head -n 1 | sed 's/ //g'`

#Build the filename
FILENAME="hrm_0.1-$REV.tar.gz"

# I need to tar up the pkg so I can install it.
# Jump to the parent directory and work from there.
cd ..
# Exclude any hidden files under the directories (svn has a bunch)
# and add the named files
tar czf $FILENAME --exclude '.*' hrm/DESCRIPTION hrm/NAMESPACE hrm/src hrm/R

# Remove the old version of the package
R CMD REMOVE -l $LIB hrm

# Install the new package
R CMD INSTALL $FILENAME

# Clean up
rm $FILENAME

# Go back to our previous directory
cd hrm

StatEt in Ubuntu 10.04

I wanted a “lightweight” version of Eclipse to run R from Ubuntu. (I installed eclipse-pde using apt-get. It worked fine.) Once it was running, I installed StatEt via the “Install new software” feature from http://download.walware.de/eclipse-3.5. While it was downloading, I opened up an R console and ran install.packages("rJava"). When the installation of both StatEt and rJava finished I restarted Eclipse. This is when things stopped working and I couldn’t really find any step-by step directions on how to proceed. Here is what I did:

  1. Run -> Run Configurations
  2. Click on R-Console in the left pane. This will create a new run configuration. Change the name to “R 2.10″
  3. Click on the “R_Config” tab. Choose “Selected Configuration:” and then hit the “Configure…” button.
  4. Click “Add”. Change “Location (R_Home):” to “/usr/lib/R” and click “Detect Default Properties/Settings” Click “Ok” until you are back to the “Run Configurations” window
  5. This is the important step. Without it you will get

    Launching the R Console was cancelled, because it seems starting the Java process/R engine failed.
    Please make sure that R package 'rJava' with JRI is installed and look into the Troubleshooting section on the homepage.

    Click on the JRE tab. In the “VM Arguments” box add
    -Drjava.path=/home/<username>/R/i486-pc-linux-gnu-library/2.10/rJava

    Where <username> is your username. (You are providing the path to rJava, for some reason, even though Eclipse will detect it during the setup in the “R_Config” step, it doesn’t seem to share that information with the JRE.)

  6. Click Run. It should work.

Not to make excuses, but this is why stopping the oil is hard.

I subscribe to R Bloggers, a nifty service that aggregates blog posts that have to do with R. The top post just now was The Deepwater Horizon, in context which shows an infographic of the scale of the Earth from Mt. Everest to the Mariana trench. The Deepwater Horizon Rig covers a good chunk of the span from sea level downwards.

What the graph made clear is that the name Deepwater Horizon isn’t a poetic name, it is a description. We are dealing with oil gushing from a point as far below the ocean as Denver is above sea-level. I wonder if the regulators who approved the project had a sense of how remote the “top” of the well was. I have great faith in the ability of engineers to eventually stop the flow, but that BP was “exempted in 2008 from filing a plan on how they would clean up a major spill” seem laughable given the context.

Pegging your multicore CPU in Revolution R, Good and Bad

Seven of eight cores at maximum usage

I take an almost unhealthy pleasure in pushing my computer to its limits. This has become easier with Revolution R and its free license for academic use. One of its best features is debugger that allows you to step through R code interactively like you can with python on PyDev. The other useful thing it packages is a simple way to run embarrassingly parallel jobs on a multicore box with the doSMP package.


library(doSMP)

# This declares how many processors to use.
# Since I still wanted to use my laptop, during the simulation I chose cores-1.
workers <- startWorkers(7)
registerDoSMP(workers)

# Make Revolution R not try to go multi-core since we're already explicitly running in parallel
# Tip from: http://blog.revolutionanalytics.com/2010/06/performance-benefits-of-multithreaded-r.html
setMKLthreads(1)

chunkSize <- ceiling(runs / getDoParWorkers())
smpopts <- list(chunkSize=chunkSize)

#This just let's me see how long the simulation ran
beginTime <- Sys.time()

#This is the crucial piece. It parallelizes a for loop among the workers and aggregates their results
#with cbind. Since my function returns c(result1, result2, result3), r becomes a matrix with 3 rows and
# "runs" columns.
r <- foreach(icount(runs), .combine=cbind, .options.smp=smpopts) %dopar% {
# repeatExperiment is just a wrapper function that returns a c(result1, result2, result3)
tmp <- repeatExperiment(N,ratingsPerQuestion, minRatings, trials, cutoff, studentScores)
}

runTime <- Sys.time() - beginTime

#So now I can do something like this:
boxplot(r[1,], r[2,], r[3,],
main=paste("Distribution of Percent of rmse below ", cutoff,
"n Runs=", runs, " Trials=",trials, " Time=",round(runTime,2)," minsn",
"scale: ",scaleLow,"-",scaleHigh,
sep=""),
names=c("Ave3","Ave5","Ave7"))

If you are intersested in finding out more of about this, their docs are pretty good.

The only drawback is that Revolution R is a bit rough around the edges and crashes much more than it should. Worse, for me at least the support forum doesn’t show any posts when I’m logged in and I can’t post anything. Although I’ve filled out (what I think is) the appropriate web-form no one has gotten back to me about fixing my account. I’m going to try twitter in a bit. Your mileage may vary.

Update: 6/9/2010 22:03 EST

Revolution Analytics responded to my support request after I mentioned it on twitter. Apparently, they had done something to the forums which corrupted my account. Creating a new account fixed the problem, so now I can report the bugs that I
find and get some help.

Update: 6/11/2010 16:03 EST

It turns out that you get a small speed improvement by setting setMKLthreads(1). Apparently, the libraries Revolution R links against attempt to use multiple cores by default. If you are explicitly parrallel programing, this means that your code is competing with itself for resources. Thanks for the tip!

CMU Web Publishing is Strange

The My Andrew web publishing workflow is a little strange. Once you get everything setup, maintenance is a pain. Once you copy the updated files to ~/www, you must visit a webpage to “publish” those files. Presumably the script that copies the files from ~/www to somewhere else in the infrastructure does some security checking or something.

Visiting a webform is a pain. Luckily, it is unauthenticated and uses GET so it is simplicity itself to “publish” from the command line. A quick script:

#!/bin/sh

#Change these to where your local copy of www sits and your username
SOURCE=/home/nathanvan/winhome/workspace/www/
USER=nmv

rsync -rv $SOURCE $USER@unix.andrew.cmu.edu:www
#Since I'm triggering an event, I'm not worried about certificate integrity.
wget --no-check-certificate 'https://www.andrew.cmu.edu/cgi-bin/publish?FLAG=0&NAME='$USER

#Clean up what wget left behind.
rm 'publish@FLAG=0&NAME='$USER

Clearly there are other ways to do this, but this is simple and compatible with my workflow. Since I don’t do this often, I’m okay with typing my password. I could use ssh-agent and such, but I haven’t got that setup on this machine yet and doubtfully ever will.

Rules

This is where I’ll post small things that might be useful to others as I play around with code.