Category Archives: Uncategorized

Revolution R with Eclipse Helios

One of the reasons that I don’t often take advantage of the cool features in Revolution R is that I absolutely can’t stand their Visual Studio interface. Previously, if I wanted to run something in RevoR, I fired up the RGui.exe that comes buried in their distribution and used R’s built in script editor. My normal workflow is to use StatEt inside of Eclipse, so dealing with R’s meager editor was always painful. (Although less painful than the bloated VS-standalone alternative.)

Over the break, I ran across Luke Miller’s excellent post on getting Eclipse setup with StatEt the right way. I was able to follow his tutorial to get vanilla 64-bit R setup on a new installation of 64-bit Eclipse Helios. Once that was working, I changed two things to add a second shortcut for Revo R.

First, I followed his directions to install rJava in RevoR:

C:Usersnathanvan>cd C:RevolutionRevo-4.0RevoEnt64R-2.11.1bin
C:RevolutionRevo-4.0RevoEnt64R-2.11.1bin>R.exe

R version 2.11.1 (2010-05-31)
Copyright (C) 2010 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
...
Type 'revo()' to visit www.revolutionanalytics.com for the latest
Revolution R news, 'forum()' for the community forum, or 'readme()'
for release notes.

> install.packages("rJava")
...
package 'rJava' successfully unpacked and MD5 sums checked

The downloaded packages are in
C:UsersnathanvanAppDataLocalTempRtmpG3tMzbdownloaded_packages

And then installed rj in RevoR, once again using his directions.

C:RevolutionRevo-4.0RevoEnt64R-2.11.1bin>R CMD INSTALL --no-test-load "C:UsersnathanvanDownloadsrj_0.5.2-1.tar.gz"
...
* DONE (rj)

And finally setup Eclipse with a second Run Configuration which I named Revo-R-x64-2.11.1. Now I can run the 64bit version of RevoR without having to deal with the VisualStudio interface. If I get around to it, I’ll post some performance numbers. (The last time I used the VS interface, it was noticeably slower than calling RGui.exe directly.)

A common one

After talking with Cosma Shalizi, I’ve decided to set my work-related New Year’s Resolution as blogging once a week. This is more of a goal then a commitment; if I don’t have anything useful to share or I don’t have time to share it, there won’t be a post. The hope is to be high-quality at low-volume. I intend to blog the afternoon after my Advanced Probability problem set is due.

I’ll try to post (in roughly increasing frequency):

  1. General summaries of things that I finish (papers, released code, file-drawer projects).
  2. Small self contained bits of research when I can. These will be tidbits too small to go up on arXiv, but likely to be of use to at least someone. First up: subsetting rating data.
  3. Reviews of research aimed at a general audience with lots of links and as much context as I can muster.
  4. Ideas that I don’t have time to pursue but would rather like someone else to actualize so that I could purchase their product/service.
  5. Reactions to scholarly ideas in the popular press. (If I’m training to become a public intellectual, I might as well start.)
  6. Snippets of code that may be of general use. (Including benchmarking and howto posts.)

Let’s see if I can keep this up for at least a semester.

hobble hobble

It’s been awhile.

I’ve got some preliminary results that I’ll be presenting to the PIER EdBag on Thursday. If there is interest, I’ll post my slides for those of you who are not able to make it.

I need to redo my computing setup; the easiest way to do this is nuke everything and restore my files. Unfortunately that would mean that I’d have to get my application stack setup again (albeit a better setup, hence the exercise). But I don’t have the time to take the time to get setup in a way that would save me time. So, hobble, hobble.

At least it’s almost Thanksgiving.

A memory leak in getResults from AMT CLT? Nope.

I had a strange error popping up. Every time I tried to use the getResults.sh script outside of a project in the samples directory, an unidentified process would eat up all of the memory on my machine. Worse, when I used strace, the process seemed to be just hanging there.

I tried a bunch of things:

  1. Moving the directory back to samples. No dice.
  2. Switching from OpenJDK to Sun’s JDK. Since the CLT doesn’t support 1.6, I followed the directions here to install sun-java5-jdk and set JAVA_HOME=/usr/lib/jvm/java-1.5.0-sun/jre/
  3. Taking another close look at the script.

Here an excerpt from the the getResults.sh script:

JAVA_HOME=/usr/lib/jvm/java-1.5.0-sun/jre/
export JAVA_HOME
DIR=`pwd`
./getResults.sh $1 $2 $3 $4 $5 $6 $7 $8 $9 -successfile $DIR/rate-3-4-36.success -outputfile $DIR/rate-3-4-36.results
cd $DIR

Notice that I forgot to change to the bin directory of the CLT. (I don’t know why, but you seemingly have to be in the bin directory for the CLT to work properly. I haven’t messed with it much yet; I might have screwed up my path somehow when I first tried.)

If you add the line:

cd ~/workspace/aws-mturk-clt-*/bin

It works. My script was recursing on itself. That’s what was eating the memory. And it didn’t show up in top because each invocation of the script used up only a very little bit of the memory. Man I feel dumb.

I only post this in case someone else has these troubles.

XML for AMT ExternalQuestion

If you are trying to use Amazon Mechanical Turk for something and you don’t know XML very well, you might run into this error when trying to write an ExternalQuestion with the command line tools.

vanhoudn@gauze:~/workspace/aws-mturk-clt-1.3.0/samples/rate_tests$ ./run.sh
--[Initializing]----------
Input: ../samples/rate_tests/rate_tests.input.csv
Properties: ../samples/rate_tests/rate_tests.properties
Question File: ../samples/rate_tests/rate_tests.question
Preview mode disabled
--[Loading HITs]----------
Start time: Thu Aug 05 16:12:33 EDT 2010
[Fatal Error] :6:111: The reference to entity "student2" must end with the ';' delimiter.
[ERROR] Error creating HIT 1 (3001): [6,111] The reference to entity "student2" must end with the ';' delimiter.
...

The issue is that I’m trying to pass more than one variable via the url and this:

<ExternalURL>

http://www.contrib.andrew.cmu.edu/~nmv/amt/test.php?

student1=${helper.urlencode($student1)}
&student2=${helper.urlencode($student2)}
&student3=${helper.urlencode($student3)}
</ExternalURL> 

Should instead be:

<ExternalURL>

http://www.contrib.andrew.cmu.edu/~nmv/amt/test.php?

student1=${helper.urlencode($student1)}
&student2=${helper.urlencode($student2)}
&student3=${helper.urlencode($student3)}
</ExternalURL> 

The & that separates the variables need to be replaced with &. Then it works just fine.

Better ways to access NCES data

Since I’m supported by a Institute for Education Sciences (IES) pre-doctoral training fellowship (ours is called PIER), I had the opportunity to attend the 2010 IES Research Conference: Connecting Research, Policy and Practice. The morning of the last day of the conference, John Easton, the director of IES, held a round-table discussion with all of the pre-doc students that IES supports.

I asked him why the National Center for Education Statistics (NCES), which is a sub-agency of IES, only made their data available as flat files. I compared their offerings with the World Bank Data Catalog which goes so far as to offer an API to access the data in addition to csv and xml flat files. I also mentioned that even though they have a lot of GIS information (such as the boundaries of every school district since the mid 90s) they don’t make it easy to mash up that information onto, say a Google Maps layer.

Dr. Easton replied that he was aware of the problem and aware of the limitations of the current set of data tools offered by NCES. He asked for concrete use cases that he could forward to the appropriate people. That way they can develop towards requests from users instead of guesses about what they think users might want.

I’d like to get back to him in about a week and I thought I’d open the discussion to others. What would you like to see NCES offer in terms of data access?

StatEt in Ubuntu 10.04

I wanted a “lightweight” version of Eclipse to run R from Ubuntu. (I installed eclipse-pde using apt-get. It worked fine.) Once it was running, I installed StatEt via the “Install new software” feature from http://download.walware.de/eclipse-3.5. While it was downloading, I opened up an R console and ran install.packages("rJava"). When the installation of both StatEt and rJava finished I restarted Eclipse. This is when things stopped working and I couldn’t really find any step-by step directions on how to proceed. Here is what I did:

  1. Run -> Run Configurations
  2. Click on R-Console in the left pane. This will create a new run configuration. Change the name to “R 2.10″
  3. Click on the “R_Config” tab. Choose “Selected Configuration:” and then hit the “Configure…” button.
  4. Click “Add”. Change “Location (R_Home):” to “/usr/lib/R” and click “Detect Default Properties/Settings” Click “Ok” until you are back to the “Run Configurations” window
  5. This is the important step. Without it you will get

    Launching the R Console was cancelled, because it seems starting the Java process/R engine failed.
    Please make sure that R package 'rJava' with JRI is installed and look into the Troubleshooting section on the homepage.

    Click on the JRE tab. In the “VM Arguments” box add
    -Drjava.path=/home/<username>/R/i486-pc-linux-gnu-library/2.10/rJava

    Where <username> is your username. (You are providing the path to rJava, for some reason, even though Eclipse will detect it during the setup in the “R_Config” step, it doesn’t seem to share that information with the JRE.)

  6. Click Run. It should work.

Not to make excuses, but this is why stopping the oil is hard.

I subscribe to R Bloggers, a nifty service that aggregates blog posts that have to do with R. The top post just now was The Deepwater Horizon, in context which shows an infographic of the scale of the Earth from Mt. Everest to the Mariana trench. The Deepwater Horizon Rig covers a good chunk of the span from sea level downwards.

What the graph made clear is that the name Deepwater Horizon isn’t a poetic name, it is a description. We are dealing with oil gushing from a point as far below the ocean as Denver is above sea-level. I wonder if the regulators who approved the project had a sense of how remote the “top” of the well was. I have great faith in the ability of engineers to eventually stop the flow, but that BP was “exempted in 2008 from filing a plan on how they would clean up a major spill” seem laughable given the context.

Pegging your multicore CPU in Revolution R, Good and Bad

Seven of eight cores at maximum usage

I take an almost unhealthy pleasure in pushing my computer to its limits. This has become easier with Revolution R and its free license for academic use. One of its best features is debugger that allows you to step through R code interactively like you can with python on PyDev. The other useful thing it packages is a simple way to run embarrassingly parallel jobs on a multicore box with the doSMP package.


library(doSMP)

# This declares how many processors to use.
# Since I still wanted to use my laptop, during the simulation I chose cores-1.
workers <- startWorkers(7)
registerDoSMP(workers)

# Make Revolution R not try to go multi-core since we're already explicitly running in parallel
# Tip from: http://blog.revolutionanalytics.com/2010/06/performance-benefits-of-multithreaded-r.html
setMKLthreads(1)

chunkSize <- ceiling(runs / getDoParWorkers())
smpopts <- list(chunkSize=chunkSize)

#This just let's me see how long the simulation ran
beginTime <- Sys.time()

#This is the crucial piece. It parallelizes a for loop among the workers and aggregates their results
#with cbind. Since my function returns c(result1, result2, result3), r becomes a matrix with 3 rows and
# "runs" columns.
r <- foreach(icount(runs), .combine=cbind, .options.smp=smpopts) %dopar% {
# repeatExperiment is just a wrapper function that returns a c(result1, result2, result3)
tmp <- repeatExperiment(N,ratingsPerQuestion, minRatings, trials, cutoff, studentScores)
}

runTime <- Sys.time() - beginTime

#So now I can do something like this:
boxplot(r[1,], r[2,], r[3,],
main=paste("Distribution of Percent of rmse below ", cutoff,
"n Runs=", runs, " Trials=",trials, " Time=",round(runTime,2)," minsn",
"scale: ",scaleLow,"-",scaleHigh,
sep=""),
names=c("Ave3","Ave5","Ave7"))

If you are intersested in finding out more of about this, their docs are pretty good.

The only drawback is that Revolution R is a bit rough around the edges and crashes much more than it should. Worse, for me at least the support forum doesn’t show any posts when I’m logged in and I can’t post anything. Although I’ve filled out (what I think is) the appropriate web-form no one has gotten back to me about fixing my account. I’m going to try twitter in a bit. Your mileage may vary.

Update: 6/9/2010 22:03 EST

Revolution Analytics responded to my support request after I mentioned it on twitter. Apparently, they had done something to the forums which corrupted my account. Creating a new account fixed the problem, so now I can report the bugs that I
find and get some help.

Update: 6/11/2010 16:03 EST

It turns out that you get a small speed improvement by setting setMKLthreads(1). Apparently, the libraries Revolution R links against attempt to use multiple cores by default. If you are explicitly parrallel programing, this means that your code is competing with itself for resources. Thanks for the tip!

Rules

This is where I’ll post small things that might be useful to others as I play around with code.