Category Archives: rstats

Writing a Metropolis-Hastings within Gibbs sampler in R for a 2PL IRT model (9 posts)

Last year, Brian Junker, Richard Patz, and I wrote an invited chapter for the (soon to be released) update of the classic text Handbook of Modern Item Response Theory (1996). The chapter itself is meant to be an update of the classic IRT in MCMC papers Patz & Junker (1999a, 1999b).

To support the chapter, I have put together an online supplement which gives a detailed walk-through of how to write a Metropolis-Hastings sampler for a simple psychometric model (in R, of course!). The table of contents is below:

I will continue to add to the online supplement over time. The next few posts will be:

  • Post 10: Over dispersion and multi-core parallelism
  • Post 11: Replacing R with C
  • Post 12: Adaptive tuning of the Metropolis-Hastings proposals

I would be grateful for any feedback. Feel free to either leave it here or at the online supplement itself.

For faster R use OpenBLAS instead: better than ATLAS, trivial to switch to on Ubuntu

R speeds up when the Basic Linear Algebra System (BLAS) it uses is well tuned. The reference BLAS that comes with R and Ubuntu isn’t very fast. On my machine, it takes 9 minutes to run a well known R benchmarking script. If I use ATLAS, an optimized BLAS that can be easily installed, the same script takes 3.5 minutes. If I use OpenBLAS, yet another optimized BLAS that is equally easy to install, the same script takes 2 minutes. That’s a pretty big improvement!

In this post, I’ll show you how to install ATLAS and OpenBLAS, demonstrate how you can switch between them, and let you pick which you would like to use based on benchmark results. Before we get started, one quick shout out to Felix Riedel: thanks for encouraging me to look at OpenBLAS instead of ATLAS in your comment on my previous post.

Update for Mac OS X users: Zachary Meyer’s comment gives bare bones details for how to accomplish a similar BLAS switch. He has a few more details on his blog. Thanks Zachary!

Update for R multicore users: According to this comment and this comment, OpenBLAS does not play well with one of R’s other multicore schemes. It appears to be a bug, so perhaps it will get fixed in the future. See the comment stream for further details.

Installing additional BLAS libraries on Ubuntu

For Ubuntu, there are currently three different BLAS options that can be easily chosen: “libblas” the reference BLAS, “libatlas” the ATLAS BLAS, and “libopenblas” the OpenBLAS. Their package names are

$ apt-cache search libblas
libblas-dev - Basic Linear Algebra Subroutines 3, static library
libblas-doc - Basic Linear Algebra Subroutines 3, documentation
libblas3gf - Basic Linear Algebra Reference implementations, shared library
libatlas-base-dev - Automatically Tuned Linear Algebra Software, generic static
libatlas3gf-base - Automatically Tuned Linear Algebra Software, generic shared
libblas-test - Basic Linear Algebra Subroutines 3, testing programs
libopenblas-base - Optimized BLAS (linear algebra) library based on GotoBLAS2
libopenblas-dev - Optimized BLAS (linear algebra) library based on GotoBLAS2

Since libblas already comes with Ubuntu, we only need to install the other two for our tests. (NOTE: In the following command, delete ‘libatlas3gf-base’ if you don’t want to experiment with ATLAS.):

$ sudo apt-get install libopenblas-base libatlas3gf-base

Switching between BLAS libraries

Now we can switch between the different BLAS options that are installed:

$ sudo update-alternatives --config
There are 3 choices for the alternative (providing /usr/lib/

Selection Path Priority Status
* 0 /usr/lib/openblas-base/ 40 auto mode
1 /usr/lib/atlas-base/atlas/ 35 manual mode
2 /usr/lib/libblas/ 10 manual mode
3 /usr/lib/openblas-base/ 40 manual mode

Press enter to keep the current choice[*], or type selection number:
    Side note: If the above returned:

    update-alternatives: error: no alternatives for


    $ sudo update-alternatives --config

    instead. See the comments at the end of the post for further details.

From the selection menu, I picked 3, so it now shows that choice 3 (OpenBLAS) is selected:

$ sudo update-alternatives --config
There are 3 choices for the alternative (providing /usr/lib/

Selection Path Priority Status
0 /usr/lib/openblas-base/ 40 auto mode
1 /usr/lib/atlas-base/atlas/ 35 manual mode
2 /usr/lib/libblas/ 10 manual mode
* 3 /usr/lib/openblas-base/ 40 manual mode

And we can pull the same trick to choose between LAPACK implementations. From the output we can see that OpenBLAS does not provide a new LAPACK implementation, but ATLAS does:

$ sudo update-alternatives --config
There are 2 choices for the alternative (providing /usr/lib/

Selection Path Priority Status
* 0 /usr/lib/atlas-base/atlas/ 35 auto mode
1 /usr/lib/atlas-base/atlas/ 35 manual mode
2 /usr/lib/lapack/ 10 manual mode

So we will do nothing in this case, since OpenBLAS is supposed to use the reference implementation (which is already selected).

    Side note: If the above returned:

    update-alternatives: error: no alternatives for


    $ sudo update-alternatives –config

    instead. See the comments at the end of the post for further details.

Checking that R is using the right BLAS

Now we can check that everything is working by starting R in a new terminal:

$ R

R version 3.0.1 (2013-05-16) -- "Good Sport"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
Type 'q()' to quit R.


Great. Let’s see if R is using the BLAS and LAPACK libraries we selected. To do so, we open another terminal so that we can run a few more shell commands. First, we find the PID of the R process we just started. Your output will look something like this:

$ ps aux | grep exec/R
1000 18065 0.4 1.0 200204 87568 pts/1 Sl+ 09:00 0:00 /usr/lib/R/bin/exec/R
root 19250 0.0 0.0 9396 916 pts/0 S+ 09:03 0:00 grep --color=auto exec/R

The PID is the second number on the ‘/usr/lib/R/bin/exec/R’ line. To see
which BLAS and LAPACK libraries are loaded with that R session, we use the “list open files” command:

$ lsof -p 18065 | grep 'blas\|lapack'
R 18065 nathanvan mem REG 8,1 9342808 12857980 /usr/lib/lapack/
R 18065 nathanvan mem REG 8,1 19493200 13640678 /usr/lib/openblas-base/

As expected, the R session is using the reference LAPACK (/usr/lib/lapack/ and OpenBLAS (/usr/lib/openblas-base/

Testing the different BLAS/LAPACK combinations

I used Simon Urbanek’s most recent benchmark script. To follow along, first download it to your current working directory:

$ curl -O

And then run it:

$ cat R-benchmark-25.R | time R --slave
Loading required package: Matrix
Loading required package: lattice
Loading required package: SuppDists
Warning message:
In library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE, :
there is no package called ‘SuppDists’

Ooops. I don’t have the SuppDists package installed. I can easily load it via Michael Rutter’s ubuntu PPA:

$ sudo apt-get install r-cran-suppdists

Now Simon’s script works wonderfully. Full output

$ cat R-benchmark-25.R | time R --slave
Loading required package: Matrix
Loading required package: lattice
Loading required package: SuppDists
Warning messages:
1: In remove("a", "b") : object 'a' not found
2: In remove("a", "b") : object 'b' not found

R Benchmark 2.5
Number of times each test is run__________________________: 3

I. Matrix calculation
Creation, transp., deformation of a 2500x2500 matrix (sec): 1.36566666666667
2400x2400 normal distributed random matrix ^1000____ (sec): 0.959
Sorting of 7,000,000 random values__________________ (sec): 1.061
2800x2800 cross-product matrix (b = a' * a)_________ (sec): 1.777
Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec): 1.00866666666667
Trimmed geom. mean (2 extremes eliminated): 1.13484335940626

II. Matrix functions
FFT over 2,400,000 random values____________________ (sec): 0.566999999999998
Eigenvalues of a 640x640 random matrix______________ (sec): 1.379
Determinant of a 2500x2500 random matrix____________ (sec): 1.69
Cholesky decomposition of a 3000x3000 matrix________ (sec): 1.51366666666667
Inverse of a 1600x1600 random matrix________________ (sec): 1.40766666666667
Trimmed geom. mean (2 extremes eliminated): 1.43229160585452

III. Programmation
3,500,000 Fibonacci numbers calculation (vector calc)(sec): 1.10533333333333
Creation of a 3000x3000 Hilbert matrix (matrix calc) (sec): 1.169
Grand common divisors of 400,000 pairs (recursion)__ (sec): 2.267
Creation of a 500x500 Toeplitz matrix (loops)_______ (sec): 1.213
Escoufier's method on a 45x45 matrix (mixed)________ (sec): 1.32600000000001
Trimmed geom. mean (2 extremes eliminated): 1.23425893178325

Total time for all 15 tests_________________________ (sec): 19.809
Overall mean (sum of I, II and III trimmed means/3)_ (sec): 1.26122106386747
--- End of test ---

134.75user 16.06system 1:50.08elapsed 137%CPU (0avgtext+0avgdata 1949744maxresident)k
448inputs+0outputs (3major+1265968minor)pagefaults 0swaps

Where the elapsed time at the very bottom is the part that we care about. With OpenBLAS and the reference LAPACK, the script took 1 minute and 50 seconds to run. By changing around the selections with update-alternatives, we can test out R with ATLAS (3:21) or R with the reference BLAS (9:13). For my machine, OpenBLAS is a clear winner.

Give it a shot yourself. If you find something different, let me know.

My Stat Bytes talk, with slides and code

On Thursday of last week I gave a short informal talk to Stat Bytes, the CMU Statistics department‘s twice a month computing seminar.

Quick tricks for faster R code: Profiling to Parallelism

I will present a grab bag of tricks to speed up your R code. Topics will include: installing an optimized BLAS, how to profile your R code to find which parts are slow, replacing slow code with inline C/C++, and running code in parallel on multiple cores. My running example will be fitting a 2PL IRT model with a hand coded MCMC sampler. The idea is to start with naive, pedagogically clear code and end up with fast, production quality code.

The slides are here. Code is here.

This was an informal talk. If you would like to dig into these topics more, some more references:

Update: 6/25/2013 For the Windows users out there, Felix Reidel has some notes about upgrading your BLAS. It is easier than I thought!

Update: 7/9/2013 Felix pointed out that OpenBLAS is faster than ATLAS. He is right. See my new blog post for details and proof.

Managing memory in a list of lists data structure

First, a confession: instead of using classes and defining methods for them, I build a lot of ad hoc data structures out of lists and then build up one-off methods that operate on those lists of lists. I think this is a perl-ism that has transferred into my R code. I might eventually learn how to do classes, but this hack has been working well enough.

One issue I ran into today is that it was getting tedious to find out which objects stored in the list of lists was taking up the most memory. I ended up writing this rather silly recursive function that may be of use to you if you also have been scarred by perl.

# A hacked together function for exploring these structures
get.size <- function(, units='Kb') {
  # Check if the object we were passed is a list
  # N.B. Since is(list()) returns c('list', 'vector') we need a
  #      multiple value comparison like all.equal
  # N.B. Since all.equal will either return TRUE or a vector of 
  #      differences wrapping it in is.logical is the same as 
  #      checking if it returned TRUE. 
  if ( is.logical( all.equal( is( , is(list())))) {
    # Iterate over each element of the list
    lapply( ,
        # Calculate the size of the current element of the list
        # N.B. object.size always returns bytes, but its print 
        #      allows different units. Using capture.output allows
        #      us to do the conversion with the print method
        the.size <- capture.output(print(object.size(xx), units=units))
        # This object may itself be a list...
        if( is.logical( all.equal( is(xx), is(list())))) {
           # if so, recurse if we aren't already at zero size 
           if( the.size != paste(0, units) ) {
    <- get.size( xx , units)
             return( list(the.size, )
           }else {
             # Or just return the zero size
             return( the.size )             
        } else {
           # the element isn't a list, just return its size
           return( the.size)
  } else {
    # If the object wasn't a list, return an error.
    stop("The object passed to this function was not a list.")

The output looks something like this

[1] "2487.7 Kb"

[1] "871 Kb"

[1] "664.5 Kb"

[1] "951.9 Kb"

[1] "4628.2 Kb"

[1] "1.2 Kb"

[1] "2.6 Kb"

where the first element of the list is the sum of everything below it in the hierarchy. Therefore, the whole “models” is 2487.7 Kb and “models$naive.model” is only 871 Kb of that total.

Getting R2WinBUGS to talk to WinBUGS 1.4 on Ubuntu 12.04 LTS

Disclaimer 1: WinBUGS is old and not maintained. There are other packages to use, if you would like to take advantage of more modern developments in MCMC such as:

  • PyMC which transparently implements adaptive Metropolis-Hastings proposals (among other great features), or
  • the LaplacesDemon R package, which dispenses guidance on whether or not your chain converged, or
  • the as of yet released STAN, which will use an automatically tuned Hamiltonian Monte Carlo sampler when it can and (presumably) a WinBUGS like Gibbs sampler when it can’t.

Disclaimer 2: There are also WinBUGS alternatives, like JAGS and OpenBUGS, that are both currently maintained and cross platform (Windows, Mac, and linux). They are worth checking out if you want to maintain some legacy BUGS code.

If you are set on using WinBUGS, the installation is remarkably easy on Ubuntu (easier than Windows 7, in fact).The steps are as follows:

1. Install R. (R 2.14.2)
2. Install wine.  (wine-1.4)
3. Install WinBUGS via wine and setup R2WinBugs. That guide was written for Ubuntu 10.04. Some modifications for Ubuntu 12.04:

  • Ignore the bits about wine 1.0. Wine 1.4 works great.
  • The R2WinBUGS example won’t work. When you run this:
    &gt; schools.sim &lt;- bugs( data, inits, parameters, model.file, n.chains=3, n.iter=5000)

    WinBUGS will pop-up, but it will get stuck at its license screen. If you close the license screen, nothing happens. If you close the WinBUGS window, you get:

    schools.sim p11-kit: couldn't load module: /usr/lib/i386-linux-gnu/pkcs11/ /usr/lib/i386-linux-gnu/pkcs11/ cannot open shared object file: No such file or directory
    err:ole:CoGetClassObject class {0003000a-0000-0000-c000-000000000046} not registered
    err:ole:CoGetClassObject class {0003000a-0000-0000-c000-000000000046} not registered
    err:ole:CoGetClassObject no class object {0003000a-0000-0000-c000-000000000046} could be created for context 0x3&amp;lt;/code&amp;gt;
    Error in,, WINE = WINE, useWINE = useWINE, : Look at the log file and try again with 'debug=TRUE' to figure out what went wrong within Bugs.

    Which isn’t a particularly helpful error message.

  • The error is that the intermediate files that R2WinBUGS produces are not getting shared with WinBUGS, so WinBUGS thinks it doesn’t have to do anything. As mentioned by ‘zcronix’ in the comment thread for the instructions, it is a two step fix: (1) create a temporary directory to store those files and (2) tell R2WinBugs about it with the and clearWD options.

    In your shell:

    nathanvan@nathanvan-N61Jq:~$ cd .wine/drive_c/
    nathanvan@nathanvan-N61Jq:~/.wine/drive_c$ mkdir temp
    nathanvan@nathanvan-N61Jq:~/.wine/drive_c$ cd temp
    nathanvan@nathanvan-N61Jq:~/.wine/drive_c/temp$ mkdir Rtmp

    In R:

    &gt; schools.sim &lt;- bugs( data, inits, parameters, model.file, n.chains=3, n.iter=5000,'~/.wine/drive_c/temp/Rtmp/', clearWD=TRUE)

Hopefully that will work for you too.

R is not C

I keep trying to write R code like it was C code. It is a habit I’m trying to break myself of.

For example, the other day I need to construct a model matrix of 1′s and 0′s in the standard, counting in binary, pattern. My solution was:

n <- 8
powers <- 2^(0:(n-1))
NN <- (max(powers)*2)
designMatrix <- matrix( NA, nrow=NN, ncol=n)
for( ii in 0:(NN-1) ) {
     leftOver <- ii
     for ( jj in 1:n ) {
          largest <- rev(powers)[jj]
          if ( leftOver != 0 && largest <= leftOver ) {
               designMatrix[ii+1,jj] <- 1	
               leftOver <- leftOver - largest
          } else {
               designMatrix[ii+1,jj] <- 0

The code works, but it is a low-level re-implementation of something that already exists in base R. R is not C, because base R has pieces that implement statistical ideas for you. Consider:

expand.grid                package:base                R Documentation

Create a Data Frame from All Combinations of Factors


     Create a data frame from all combinations of the supplied vectors
     or factors.  See the description of the return value for precise
     details of the way this is done.

So then instead of writing (and debugging!) a function to make a binary model matrix, I could have simply used a one-liner:

# Note that c(0,1) is encased in list() so that
# rep(..., n) will repeat the object c(0,1) n 
# times instead of its default behavior of 
# concatenating the c(0,1) objects. 
designMatrix_R <- as.matrix( expand.grid( rep( list(c(0,1) ), n) ) )

I like it. It is both shorter and easier to debug. Now I just need to figure out how to find these base R functions before I throw up my hands and re-implement them in C.

Revolution R with Eclipse Helios

One of the reasons that I don’t often take advantage of the cool features in Revolution R is that I absolutely can’t stand their Visual Studio interface. Previously, if I wanted to run something in RevoR, I fired up the RGui.exe that comes buried in their distribution and used R’s built in script editor. My normal workflow is to use StatEt inside of Eclipse, so dealing with R’s meager editor was always painful. (Although less painful than the bloated VS-standalone alternative.)

Over the break, I ran across Luke Miller’s excellent post on getting Eclipse setup with StatEt the right way. I was able to follow his tutorial to get vanilla 64-bit R setup on a new installation of 64-bit Eclipse Helios. Once that was working, I changed two things to add a second shortcut for Revo R.

First, I followed his directions to install rJava in RevoR:

C:Usersnathanvan>cd C:RevolutionRevo-4.0RevoEnt64R-2.11.1bin

R version 2.11.1 (2010-05-31)
Copyright (C) 2010 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Type 'revo()' to visit for the latest
Revolution R news, 'forum()' for the community forum, or 'readme()'
for release notes.

> install.packages("rJava")
package 'rJava' successfully unpacked and MD5 sums checked

The downloaded packages are in

And then installed rj in RevoR, once again using his directions.

C:RevolutionRevo-4.0RevoEnt64R-2.11.1bin>R CMD INSTALL --no-test-load "C:UsersnathanvanDownloadsrj_0.5.2-1.tar.gz"
* DONE (rj)

And finally setup Eclipse with a second Run Configuration which I named Revo-R-x64-2.11.1. Now I can run the 64bit version of RevoR without having to deal with the VisualStudio interface. If I get around to it, I’ll post some performance numbers. (The last time I used the VS interface, it was noticeably slower than calling RGui.exe directly.)

Messing with R packages

This was really frustrating. I’m trying to modify a package from Matt Johnson and although I could get the package he sent me to install flawlessly, I couldn’t un-tar it, make a change, re-tar it, and then R CMD INSTALL it. I was about to pull out my hair. The error I got was:
ERROR: cannot extract package from ‘hrm-rev9.tar.gz’

The secret: you have to have the name correct.
R CMD INSTALL hrm-rev9.tar.gz
barfs. But
R CMD INSTALL hrm_0.1-9.tar.gz
works fine. I’m sure it’s somewhere in the docs. I just couldn’t find it.

As always, I made a script to do it for me: (Updated 6/17/2010 15:41)

# Quick script to tar & gzip the package, remove the old one, and install the new one
# I'll add options automatically tag and release it later.

#Set the library that I'm using

svn commit -m "Build commit"

#get the revision number from svn
REV=`svn info -R | grep Revision | cut -d: -f 2 | sort -g -r | head -n 1 | sed 's/ //g'`

#Build the filename

# I need to tar up the pkg so I can install it.
# Jump to the parent directory and work from there.
cd ..
# Exclude any hidden files under the directories (svn has a bunch)
# and add the named files
tar czf $FILENAME --exclude '.*' hrm/DESCRIPTION hrm/NAMESPACE hrm/src hrm/R

# Remove the old version of the package

# Install the new package

# Clean up

# Go back to our previous directory
cd hrm

StatEt in Ubuntu 10.04

I wanted a “lightweight” version of Eclipse to run R from Ubuntu. (I installed eclipse-pde using apt-get. It worked fine.) Once it was running, I installed StatEt via the “Install new software” feature from While it was downloading, I opened up an R console and ran install.packages("rJava"). When the installation of both StatEt and rJava finished I restarted Eclipse. This is when things stopped working and I couldn’t really find any step-by step directions on how to proceed. Here is what I did:

  1. Run -> Run Configurations
  2. Click on R-Console in the left pane. This will create a new run configuration. Change the name to “R 2.10″
  3. Click on the “R_Config” tab. Choose “Selected Configuration:” and then hit the “Configure…” button.
  4. Click “Add”. Change “Location (R_Home):” to “/usr/lib/R” and click “Detect Default Properties/Settings” Click “Ok” until you are back to the “Run Configurations” window
  5. This is the important step. Without it you will get

    Launching the R Console was cancelled, because it seems starting the Java process/R engine failed.
    Please make sure that R package 'rJava' with JRI is installed and look into the Troubleshooting section on the homepage.

    Click on the JRE tab. In the “VM Arguments” box add

    Where <username> is your username. (You are providing the path to rJava, for some reason, even though Eclipse will detect it during the setup in the “R_Config” step, it doesn’t seem to share that information with the JRE.)

  6. Click Run. It should work.

Pegging your multicore CPU in Revolution R, Good and Bad

Seven of eight cores at maximum usage

I take an almost unhealthy pleasure in pushing my computer to its limits. This has become easier with Revolution R and its free license for academic use. One of its best features is debugger that allows you to step through R code interactively like you can with python on PyDev. The other useful thing it packages is a simple way to run embarrassingly parallel jobs on a multicore box with the doSMP package.


# This declares how many processors to use.
# Since I still wanted to use my laptop, during the simulation I chose cores-1.
workers <- startWorkers(7)

# Make Revolution R not try to go multi-core since we're already explicitly running in parallel
# Tip from:

chunkSize <- ceiling(runs / getDoParWorkers())
smpopts <- list(chunkSize=chunkSize)

#This just let's me see how long the simulation ran
beginTime <- Sys.time()

#This is the crucial piece. It parallelizes a for loop among the workers and aggregates their results
#with cbind. Since my function returns c(result1, result2, result3), r becomes a matrix with 3 rows and
# "runs" columns.
r <- foreach(icount(runs), .combine=cbind, .options.smp=smpopts) %dopar% {
# repeatExperiment is just a wrapper function that returns a c(result1, result2, result3)
tmp <- repeatExperiment(N,ratingsPerQuestion, minRatings, trials, cutoff, studentScores)

runTime <- Sys.time() - beginTime

#So now I can do something like this:
boxplot(r[1,], r[2,], r[3,],
main=paste("Distribution of Percent of rmse below ", cutoff,
"n Runs=", runs, " Trials=",trials, " Time=",round(runTime,2)," minsn",
"scale: ",scaleLow,"-",scaleHigh,

If you are intersested in finding out more of about this, their docs are pretty good.

The only drawback is that Revolution R is a bit rough around the edges and crashes much more than it should. Worse, for me at least the support forum doesn’t show any posts when I’m logged in and I can’t post anything. Although I’ve filled out (what I think is) the appropriate web-form no one has gotten back to me about fixing my account. I’m going to try twitter in a bit. Your mileage may vary.

Update: 6/9/2010 22:03 EST

Revolution Analytics responded to my support request after I mentioned it on twitter. Apparently, they had done something to the forums which corrupted my account. Creating a new account fixed the problem, so now I can report the bugs that I
find and get some help.

Update: 6/11/2010 16:03 EST

It turns out that you get a small speed improvement by setting setMKLthreads(1). Apparently, the libraries Revolution R links against attempt to use multiple cores by default. If you are explicitly parrallel programing, this means that your code is competing with itself for resources. Thanks for the tip!