Managing memory in a list of lists data structure

First, a confession: instead of using classes and defining methods for them, I build a lot of ad hoc data structures out of lists and then build up one-off methods that operate on those lists of lists. I think this is a perl-ism that has transferred into my R code. I might eventually learn how to do classes, but this hack has been working well enough.

One issue I ran into today is that it was getting tedious to find out which objects stored in the list of lists was taking up the most memory. I ended up writing this rather silly recursive function that may be of use to you if you also have been scarred by perl.

# A hacked together function for exploring these structures
get.size <- function( obj.to.size, units='Kb') {
  # Check if the object we were passed is a list
  # N.B. Since is(list()) returns c('list', 'vector') we need a
  #      multiple value comparison like all.equal
  # N.B. Since all.equal will either return TRUE or a vector of 
  #      differences wrapping it in is.logical is the same as 
  #      checking if it returned TRUE. 
  if ( is.logical( all.equal( is(obj.to.size) , is(list())))) {
    # Iterate over each element of the list
    lapply( obj.to.size ,
      function(xx){
        # Calculate the size of the current element of the list
        # N.B. object.size always returns bytes, but its print 
        #      allows different units. Using capture.output allows
        #      us to do the conversion with the print method
        the.size <- capture.output(print(object.size(xx), units=units))
        # This object may itself be a list...
        if( is.logical( all.equal( is(xx), is(list())))) {
           # if so, recurse if we aren't already at zero size 
           if( the.size != paste(0, units) ) {
             the.rest <- get.size( xx , units)
             return( list(the.size, the.rest) )
           }else {
             # Or just return the zero size
             return( the.size )             
           }
        } else {
           # the element isn't a list, just return its size
           return( the.size)
        }
      })
  } else {
    # If the object wasn't a list, return an error.
    stop("The object passed to this function was not a list.")
  }
}

The output looks something like this

$models
$models[[1]]
[1] "2487.7 Kb"

$models[[2]]
$models[[2]]$naive.model
[1] "871 Kb"

$models[[2]]$clustered.model
[1] "664.5 Kb"

$models[[2]]$gls.model
[1] "951.9 Kb"



$V
[1] "4628.2 Kb"

$fixed.formula
[1] "1.2 Kb"

$random.formula
[1] "2.6 Kb"

where the first element of the list is the sum of everything below it in the hierarchy. Therefore, the whole "models" is 2487.7 Kb and "models$naive.model" is only 871 Kb of that total.

Managing Latex packages manually in Ubuntu 12.04

Ubuntu is great, but making it play nice with latex is a bit of a pain. There are three parts to this:

  1. Ubuntu comes with APT, a really nice package management system that lets you easily install, update, and remove software that has been helpfully packaged by both Canonical and the wider Debian community.
  2. Tex Live, the official release of latex, comes with tlmgr, an equally great package manager for managing the all of the latex packages on CTAN.
  3. Ubuntu's distribution of latex omits tlmgr and forces developers to repackage the latex packages to fit into the APT scheme. (source)

This seems to be why my previous post about fixing moderncv for Ubuntu was so popular. It is not obvious to most users that to fix the

LaTeX Error: File `marvosym.sty' not found.

error, the user has to both (1) find the Ubuntu package that provides marvosym.sty and then (2) install that Ubuntu package along with every other latex package that happens to be bundled with it.

All of that is fine if a kind-hearted developer had the foresight to bundle the latex package you want/need in a convient form for installation with APT. If not, you have two options:

  1. Keep Tex Live under the control of Ubuntu's package management and manually install the Latex packages you need. An easy way to do this is described below.
  2. Break out Tex Live from Ubuntu's package manager and use tlmgr for Latex package management. This gives you MikTex style latex package management for Ubuntu, but you are responsible for keeping Tex Live up to date. See the answers to this Stack Exchange question for details of how to do it.

For now I'm sticking with Option 1. Here is a worked example to install the Latex package outlines for Ubuntu:

  1. Look at the path Latex searches to find packages with 'kpsepath tex' which should give output similar to:
    nathanvan@nathanvan-N61Jq:~$ kpsepath tex | sed -e 's/:/\n:/g'
    .
    :/home/nathanvan/.texmf-config/tex/kpsewhich//
    :/home/nathanvan/.texmf-var/tex/kpsewhich//
    :/home/nathanvan/texmf/tex/kpsewhich//
    :/etc/texmf/tex/kpsewhich//
    :!!/var/lib/texmf/tex/kpsewhich//
    :!!/usr/local/share/texmf/tex/kpsewhich//
    :!!/usr/share/texmf/tex/kpsewhich//
    :!!/usr/share/texmf-texlive/tex/kpsewhich//
    :/home/nathanvan/.texmf-config/tex/generic//
    :/home/nathanvan/.texmf-var/tex/generic//
    :/home/nathanvan/texmf/tex/generic//
    :/etc/texmf/tex/generic//
    :!!/var/lib/texmf/tex/generic//
    :!!/usr/local/share/texmf/tex/generic//
    :!!/usr/share/texmf/tex/generic//
    :!!/usr/share/texmf-texlive/tex/generic//
    :/home/nathanvan/.texmf-config/tex///
    :/home/nathanvan/.texmf-var/tex///
    :/home/nathanvan/texmf/tex///
    :/etc/texmf/tex///
    :!!/var/lib/texmf/tex///
    :!!/usr/local/share/texmf/tex///
    :!!/usr/share/texmf/tex///
    :!!/usr/share/texmf-texlive/tex///
    
  2. Note that the entry on line 21 is '/home/nathanvan/texmf/tex//', which tells latex to search every subdirectory under '/home/nathanvan/texmf/tex' to find packages that haven't been found yet. You'll have something similar for your home directory.
  3. Make a 'texmf/tex/latex' directory under your home directory:
    nathanvan@nathanvan-N61Jq:~$ mkdir -p ~/texmf/tex/latex
    
  4. Find the pacakge you want on CTAN, say outlines, because you read this blog post and want to try it out.
  5. Download the 'the contents of this directory bundled as a zip file', as CTAN likes to say it, and save it to '~/texmf/tex/latex'
  6. Unzip it right there:
    nathanvan@nathanvan-N61Jq:~$ cd texmf/tex/latex
    nathanvan@nathanvan-N61Jq:~/texmf/tex/latex$ ls
    outlines.zip
    nathanvan@nathanvan-N61Jq:~/texmf/tex/latex$ unzip outlines.zip
    Archive: outlines.zip
    creating: outlines/
    inflating: outlines/outlines.pdf
    inflating: outlines/outlines.sty
    inflating: outlines/outlines.tex
    inflating: outlines/README
    nathanvan@nathanvan-N61Jq:~/texmf/tex/latex$ ls
    outlines outlines.zip
    

And then you are done installing the latex package. It works great without any big hassles.

Edit: If the package you were installing contains fonts, this won't quite work. See Steve Kroon's comment below for details of how to fix it.

Edit: Thanks to jon for pointing out the correct directory structure for ~/texmf in the first comment to this answer. For the curious, more details, including why the directory is called texmf, can be found here.

Getting R2WinBUGS to talk to WinBUGS 1.4 on Ubuntu 12.04 LTS

Disclaimer 1: WinBUGS is old and not maintained. There are other packages to use, if you would like to take advantage of more modern developments in MCMC such as:

  • PyMC which transparently implements adaptive Metropolis-Hastings proposals (among other great features), or
  • the LaplacesDemon R package, which dispenses guidance on whether or not your chain converged, or
  • the as of yet released STAN, which will use an automatically tuned Hamiltonian Monte Carlo sampler when it can and (presumably) a WinBUGS like Gibbs sampler when it can't.

Disclaimer 2: There are also WinBUGS alternatives, like JAGS and OpenBUGS, that are both currently maintained and cross platform (Windows, Mac, and linux). They are worth checking out if you want to maintain some legacy BUGS code.

If you are set on using WinBUGS, the installation is remarkably easy on Ubuntu (easier than Windows 7, in fact).The steps are as follows:

1. Install R. (R 2.14.2)
2. Install wine.  (wine-1.4)
3. Install WinBUGS via wine and setup R2WinBugs. That guide was written for Ubuntu 10.04. Some modifications for Ubuntu 12.04:

  • Ignore the bits about wine 1.0. Wine 1.4 works great.
  • The R2WinBUGS example won't work. When you run this:
    > schools.sim <- bugs( data, inits, parameters, model.file, n.chains=3, n.iter=5000)
    

    WinBUGS will pop-up, but it will get stuck at its license screen. If you close the license screen, nothing happens. If you close the WinBUGS window, you get:

    schools.sim p11-kit: couldn't load module: /usr/lib/i386-linux-gnu/pkcs11/gnome-keyring-pkcs11.so: /usr/lib/i386-linux-gnu/pkcs11/gnome-keyring-pkcs11.so: cannot open shared object file: No such file or directory
    err:ole:CoGetClassObject class {0003000a-0000-0000-c000-000000000046} not registered
    err:ole:CoGetClassObject class {0003000a-0000-0000-c000-000000000046} not registered
    err:ole:CoGetClassObject no class object {0003000a-0000-0000-c000-000000000046} could be created for context 0x3&lt;/code&gt;
    
    Error in bugs.run(n.burnin, bugs.directory, WINE = WINE, useWINE = useWINE, : Look at the log file and try again with 'debug=TRUE' to figure out what went wrong within Bugs.
    

    Which isn't a particularly helpful error message.

  • The error is that the intermediate files that R2WinBUGS produces are not getting shared with WinBUGS, so WinBUGS thinks it doesn't have to do anything. As mentioned by 'zcronix' in the comment thread for the instructions, it is a two step fix: (1) create a temporary directory to store those files and (2) tell R2WinBugs about it with the working.directory and clearWD options.

    In your shell:

    nathanvan@nathanvan-N61Jq:~$ cd .wine/drive_c/
    nathanvan@nathanvan-N61Jq:~/.wine/drive_c$ mkdir temp
    nathanvan@nathanvan-N61Jq:~/.wine/drive_c$ cd temp
    nathanvan@nathanvan-N61Jq:~/.wine/drive_c/temp$ mkdir Rtmp
    

    In R:

    > schools.sim <- bugs( data, inits, parameters, model.file, n.chains=3, n.iter=5000, working.directory='~/.wine/drive_c/temp/Rtmp/', clearWD=TRUE)
    

Hopefully that will work for you too.

R is not C

I keep trying to write R code like it was C code. It is a habit I'm trying to break myself of.

For example, the other day I need to construct a model matrix of 1's and 0's in the standard, counting in binary, pattern. My solution was:

n <- 8
powers <- 2^(0:(n-1))
NN <- (max(powers)*2)
designMatrix <- matrix( NA, nrow=NN, ncol=n)
for( ii in 0:(NN-1) ) {
     leftOver <- ii
     for ( jj in 1:n ) {
          largest <- rev(powers)[jj]
          if ( leftOver != 0 && largest <= leftOver ) {
               designMatrix[ii+1,jj] <- 1	
               leftOver <- leftOver - largest
          } else {
               designMatrix[ii+1,jj] <- 0
          }
     }	
} 
print(designMatrix)

The code works, but it is a low-level re-implementation of something that already exists in base R. R is not C, because base R has pieces that implement statistical ideas for you. Consider:

expand.grid                package:base                R Documentation

Create a Data Frame from All Combinations of Factors

Description:

     Create a data frame from all combinations of the supplied vectors
     or factors.  See the description of the return value for precise
     details of the way this is done.

So then instead of writing (and debugging!) a function to make a binary model matrix, I could have simply used a one-liner:

# Note that c(0,1) is encased in list() so that
# rep(..., n) will repeat the object c(0,1) n 
# times instead of its default behavior of 
# concatenating the c(0,1) objects. 
designMatrix_R <- as.matrix( expand.grid( rep( list(c(0,1) ), n) ) )

I like it. It is both shorter and easier to debug. Now I just need to figure out how to find these base R functions before I throw up my hands and re-implement them in C.

My first citation in a published work

Back in undergrad, I wrote a term paper for May Berenbaum's Insects & People honors seminar. The assignment was to relate something about insects to people's daily lives. At the time, I was finishing up the first course in the year long classical mechanics sequence, so my daily life was physics problem sets. For the first time, I noticed something. There are a lot of insects in mechanics word problems (if a roach is walking on a turntable...; if a bee files in a spiral parametrized by... ). It seems they were always there, but take a class on insects and all of sudden you notice them making cameos. So, that's what I wrote about. It was a fun paper.

Seven years later, this paper makes it into the first chapter of her book: The Earwig's Tail: A Modern Bestiary of Multi-legged Legends.

It's kinda funny; reading it on Google Books. My only complaint is that she misspelled my last name. It's not Van Houdnos; it's VanHoudnos. That's why I didn't know about it; I had never google stalked myself with the alternate spelling of "nathan van houdnos" before. (Update: A bit more detail about my last name.)

I'm pretty excited about this. I'll have to see if she'll send me a signed copy. :)

UPDATE 5/9/2011
This essay in the American Entomologist is actually my first citation in a published work. It's from Spring 2003, so it beats the book by several years. (Of course, the essay is the relevant part of the book; it's not like I have two citations or anything.)

Is my bounce rate due to my HIT display?

My HITS have a rather high bounce rate. Between 40-50% of the Turkers who preview my HIT, choose not to accept it. I previously posted a histogram of the screen widths that I observed from workers who had accepted at least one HIT. That is very clearly a biased sample; it could be that only workers with screens large enough to comfortably display my HIT choose to accept it. I was curious to see if there was another population of Turkers that chose not to accept my HIT because their screens were too small.

I made the necessary modifications to my webapp and then generated the following graph:

Histogram split by worker acceptance

Screen resolution observed for Experiment 15

You'll notice that there isn't much of a practical difference between the workers who accept the HIT and those that do not. This makes me feel a little better. I'm not worried that my bounce rate is due to a display artifact. It does make me wonder though, is my bounce rate typical?

General Purpose AMT webapp based on web2py

I wrote in my previous post about some scripts I developed to make interacting with Amazon Mechanical Turk a bit easier from the command line. I didn't talk much about the web2py + Google AppEngine piece that actually serves the ExternalQuestion HITS. I realized after talking to another PhD student that people might be interested in a way to host general purpose HITS for free. My web2py application does that.

Getting it running is pretty simple. Sign up for an AppEngine account, install their SDK, install web2py into the SDK, and copy my application (really a directory) into your web2py installation. Use the development webserver included with the SDK to test that your installation is sane. Push it to google to make sure your account is working right. Once you have it running on appspot, install the Amazon Command line tools, install my wrapper scripts to make them behave better, and run the test experiment on the sandbox to verify that AMT + GAE + web2py are talking nicely in the clouds.

Once that's done, defining your own rubrics for HITS is pretty easy:

  1. Define a new controller to handle your custom rubrics. Example: grade6.py to hold grade6_test2_problem35_version1
  2. Copy the provided method template and modify it to handle a given rubricCode. The template builds a SQLFORM.factory to present the questions to Turkers and then validate the form input. Once the form is accepted, the method processes the result (scores it) and forwards it to a generic method to write it to the GAE datastore and sends it back to AMT. Example: def grade6_test2_problem35_version1() ....
  3. Copy the provided view template and modify it to ask the question you want for a given rubric code. The template extends a view class that knows how to display an informed consent in preview mode and track whenever a Turker clicks on a form element. It uses standard web2py tricks to protect you against injection and give you form validation for free. Example: grade6_test2_problem35_version1.html
  4. Prep the HITS by using my scripts to cross the rubricCode with the image (or whatever) you want to display. Run it on the sandbox and test it out. Promote the experiment and run it on production when you are ready.

Okay. Well maybe the setup doesn't look easy, but a complete definition of a HIT is just over 125 lines of code including comments. That's not really that bad. It's a heck-of-a-lot easier than trying to put an ExternalQuestion together from scratch. If the internet is interested, I'll clean up the application code (read: remove the parts pertinent to my research and IRB restrictions) and post it on GitHub. Leave a comment or send me mail if you are interested.

A better way to interact with AMT from the command line

Amazon Mechanical Turk is a useful thing. Interacting with it can be a giant pain.

My "standard" approach is to use web2py running on top of Google AppEngine to serve ExternalQuestion HITs. This allows me to have quite a bit of control over the Turker's experience and collect useful data like when they click on each button in the webapp. Although my current project doesn't use this click history, a related, later project will. It also let's me do fun things like use a bit of JavaScript to figure out the distribution of screen widths across workers so that I can optimize their viewing experience.

Width histogram

Anything under 700 pixels is fair game.

But that's not what this post is about. I'd like to eventually use boto to build the control of AMT directly in to the webapp itself. For now, I'm using the command line interface that Amazon provides. The CLT is an ugly hack that implements the Java API in a bunch of shell scripts. I have written my own wrapper around their scripts that enforces a certain amount of sanity. You can get the scripts over on gitHub.

There isn't really any documentation beyond the scripts themselves. The idea is that you create an amt-script directory where the new-and-improved scripts live. Under that directory, you create several exp# directories that hold the info that you need for experiments. Even ones are for production runs. Odd ones are for sandbox runs. Once you get a sandbox run working you run buildGoLiveExp.sh and it makes a copy from the staged experiment to a new go-live experiment. It's a bit of a hack at the moment, but it works for me. I like it because it gives me an audit trail for each thing I run on AMT. Feel free to use them yourself. (Or use them as inspiration for something better that you can write yourself!)