Stochastic regressors and exogeneity for statisticians

Since my Ph.D. training was split between the Statistics Department and the Heinz School of Public Policy, I can attest that statisticians and economists sometimes talk past each other. This blog post explains the econometric terms exogenous and endogenous in a framework that is perhaps more natural for a statistician. The post ends with a realization that I should pay more attention to exogeneity when including covariates to increase the power for an experiment.
Continue reading

Implementing mclapply() on Windows: a primer on embarrassingly parallel computation on multicore systems with R

An easy way to run R code in parallel on a multicore system is with the mclapply() function. Unfortunately, mclapply() does not work on Windows machines because the mclapply() implementation relies on forking and Windows does not support forking.

For me, this is somewhat of a headache because I am used to using mclapply(), and yet I need to support Windows users for one of my projects.

My hackish solution is to implement a fake mclapply() for Windows users with one of the Windows compatible parallel R strategies. For the impatient, it works like this:

require(parallel) 

## On Windows, the following line will take about 40 seconds to run
## because by default, mclapply is implemented as a serial function
## on Windows systems.
system.time( mclapply(1:4, function(xx){ Sys.sleep(10) }) )
##    user  system elapsed 
##    0.00    0.00   40.06 

## If we try to force  mclapply() to use multiple cores on Windows, it doesn't work:
system.time( mclapply(1:4, function(xx){ Sys.sleep(10) }), mc.cores=4 ) )
## Error in mclapply(1:4, function(xx){ Sys.sleep(10) }), mc.cores=4 ) : 
##   'mc.cores' > 1 is not supported on Windows

## Using the ideas developed in this post, we can implement
## a parallel (as it should be!) mclapply() on Windows. 
source("http://www.stat.cmu.edu/~nmv/setup/mclapply.hack.R")
## 
##     *** Microsoft Windows detected ***
##     
##     For technical reasons, the MS Windows version of mclapply()
##     is implemented as a serial function instead of a parallel
##     function.    
## 
##     As a quick hack, we replace this serial version of mclapply()
##     with a wrapper to parLapply() for this R session. Please see
## 
##       http://www.stat.cmu.edu/~nmv/2014/07/14/implementing-mclapply-on-windows 
## 
##     for details.

## And now the code from above will take about 10 seconds (plus overhead). 
system.time( mclapply(1:4, function(xx){ Sys.sleep(10) }) )
##    user  system elapsed 
##    0.01    0.06   11.25 

As we will see, however, there are a few reasons why no one has done this in the past.

Our goal: easy, Linux/Mac-like parallelization

On Linux or Mac, it is is very simple to parallelize R code across multiple cores. Consider the following function

wait.then.square <- function(xx){
    # Wait for one second
    Sys.sleep(1);
    # Square the argument 
    xx^2 } 

If we want to run it on the integers from 1 to 4 in serial, it will take about 4 seconds

## Run in serial 
system.time( serial.output <- lapply( 1:4, wait.then.square ) )
##  user  system elapsed 
## 0.000   0.000   4.004 

If we run it in parallel, it will take about 1 second:

## Run in parallel
require(parallel) 
## Note two changes: 
##   (1)  lapply to mclapply 
##   (2)  mc.cores (the number of processors to use in parallel) 
system.time( par.output <- mclapply( 1:4, wait.then.square,
                                     mc.cores=4             ) )
##  user  system elapsed 
## 0.008   0.000   1.008 

And we can verify that the output is, in fact, the same:

## Check if the output is the same
all.equal( serial.output, par.output )
## [1] TRUE

This toy example is a little unrealistic. It is often the case, at least for the work that I do, that the parallelized function either (i) uses an R library that isn't loaded at startup by deafault, e.g. the Matrix library for sparse matrices, or (ii) needs to access an object in the global environment, e.g. a variable.

The magic of mclapply() is that uses fork to replicate the R process into several child processes, tells the children to do the work, and then aggregates the children's results for you. Since it uses forking, the entire R session -- all of its variables, functions, and packages -- is replicated among the children. Therefore, you can do things like this:

## Setup a global variable that uses a non-base package
require(Matrix)
( a.global.variable <- Diagonal(3) )
## 3 x 3 diagonal matrix of class "ddiMatrix"
##      [,1] [,2] [,3]
## [1,]    1    .    .
## [2,]    .    1    .
## [3,]    .    .    1

## Write a proof-of-concept lapply
serial.output <- lapply( 1:4,
    function(xx) {
       return( wait.then.square(xx) + a.global.variable )
    })

## Parallelize it
par.output <- mclapply( 1:4,
    function(xx) {
       return( wait.then.square(xx) + a.global.variable )
    }, mc.cores=4)

## Check that they are equal 
all.equal(serial.output, par.output)
## [1] TRUE

It is, at least to me, a little magical! I don't have to think much.

The problem: Windows parallelization requires more setup

Windows doesn't fork. It is a limitation of the operating system that there is no easy way to replicate the parent R session to create new child R sessions that can do the work.

R gets around this by pretending that each core on the machine is an entirely separate machine. This makes the setup a little more involved because the user must:

  1. create a "cluster" of child processes,
  2. load the necessary R packages on the cluster,
  3. copy the necessary R objects to the cluster,
  4. distribute work to the cluster, and finally
  5. stop the cluster.

Recall that the setup of the example is as follows:

## Load packages 
   require(parallel)
   require(Matrix)
##
## Define example function and the global variable 
   wait.then.square <- function(xx){
       # Wait for one second
       Sys.sleep(1);
       # Square the argument 
      xx^2 } 
   a.global.variable <- Diagonal(3) 

and the serial version of the code is:

serial.output <- lapply( 1:4,
    function(xx) {
       return( wait.then.square(xx) + a.global.variable )
    })

Parallelizing this code requires more setup with the "cluster" approach.

## Step 1: Create a cluster of child processes 
cl <- makeCluster( 4 )

## Step 2: Load the necessary R package(s)
## N.B. length(cl) is the number of child processes
##      in the cluster 
par.setup <- parLapply( cl, 1:length(cl),
    function(xx) {
        require(Matrix) 
    })

## Step 3: Distribute the necessary R objects 
clusterExport( cl, c('wait.then.square', 'a.global.variable') )

## Step 4: Do the computation
par.output <- parLapply(cl, 1:4,
    function(xx) {
       return( wait.then.square(xx) + a.global.variable )
    })

## Step 5: Remember to stop the cluster!
stopCluster(cl)

## Check that the parallel and serial output are the same
all.equal(serial.output, par.output)
## [1] TRUE

This approach works on Windows, Linux, and Mac, but it requires a bit more bookkeeping.

The hack: implementing mclapply() with parLapply()

Even though Windows doesn't fork, I'd like to pretend that it does so that I can use the simpler syntax of mclapply(). My approach is to wrap the bookkeeping code for parLapply() into a single function: mclapply.hack().

This is likely a bad idea for general use. Creating and destroying clusters for every mclapply.hack() call defeats the advantages of having a persistent cluster to farm out work to. Copying every R object from the parent session to all of the cluster sessions takes up much more memory (and time!) then simply forking processes. Use this approach with caution!

The final code is as follows.

mclapply.hack <- function(...) {
    ## Create a cluster
    ## ... How many workers do you need?
    ## ... N.B. list(...)[[1]] returns the first 
    ##          argument passed to the function. In
    ##          this case it is the list to iterate over
    size.of.list <- length(list(...)[[1]])
    cl <- makeCluster( min(size.of.list, detectCores()) )

    ## Find out the names of the loaded packages 
    loaded.package.names <- c(
        ## Base packages
        sessionInfo()$basePkgs,
        ## Additional packages
        names( sessionInfo()$otherPkgs ))

    ## N.B. tryCatch() allows us to properly shut down the 
    ##      cluster if an error in our code halts execution
    ##      of the function. For details see: help(tryCatch)
    tryCatch( {

       ## Copy over all of the objects within scope to
       ## all clusters. 
       ## 
       ## The approach is as follows: Beginning with the 
       ## current environment, copy over all objects within
       ## the environment to all clusters, and then repeat
       ## the process with the parent environment. 
       ##
       this.env <- environment()
       while( identical( this.env, globalenv() ) == FALSE ) {
           clusterExport(cl,
                         ls(all.names=TRUE, env=this.env),
                         envir=this.env)
           this.env <- parent.env(environment())
       }
       ## repeat for the global environment
       clusterExport(cl,
                     ls(all.names=TRUE, env=globalenv()),
                     envir=globalenv())
       
       ## Load the libraries on all the clusters
       ## N.B. length(cl) returns the number of clusters
       parLapply( cl, 1:length(cl), function(xx){
           lapply(loaded.package.names, function(yy) {
               ## N.B. the character.only option of 
               ##      require() allows you to give the 
               ##      name of a package as a string. 
               require(yy , character.only=TRUE)})
       })
       
       ## Run the lapply in parallel 
       return( parLapply( cl, ...) )
    }, finally = {        
       ## Stop the cluster
       stopCluster(cl)
    })
}

We can test it as follows:

system.time( serial.output <- lapply( 1:4,
    function(xx) {
       return( wait.then.square(xx) + a.global.variable )
    })) 
##    user  system elapsed 
##   0.020   0.000   4.022 

system.time( par.output <- mclapply.hack( 1:4,
    function(xx) {
       return( wait.then.square(xx) + a.global.variable )
    })) 
##    user  system elapsed 
##   0.024   0.012   3.683 

all.equal( serial.output, par.output )
## [1] TRUE

In this case, it works, but we don't save much time because of the bookkeeping required to setup the cluster for parLapply(). If we run a more intense function, say one that takes 10 seconds per iteration to run, then we can begin to see gains:

wait.longer.then.square <- function(xx){
    ## Wait for ten seconds
    Sys.sleep(10);
    ## Square the argument
    xx^2 } 

system.time( serial.output <- lapply( 1:4,
    function(xx) {
       return( wait.longer.then.square(xx) + a.global.variable )
    })) 
##    user  system elapsed 
##   0.020   0.000  40.059

system.time( par.output <- mclapply.hack( 1:4,
    function(xx) {
       return( wait.longer.then.square(xx) + a.global.variable )
    })) 
##    user  system elapsed 
##  0.024   0.008  12.794 

all.equal( serial.output, par.output )
## [1] TRUE

The final answer: A multi-platform wrapper for mclappy()

My motivation for implementing mclapply() on Windows is so that code I wrote on Linux will "just work" on Windows.

I wrote a quick script to implement mclapply.hack() as mclapply() on Windows machines and leave mclapply() alone on Linux and Mac machines. The code is as follows:

##
## mclapply.hack.R
##
## Nathan VanHoudnos
## nathanvan AT northwestern FULL STOP edu
## July 14, 2014
##
## A script to implement a hackish version of 
## parallel:mclapply() on Windows machines.
## On Linux or Mac, the script has no effect
## beyond loading the parallel library. 

require(parallel)    

## Define the hack
mclapply.hack <- function(...) {
    ## Create a cluster
    size.of.list <- length(list(...)[[1]])
    cl <- makeCluster( min(size.of.list, detectCores()) )

    ## Find out the names of the loaded packages 
    loaded.package.names <- c(
        ## Base packages
        sessionInfo()$basePkgs,
        ## Additional packages
        names( sessionInfo()$otherPkgs ))

    tryCatch( {

       ## Copy over all of the objects within scope to
       ## all clusters. 
       this.env <- environment()
       while( identical( this.env, globalenv() ) == FALSE ) {
           clusterExport(cl,
                         ls(all.names=TRUE, env=this.env),
                         envir=this.env)
           this.env <- parent.env(environment())
       }
       clusterExport(cl,
                     ls(all.names=TRUE, env=globalenv()),
                     envir=globalenv())
       
       ## Load the libraries on all the clusters
       ## N.B. length(cl) returns the number of clusters
       parLapply( cl, 1:length(cl), function(xx){
           lapply(loaded.package.names, function(yy) {
               require(yy , character.only=TRUE)})
       })
       
       ## Run the lapply in parallel 
       return( parLapply( cl, ...) )
    }, finally = {        
       ## Stop the cluster
       stopCluster(cl)
    })
}

## Warn the user if they are using Windows
if( Sys.info()[['sysname']] == 'Windows' ){
    message(paste(
      "\n", 
      "   *** Microsoft Windows detected ***\n",
      "   \n",
      "   For technical reasons, the MS Windows version of mclapply()\n",
      "   is implemented as a serial function instead of a parallel\n",
      "   function.",
      "   \n\n",
      "   As a quick hack, we replace this serial version of mclapply()\n",
      "   with a wrapper to parLapply() for this R session. Please see\n\n",
      "     http://www.stat.cmu.edu/~nmv/2014/07/14/implementing-mclapply-on-windows \n\n",
      "   for details.\n\n"))
}

## If the OS is Windows, set mclapply to the
## the hackish version. Otherwise, leave the
## definition alone. 
mclapply <- switch( Sys.info()[['sysname']],
   Windows = {mclapply.hack}, 
   Linux   = {mclapply},
   Darwin  = {mclapply})

## end mclapply.hack.R

I posted the script at http://www.stat.cmu.edu/~nmv/setup/mclapply.hack.R. You can use it with

source('http://www.stat.cmu.edu/~nmv/setup/mclapply.hack.R')

as shown in the beginning of the post.

I would be grateful for any comments or suggestions for improving it. If there is sufficient interest, I can wrap it into a simple R package.

Writing a Metropolis-Hastings within Gibbs sampler in R for a 2PL IRT model (9 posts)

Last year, Brian Junker, Richard Patz, and I wrote an invited chapter for the (soon to be released) update of the classic text Handbook of Modern Item Response Theory (1996). The chapter itself is meant to be an update of the classic IRT in MCMC papers Patz & Junker (1999a, 1999b).

To support the chapter, I have put together an online supplement which gives a detailed walk-through of how to write a Metropolis-Hastings sampler for a simple psychometric model (in R, of course!). The table of contents is below:

I will continue to add to the online supplement over time. The next few posts will be:

  • Post 10: Over dispersion and multi-core parallelism
  • Post 11: Replacing R with C
  • Post 12: Adaptive tuning of the Metropolis-Hastings proposals

I would be grateful for any feedback. Feel free to either leave it here or at the online supplement itself.

For faster R use OpenBLAS instead: better than ATLAS, trivial to switch to on Ubuntu

R speeds up when the Basic Linear Algebra System (BLAS) it uses is well tuned. The reference BLAS that comes with R and Ubuntu isn't very fast. On my machine, it takes 9 minutes to run a well known R benchmarking script. If I use ATLAS, an optimized BLAS that can be easily installed, the same script takes 3.5 minutes. If I use OpenBLAS, yet another optimized BLAS that is equally easy to install, the same script takes 2 minutes. That's a pretty big improvement!

In this post, I'll show you how to install ATLAS and OpenBLAS, demonstrate how you can switch between them, and let you pick which you would like to use based on benchmark results. Before we get started, one quick shout out to Felix Riedel: thanks for encouraging me to look at OpenBLAS instead of ATLAS in your comment on my previous post.

Update for Mac OS X users: Zachary Meyer's comment gives bare bones details for how to accomplish a similar BLAS switch. He has a few more details on his blog. Thanks Zachary!

Update for R multicore users: According to this comment and this comment, OpenBLAS does not play well with one of R's other multicore schemes. It appears to be a bug, so perhaps it will get fixed in the future. See the comment stream for further details.

Update for the adventurous: According to Joe Herman: "OpenBLAS isn't faster than ATLAS, but it is much easier to install OpenBLAS via apt-get than it is to compile ATLAS and R manually from source." See Joe's comment for details on the benefits of compiling ATLAS and R from scratch.

Installing additional BLAS libraries on Ubuntu

For Ubuntu, there are currently three different BLAS options that can be easily chosen: "libblas" the reference BLAS, "libatlas" the ATLAS BLAS, and "libopenblas" the OpenBLAS. Their package names are

$ apt-cache search libblas
libblas-dev - Basic Linear Algebra Subroutines 3, static library
libblas-doc - Basic Linear Algebra Subroutines 3, documentation
libblas3gf - Basic Linear Algebra Reference implementations, shared library
libatlas-base-dev - Automatically Tuned Linear Algebra Software, generic static
libatlas3gf-base - Automatically Tuned Linear Algebra Software, generic shared
libblas-test - Basic Linear Algebra Subroutines 3, testing programs
libopenblas-base - Optimized BLAS (linear algebra) library based on GotoBLAS2
libopenblas-dev - Optimized BLAS (linear algebra) library based on GotoBLAS2

Since libblas already comes with Ubuntu, we only need to install the other two for our tests. (NOTE: In the following command, delete 'libatlas3gf-base' if you don't want to experiment with ATLAS.):

$ sudo apt-get install libopenblas-base libatlas3gf-base

Switching between BLAS libraries

Now we can switch between the different BLAS options that are installed:

$ sudo update-alternatives --config libblas.so.3gf
There are 3 choices for the alternative libblas.so.3gf (providing /usr/lib/libblas.so.3gf).

Selection Path Priority Status
------------------------------------------------------------
* 0 /usr/lib/openblas-base/libopenblas.so.0 40 auto mode
1 /usr/lib/atlas-base/atlas/libblas.so.3gf 35 manual mode
2 /usr/lib/libblas/libblas.so.3gf 10 manual mode
3 /usr/lib/openblas-base/libopenblas.so.0 40 manual mode

Press enter to keep the current choice[*], or type selection number:
    Side note: If the above returned:

    update-alternatives: error: no alternatives for libblas.so.3gf

    Try

    $ sudo update-alternatives --config libblas.so.3

    instead. See the comments at the end of the post for further details.

From the selection menu, I picked 3, so it now shows that choice 3 (OpenBLAS) is selected:

$ sudo update-alternatives --config libblas.so.3gf
There are 3 choices for the alternative libblas.so.3gf (providing /usr/lib/libblas.so.3gf).

Selection Path Priority Status
------------------------------------------------------------
0 /usr/lib/openblas-base/libopenblas.so.0 40 auto mode
1 /usr/lib/atlas-base/atlas/libblas.so.3gf 35 manual mode
2 /usr/lib/libblas/libblas.so.3gf 10 manual mode
* 3 /usr/lib/openblas-base/libopenblas.so.0 40 manual mode

And we can pull the same trick to choose between LAPACK implementations. From the output we can see that OpenBLAS does not provide a new LAPACK implementation, but ATLAS does:

$ sudo update-alternatives --config liblapack.so.3gf
There are 2 choices for the alternative liblapack.so.3gf (providing /usr/lib/liblapack.so.3gf).

Selection Path Priority Status
------------------------------------------------------------
* 0 /usr/lib/atlas-base/atlas/liblapack.so.3gf 35 auto mode
1 /usr/lib/atlas-base/atlas/liblapack.so.3gf 35 manual mode
2 /usr/lib/lapack/liblapack.so.3gf 10 manual mode

So we will do nothing in this case, since OpenBLAS is supposed to use the reference implementation (which is already selected).

    Side note: If the above returned:

    update-alternatives: error: no alternatives for liblapack.so.3gf

    Try

    $ sudo update-alternatives –config liblapack.so.3

    instead. See the comments at the end of the post for further details.

Checking that R is using the right BLAS

Now we can check that everything is working by starting R in a new terminal:

$ R

R version 3.0.1 (2013-05-16) -- "Good Sport"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
...snip...
Type 'q()' to quit R.

>

Great. Let's see if R is using the BLAS and LAPACK libraries we selected. To do so, we open another terminal so that we can run a few more shell commands. First, we find the PID of the R process we just started. Your output will look something like this:

$ ps aux | grep exec/R
1000 18065 0.4 1.0 200204 87568 pts/1 Sl+ 09:00 0:00 /usr/lib/R/bin/exec/R
root 19250 0.0 0.0 9396 916 pts/0 S+ 09:03 0:00 grep --color=auto exec/R

The PID is the second number on the '/usr/lib/R/bin/exec/R' line. To see
which BLAS and LAPACK libraries are loaded with that R session, we use the "list open files" command:

$ lsof -p 18065 | grep 'blas\|lapack'
R 18065 nathanvan mem REG 8,1 9342808 12857980 /usr/lib/lapack/liblapack.so.3gf.0
R 18065 nathanvan mem REG 8,1 19493200 13640678 /usr/lib/openblas-base/libopenblas.so.0

As expected, the R session is using the reference LAPACK (/usr/lib/lapack/liblapack.so.3gf.0) and OpenBLAS (/usr/lib/openblas-base/libopenblas.so.0)

Testing the different BLAS/LAPACK combinations

I used Simon Urbanek's most recent benchmark script. To follow along, first download it to your current working directory:

$ curl http://r.research.att.com/benchmarks/R-benchmark-25.R -O

And then run it:

$ cat R-benchmark-25.R | time R --slave
Loading required package: Matrix
Loading required package: lattice
Loading required package: SuppDists
Warning message:
In library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE, :
there is no package called ‘SuppDists’
...snip...

Ooops. I don't have the SuppDists package installed. I can easily load it via Michael Rutter's ubuntu PPA:

$ sudo apt-get install r-cran-suppdists

Now Simon's script works wonderfully. Full output

$ cat R-benchmark-25.R | time R --slave
Loading required package: Matrix
Loading required package: lattice
Loading required package: SuppDists
Warning messages:
1: In remove("a", "b") : object 'a' not found
2: In remove("a", "b") : object 'b' not found

R Benchmark 2.5
===============
Number of times each test is run__________________________: 3

I. Matrix calculation
---------------------
Creation, transp., deformation of a 2500x2500 matrix (sec): 1.36566666666667
2400x2400 normal distributed random matrix ^1000____ (sec): 0.959
Sorting of 7,000,000 random values__________________ (sec): 1.061
2800x2800 cross-product matrix (b = a' * a)_________ (sec): 1.777
Linear regr. over a 3000x3000 matrix (c = a \ b')___ (sec): 1.00866666666667
--------------------------------------------
Trimmed geom. mean (2 extremes eliminated): 1.13484335940626

II. Matrix functions
--------------------
FFT over 2,400,000 random values____________________ (sec): 0.566999999999998
Eigenvalues of a 640x640 random matrix______________ (sec): 1.379
Determinant of a 2500x2500 random matrix____________ (sec): 1.69
Cholesky decomposition of a 3000x3000 matrix________ (sec): 1.51366666666667
Inverse of a 1600x1600 random matrix________________ (sec): 1.40766666666667
--------------------------------------------
Trimmed geom. mean (2 extremes eliminated): 1.43229160585452

III. Programmation
------------------
3,500,000 Fibonacci numbers calculation (vector calc)(sec): 1.10533333333333
Creation of a 3000x3000 Hilbert matrix (matrix calc) (sec): 1.169
Grand common divisors of 400,000 pairs (recursion)__ (sec): 2.267
Creation of a 500x500 Toeplitz matrix (loops)_______ (sec): 1.213
Escoufier's method on a 45x45 matrix (mixed)________ (sec): 1.32600000000001
--------------------------------------------
Trimmed geom. mean (2 extremes eliminated): 1.23425893178325

Total time for all 15 tests_________________________ (sec): 19.809
Overall mean (sum of I, II and III trimmed means/3)_ (sec): 1.26122106386747
--- End of test ---

134.75user 16.06system 1:50.08elapsed 137%CPU (0avgtext+0avgdata 1949744maxresident)k
448inputs+0outputs (3major+1265968minor)pagefaults 0swaps

Where the elapsed time at the very bottom is the part that we care about. With OpenBLAS and the reference LAPACK, the script took 1 minute and 50 seconds to run. By changing around the selections with update-alternatives, we can test out R with ATLAS (3:21) or R with the reference BLAS (9:13). For my machine, OpenBLAS is a clear winner.

Give it a shot yourself. If you find something different, let me know.

My Stat Bytes talk, with slides and code

On Thursday of last week I gave a short informal talk to Stat Bytes, the CMU Statistics department's twice a month computing seminar.

Quick tricks for faster R code: Profiling to Parallelism

Abstract:
I will present a grab bag of tricks to speed up your R code. Topics will include: installing an optimized BLAS, how to profile your R code to find which parts are slow, replacing slow code with inline C/C++, and running code in parallel on multiple cores. My running example will be fitting a 2PL IRT model with a hand coded MCMC sampler. The idea is to start with naive, pedagogically clear code and end up with fast, production quality code.

The slides are here. Code is here.

This was an informal talk. If you would like to dig into these topics more, some more references:

Update: 6/25/2013 For the Windows users out there, Felix Reidel has some notes about upgrading your BLAS. It is easier than I thought!

Update: 7/9/2013 Felix pointed out that OpenBLAS is faster than ATLAS. He is right. See my new blog post for details and proof.

HSP Talk: On Correcting a Significance Test for Model Misspecification

Heinz Second Paper* presentation by Nathan VanHoudnos
Monday, June 10, 2013
Noon - 1:30 PM
Room 237, Hamburg Hall
Title: On Correcting a Significance Test for Model Misspecification**

* The Heinz Second Paper (HSP) is a PhD qualifier for public policy students. Since I am in the joint Statistics and Public Policy program, mine is mix of math and policy.
** Contact me for a copy of the paper or slides.

Abstract:
Learning about whether interventions improve student learning is sometimes more complicated than it needs to be because of errors in the specification of statistical models for the analysis of educational intervention data. Recently, a series of papers in the education research literature (Hedges, 2007a, 2009; Hedges and Rhoads, 2011) have derived post-hoc corrections to misspecified test statistics so that the corrected versions can be used in a meta-analysis. However, these corrections are currently limited to special cases of simple models.

The purpose of this paper is to extend these corrections to models that include covariates and more general random effect structures. We develop a sufficient condition such that the distribution of the corrected test statistic asymptotically converges to the distribution of the standard statistical test that accounts for random effects, and we examine the finite sample performance of these approximations using simulation and real data from the Tennessee STAR experiment (Word et al., 1990). The What Works Clearinghouse, a division of the US Department of Education that rates the quality of educational interventions, has a policy that applies a simplified version of the Hedges (2007a) correction to any study which randomized by group but does not account for the group membership in the original analysis. We discuss the implications of this policy in practice.

Managing memory in a list of lists data structure

First, a confession: instead of using classes and defining methods for them, I build a lot of ad hoc data structures out of lists and then build up one-off methods that operate on those lists of lists. I think this is a perl-ism that has transferred into my R code. I might eventually learn how to do classes, but this hack has been working well enough.

One issue I ran into today is that it was getting tedious to find out which objects stored in the list of lists was taking up the most memory. I ended up writing this rather silly recursive function that may be of use to you if you also have been scarred by perl.

# A hacked together function for exploring these structures
get.size <- function( obj.to.size, units='Kb') {
  # Check if the object we were passed is a list
  # N.B. Since is(list()) returns c('list', 'vector') we need a
  #      multiple value comparison like all.equal
  # N.B. Since all.equal will either return TRUE or a vector of 
  #      differences wrapping it in is.logical is the same as 
  #      checking if it returned TRUE. 
  if ( is.logical( all.equal( is(obj.to.size) , is(list())))) {
    # Iterate over each element of the list
    lapply( obj.to.size ,
      function(xx){
        # Calculate the size of the current element of the list
        # N.B. object.size always returns bytes, but its print 
        #      allows different units. Using capture.output allows
        #      us to do the conversion with the print method
        the.size <- capture.output(print(object.size(xx), units=units))
        # This object may itself be a list...
        if( is.logical( all.equal( is(xx), is(list())))) {
           # if so, recurse if we aren't already at zero size 
           if( the.size != paste(0, units) ) {
             the.rest <- get.size( xx , units)
             return( list(the.size, the.rest) )
           }else {
             # Or just return the zero size
             return( the.size )             
           }
        } else {
           # the element isn't a list, just return its size
           return( the.size)
        }
      })
  } else {
    # If the object wasn't a list, return an error.
    stop("The object passed to this function was not a list.")
  }
}

The output looks something like this

$models
$models[[1]]
[1] "2487.7 Kb"

$models[[2]]
$models[[2]]$naive.model
[1] "871 Kb"

$models[[2]]$clustered.model
[1] "664.5 Kb"

$models[[2]]$gls.model
[1] "951.9 Kb"



$V
[1] "4628.2 Kb"

$fixed.formula
[1] "1.2 Kb"

$random.formula
[1] "2.6 Kb"

where the first element of the list is the sum of everything below it in the hierarchy. Therefore, the whole "models" is 2487.7 Kb and "models$naive.model" is only 871 Kb of that total.

Managing Latex packages manually in Ubuntu 12.04

Ubuntu is great, but making it play nice with latex is a bit of a pain. There are three parts to this:

  1. Ubuntu comes with APT, a really nice package management system that lets you easily install, update, and remove software that has been helpfully packaged by both Canonical and the wider Debian community.
  2. Tex Live, the official release of latex, comes with tlmgr, an equally great package manager for managing the all of the latex packages on CTAN.
  3. Ubuntu's distribution of latex omits tlmgr and forces developers to repackage the latex packages to fit into the APT scheme. (source)

This seems to be why my previous post about fixing moderncv for Ubuntu was so popular. It is not obvious to most users that to fix the

LaTeX Error: File `marvosym.sty' not found.

error, the user has to both (1) find the Ubuntu package that provides marvosym.sty and then (2) install that Ubuntu package along with every other latex package that happens to be bundled with it.

All of that is fine if a kind-hearted developer had the foresight to bundle the latex package you want/need in a convient form for installation with APT. If not, you have two options:

  1. Keep Tex Live under the control of Ubuntu's package management and manually install the Latex packages you need. An easy way to do this is described below.
  2. Break out Tex Live from Ubuntu's package manager and use tlmgr for Latex package management. This gives you MikTex style latex package management for Ubuntu, but you are responsible for keeping Tex Live up to date. See the answers to this Stack Exchange question for details of how to do it.

For now I'm sticking with Option 1. Here is a worked example to install the Latex package outlines for Ubuntu:

  1. Look at the path Latex searches to find packages with 'kpsepath tex' which should give output similar to:
    nathanvan@nathanvan-N61Jq:~$ kpsepath tex | sed -e 's/:/\n:/g'
    .
    :/home/nathanvan/.texmf-config/tex/kpsewhich//
    :/home/nathanvan/.texmf-var/tex/kpsewhich//
    :/home/nathanvan/texmf/tex/kpsewhich//
    :/etc/texmf/tex/kpsewhich//
    :!!/var/lib/texmf/tex/kpsewhich//
    :!!/usr/local/share/texmf/tex/kpsewhich//
    :!!/usr/share/texmf/tex/kpsewhich//
    :!!/usr/share/texmf-texlive/tex/kpsewhich//
    :/home/nathanvan/.texmf-config/tex/generic//
    :/home/nathanvan/.texmf-var/tex/generic//
    :/home/nathanvan/texmf/tex/generic//
    :/etc/texmf/tex/generic//
    :!!/var/lib/texmf/tex/generic//
    :!!/usr/local/share/texmf/tex/generic//
    :!!/usr/share/texmf/tex/generic//
    :!!/usr/share/texmf-texlive/tex/generic//
    :/home/nathanvan/.texmf-config/tex///
    :/home/nathanvan/.texmf-var/tex///
    :/home/nathanvan/texmf/tex///
    :/etc/texmf/tex///
    :!!/var/lib/texmf/tex///
    :!!/usr/local/share/texmf/tex///
    :!!/usr/share/texmf/tex///
    :!!/usr/share/texmf-texlive/tex///
    
  2. Note that the entry on line 21 is '/home/nathanvan/texmf/tex//', which tells latex to search every subdirectory under '/home/nathanvan/texmf/tex' to find packages that haven't been found yet. You'll have something similar for your home directory.
  3. Make a 'texmf/tex/latex' directory under your home directory:
    nathanvan@nathanvan-N61Jq:~$ mkdir -p ~/texmf/tex/latex
    
  4. Find the pacakge you want on CTAN, say outlines, because you read this blog post and want to try it out.
  5. Download the 'the contents of this directory bundled as a zip file', as CTAN likes to say it, and save it to '~/texmf/tex/latex'
  6. Unzip it right there:
    nathanvan@nathanvan-N61Jq:~$ cd texmf/tex/latex
    nathanvan@nathanvan-N61Jq:~/texmf/tex/latex$ ls
    outlines.zip
    nathanvan@nathanvan-N61Jq:~/texmf/tex/latex$ unzip outlines.zip
    Archive: outlines.zip
    creating: outlines/
    inflating: outlines/outlines.pdf
    inflating: outlines/outlines.sty
    inflating: outlines/outlines.tex
    inflating: outlines/README
    nathanvan@nathanvan-N61Jq:~/texmf/tex/latex$ ls
    outlines outlines.zip
    

And then you are done installing the latex package. It works great without any big hassles.

Edit: If the package you were installing contains fonts, this won't quite work. See Steve Kroon's comment below for details of how to fix it.

Edit: Thanks to jon for pointing out the correct directory structure for ~/texmf in the first comment to this answer. For the curious, more details, including why the directory is called texmf, can be found here.