Random Subset of a Large Text Datafile

It is common to want to get a random subset of a large text dataset, e.g., to work on while protoyping an R or C program. If the file is too large to load in R to begin with, another method of subsetting is required. The example motivating this page was a 188MB file with 280,000 rows and 48 columns separated by spaces.

No-programming method

This section shows a Linux solution; Windows does not come with the tools needed for this solution. It takes only a few tens of seconds to run on the example mentioned above.

The goal is to obtain a random subset of n rows from a file of N rows without loading the whole dataset in memory at once. The solution is to use R to select the random rows and various built-in Linux "filter" programs to do the subsetting.

In more detail, we create a file in R containing N lines: a random n of these lines are "Y" and the remaining N-n are "N". (We assume the file contains only numeric data; otherwise use some other unique codes instead of Y and N.) Then we "paste" this file to the left of the data file, use "grep" to take only the "Y" lines, and use "sed" to remove the Y code from the selected lines.

Assuming the big file is called mydata.dat, in Linux, use "wc mydata.dat" to get the number of lines, number of words and number of characters. (The number of fields per line is number of words divided by number of lines.)

In R use code like this:

N=280000 # exact total number if data lines in big input file
n=1000   # exact number of subset lines desired
write(sample(c(rep("Y",n),rep("N",N-n))), "random.dat", ncol=1)
Then in Linux, type this to get the subset:
paste random.dat mydata.dat | grep Y | sed -e s/Y// > subset.dat
The paste command puts the two files together vertically. The grep command returns only lines containing the letter "Y". The sed command substitutes nothing for the first occurrence of "Y" in each line. The use of pipes (|) to connect these steps forces the operating system to manage (and clean up) the intermediate files. The output redirection symbol (>) is used to send the final result to a file whose name you can freely choose.

Method based on writing a small program (which you can use without any programming)

An even better solution is to write a small c program that reads one line at a time and writes it to the output a random fraction of the time. The program subset.c works as a filter, i.e., it is used in the form
 subset 0.01 <mydata.dat >subset.dat
The details of how it works are in the .c file. A Linux/PC version of the executable file is here: subset. You can compile this program under Windows or other platforms if needed. (Note that the maximum line length for your large input file is 1024 characters unless you change the .c file.)

This method produces an output file that contains approximately the specified fraction of total lines. If you need an exact number of lines, generate a file with slightly more than the required number of lines (e.g., by slighly increasing the fraction specified and/or by running subset repeatedly and checking the length with wc), then use something like

head -n# subset.dat > exactsubset.dat
to trim to exact line count equal to the number "#".