The goal is to obtain a random subset of n rows from a file of N rows without loading the whole dataset in memory at once. The solution is to use R to select the random rows and various built-in Linux "filter" programs to do the subsetting.
In more detail, we create a file in R containing N lines: a random n of these lines are "Y" and the remaining N-n are "N". (We assume the file contains only numeric data; otherwise use some other unique codes instead of Y and N.) Then we "paste" this file to the left of the data file, use "grep" to take only the "Y" lines, and use "sed" to remove the Y code from the selected lines.
Assuming the big file is called mydata.dat, in Linux, use "wc mydata.dat" to get the number of lines, number of words and number of characters. (The number of fields per line is number of words divided by number of lines.)
In R use code like this:
N=280000 # exact total number if data lines in big input file n=1000 # exact number of subset lines desired write(sample(c(rep("Y",n),rep("N",N-n))), "random.dat", ncol=1)Then in Linux, type this to get the subset:
paste random.dat mydata.dat | grep Y | sed -e s/Y// > subset.datThe paste command puts the two files together vertically. The grep command returns only lines containing the letter "Y". The sed command substitutes nothing for the first occurrence of "Y" in each line. The use of pipes (|) to connect these steps forces the operating system to manage (and clean up) the intermediate files. The output redirection symbol (>) is used to send the final result to a file whose name you can freely choose.
subset 0.01 <mydata.dat >subset.datThe details of how it works are in the .c file. A Linux/PC version of the executable file is here: subset. You can compile this program under Windows or other platforms if needed. (Note that the maximum line length for your large input file is 1024 characters unless you change the .c file.)
This method produces an output file that contains approximately the specified fraction of total lines. If you need an exact number of lines, generate a file with slightly more than the required number of lines (e.g., by slighly increasing the fraction specified and/or by running subset repeatedly and checking the length with wc), then use something like
head -n# subset.dat > exactsubset.datto trim to exact line count equal to the number "#".