Return to Howard's home page

Automated Culling of Data from the Internet

The Oxford English Dictionary includes this definition of cull: To gather, pick, pluck (flowers, fruit, etc.). Here I describe how to pick the fruits found in databases on the Internet to collect data for statistical analysis.

The basic concept is that we can use a variety of techniques to automate any manual process you might use to obtain information from the Internet. The specific case of interest is one where you fill in information onto a form on a web page and receive the information of interest on a new web page. This tutorial assumes that you are using a Unix (or Linux) operating system. It may be possible to perform the same steps on a Windows/DOS system, but I probably can't help you with the details.

Special thanks to Paula Pfleiger for introducing me to this whole concept.

The basic steps are

  1. Step 1   Decode how and where the information that you manually request is passed.
  2. Step 2   Create list(s) of the various inputs you would have had to type manually to get all the data you want.
  3. Step 3   Write scripts to automate the data collection procedure.
  4. Step 4   Extract the data from the web pages retrieved.

Some examples include collecting prices on stocks from a financial data base, collecting water source information from the EPA, compiling a price list from a list of ISBN book numbers, finding latitude and longitude from a list of zip codes, etc. The only limitation is your imagination and ingenuity (plus occasionally limits set by database owners on the number of queries, e.g. at


Return to Howard's home page