Today’s agenda: Using regular expressions to extract data from text; text manipulations; getting used to very skewed distributions

General instructions for labs: Upload an R Markdown file, named with your andrew ID, to Blackboard. You will give the commands to answer each question in its own code block, which will also produce plots that will be automatically embedded in the output file. Each answer must be supported by written statements as well as any code used. Include the name of your lab partner (if you have one) in the file. Do not include the text of the questions or any Markdown template you might use.

The file rich.html on the class website is a listing of the 100 richest people in America, according to Forbes magazine. We will use the file to practice extracting information from Web pages.

Part I

Use the readLines command to load the file into a character vector called richhtml. How many lines does it contain? What is the total number of characters in the file?
Open the file in a text editor (not as a web-page). Find the entries for Bill Gates and for Stanley Kroenke. Give the text of the lines from the file which record their net worths.
Write a regular expression which should capture a person’s net worth. Write code, using the grep function, to check that this has exactly 100 matches in richhtml, and that the expression is matching the actual net worths (and not just some bit of text associated with them).
Write code, using your regular expression from problem 3 and the functions regexp and regmatches, to extract all the net worths from richhtml. Check the following:
1. There should be 100 net worths.
2. The largest net worth should be that of Bill Gates, and there should be only one person worth that much.
3. There should be exactly one person whose net worth matches what you observed for Stanley Kroenke.
4. There should be at least two values which appear more than once.

Part II

The Forbes website writes net worths in the form “$7,7 B” to mean $7.7 \times {10}^{9}$ dollars. Write code to convert from the Forbes format to floating-point numbers, and run it to create a vector of net worths, called networths. Check the following:
1. networths is indeed a vector, of length 100 and type double.
2. All of the entries in networths are greater than 1 billion.
3. The largest entry in networths matches the net worth of Bill Gates.
4. There is exactly one entry in networths matching the net worth of Stanley Kroenke.
Skew Answering the following using the networths vector from problem 4:
1. What is the median net worth of these 100 people?
2. What is the mean net worth of these 100 people?
3. How many of these 100 individuals were worth at least 5 billion dollars? 10 billion? 25 billion?
Concentrate Again, answer using the networths vector.
1. What is the total net worth of the 100 richest people?
2. What fraction of that total was held by the five richest people?
3. What fraction of that total wealth is held by the richest 20 individuals?
4. What is the smallest number of people who together hold at least 80 percent of that total wealth?
5. There are about 118 million households in the US, with a total net worth of about 82 trillion dollars ([http://www.federalreserve.gov/releases/z1/current/z1.pdf]). What fraction of that total wealth is held by the 100 richest people? What is the ratio of the mean net worth of the richest 100 to the net worth of the mean household?

Lab 3: Scrape the Rich!

36-350

12 September 2014

Part I

Part II