Today’s agenda: Using regular expressions to extract data from text; text manipulations; getting used to very skewed distributions
General instructions for labs: Upload an R Markdown file, named with your andrew ID, to Blackboard. You will give the commands to answer each question in its own code block, which will also produce plots that will be automatically embedded in the output file. Each answer must be supported by written statements as well as any code used. Include the name of your lab partner (if you have one) in the file. Do not include the text of the questions or any Markdown template you might use.
The file rich.html
on the class website is a listing of the 100 richest people in America, according to Forbes magazine. We will use the file to practice extracting information from Web pages.
Use the readLines
command to load the file into a character vector called richhtml
. How many lines does it contain? What is the total number of characters in the file?
Open the file in a text editor (not as a web-page). Find the entries for Bill Gates and for Stanley Kroenke. Give the text of the lines from the file which record their net worths.
Write a regular expression which should capture a person’s net worth. Write code, using the grep
function, to check that this has exactly 100 matches in richhtml
, and that the expression is matching the actual net worths (and not just some bit of text associated with them).
regexp
and regmatches
, to extract all the net worths from richhtml
. Check the following:
networths
. Check the following:
networths
is indeed a vector, of length 100 and type double
.networths
are greater than 1 billion.networths
matches the net worth of Bill Gates.networths
matching the net worth of Stanley Kroenke.networths
vector from problem 4:
networths
vector.