Today’s agenda: Using regular expressions to extract data from text; text manipulations; getting used to very skewed distributions
General instructions for labs: Upload an R Markdown file, named with your andrew ID, to Blackboard. You will give the commands to answer each question in its own code block, which will also produce plots that will be automatically embedded in the output file. Each answer must be supported by written statements as well as any code used. Include the name of your lab partner (if you have one) in the file. Do not include the text of the questions or any Markdown template you might use.
rich.html on the class website is a listing of the 100 richest people in America, according to Forbes magazine. We will use the file to practice extracting information from Web pages.
readLines command to load the file into a character vector called
richhtml. How many lines does it contain? What is the total number of characters in the file?
Open the file in a text editor (not as a web-page). Find the entries for Bill Gates and for Stanley Kroenke. Give the text of the lines from the file which record their net worths.
Write a regular expression which should capture a person’s net worth. Write code, using the
grep function, to check that this has exactly 100 matches in
richhtml, and that the expression is matching the actual net worths (and not just some bit of text associated with them).
regmatches, to extract all the net worths from
richhtml. Check the following:
networths. Check the following:
networthsis indeed a vector, of length 100 and type
networthsare greater than 1 billion.
networthsmatches the net worth of Bill Gates.
networthsmatching the net worth of Stanley Kroenke.
networthsvector from problem 4: