Lecture 5, Regular Expressions

36-350
10 September 2014

In Our Last Thrilling Episode

  • Characters and strings
  • Matching strings, splitting on strings, counting strings
  • We need a ways to compute with patterns of strings

Agenda

  • Patterns of strings: regular expressions
  • Grammar of regular expressions
  • Splitting, searching, replacing
  • Capture groups

Why We Need String Patterns

Split entries in a data file separate by commas:

strsplit(some_text, split=",")

Split entries in a data file separated by one space:

strsplit(some_text, split= " ")

Split entries in a data file separated by a comma, then a space:

strsplit(text, split=", ")

Split entries in a data file separated by a comma, then optionally some spaces:

???????

Regular Expressions

  • We need a language for telling R about patterns of strings
  • The most basic such language is that of regular expressions
  • Regular expressions match sets of strings
  • Start with string constants, and build up by allowing “this and then that”, “either this or that”, “repeat this
  • These rules get expressed in a grammar, with special symbols

Grammar of Regular Expressions

  • Every string is a valid regexp
    fly matches end of fruitfly, why walk when you can fly
    does not match time flies like an arrow; fruit flies like a banana; a banana flies poorly
  • OR of two regexps is a regexp, write with |
    fly|flies
  • Concatenation of two regexps is a regexp
    time|fruit fly|flies
  • Parentheses create groups: (time|fruit) (fly|flies)

Escaping, Ranges

  • Escape special characters with a leading \ to match them
  • Use braces [] to indicate character ranges
    [a-z], [0-9], many pre-named ones like [:punct:] for punctuation marks
  • Negate a character range with a leading ^
    [^aeiou] = anything except a lower-case vowel
  • The period . stands for any character, no brackets needed

Quantifiers in Regexps

How often?

  • + after a regexp means “1 or more times”
  • * means “0 or more times”
  • ? means “0 or 1 times” (optional, once)
  • {n} means “exactly n times”
  • {n,} means “n or more times”
  • {n,m} means “between n and m times (inclusive)”
    some redundancy, e.g., can fake + with *

Quantifier Scope

  • By default, quantifiers are “greedy”, match as many repetitions as they can
  • Following a quantifier by ? makes it match as few as possible
    \[.+\] matches all of [i][j], but \[.+?\] just matches [i]
  • By default, quantifiers apply to last character; use parentheses
    H(TT)+ vs. (HH|TT)+

Anchoring

  • $ means a pattern can only match at the beginning of a line or string
  • ^ means (outside of braces) the end of a line or string
  • < and > anchor to beginning or ending of words
  • \b anchors boundary (beginning or ending) of words, \B anywhere else
  • e.g. [a-z,]$ matches lines ending in a lower-case letter or comma
  • e.g., \B[A-Z] matches capital letters not at the beginning or ending of a word

Back-References

  • Use \1, \2, etc., to refer to whatever matched the 1st, 2nd, etc. parenthesized sub-expression
  • The matching strings are captures, capture-groups or captured strings
  • [HT]+ matches any sequence of heads and tails
  • ([HT]+)\1 matches any sequence of heads and tails that exactly repeats

Self-Referentially

  • Regular expressions are strings
  • \( \therefore \) a regexp can be stored in a character variable
  • regexps can be built up and changed using string-manipulating functions

Splitting on a Regexp

  • strsplit will take a regexp as its split argument

  • Splits a string into new strings at each instance of the regexp, just like it would if split were a string

Last time:

al2 <- readLines("http://www.stat.cmu.edu/~cshalizi/statcomp/14/lectures/04/al2.txt")
al2 <- paste(al2, collapse=" ")
al2.words1 <- strsplit(al2, split=" ")

Weird results (e.g., punctuation marks as parts of wordss)

head(sort(table(al2.words1)))
al2.words1
      -    "the    "Woe absorbs  accept achieve 
      1       1       1       1       1       1 

Better:

al2.words2 <- strsplit(al2, split="(\\s|[[:punct:]])+")[[1]]
head(sort(table(al2.words2)))
al2.words2
absorbs  accept achieve against  agents     aid 
      1       1       1       1       1       1 

Closer examination shows there's still a problem:
“men's” \( \rightarrow \) “men”, “s”

Handle possessives: look for any number of white spaces, or at least one puncutation mark followed by at least one space

al2.words3 <- strsplit(al2, split="\\s+|([[:punct:]]+[[:space:]]+)")[[1]]

grep() and grepl()

grep() scans a character vector for matches to a regexp
returns either indices of matches, or matching strings

grep(x, pattern, value)

Example: scanning data files

ANSS.csv.html catalogs earthquakes of magnitude 6+, 1/1/2002–1/1/2012

<HTML><HEAD><TITLE>NCEDC_Search_Results</TITLE></HEAD><BODY>Your search parameters are:<ul>
<li>catalog=ANSS
<li>start_time=2002/01/01,00:00:00
<li>end_time=2012/01/01,00:00:00
<li>minimum_magnitude=6.0
<li>maximum_magnitude=10
<li>event_type=E
</ul>
<PRE>
DateTime,Latitude,Longitude,Depth,Magnitude,MagType,NbStations,Gap,Distance,RMS,Source,EventID
2002/01/01 10:39:06.82,-55.2140,-129.0000,10.00,6.00,Mw,78,,,1.07,NEI,2002010140

Now: extract just the data, not the search parameters and so forth

Notice: every line of data begins with a date, YYYY/MM/DD

anss <- readLines("http://www.stat.cmu.edu/~cshalizi/statcomp/14/lectures/05/ANSS.csv.html", warn=FALSE)
head(grep(x=anss,pattern="^[0-9]{4}/[0-9]{2}/[0-9]{2}"))
[1] 11 12 13 14 15 16

Getting the value of the matches

head(grep(x=anss,pattern="^[0-9]{4}/[0-9]{2}/[0-9]{2}",value=TRUE))
[1] "2002/01/01 10:39:06.82,-55.2140,-129.0000,10.00,6.00,Mw,78,,,1.07,NEI,2002010140" 
[2] "2002/01/01 11:29:22.73,6.3030,125.6500,138.10,6.30,Mw,236,,,0.90,NEI,2002010140"  
[3] "2002/01/02 14:50:33.49,-17.9830,178.7440,665.80,6.20,Mw,215,,,1.08,NEI,2002010240"
[4] "2002/01/02 17:22:48.76,-17.6000,167.8560,21.00,7.20,Mw,427,,,0.90,NEI,2002010240" 
[5] "2002/01/03 07:05:27.67,36.0880,70.6870,129.30,6.20,Mw,431,,,0.87,NEI,2002010340"  
[6] "2002/01/03 10:17:36.30,-17.6640,168.0040,10.00,6.60,Mw,386,,,1.14,NEI,2002010340" 

Storing a regexp in a variable

initial_date <- "^[0-9]{4}/[0-9]{2}/[0-9]{2}"
all.equal(grep(x=anss,pattern="^[0-9]{4}/[0-9]{2}/[0-9]{2}"),
   grep(x=anss,pattern=initial_date))
[1] TRUE

Finding _non_-matches

The invert option:

grep(x=anss,pattern=initial_date,invert=TRUE,value=TRUE)
 [1] "<HTML><HEAD><TITLE>NCEDC_Search_Results</TITLE></HEAD><BODY>Your search parameters are:<ul>"   
 [2] "<li>catalog=ANSS"                                                                              
 [3] "<li>start_time=2002/01/01,00:00:00"                                                            
 [4] "<li>end_time=2012/01/01,00:00:00"                                                              
 [5] "<li>minimum_magnitude=6.0"                                                                     
 [6] "<li>maximum_magnitude=10"                                                                      
 [7] "<li>event_type=E"                                                                              
 [8] "</ul>"                                                                                         
 [9] "<PRE>"                                                                                         
[10] "DateTime,Latitude,Longitude,Depth,Magnitude,MagType,NbStations,Gap,Distance,RMS,Source,EventID"
[11] "</PRE>"                                                                                        
[12] "</BODY></HTML>"                                                                                

grepl()

When you just want a Boolean vector saying where the matches are:

grepl(x=anss,pattern=initial_date)[1:20]
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
[12]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

More information about the matches

  • regexpr() returns location of first match in the target string, plus attributes like length of matching substring
  • gregexpr() returns a list of this for all matches
  • A location of -1 means no match
  • Neither returns the text of the match

Getting the matching text

  • regmatches() takes the output of regexpr() or gregexpr() and a string, and returns the matching strings
  • Why separate regexpr() from regmatches?
    • Lets us do things like count the number or length of matches with less work
    • Lets us see what text in one file corresponds to matching locations in another file

Example: Extracting earthquake locations

Get the (latitude, longitude) pair for each earthquake:

one_geo_coord <- paste("-?[0-9]+\\.[0-9]{4}")
pair_geo_coords <- paste(rep(one_geo_coord,2),collapse=",")
have_coords <- grepl(x=anss,pattern=pair_geo_coords)
coord.matches <- gregexpr(pattern=pair_geo_coords,text=anss[have_coords])
coords <- regmatches(x=anss[have_coords],m=coord.matches)
coord.matches[1]
[[1]]
[1] 24
attr(,"match.length")
[1] 18
attr(,"useBytes")
[1] TRUE

useBytes: The default is to assume the ASCII encoding of characters for English, 1 character per byte. Other alphabets need longer encodings and forcing useBytes=FALSE

head(coords)
[[1]]
[1] "-55.2140,-129.0000"

[[2]]
[1] "6.3030,125.6500"

[[3]]
[1] "-17.9830,178.7440"

[[4]]
[1] "-17.6000,167.8560"

[[5]]
[1] "36.0880,70.6870"

[[6]]
[1] "-17.6640,168.0040"

Earthquake coordinates (cont'd)

You thought we'd forgotten data frames, didn't you?

coords <- do.call(c,coords)  # De-list-ify to vector
coord.pairs <- strsplit(coords,",")  # Break apart latitude and longitude
coord.df <- do.call(rbind, coord.pairs) # De-list-ify to array
coord.df <- apply(coord.df,2,as.numeric) # Character to numeric
coord.df <- as.data.frame(coord.df)
colnames(coord.df) <- c("Latitude","Longitude")
head(coord.df)
  Latitude Longitude
1  -55.214   -129.00
2    6.303    125.65
3  -17.983    178.74
4  -17.600    167.86
5   36.088     70.69
6  -17.664    168.00
library(maps)
map("world")
points(x=coord.df$Longitude, y=coord.df$Latitude, pch=19, col="red")

plot of chunk unnamed-chunk-16

Replacements

Assigning to regmatches() changes the matched string, just like substr()

sub() and gsub() work like regexpr() and gregexpr(), but with an extra replace argument

sub() produces a new string, assigning to regmatches() modifies the original one
Really, assigning to regmatches() creates a new string, destroys the old one, and assigns the new string the old name

Summary

  • Regexps are text patterns built up from strings by alternation and repetition
  • Mastering the syntax of regexps lets us scan text for complicated patterns
  • Many string-based functions work with regexps as well
  • Special functions exist to scan vectors for matches, to extract regexp matches, and to do substitutions