# Why do we need regex patterns?

In last week’s lectures, we computed word tables by splitting up text and counting unique words, from documents of interest. Snippet:

> clinton.wordtab[1:5]
—   …  “a “do “go
37  26   1   1   1


These are not all actual words (they include punctuation marks). We need to learn how to better split text, and for this we need regular expressions. This will also help us better search text

# What are regex patterns?

• A regular expression or regex is a specially structured string that allow us to match certain patterns occurring in text
• (Note: regexes follow a well-defined set of rules, independent of the R language)
• Any string defines a valid regex. To get us started, we’ll consider literals, which are just strings that we want to match, literally. E.g.,
• “fly” matches “superfly”, “why walk when you can fly”
• “fly” does not match “time flies like an arrow”, “fruit flies like bananas”
• OR of two regexes is a regex. E.g.,
• “fly|flies” tries to match “fly” or “flies”
• Concatenation of regexes is a regex. E.g.,
• “(time|fruit) (fly|flies)” tries to match “time” or “fruit”, then a space, then “fly” or “flies”
• Parentheses define groups; more on this later

# Scanning for matches to a regex

Scan a vector of strings for matches to a regex, using grep()

str.vec = c("time flies when you're having fun in 350",
"time does not fly in 350, because it's not fun",
"Flyers suck, Penguins rule")
grep("fly", str.vec) 
## [1] 2
grep("fly", str.vec, value=TRUE)
## [1] "time does not fly in 350, because it's not fun"
grep("fly|flies", str.vec, value=TRUE)
## [1] "time flies when you're having fun in 350"
## [2] "time does not fly in 350, because it's not fun"

# More examples

str.vec.2 = c("time flies when you're having fun in 350",
"fruit flies when you throw it",
"a fruit fly is a beautiful creature",
"how do you spell fruitfly?")
grep("(time|fruit)(fly|flies)", str.vec.2, value=TRUE)
## [1] "how do you spell fruitfly?"
grep("(time|fruit) (fly|flies)", str.vec.2, value=TRUE)
## [1] "time flies when you're having fun in 350"
## [2] "fruit flies when you throw it"
## [3] "a fruit fly is a beautiful creature"
grep("(time|fruit)  (fly|flies)", str.vec.2, value=TRUE)
## character(0)

# Metacharacters

• Metacharacters are special characters that have a special meaning, and are not interpreted literally
• Important example: square braces, used to indicate that we want to match anything inside the square braces, for one character position. E.g.,
• “[abcde]” matches the “a” in “Ryan”
• “[123]” matches the “3” in “StatComp 350”
• “[aeiou]” tries to match any lower case vowel
• A dash inside square braces is used to denote a range. E.g.,
• “[a-e]” is the same as “[abcde]”
• “[0-9]” is the same as “[0123456789]”
• Rules for combining regexes apply as before. E.g.,
• “(Baker|Porter) 229[A-J]” tries to match “Baker” or “Porter”, then a space, then “229”, then any upper case letter between A and J

# More metacharacters

• “[:alnum:]” tries to match any alphanumeric character (same as “[a-zA-Z0-9]”)
• “[:punct:]” tries to match any punctuation mark
• “[:space:]” tries to match any white space character (including tabs and line breaks)
• A caret inside square braces negates what follows. E.g.,
• “[^0-9]” tries to match anything but a number between 0 and 9
• “[^aeiou]” tries to match anything but a lower case vowel
• A period “.” tries to match any character (don’t even need square braces)

# More examples

str.vec.3 = c("R2D2","r2d2","RJD2","RT85")
grep("[A-Z][0-9]", str.vec.3, value=TRUE)
## [1] "R2D2" "RJD2" "RT85"
grep("[A-Z][0-9][A-Z][0-9]", str.vec.3, value=TRUE)
## [1] "R2D2"
grep("[A-Za-z][0-9][A-Za-z][0-9]", str.vec.3, value=TRUE)
## [1] "R2D2" "r2d2"
grep("[A-Z][^0-9][^0-9][0-9]", str.vec.3, value=TRUE)
## [1] "RJD2"

# More examples

In R, we need to use double brackets for special abbreviated metacharacter classes like “[:punct:]” (to distinguish this from “[:punct]”, which has its own interpretation)

str.vec.4 = c("im simple i dont like punctuation",
grep("[:punct:]", str.vec.4, value=TRUE)
## [1] "im simple i dont like punctuation"
## [2] "I'm, all; about! p.u.n.c.t.u.a.t.i.o.n."
grep("[[:punct:]]", str.vec.4, value=TRUE)
## [1] "I'm, all; about! p.u.n.c.t.u.a.t.i.o.n."

# Escape sequences

• In regexes, the characters “.”, “\$”, “^”, “{”, “[“,”(“,”|“,”)“,”]”, “}”, “*“,”+“,”?“,”\" are all metacharacters and have special meaning
• An escape sequence is a way of turning them into literals: simply place a backslash in front. E.g.,
• “\[” tries to match a left square brace
• “\?” tries to match a question mark
• “\\” tries to match a backslash
• “\\\\” tries to match two backslashes

# More examples

In R, we always have to use double the number of backslashes (because the backslash itself is a special character in an R string)

str.vec.5 = c("Stat + Computing = Magic",
"Stat - Computing = Boring Theorems",
"Do you have the time?")
grep("Stat \\+|Stat -", str.vec.5, value=TRUE)
## [1] "Stat + Computing = Magic"
## [2] "Stat - Computing = Boring Theorems"
grep("time\\?", str.vec.5, value=TRUE)
## [1] "Do you have the time?"