--- title: "Regex Patterns" author: "Statistical Computing, 36-350" date: "Monday September 12, 2016" --- Why do we need regex patterns? === In last week's lectures, we computed word tables by splitting up text and counting unique words, from documents of interest. Snippet: ``` > clinton.wordtab[1:5] — … “a “do “go 37 26 1 1 1 ``` These are not all actual words (they include punctuation marks). We need to learn how to **better split text**, and for this we need regular expressions. This will also help us **better search text** What are regex patterns? === - A **regular expression** or **regex** is a specially structured string that allow us to match certain patterns occurring in text - (Note: regexes follow a well-defined set of rules, independent of the R language) - Any string defines a valid regex. To get us started, we'll consider **literals**, which are just strings that we want to match, literally. E.g., - "fly" matches "superfly", "why walk when you can fly" - "fly" does not match "time flies like an arrow", "fruit flies like bananas" - OR of two regexes is a regex. E.g., - "fly|flies" tries to match "fly" or "flies" - Concatenation of regexes is a regex. E.g., - "(time|fruit) (fly|flies)" tries to match "time" or "fruit", then a space, then "fly" or "flies" - Parentheses define groups; more on this later Scanning for matches to a regex ==== Scan a vector of strings for matches to a regex, using `grep()` ```{r} str.vec = c("time flies when you're having fun in 350", "time does not fly in 350, because it's not fun", "Flyers suck, Penguins rule") grep("fly", str.vec) grep("fly", str.vec, value=TRUE) grep("fly|flies", str.vec, value=TRUE) ``` More examples === ```{r} str.vec.2 = c("time flies when you're having fun in 350", "fruit flies when you throw it", "a fruit fly is a beautiful creature", "how do you spell fruitfly?") grep("(time|fruit)(fly|flies)", str.vec.2, value=TRUE) grep("(time|fruit) (fly|flies)", str.vec.2, value=TRUE) grep("(time|fruit) (fly|flies)", str.vec.2, value=TRUE) ``` Metacharacters === - **Metacharacters** are special characters that have a special meaning, and are not interpreted literally - Important example: square braces, used to indicate that we want to match anything inside the square braces, for one character position. E.g., - "[abcde]" matches the "a" in "Ryan" - "[123]" matches the "3" in "StatComp 350" - "[aeiou]" tries to match any lower case vowel - A dash inside square braces is used to denote a range. E.g., - "[a-e]" is the same as "[abcde]" - "[0-9]" is the same as "[0123456789]" - Rules for combining regexes apply as before. E.g., - "(Baker|Porter) 229[A-J]" tries to match "Baker" or "Porter", then a space, then "229", then any upper case letter between A and J More metacharacters === - "[:alnum:]" tries to match any alphanumeric character (same as "[a-zA-Z0-9]") - "[:punct:]" tries to match any punctuation mark - "[:space:]" tries to match any white space character (including tabs and line breaks) - A caret inside square braces negates what follows. E.g., - "[^0-9]" tries to match anything but a number between 0 and 9 - "[^aeiou]" tries to match anything but a lower case vowel - A period "." tries to match any character (don't even need square braces) More examples === ```{r} str.vec.3 = c("R2D2","r2d2","RJD2","RT85") grep("[A-Z][0-9]", str.vec.3, value=TRUE) grep("[A-Z][0-9][A-Z][0-9]", str.vec.3, value=TRUE) grep("[A-Za-z][0-9][A-Za-z][0-9]", str.vec.3, value=TRUE) grep("[A-Z][^0-9][^0-9][0-9]", str.vec.3, value=TRUE) ``` More examples === In R, we need to use double brackets for special abbreviated metacharacter classes like "[:punct:]" (to distinguish this from "[:punct]", which has its own interpretation) ```{r} str.vec.4 = c("im simple i dont like punctuation", "I'm, all; about! p.u.n.c.t.u.a.t.i.o.n.") grep("[:punct:]", str.vec.4, value=TRUE) grep("[[:punct:]]", str.vec.4, value=TRUE) ``` Escape sequences === - In regexes, the characters ".", "$", "^", "{", "[", "(", "|", ")", "]", "}", "*", "+", "?", "\\" are all metacharacters and have special meaning - An **escape sequence** is a way of turning them into literals: simply place a backslash in front. E.g., - "\\[" tries to match a left square brace - "\\?" tries to match a question mark - "\\\\" tries to match a backslash - "\\\\\\\\" tries to match two backslashes More examples === In R, we always have to use double the number of backslashes (because the backslash itself is a special character in an R string) ```{r} str.vec.5 = c("Stat + Computing = Magic", "Stat - Computing = Boring Theorems", "Do you have the time?") grep("Stat \\+|Stat -", str.vec.5, value=TRUE) grep("time\\?", str.vec.5, value=TRUE) ```