36-350
10 September 2014
Split entries in a data file separate by commas:
strsplit(some_text, split=",")
Split entries in a data file separated by one space:
strsplit(some_text, split= " ")
Split entries in a data file separated by a comma, then a space:
strsplit(text, split=", ")
Split entries in a data file separated by a comma, then optionally some spaces:
???????
fly
matches end of fruitfly
, why walk when you can fly
time flies like an arrow; fruit flies like a banana; a banana flies poorly
|
fly|flies
time|fruit fly|flies
(time|fruit) (fly|flies)
\
to match them[]
to indicate character ranges[a-z]
, [0-9]
, many pre-named ones like [:punct:]
for punctuation marks^
[^aeiou]
= anything except a lower-case vowel.
stands for any character, no brackets neededHow often?
+
after a regexp means “1 or more times”*
means “0 or more times”?
means “0 or 1 times” (optional, once){n}
means “exactly n times”{n,}
means “n or more times”{n,m}
means “between n and m times (inclusive)”+
with *
?
makes it match as few as possible\[.+\]
matches all of [i][j]
, but \[.+?\]
just matches [i]
H(TT)+
vs. (HH|TT)+
$
means a pattern can only match at the beginning of a line or string^
means (outside of braces) the end of a line or string<
and >
anchor to beginning or ending of words\b
anchors boundary (beginning or ending) of words, \B
anywhere else[a-z,]$
matches lines ending in a lower-case letter or comma\B[A-Z]
matches capital letters not at the beginning or ending of a word\1
, \2
, etc., to refer to whatever matched the 1st, 2nd, etc. parenthesized sub-expression[HT]+
matches any sequence of heads and tails([HT]+)\1
matches any sequence of heads and tails that exactly repeatscharacter
variablestrsplit
will take a regexp as its split
argument
Splits a string into new strings at each instance of the regexp, just
like it would if split
were a string
Last time:
al2 <- readLines("http://www.stat.cmu.edu/~cshalizi/statcomp/14/lectures/04/al2.txt")
al2 <- paste(al2, collapse=" ")
al2.words1 <- strsplit(al2, split=" ")
Weird results (e.g., punctuation marks as parts of wordss)
head(sort(table(al2.words1)))
al2.words1
- "the "Woe absorbs accept achieve
1 1 1 1 1 1
Better:
al2.words2 <- strsplit(al2, split="(\\s|[[:punct:]])+")[[1]]
head(sort(table(al2.words2)))
al2.words2
absorbs accept achieve against agents aid
1 1 1 1 1 1
Closer examination shows there's still a problem:
“men's” \( \rightarrow \) “men”, “s”
Handle possessives: look for any number of white spaces, or at least one puncutation mark followed by at least one space
al2.words3 <- strsplit(al2, split="\\s+|([[:punct:]]+[[:space:]]+)")[[1]]
grep()
scans a character vector for matches to a regexp
returns either indices of matches, or matching strings
grep(x, pattern, value)
ANSS.csv.html catalogs earthquakes of magnitude 6+, 1/1/2002–1/1/2012
<HTML><HEAD><TITLE>NCEDC_Search_Results</TITLE></HEAD><BODY>Your search parameters are:<ul>
<li>catalog=ANSS
<li>start_time=2002/01/01,00:00:00
<li>end_time=2012/01/01,00:00:00
<li>minimum_magnitude=6.0
<li>maximum_magnitude=10
<li>event_type=E
</ul>
<PRE>
DateTime,Latitude,Longitude,Depth,Magnitude,MagType,NbStations,Gap,Distance,RMS,Source,EventID
2002/01/01 10:39:06.82,-55.2140,-129.0000,10.00,6.00,Mw,78,,,1.07,NEI,2002010140
Now: extract just the data, not the search parameters and so forth
Notice: every line of data begins with a date, YYYY/MM/DD
anss <- readLines("http://www.stat.cmu.edu/~cshalizi/statcomp/14/lectures/05/ANSS.csv.html", warn=FALSE)
head(grep(x=anss,pattern="^[0-9]{4}/[0-9]{2}/[0-9]{2}"))
[1] 11 12 13 14 15 16
head(grep(x=anss,pattern="^[0-9]{4}/[0-9]{2}/[0-9]{2}",value=TRUE))
[1] "2002/01/01 10:39:06.82,-55.2140,-129.0000,10.00,6.00,Mw,78,,,1.07,NEI,2002010140"
[2] "2002/01/01 11:29:22.73,6.3030,125.6500,138.10,6.30,Mw,236,,,0.90,NEI,2002010140"
[3] "2002/01/02 14:50:33.49,-17.9830,178.7440,665.80,6.20,Mw,215,,,1.08,NEI,2002010240"
[4] "2002/01/02 17:22:48.76,-17.6000,167.8560,21.00,7.20,Mw,427,,,0.90,NEI,2002010240"
[5] "2002/01/03 07:05:27.67,36.0880,70.6870,129.30,6.20,Mw,431,,,0.87,NEI,2002010340"
[6] "2002/01/03 10:17:36.30,-17.6640,168.0040,10.00,6.60,Mw,386,,,1.14,NEI,2002010340"
initial_date <- "^[0-9]{4}/[0-9]{2}/[0-9]{2}"
all.equal(grep(x=anss,pattern="^[0-9]{4}/[0-9]{2}/[0-9]{2}"),
grep(x=anss,pattern=initial_date))
[1] TRUE
The invert
option:
grep(x=anss,pattern=initial_date,invert=TRUE,value=TRUE)
[1] "<HTML><HEAD><TITLE>NCEDC_Search_Results</TITLE></HEAD><BODY>Your search parameters are:<ul>"
[2] "<li>catalog=ANSS"
[3] "<li>start_time=2002/01/01,00:00:00"
[4] "<li>end_time=2012/01/01,00:00:00"
[5] "<li>minimum_magnitude=6.0"
[6] "<li>maximum_magnitude=10"
[7] "<li>event_type=E"
[8] "</ul>"
[9] "<PRE>"
[10] "DateTime,Latitude,Longitude,Depth,Magnitude,MagType,NbStations,Gap,Distance,RMS,Source,EventID"
[11] "</PRE>"
[12] "</BODY></HTML>"
When you just want a Boolean vector saying where the matches are:
grepl(x=anss,pattern=initial_date)[1:20]
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
[12] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
regexpr()
returns location of first match in the target string, plus attributes like length of matching substringgregexpr()
returns a list of this for all matches-1
means no matchregmatches()
takes the output of regexpr()
or gregexpr()
and a string, and returns the matching stringsregexpr()
from regmatches
?
Get the (latitude, longitude) pair for each earthquake:
one_geo_coord <- paste("-?[0-9]+\\.[0-9]{4}")
pair_geo_coords <- paste(rep(one_geo_coord,2),collapse=",")
have_coords <- grepl(x=anss,pattern=pair_geo_coords)
coord.matches <- gregexpr(pattern=pair_geo_coords,text=anss[have_coords])
coords <- regmatches(x=anss[have_coords],m=coord.matches)
coord.matches[1]
[[1]]
[1] 24
attr(,"match.length")
[1] 18
attr(,"useBytes")
[1] TRUE
useBytes
: The default is to assume the ASCII encoding of characters for English, 1 character per byte. Other alphabets need longer encodings and forcing useBytes=FALSE
head(coords)
[[1]]
[1] "-55.2140,-129.0000"
[[2]]
[1] "6.3030,125.6500"
[[3]]
[1] "-17.9830,178.7440"
[[4]]
[1] "-17.6000,167.8560"
[[5]]
[1] "36.0880,70.6870"
[[6]]
[1] "-17.6640,168.0040"
You thought we'd forgotten data frames, didn't you?
coords <- do.call(c,coords) # De-list-ify to vector
coord.pairs <- strsplit(coords,",") # Break apart latitude and longitude
coord.df <- do.call(rbind, coord.pairs) # De-list-ify to array
coord.df <- apply(coord.df,2,as.numeric) # Character to numeric
coord.df <- as.data.frame(coord.df)
colnames(coord.df) <- c("Latitude","Longitude")
head(coord.df)
Latitude Longitude
1 -55.214 -129.00
2 6.303 125.65
3 -17.983 178.74
4 -17.600 167.86
5 36.088 70.69
6 -17.664 168.00
library(maps)
map("world")
points(x=coord.df$Longitude, y=coord.df$Latitude, pch=19, col="red")
Assigning to regmatches()
changes the matched string, just like substr()
sub()
and gsub()
work like regexpr()
and gregexpr()
, but with an extra replace
argument
sub()
produces a new string, assigning to regmatches()
modifies the original one
Really, assigning to regmatches()
creates a new string, destroys the old one, and assigns the new string the old name