36-350
8 September 2014
Most data we deal with is in character form!
Even if you only care about numbers, it helps to be able to extract them from text and manipulate them easily.
'L', 'i', 'n', 'c', 'o', 'l'
Lincoln
Note: R does not have a separate type for characters and strings
mode("L")
[1] "character"
mode("Lincoln")
[1] "character"
class("Lincoln")
[1] "character"
Use single or double quotes to construct a string; use nchar()
to get the length of a single string. Why do we prefer double quotes?
"Lincoln"
[1] "Lincoln"
"Abraham Lincoln"
[1] "Abraham Lincoln"
"Abraham Lincoln's Hat"
[1] "Abraham Lincoln's Hat"
"As Lincoln never said, \"Four score and seven beers ago\""
[1] "As Lincoln never said, \"Four score and seven beers ago\""
The space, " "
is a character; so are multiple spaces " "
and the empty string, ""
.
Some characters are special, so we have “escape characters” to specify them in strings.
\"
\t
\n
and carriage return \r
– use the former rather than the latter when possibleOne of the atomic data types, like numeric
or logical
Can go into scalars, vectors, arrays, lists, or be the type of a column in a data frame.
length("Abraham Lincoln's beard")
[1] 1
length(c("Abraham", "Lincoln's", "beard"))
[1] 3
nchar("Abraham Lincoln's beard")
[1] 23
nchar(c("Abraham", "Lincoln's", "beard"))
[1] 7 9 5
They work just like others, e.g., with vectors:
president <- "Lincoln"
nchar(president) # NOT 9
[1] 7
presidents <- c("Fillmore","Pierce","Buchanan","Davis","Johnson")
presidents[3]
[1] "Buchanan"
presidents[-(1:3)]
[1] "Davis" "Johnson"
We know print()
, of course; cat()
writes the string directly to the console. If you're debugging, message()
is R's preferred syntax.
print("Abraham Lincoln")
[1] "Abraham Lincoln"
cat("Abraham Lincoln")
Abraham Lincoln
cat(presidents)
Fillmore Pierce Buchanan Davis Johnson
message(presidents)
Substring: a smaller string from the big string, but still a string in its own right.
A string is not a vector or a list, so we cannot use subscripts like [[ ]]
or [ ]
to extract substrings; we use substr()
instead.
phrase <- "Christmas Bonus"
substr (phrase, start=8, stop=12)
[1] "as Bo"
We can also use substr
to replace elements:
substr(phrase, 13, 13) <- "g"
phrase
[1] "Christmas Bogus"
substr()
vectorizes over all its arguments:
presidents
[1] "Fillmore" "Pierce" "Buchanan" "Davis" "Johnson"
substr(presidents,1,2) # First two characters
[1] "Fi" "Pi" "Bu" "Da" "Jo"
substr(presidents,nchar(presidents)-1,nchar(presidents)) # Last two
[1] "re" "ce" "an" "is" "on"
substr(presidents,20,21) # No such substrings so return the null string
[1] "" "" "" "" ""
substr(presidents,7,7) # Explain!
[1] "r" "" "a" "" "n"
strsplit()
divides a string according to key characters, by splitting each element of the character vector x
at appearances of the pattern split
.
scarborough.fair <- "parsley, sage, rosemary, thyme"
strsplit (scarborough.fair, ",")
[[1]]
[1] "parsley" " sage" " rosemary" " thyme"
strsplit (scarborough.fair, ", ")
[[1]]
[1] "parsley" "sage" "rosemary" "thyme"
Pattern is recycled over elements of the input vector:
strsplit (c(scarborough.fair, "Garfunkel, Oates", "Clement, McKenzie"), ", ")
[[1]]
[1] "parsley" "sage" "rosemary" "thyme"
[[2]]
[1] "Garfunkel" "Oates"
[[3]]
[1] "Clement" "McKenzie"
Note that it outputs a list
of character vectors – why should this be the default?
Converting one variable type to another is called casting:
as.character(7.2) # Obvious
[1] "7.2"
as.character(7.2e12) # Obvious
[1] "7.2e+12"
as.character(c(7.2,7.2e12)) # Obvious
[1] "7.2" "7.2e+12"
as.character(7.2e5) # Not quite so obvious
[1] "720000"
The paste()
function is very flexible!
With one vector argument, works like as.character()
:
paste(41:45)
[1] "41" "42" "43" "44" "45"
With 2 or more vector arguments, combines them with recycling:
paste(presidents,41:45)
[1] "Fillmore 41" "Pierce 42" "Buchanan 43" "Davis 44" "Johnson 45"
paste(presidents,c("R","D")) # Not historically accurate!
[1] "Fillmore R" "Pierce D" "Buchanan R" "Davis D" "Johnson R"
paste(presidents,"(",c("R","D"),41:45,")")
[1] "Fillmore ( R 41 )" "Pierce ( D 42 )" "Buchanan ( R 43 )"
[4] "Davis ( D 44 )" "Johnson ( R 45 )"
Changing the separator between pasted-together terms:
paste(presidents, " (", 41:45, ")", sep="_")
[1] "Fillmore_ (_41_)" "Pierce_ (_42_)" "Buchanan_ (_43_)"
[4] "Davis_ (_44_)" "Johnson_ (_45_)"
paste(presidents, " (", 41:45, ")", sep="")
[1] "Fillmore (41)" "Pierce (42)" "Buchanan (43)" "Davis (44)"
[5] "Johnson (45)"
Exercise: what happens if you give sep
a vector?
Exercise: Convince yourself of why this works as it does
paste(c("HW","Lab"),rep(1:11,times=rep(2,11)))
[1] "HW 1" "Lab 1" "HW 2" "Lab 2" "HW 3" "Lab 3" "HW 4"
[8] "Lab 4" "HW 5" "Lab 5" "HW 6" "Lab 6" "HW 7" "Lab 7"
[15] "HW 8" "Lab 8" "HW 9" "Lab 9" "HW 10" "Lab 10" "HW 11"
[22] "Lab 11"
Producing one big string:
paste(presidents, " (", 41:45, ")", sep="", collapse="; ")
[1] "Fillmore (41); Pierce (42); Buchanan (43); Davis (44); Johnson (45)"
Default value of collapse
is NULL
– that is, it won't use it
R has a standard syntax for models: outcome and predictors.
my.formula <- function(dep,indeps,df) {
rhs <- paste(colnames(df)[indeps], collapse="+")
return(paste(colnames(df)[dep], " ~ ", rhs, collapse=""))
}
my.formula(2,c(3,5,7),df=state.x77)
[1] "Income ~ Illiteracy+Murder+Frost"
If we shall suppose that American slavery is one of those offenses which, in the providence of God, must needs come, but which, having continued through His appointed time, He now wills to remove, and that He gives to both North and South this terrible war as the woe due to those by whom the offense came, shall we discern therein any departure from those divine attributes which the believers in a living God always ascribe to Him? Fondly do we hope, fervently do we pray, that this mighty scourge of war may speedily pass away. Yet, if God wills that it continue until all the wealth piled by the bondsman's two hundred and fifty years of unrequited toil shall be sunk, and until every drop of blood drawn with the lash shall be paid by another drawn with the sword, as was said three thousand years ago, so still it must be said “the judgments of the Lord are true and righteous altogether.”
al2 <- readLines("al2.txt")
length(al2)
[1] 58
head(al2)
[1] "Fellow-Countrymen:"
[2] ""
[3] "At this second appearing to take the oath of the Presidential office there is"
[4] "less occasion for an extended address than there was at the first. Then a"
[5] "statement somewhat in detail of a course to be pursued seemed fitting and"
[6] "proper. Now, at the expiration of four years, during which public declarations"
al2
is a vector, one element per line of text
Narrowing down entries: use grep()
to find which strings have a matching search term
grep("God", al2)
[1] 34 35 41 45 47 54
grepl("God", al2)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[34] TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
[45] TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
[56] FALSE FALSE FALSE
al2[grep("God", al2)]
[1] "God, and each invokes His aid against the other. It may seem strange that any"
[2] "men should dare to ask a just God's assistance in wringing their bread from the"
[3] "offenses which, in the providence of God, must needs come, but which, having"
[4] "attributes which the believers in a living God always ascribe to Him? Fondly"
[5] "pass away. Yet, if God wills that it continue until all the wealth piled by"
[6] "God gives us to see the right, let us strive on to finish the work we are in,"
Make one long string, then split the words
al2 <- paste(al2, collapse=" ")
al2.words <- strsplit(al2, split=" ")[[1]]
head(al2.words)
[1] "Fellow-Countrymen:" "" "At"
[4] "this" "second" "appearing"
Tabulate how often each word appears, put in order:
wc <- table(al2.words)
wc <- sort(wc,decreasing=TRUE)
head(wc,20)
al2.words
the to and of that for be in it a this
54 26 25 24 22 11 9 8 8 8 7 7
war which all by we with as but
7 7 6 6 6 6 5 5
names(wc)
gives all the distinct words in al2.words
(types); wc
counts how often they appear (tokens)
The null string is the third-most-common word:
names(wc)[3]
[1] ""
wc["years"]
years
3
wc["years,"]
years,
1
Capitalization:
wc["that"]
that
11
wc["That"]
That
1
All of this can be fixed if we learn how to work with text patterns and not just constants.
substr()
extracts and substitutesstrsplit()
turns strings into vectorspaste()
turns vectors into stringstable()
for counting how many tokens belong to each typeNext time: searching for text patterns using regular expressions