# Information Retrieval 2 — Queries and Relevance

Lecture 3, 4 September 2019

# What we know how to do

• Take a big collection of items (e.g., documents)
• Represent each item as a vector of features (e.g., bag of words)
• Calculate distances between items (e.g., Euclidean distance with normalization and IDF)
• Find nearest items to a given item
source("http://www.stat.cmu.edu/~cshalizi/dm/19/hw/01/nytac-and-bow.R")
art.BoW.list <- lapply(art.stories, table)
music.BoW.list <- lapply(music.stories, table)
nyt.BoW.frame <- make.BoW.frame(c(art.BoW.list, music.BoW.list), row.names = c(paste("art",
1:length(art.BoW.list), sep = "."), paste("music", 1:length(music.BoW.list),
sep = ".")))
dim(nyt.BoW.frame)
## [1]  102 4431

# The trick: Queries are documents

• Turn the query string into a bag of words vector
• Find distances to other vectors in the data base
• Return the closest items
• Optional: weigh distance against other measures of quality

# The trick in action

query.by.similarity <- function(query, BoW.frame) {
query.vec = strip.text(query)
query.BoW = table(query.vec)
lexicon = colnames(BoW.frame)
query.vocab = names(query.BoW)
query.lex = query.BoW[intersect(query.vocab, lexicon)]
query.lex[setdiff(lexicon, query.vocab)] = 0
query.lex = query.lex[lexicon]
q = t(as.matrix(query.lex))
idf = get.idf.weights(BoW.frame)
BoW = scale.cols(BoW.frame, idf)
q = q * idf
BoW = div.by.euc.length(BoW)
q = q/sqrt(sum(q^2))
best.index = nearest.points(q, BoW)$which best.name = rownames(BoW)[best.index] return(list(best.index = best.index, best.name = best.name)) } get.idf.weights <- function(x) { doc.freq <- colSums(x > 0) doc.freq[doc.freq == 0] <- 1 w <- log(nrow(x)/doc.freq) return(w) } query.by.similarity("jazz lincoln center", nyt.BoW.frame) ##$best.index
## [1] 96
##
## $best.name ## [1] "music.39" paste(music.stories[[39]][1:25]) ## [1] "perched" "five" "stories" "above" "columbus" ## [6] "circle" "in" "the" "time" "warner" ## [11] "center" "rafael" "vi" "olys" "new" ## [16] "design" "for" "jazz" "at" "lincoln" ## [21] "center" "has" "a" "cool" "ethereality" query.by.similarity("painting sale", nyt.BoW.frame) ##$best.index
## [1] 30
##
## $best.name ## [1] "art.30" paste(art.stories[[30]][1:25]) ## [1] "xl" "xavier" "laboulbenne" "gallery" "#" ## [6] "west" "#nd" "street" "chelsea" "through" ## [11] "feb" "#popular" "culture" "may" "be" ## [16] "the" "mainspring" "for" "a" "lot" ## [21] "of" "new" "art" "but" "it" # Evaluating queries • Usually done in terms of relevance • Something is relevant (to the user) if it makes a difference to what they think, how the act, etc. • Want all the relevant items and only relevant items • Precision: What fraction of returned items are relevant $$= \frac{\mathrm{number\ of\ hits}}{\mathrm{number\ of\ items\ returned}}$$ • Recall: What fraction of relevant items are returned $$= \frac{\mathrm{number\ of\ hits}}{\mathrm{number\ of\ relevant\ items}}$$ # Precision-recall curve • Trade-off: • Returning everything guarantees 100% recall • If you’re any good at all, returning fewer, higher-ranked items improves precision • The precision-recall curve: • Threshold in terms of number of items returned, or confidence in the relevance, etc. • For each value of the threshold, calculate precision and recall of the query • Plot precision vs. recall • Connect the dots # Expanding on the query: Rocchio’s algorithm • Get users to show you what’s relevant, instead of trying to tell you • Run a query with vector $$\vec{q}_t$$, show user results • User marks results as relevant ($$R$$) or not-relevant ($$N$$) • $$\vec{q}_{t+1} = \alpha \vec{q}_t + \frac{\beta}{|R|}\sum_{\vec{x} \in R}{\vec{x}} - \frac{\gamma}{|N|}\sum_{\vec{y}\in N}{\vec{y}}$$ • $$\alpha$$: continuity between old query and new • $$\beta$$: amps up recall (more like the relevant stuff!) • $$\gamma$$: amps up precision (less like the irrelevant stuff!) • Control settings, not parameters • Iterate with new query vector $$\vec{q}_{t+1}$$ # Rocchio’s algorithm and adaptation • Basic strategy: “do more of what worked and less of what didn’t” • Needs feedback about what worked and what didn’t • Other instances of the basic strategy: • Lots of online / incremental estimation procedures • Psychological conditioning • Reinforcement learning (Sutton and Barto 1998) • Natural selection and evolutionary optimization procedures (Mitchell 1996) • Bayesian inference (Shalizi 2009) # Evaluating relevance is hard • Conceptually: it’s not just binary but at least scalar • Excellent if subtle psychology: Sperber and Wilson (1995) • Practically: how do you tell whether document X was relevant to query Q? • Users don’t usually tell you! • You could ask them (but that costs time, money, trouble…) • Substitutes for relevance: engagement, clicks, payments, … # Classification • Assign $$X$$ a binary label, “positive” and “negative”: • Relevant / irrelevant • Spam / not-spam • Cancerous cell / healthy cell • Fraudulent transaction / genuine transaction • (3+ classes are similar but extra notation) • Two kinds of errors: • False positive: Classifier says “positive” when it’s not (false alarm) • False negative: Classifier misses a true positive (miss) • False positive = lack of precision • False negative = lack of recall • Confusion matrix: $$2\times 2$$ table of true class vs. guessed class # Some classification methods • Nearest neighbors • Guess same class as most similar (labeled) item • $$k$$ nearest neighbors • Majority vote among the $$k$$ most similar items • Higher $$k$$ $$\Rightarrow$$ less noise, more (possible) bias • Linear classifiers • Everything on one side of a linear hyperplane is in one class • Mathematically, is $$\vec{x} \cdot \vec{b} + b_0 \geq 0$$? • The hyperplane $$\vec{x} \cdot \vec{b} + b_0 = 0$$ is the decision boundary • Prototype method • Each class has a prototype point (often the class average) • Assign new points to the class with the closer prototype • A linear classifier (see backup) # Nearest-neighbor-method demo: # Which story is in which class? story.classes <- c(rep("art", times = length(art.stories)), rep("music", times = length(music.stories))) nyt.similarities <- distances(div.by.euc.length(idf.weight(nyt.BoW.frame))) NNs <- nearest.points(nyt.BoW.frame, d = nyt.similarities)$which
NN.classes <- story.classes[NNs]
# Average error rate
mean(NN.classes != story.classes)
## [1] 0.1862745
# 'Confusion matrix'
table(story.classes, NN.classes)
##              NN.classes
## story.classes art music
##         art    45    12
##         music   7    38

# Exercise:

Say “art” is positive and “music” is negative

What’s the false positive rate, i.e., the probability that a story about music will be falsely classified as about art?

What’s the false negative rate?

What’s the positive predictive value, i.e., the probability that a story classified as “art” is actually about art?

# Prototype-method demo:

nyt.BoW.normed.idf <- div.by.euc.length(idf.weight(nyt.BoW.frame))
dim(nyt.BoW.normed.idf)
## [1]  102 4431
prototypical.art <- colMeans(nyt.BoW.normed.idf[story.classes == "art", ])
prototypical.music <- colMeans(nyt.BoW.normed.idf[story.classes == "music",
])
prototypes <- rbind(prototypical.art, prototypical.music)
prototype.matches <- nearest.points(nyt.BoW.normed.idf, prototypes)\$which
prototype.classes <- c("art", "music")[prototype.matches]
mean(prototype.classes != story.classes)
## [1] 0
table(story.classes, prototype.classes)
##              prototype.classes
## story.classes art music
##         art    57     0
##         music   0    45

(Don’t expect the prototype method to always out-perform nearest neighbors!)

# Summing up

• We represent our database of items as feature vectors
• We do similarity search by looking at distance in feature space
• Queries are also (represented by) feature vectors, so similar items are (represented by) near-by vectors
• Good searches have high precision (everything they return is relevant) and high recall (they return everything relevant), but there’s usually a trade-off
• Users are bad at describing what they want ($$\Rightarrow$$ adapt) and we’re bad at evaluating actual relevance ($$\Rightarrow$$ proxies)
• Search is a kind of classification; there are others

# Looking forward

dim(nyt.BoW.frame)
## [1]  102 4431
• 4431 features
• Too many for us to grasp
• Lots of parameters for any model
• Many features are useless
• Many are redundant
• And real data sets have even larger lexicons
• In other settings we may start with only weak representations with even more features
• e.g., pixels
• Next up: dimension reduction
• Which means: linear algebra

# Backup: The Prototype method is a linear classifier

• Say the two prototypes are $$\vec{p}_0$$ and $$\vec{p}_1$$
• We assign $$x$$ to class 1 if $$\| \vec{x} - \vec{p}_1\| \leq \|\vec{x}-\vec{p}_0\|$$, otherwise it’s assign to class 0
• Same inequality applies to squared distances: $\begin{eqnarray} \| \vec{x} - \vec{p}_1\| & \leq & \|\vec{x}-\vec{p}_0\|\\ \| \vec{x} - \vec{p}_1\|^2 & \leq & \|\vec{x}-\vec{p}_0\|^2\\ \| \vec{x}\|^2 - 2 \vec{x}\cdot\vec{p}_1 + \|\vec{p}_1\|^2 & \leq & \|\vec{x}\|^2 - 2 \vec{x}\cdot\vec{p}_0 + \|\vec{p}_0\|^2\\ 0 & \leq & \vec{x}\cdot 2(\vec{p}_1-\vec{p}_0) + \|\vec{p}_0\|^2 - \|\vec{p}_1\|^2 \end{eqnarray}$
• Query: Can every linear classifier be written as a prototype method for some choice of prototypes?

# Backup: Time complexity of nearest neighbor vs. prototype methods

$$n$$ data points, 2 classes

• Nearest neighbors:
• Each prediction needs finding the distance to all $$n$$ points, so each prediction takes $$O(n)$$ operations
• No set-up cost
• Prototype method:
• Each prediction requires calculating only 2 distances
• Set-up requires two averages, each of which takes $$O(n)$$ to compute

# References

Mitchell, Melanie. 1996. An Introduction to Genetic Algorithms. Cambridge, Massachusetts: MIT Press.

Shalizi, Cosma Rohilla. 2009. “Dynamics of Bayesian Updating with Dependent Data and Misspecified Models.” Electronic Journal of Statistics 3:1039–74. https://doi.org/10.1214/09-EJS485.

Sperber, Dan, and Deirdre Wilson. 1995. Relevance: Cognition and Communication. Second. Oxford: Basil Blackwell.

Sutton, Richard S., and Andrew G. Barto. 1998. Reinforcement Learning: An Introduction. Cambridge, Massachusetts: MIT Press. http://www.cs.ualberta.ca/~sutton/book/the-book.html.