---
title: Recommender Systems I --- The What and How
author: 36-462/662, Data Mining, Fall 2019
date: Lecture 12 (7 October 2019)
bibliography: locusts.bib
output: slidy_presentation
---
\[
\DeclareMathOperator*{\argmin}{argmin} % thanks, wikipedia!
\]
## Recommender systems
"You may also like", "Customers also bought", feeds in social media, ...
Generically, two stages:
- _Predict_ some outcome for user / item interactions
+ Ratings (a la Netflix)
+ Purchases
+ Clicks
+ "Engagement"
- _Maximize_ the prediction
+ Don't bother telling people what they won't like
+ (Usually)
- Subtle issues with treatment effects which we'll get to next time
## Very simple (dumb) baselines
- The best-seller / most-popular list
+ Prediction is implicit: everyone's pretty much like everyone one else,
use average ratings
+ We've been doing this for at least 100 years
* An interesting exercise: go to the historical best-seller lists (at, e.g., [https://lithub.com/here-are-the-biggest-fiction-bestsellers-of-the-last-100-years/]) and see how long it takes to find a book or even an author you recognize (whether or not you've read it)
+ Good experimental evidence that it really does alter what (some) people do [@Salganik-Dodds-Watts-artificial-cultural-market; @Salganik-Watts-leading-the-herd-astray]
- Co-purchases, association lists
+ Not much user modeling
+ Problems of really common items
* (For a while, Amazon recommended Harry Potter books to _everyone_ after _everything_)
+ Also problems for really rare items
* (For a while, I was almost certainly the only person to have bought a certain math book on Amazon)
* (You can imagine your own privacy-destroying nightmare here)
## Two immediate baselines, which we've seen
- Nearest neighbors
+ Content-based
+ Item-based
- Factor models / matrix factorization
## Nearest neighbors: content-based
- Represent each item as a $p$-dimensional feature vector
- Take the items user $i$ has liked
- Treat the user as a query:
+ Find the average item vector for user $i$
+ What are the items closest to that average?
- Refinements:
+ Find nearest neighbors for each liked item, prioritize anything that's a match to multiple items
+ Use dis-likes to filter
+ Do a more general regression of ratings on features
- Drawback: need features on the items which track what users care about
## Nearest neighbors: item-based
- Items are features
- For user $i$ and potential item $k$, in principle we use all other users $j$ and all other items $l$ to predict $x_{ik}$
- With a few million users and ten thousand features, want don't want this
to be $O(np^2)$
+ All our tricks from last time for fast NN
+ More hashing tricks to quickly find the users with more ratings in
common with user $i$
+ Only make predictions for items highly similar to items $i$ has already rated
* Items are similar when they get similar ratings from different users (i.e., users are features for items)
* Or even: only make predictions for items highly similar to items $i$ has already liked
## Factor models
- Again, items are features
- Minimize
\[
\sum_{(i,k) ~ \mathrm{observed}}{\left(x_{ik} - \sum_{r=1}^{q}{f_{ir} g_{rj}}\right)^2}
\]
+ $r$ runs along the latent dimensions/factors
+ $f_{ir}$ is how high user $i$ scores on factor $r$
+ $g_{rj}$ is how much item $j$ weights on factor $r$
+ Could tweak this to let each item have its own variance
- **Matrix factorization** because we're saying $\mathbf{x} \approx \mathbf{f} \mathbf{g}$, where $\mathbf{x}$ is $[n\times p]$, $\mathbf{f}$ is $[n\times r]$ and $\mathbf{g}$ is $[r \times p]$
- Practical minimization: gradient descent, alternating between $\mathbf{f}$ and $\mathbf{g}$
## Social recommendations
- Suppose we persuade the users to give us a social network
+ $a_{ij} =$ user $i$ follows users $j$
- Presume that people are similar to those they follow
- So estimate:
\[
\widehat{x_{ik}} = \argmin_{m}{\sum_{j \neq i}{(m-x_{jk})^2}}
\]
**Exercise**: What's $\widehat{x_{ik}}$ in terms of the neighbors?
- Refinements:
+ Some information from neighbor's neighbors, etc.
+ Some information from neighbor's ratings of similar items
## Combining approaches
- Nothing says you have to pick just one method!
- Fit multiple models and predict a weighted average of the models
- E.g., predictions might be 50% NN, 25% factorization, 25% network smoothing
- Or: use one model as a base, then fit a second model to its residuals and
add the residual-model's predictions to the base model's
* E.g., use factor model as a base and then kNN on its residuals
* Or: use average ratings as a base, then factor model on residuals from that, then kNN on residuals from the factor model
## Some obstacles to all approaches
- The "cold start" problem: what do for new users/items?
+ New users: global averages, or social averaging if that's available
* Maybe present them with items with high information first?
+ New items: content-based predictions, _or_ hope that everyone isn't relying on your system completely
- Missing values are informative
- Tastes change
## Missing values are information
- Both factorization and NNs can handle missing values
+ Factorization: We saw how to do this
+ NNs: only use the variables with present values in the user we want
to make predictions for
- BUT not rating something is informative
+ You may not have heard of it...
+ ... or it may be the kind of thing you don't like
* I rate mystery novels, not Christian parenting guides or how-to books on account software
- Often substantial improvements from explicitly modeling missingness [@Marlin-et-al-collaborative-filtering-and-missing-at-random]
## Tastes change
- Down-weight old data
+ Easy but abrupt: don't care about ratings more than, say, 100 days old
+ Or: only use the last 100 ratings
+ Or: make weights on items a gradually-decaying function of age
- Could also try to explicitly model change in tastes, but that adds to the
computational burden
+ One simple approach for factor models: $\vec{f}_i(t+1) = \alpha \vec{f}_i(t) + (1-\alpha) (\mathrm{new\ estimate\ at\ time}\ t+1)$
## Maximization
- Once you have predicted ratings, pick the highest-predicted ratings
+ Finding the maximum of $p$ items takes $O(p)$ time in the worst case,
so it helps if you can cut this down
* Sorting
* Early stopping if it looks like the predicted rating will be low
+ We've noted some tricks for only predicting ratings for items likely to be good
## Summing up
- Recommendation systems work by first predicting what items users will like,
and then maximizing the predictions
- Basically all prediction methods assume $x_{ik}$ can be estimated from
$x_{jl}$ when $j$ and/or $l$ are similar to $i$ and/or $k$
+ More or less elaborate models
+ Different notions of similarity
- Everyone wants to restrict the computational burden that comes with large
$n$ and $p$
## References (in addition to the background reading on the course homepage)