# Recommender Systems I — The What and How

Lecture 12 (7 October 2019)

$\DeclareMathOperator*{\argmin}{argmin} % thanks, wikipedia!$

# Recommender systems

“You may also like”, “Customers also bought”, feeds in social media, …

Generically, two stages:

• Predict some outcome for user / item interactions
• Ratings (a la Netflix)
• Purchases
• Clicks
• “Engagement”
• Maximize the prediction
• Don’t bother telling people what they won’t like
• (Usually)
• Subtle issues with treatment effects which we’ll get to next time

# Very simple (dumb) baselines

• The best-seller / most-popular list
• Prediction is implicit: everyone’s pretty much like everyone one else, use average ratings
• We’ve been doing this for at least 100 years
• Good experimental evidence that it really does alter what (some) people do (Salganik, Dodds, and Watts 2006; Salganik and Watts 2008)
• Co-purchases, association lists
• Not much user modeling
• Problems of really common items
• (For a while, Amazon recommended Harry Potter books to everyone after everything)
• Also problems for really rare items
• (For a while, I was almost certainly the only person to have bought a certain math book on Amazon)
• (You can imagine your own privacy-destroying nightmare here)

# Two immediate baselines, which we’ve seen

• Nearest neighbors
• Content-based
• Item-based
• Factor models / matrix factorization

# Nearest neighbors: content-based

• Represent each item as a $$p$$-dimensional feature vector
• Take the items user $$i$$ has liked
• Treat the user as a query:
• Find the average item vector for user $$i$$
• What are the items closest to that average?
• Refinements:
• Find nearest neighbors for each liked item, prioritize anything that’s a match to multiple items
• Use dis-likes to filter
• Do a more general regression of ratings on features
• Drawback: need features on the items which track what users care about

# Nearest neighbors: item-based

• Items are features
• For user $$i$$ and potential item $$k$$, in principle we use all other users $$j$$ and all other items $$l$$ to predict $$x_{ik}$$
• With a few million users and ten thousand features, want don’t want this to be $$O(np^2)$$
• All our tricks from last time for fast NN
• More hashing tricks to quickly find the users with more ratings in common with user $$i$$
• Only make predictions for items highly similar to items $$i$$ has already rated
• Items are similar when they get similar ratings from different users (i.e., users are features for items)
• Or even: only make predictions for items highly similar to items $$i$$ has already liked

# Factor models

• Again, items are features
• Minimize $\sum_{(i,k) ~ \mathrm{observed}}{\left(x_{ik} - \sum_{r=1}^{q}{f_{ir} g_{rj}}\right)^2}$
• $$r$$ runs along the latent dimensions/factors
• $$f_{ir}$$ is how high user $$i$$ scores on factor $$r$$
• $$g_{rj}$$ is how much item $$j$$ weights on factor $$r$$
• Could tweak this to let each item have its own variance
• Matrix factorization because we’re saying $$\mathbf{x} \approx \mathbf{f} \mathbf{g}$$, where $$\mathbf{x}$$ is $$[n\times p]$$, $$\mathbf{f}$$ is $$[n\times r]$$ and $$\mathbf{g}$$ is $$[r \times p]$$
• Practical minimization: gradient descent, alternating between $$\mathbf{f}$$ and $$\mathbf{g}$$

# Social recommendations

• Suppose we persuade the users to give us a social network
• $$a_{ij} =$$ user $$i$$ follows users $$j$$
• Presume that people are similar to those they follow
• So estimate: $\widehat{x_{ik}} = \argmin_{m}{\sum_{j \neq i}{(m-x_{jk})^2}}$

Exercise: What’s $$\widehat{x_{ik}}$$ in terms of the neighbors?

• Refinements:
• Some information from neighbor’s neighbors, etc.
• Some information from neighbor’s ratings of similar items

# Combining approaches

• Nothing says you have to pick just one method!
• Fit multiple models and predict a weighted average of the models
• E.g., predictions might be 50% NN, 25% factorization, 25% network smoothing
• Or: use one model as a base, then fit a second model to its residuals and add the residual-model’s predictions to the base model’s
• E.g., use factor model as a base and then kNN on its residuals
• Or: use average ratings as a base, then factor model on residuals from that, then kNN on residuals from the factor model

# Some obstacles to all approaches

• The “cold start” problem: what do for new users/items?
• New users: global averages, or social averaging if that’s available
• Maybe present them with items with high information first?
• New items: content-based predictions, or hope that everyone isn’t relying on your system completely
• Missing values are informative
• Tastes change

# Missing values are information

• Both factorization and NNs can handle missing values
• Factorization: We saw how to do this
• NNs: only use the variables with present values in the user we want to make predictions for
• BUT not rating something is informative
• You may not have heard of it…
• … or it may be the kind of thing you don’t like
• I rate mystery novels, not Christian parenting guides or how-to books on account software
• Often substantial improvements from explicitly modeling missingness (Marlin et al. 2007)

# Tastes change

• Down-weight old data
• Easy but abrupt: don’t care about ratings more than, say, 100 days old
• Or: only use the last 100 ratings
• Or: make weights on items a gradually-decaying function of age
• Could also try to explicitly model change in tastes, but that adds to the computational burden
• One simple approach for factor models: $$\vec{f}_i(t+1) = \alpha \vec{f}_i(t) + (1-\alpha) (\mathrm{new\ estimate\ at\ time}\ t+1)$$

# Maximization

• Once you have predicted ratings, pick the highest-predicted ratings
• Finding the maximum of $$p$$ items takes $$O(p)$$ time in the worst case, so it helps if you can cut this down
• Sorting
• Early stopping if it looks like the predicted rating will be low
• We’ve noted some tricks for only predicting ratings for items likely to be good

# Summing up

• Recommendation systems work by first predicting what items users will like, and then maximizing the predictions
• Basically all prediction methods assume $$x_{ik}$$ can be estimated from $$x_{jl}$$ when $$j$$ and/or $$l$$ are similar to $$i$$ and/or $$k$$
• More or less elaborate models
• Different notions of similarity
• Everyone wants to restrict the computational burden that comes with large $$n$$ and $$p$$

# References (in addition to the background reading on the course homepage)

Marlin, Benjamin, Richard S. Zemel, Sam Roweis, and Malcolm Slaney. 2007. “Collaborative Filtering and the Missing at Random Assumption.” In Proceedings of the 23rd Conference on Uncertainty in Artificial Intelligence [Uai 2007]. https://arxiv.org/abs/1206.5267.

Salganik, Matthew J., Peter S. Dodds, and Duncan J. Watts. 2006. “Experimental Study of Inequality and Unpredictability in an Artificial Cultural Market.” Science 311:854–56. http://www.princeton.edu/~mjs3/musiclab.shtml.

Salganik, Matthew J., and Duncan J. Watts. 2008. “Leading the Herd Astray: An Experimental Study of Self-Fulfilling Prophecies in an Artificial Cultural Market.” Social Psychological Quarterly 71:338–55. http://www.princeton.edu/~mjs3/salganik_watts08.pdf.