36-462/662, Data Mining, Fall 2019

Lecture 12 (7 October 2019)

\[ \DeclareMathOperator*{\argmin}{argmin} % thanks, wikipedia! \]

“You may also like”, “Customers also bought”, feeds in social media, …

Generically, two stages:

*Predict*some outcome for user / item interactions- Ratings (a la Netflix)
- Purchases
- Clicks
- “Engagement”

*Maximize*the prediction- Don’t bother telling people what they won’t like
- (Usually)

- Subtle issues with treatment effects which we’ll get to next time

- The best-seller / most-popular list
- Prediction is implicit: everyone’s pretty much like everyone one else, use average ratings
- We’ve been doing this for at least 100 years
- An interesting exercise: go to the historical best-seller lists (at, e.g., [https://lithub.com/here-are-the-biggest-fiction-bestsellers-of-the-last-100-years/]) and see how long it takes to find a book or even an author you recognize (whether or not you’ve read it)

- Good experimental evidence that it really does alter what (some) people do (Salganik, Dodds, and Watts 2006; Salganik and Watts 2008)

- Co-purchases, association lists
- Not much user modeling
- Problems of really common items
- (For a while, Amazon recommended Harry Potter books to
*everyone*after*everything*)

- (For a while, Amazon recommended Harry Potter books to
- Also problems for really rare items
- (For a while, I was almost certainly the only person to have bought a certain math book on Amazon)
- (You can imagine your own privacy-destroying nightmare here)

- Nearest neighbors
- Content-based
- Item-based

- Factor models / matrix factorization

- Represent each item as a \(p\)-dimensional feature vector
- Take the items user \(i\) has liked
- Treat the user as a query:
- Find the average item vector for user \(i\)
- What are the items closest to that average?

- Refinements:
- Find nearest neighbors for each liked item, prioritize anything that’s a match to multiple items
- Use dis-likes to filter
- Do a more general regression of ratings on features

- Drawback: need features on the items which track what users care about

- Items are features
- For user \(i\) and potential item \(k\), in principle we use all other users \(j\) and all other items \(l\) to predict \(x_{ik}\)
- With a few million users and ten thousand features, want don’t want this to be \(O(np^2)\)
- All our tricks from last time for fast NN
- More hashing tricks to quickly find the users with more ratings in common with user \(i\)
- Only make predictions for items highly similar to items \(i\) has already rated
- Items are similar when they get similar ratings from different users (i.e., users are features for items)
- Or even: only make predictions for items highly similar to items \(i\) has already liked

- Again, items are features
- Minimize \[
\sum_{(i,k) ~ \mathrm{observed}}{\left(x_{ik} - \sum_{r=1}^{q}{f_{ir} g_{rj}}\right)^2}
\]
- \(r\) runs along the latent dimensions/factors
- \(f_{ir}\) is how high user \(i\) scores on factor \(r\)
- \(g_{rj}\) is how much item \(j\) weights on factor \(r\)
- Could tweak this to let each item have its own variance

**Matrix factorization**because we’re saying \(\mathbf{x} \approx \mathbf{f} \mathbf{g}\), where \(\mathbf{x}\) is \([n\times p]\), \(\mathbf{f}\) is \([n\times r]\) and \(\mathbf{g}\) is \([r \times p]\)- Practical minimization: gradient descent, alternating between \(\mathbf{f}\) and \(\mathbf{g}\)

- Nothing says you have to pick just one method!
- Fit multiple models and predict a weighted average of the models
- E.g., predictions might be 50% NN, 25% factorization, 25% network smoothing
- Or: use one model as a base, then fit a second model to its residuals and add the residual-model’s predictions to the base model’s
- E.g., use factor model as a base and then kNN on its residuals
- Or: use average ratings as a base, then factor model on residuals from that, then kNN on residuals from the factor model

- The “cold start” problem: what do for new users/items?
- New users: global averages, or social averaging if that’s available
- Maybe present them with items with high information first?

- New items: content-based predictions,
*or*hope that everyone isn’t relying on your system completely

- New users: global averages, or social averaging if that’s available
- Missing values are informative
- Tastes change

- Both factorization and NNs can handle missing values
- Factorization: We saw how to do this
- NNs: only use the variables with present values in the user we want to make predictions for

- BUT not rating something is informative
- You may not have heard of it…
- … or it may be the kind of thing you don’t like
- I rate mystery novels, not Christian parenting guides or how-to books on account software

- Often substantial improvements from explicitly modeling missingness (Marlin et al. 2007)

- Down-weight old data
- Easy but abrupt: don’t care about ratings more than, say, 100 days old
- Or: only use the last 100 ratings
- Or: make weights on items a gradually-decaying function of age

- Could also try to explicitly model change in tastes, but that adds to the computational burden
- One simple approach for factor models: \(\vec{f}_i(t+1) = \alpha \vec{f}_i(t) + (1-\alpha) (\mathrm{new\ estimate\ at\ time}\ t+1)\)

- Once you have predicted ratings, pick the highest-predicted ratings
- Finding the maximum of \(p\) items takes \(O(p)\) time in the worst case, so it helps if you can cut this down
- Sorting
- Early stopping if it looks like the predicted rating will be low

- We’ve noted some tricks for only predicting ratings for items likely to be good

- Finding the maximum of \(p\) items takes \(O(p)\) time in the worst case, so it helps if you can cut this down

- Recommendation systems work by first predicting what items users will like, and then maximizing the predictions
- Basically all prediction methods assume \(x_{ik}\) can be estimated from \(x_{jl}\) when \(j\) and/or \(l\) are similar to \(i\) and/or \(k\)
- More or less elaborate models
- Different notions of similarity

- Everyone wants to restrict the computational burden that comes with large \(n\) and \(p\)

Marlin, Benjamin, Richard S. Zemel, Sam Roweis, and Malcolm Slaney. 2007. “Collaborative Filtering and the Missing at Random Assumption.” In *Proceedings of the 23rd Conference on Uncertainty in Artificial Intelligence [Uai 2007]*. https://arxiv.org/abs/1206.5267.

Salganik, Matthew J., Peter S. Dodds, and Duncan J. Watts. 2006. “Experimental Study of Inequality and Unpredictability in an Artificial Cultural Market.” *Science* 311:854–56. http://www.princeton.edu/~mjs3/musiclab.shtml.

Salganik, Matthew J., and Duncan J. Watts. 2008. “Leading the Herd Astray: An Experimental Study of Self-Fulfilling Prophecies in an Artificial Cultural Market.” *Social Psychological Quarterly* 71:338–55. http://www.princeton.edu/~mjs3/salganik_watts08.pdf.

## Social recommendations

Exercise: What’s \(\widehat{x_{ik}}\) in terms of the neighbors?