$\DeclareMathOperator*{\argmin}{argmin} % thanks, wikipedia!$


# 1 Recommender systems

## 1.1 The basic idea

“You may also like”, “Customers also bought”, feeds in social media, …

Generically, two stages:

• Predict some outcome for user / item interactions
• Ratings (a la Netflix)
• Purchases
• Clicks
• “Engagement”
• Maximize the prediction
• Don’t bother telling people what they won’t like
• (Usually)
• Subtle issues with prediction vs. action which we’ll get to next time

## 1.2 Very simple (dumb) baselines

• The best-seller / most-popular list
• Prediction is implicit: everyone’s pretty much like everyone one else, use average ratings
• We’ve been doing this for at least 100 years
• Good experimental evidence that it really does alter what (some) people do (Salganik, Dodds, and Watts 2006; Salganik and Watts 2008)
• Co-purchases, association lists
• Not much user modeling
• Problems of really common items
• (For a while, Amazon recommended Harry Potter books to everyone after everything)
• Also problems for really rare items
• (For a while, I was almost certainly the only person to have bought a certain math book on Amazon)
• (You can imagine your own privacy-destroying nightmare here)

# 2 Common approaches: nearest neighbors, matrix factorization, social recommendation

• Nearest neighbors
• Content-based
• Item-based
• PCA-like dimension reduction, matrix factorization
• Social recommendation: what did your friends like?

## 2.1 Nearest neighbors

### 2.1.1 Content-based nearest neighbors

• Represent each item as a $$p$$-dimensional feature vector
• Appropriate features will be different for music, video, garden tools, text (even different kinds of text)…
• Take the items user $$i$$ has liked
• Treat the user as a vector:
• Find the average item vector for user $$i$$
• What are the items closest to that average?
• Refinements:
• Find nearest neighbors for each liked item, prioritize anything that’s a match to multiple items
• Use dis-likes to filter
• Do a more general regression of ratings on features
• Drawback: need features on the items which track what users actually care about

### 2.1.2 Item-based nearest neighbors

• Items are features
• For user $$i$$ and potential item $$k$$, in principle we use all other users $$j$$ and all other items $$l$$ to predict $$x_{ik}$$
• With a few million users and ten thousand features, want don’t want this to be $$O(np^2)$$
• Use all the tricks for finding nearest neighbors quickly
• Only make predictions for items highly similar to items $$i$$ has already rated
• Items are similar when they get similar ratings from different users (i.e., users are features for items)
• Or even: only make predictions for items highly similar to items $$i$$ has already liked

## 2.2 Dimension reduction

• Again, items are features
• Fix a number of (latent) factors $$q$$
• Minimize $\sum_{(i,k) ~ \mathrm{observed}}{\left(x_{ik} - \sum_{r=1}^{q}{f_{ir} g_{rj}}\right)^2}$
• $$r$$ runs along the latent dimensions/factors
• $$f_{ir}$$ is how high user $$i$$ scores on factor $$r$$
• $$g_{rj}$$ is how much item $$j$$ weights on factor $$r$$
• Could tweak this to let each item have its own variance
• Matrix factorization because we’re saying $$\mathbf{x} \approx \mathbf{f} \mathbf{g}$$, where $$\mathbf{x}$$ is $$[n\times p]$$, $$\mathbf{f}$$ is $$[n\times q]$$ and $$\mathbf{g}$$ is $$[q \times p]$$
• Practical minimization: gradient descent, alternating between $$\mathbf{f}$$ and $$\mathbf{g}$$
• See backup for a lot more on factor modeling in general, and some other uses of it in data-mining in particular

## 2.3 Interpreting factor models

• The latent, inferred dimensions of the $$f_{ir}$$ and $$g_{rj}$$ values are the factors
• To be concrete, think of movies
• Each movie loads on to each factor
• E.g., one might load highly on “stuff blowing up”, “in space”, “dark and brooding”, “family secrets”, but not at all or negatively on “romantic comedy”, “tearjerker”, “bodily humor”
• Discovery: We don’t need to tell the system these are the dimensions movies vary along; it will find as many factors as we ask
• Interpretation: The factors it finds might not have humanly-comprehensible interpretations like “stuff blowing up” or “family secrets”
• Each user also has a score on each factor
• E.g., I might have high scores for “stuff blowing up”, “in space” and “romantic comedy”, but negative scores for “tearjerker”, “bodily humor” and “family secrets”
• Ratings are inner products plus noise
• Observable-specific noise helps capture ratings of movies which are extra variable, even given their loadings

## 2.4 Social recommendations

• We persuade/trick the users to give us a social network
• $$a_{ij} =$$ user $$i$$ follows users $$j$$
• Presume that people are similar to those they follow
• So estimate: $\widehat{x_{ik}} = \argmin_{m}{\sum_{j \neq i}{a_{ij}(m-x_{jk})^2}}$

Exercise: What’s $$\widehat{x_{ik}}$$ in terms of the neighbors?

• Refinements:
• Some information from neighbor’s neighbors, etc.
• Some information from neighbor’s ratings of similar items

## 2.5 Combining approaches

• Nothing says you have to pick just one method!
• Fit multiple models and predict a weighted average of the models
• E.g., predictions might be 50% NN, 25% factorization, 25% network smoothing
• Or: use one model as a base, then fit a second model to its residuals and add the residual-model’s predictions to the base model’s
• E.g., use factor model as a base and then kNN on its residuals
• Or: use average ratings as a base, then factor model on residuals from that, then kNN on residuals from the factor model

## 2.6 Some obstacles to all approaches

• The “cold start” problem: what do for new users/items?
• New users: global averages, or social averaging if that’s available
• Maybe present them with items with high information first?
• New items: content-based predictions, or hope that everyone isn’t relying on your system completely
• Missing values are informative
• Tastes change

### 2.6.1 Missing values are information

• Both factorization and NNs can handle missing values
• Factorization: We saw how to do this
• NNs: only use the variables with present values in the user we want to make predictions for
• BUT not rating something is informative
• You may not have heard of it…
• … or it may be the kind of thing you don’t like
• I rate mystery novels, not Christian parenting guides or how-to books on account software
• Often substantial improvements from explicitly modeling missingness (Marlin et al. 2007)

### 2.6.2 Tastes change

• Down-weight old data
• Easy but abrupt: don’t care about ratings more than, say, 100 days old
• Or: only use the last 100 ratings
• Or: make weights on items a gradually-decaying function of age
• Could also try to explicitly model change in tastes, but that adds to the computational burden
• One simple approach for factor models: $$\vec{f}_i(t+1) = \alpha \vec{f}_i(t) + (1-\alpha) (\mathrm{new\ estimate\ at\ time}\ t+1)$$

# 3 Maximization

• Once you have predicted ratings, pick the highest-predicted ratings
• Finding the maximum of $$p$$ items takes $$O(p)$$ time in the worst case, so it helps if you can cut this down
• Sorting
• Early stopping if it looks like the predicted rating will be low
• We’ve noted some tricks for only predicting ratings for items likely to be good

# 4 Summing up

• Recommendation systems work by first predicting what items users will like, and then maximizing the predictions
• Basically all prediction methods assume $$x_{ik}$$ can be estimated from $$x_{jl}$$ when $$j$$ and/or $$l$$ are similar to $$i$$ and/or $$k$$
• More or less elaborate models
• Different notions of similarity
• Everyone wants to restrict the computational burden that comes with large $$n$$ and $$p$$

# 5 Backup: Factor models in data mining

## 5.1 Factor models take off from PCA

• Start with $$n$$ items in a data base ($$n$$ big)
• Represent items as $$p$$-dimensional vectors of features ($$p$$ too big for comfort), data is now $$\X$$, dimension $$[n \times p]$$
• Principal components analysis:
• Find the best $$q$$-dimensional linear approximation to the data
• Equivalent to finding $$q$$ directions of maximum variance through the data
• Equivalent to finding top $$q$$ eigenvalues and eigenvectors of $$\frac{1}{n}\X^T \X =$$ sample variance matrix of the data
• New features = PC scores = projections on to the eigenvectors
• Variances along those directions = eigenvalues

## 5.2 PCA is not a model

• PCA says nothing about what the data should look like
• PCA makes no predictions new data (or old data!)
• PCA just finds a linear approximation to these data
• What would be a PCA-like model?

## 5.3 This is where factor analysis comes in

Remember PCA: $\S = \X \w$ and $\X = \S \w^T$

(because $$\w^T = \w^{-1}$$)

If we use only $$q$$ PCs, then $\S_q = \X \w_q$ but $\X \neq \S_q \w_q^T$

• Usual approach in statistics when the equations don’t hold: the error is random noise

## 5.4 The factor model

• $$\vec{X}$$ is $$p$$-dimensional, manifest, unhidden or observable

• $$\vec{F}$$ is $$q$$-dimensional, $$q < p$$ but latent or hidden or unobserved

• The model: $\begin{eqnarray*} \vec{X} & = & \FactorLoadings \vec{F} + \vec{\epsilon}\\ (\text{observables}) & = & (\text{factor loadings}) (\text{factor scores}) + (\text{noise}) \end{eqnarray*}$

• $$\FactorLoadings =$$ a $$[p \times q]$$ matrix of factor loadings
• Analogous to $$\w_q$$ in PCA but without the orthonormal restrictions (some people also write $$\w$$ for the loadings)
• Analogous to $$\beta$$ in a linear regression
• Assumption: $$\vec{\epsilon}$$ is uncorrelated with $$\vec{F}$$ and has $$\Expect{\vec{\epsilon}} = 0$$
• $$p$$-dimensional vector (unlike the scalar noise in linear regression)
• Assumption: $$\Var{\vec{\epsilon}} \equiv \Uniquenesses$$ is diagonal (i.e., no correlation across dimensions of the noise)
• Sometimes called the uniquenesses or the unique variance components
• Analogous to $$\sigma^2$$ in a linear regression
• Some people write it $$\mathbf{\Sigma}$$, others use that for $$\Var{\vec{X}}$$
• Means: all correlation between observables comes from the factors
• Not really an assumption: $$\Var{\vec{F}} = \mathbf{I}$$
• Not an assumption because we could always de-correlate, as in homework 2
• Assumption: $$\vec{\epsilon}$$ is uncorrelated across units
• As we assume in linear regression…

### 5.4.1 Summary of the factor model assumptions

$\begin{eqnarray} \vec{X} & = & \FactorLoadings \vec{F} + \vec{\epsilon}\\ \Cov{\vec{F}, \vec{\epsilon}} & = & \mathbf{0}\\ \Var{\vec{\epsilon}} & \equiv & \Uniquenesses, ~ \text{diagonal} \Expect{\vec{\epsilon}} & = & \vec{0}\\ \Var{\vec{F}} & = & \mathbf{I} \end{eqnarray}$

### 5.4.2 Some consequences of the assumptions

$\begin{eqnarray} \vec{X} & = & \FactorLoadings \vec{F} + \vec{\epsilon}\\ \Expect{\vec{X}} & = & \FactorLoadings \Expect{\vec{F}} \end{eqnarray}$
• Typically: center all variables so we can take $$\Expect{\vec{F}} = 0$$
$\begin{eqnarray} \vec{X} & = & \FactorLoadings \vec{F} + \vec{\epsilon}\\ \Var{\vec{X}} & = & \FactorLoadings \Var{\vec{F}} \FactorLoadings^T + \Var{\vec{\epsilon}}\\ & = & \FactorLoadings \FactorLoadings^T + \Uniquenesses \end{eqnarray}$
• $$\FactorLoadings$$ is $$p\times q$$ so this is low-rank-plus-diagonal
• or low-rank-plus-noise
• Contrast with PCA: that approximates the variance matrix as purely low-rank
$\begin{eqnarray} \vec{X} & = & \FactorLoadings \vec{F} + \vec{\epsilon}\\ \Var{\vec{X}} & = & \FactorLoadings \FactorLoadings^T + \Uniquenesses\\ \Cov{X_i, X_j} & = & \text{what?} \end{eqnarray}$

## 5.5 Geometry

• As $$\vec{F}$$ varies over $$q$$ dimensions, $$\w \vec{F}$$ sweeps out a $$q$$-dimensional subspace in $$p$$-dimensional space

• Then $$\vec{\epsilon}$$ perturbs out of this subspace

• If $$\Var{\vec{\epsilon}} = \mathbf{0}$$ then we’d be exactly in the $$q$$-dimensional space, and we’d expect correspondence between factors and principal components
• (Modulo the rotation problem, to be discussed)
• If the noise isn’t zero, factors $$\neq$$ PCs
• In extremes: the largest direction of variation could come from a big entry in $$\Uniquenesses$$, not from the linear structure at all

## 5.6 How do we estimate a factor model?

$\vec{X} = \FactorLoadings \vec{F} + \vec{\epsilon}$

• We can’t regress $$\vec{X}$$ on $$\vec{F}$$ because we never see $$\vec{F}$$

### 5.6.1 Suppose we knew$$\Uniquenesses$$

• we’d say $\begin{eqnarray} \Var{\vec{X}} & = & \FactorLoadings\FactorLoadings^T + \Uniquenesses\\ \Var{\vec{X}} - \Uniquenesses & = & \FactorLoadings\FactorLoadings^T \end{eqnarray}$
• LHS is $$\Var{\FactorLoadings\vec{F}}$$ so we know it’s symmetric and non-negative-definite
• $$\therefore$$ We can eigendecompose LHS as $\begin{eqnarray} \Var{\vec{X}} - \Uniquenesses & = &\mathbf{v} \mathbf{\lambda} \mathbf{v}^T\\ & = & (\mathbf{v} \mathbf{\lambda}^{1/2}) (\mathbf{v} \mathbf{\lambda}^{1/2})^T \end{eqnarray}$
• $$\mathbf{\lambda} =$$ diagonal matrix of eigenvalues, only $$q$$ of which are non-zero
• Set $$\FactorLoadings = \mathbf{v} \mathbf{\lambda}^{1/2}$$ and everything’s consistent

### 5.6.2 Suppose we knew $$\FactorLoadings$$

then we’d say $\begin{eqnarray} \Var{\vec{X}} & = & \FactorLoadings\FactorLoadings^T + \Uniquenesses\\ \Var{\vec{X}} - \FactorLoadings\FactorLoadings^T & = & \Uniquenesses \end{eqnarray}$

### 5.6.3 “One person’s vicious circle is another’s iterative approximation”:

• Start with a guess about $$\Uniquenesses$$
• Suitable guess: regress each observable on the others, residual variance is $$\Uniquenesses_{ii}$$
• Until the estimates converge:
• Use $$\Uniquenesses$$ to find $$\FactorLoadings$$ (by eigen-magic)
• Use $$\FactorLoadings$$ to find $$\Uniquenesses$$ (by subtraction)
• Once we have the loadings (and uniquenesses), we can estimate the scores

## 5.7 Estimating factor scores

• PC scores were just projection
• Estimating factor scores isn’t so easy!

• Factor model: $\vec{X} = \FactorLoadings \vec{F} + \vec{\epsilon}$
• It’d be convenient to estimate factor scores as $\FactorLoadings^{-1} \vec{X}$ but $$\FactorLoadings^{-1}$$ doesn’t exist!

• Typical approach: optimal linear estimator
• We know (from 401) that the optimal linear estimator of any $$Y$$ from any $$\vec{Z}$$ is $\Cov{Y, \vec{Z}} \Var{\vec{Z}}^{-1} \vec{Z}$
• (ignoring the intercept because everything’s centered)
• i.e., column vector of optimal coefficients is $$\Var{\vec{Z}}^{-1} \Cov{\vec{Z}, Y}$$
• Here $\Cov{\vec{X}, \vec{F}} = \FactorLoadings\Var{F} = \FactorLoadings$ and $\Var{\vec{X}} = \FactorLoadings\FactorLoadings^T + \Uniquenesses$ so the optimal linear factor score estimates are $\FactorLoadings^T (\FactorLoadings\FactorLoadings^T + \Uniquenesses)^{-1} \vec{X}$

## 5.8 Example: Back to the dresses from HW 7

• Fit a one-factor model:
##       Length Class  Mode
## Gamma 14400  -none- numeric
## Z       205  -none- numeric
## Sigma 14400  -none- numeric
• Positive and negative images along the that factor: