\[
\DeclareMathOperator*{\argmin}{argmin} % thanks, wikipedia!
\]
\[
\newcommand{\X}{\mathbf{x}}
\newcommand{\w}{\mathbf{w}}
\newcommand{\V}{\mathbf{v}}
\newcommand{\S}{\mathbf{s}}
\newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]}
\newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]}
\newcommand{\SampleVar}[1]{\widehat{\mathrm{Var}}\left[ #1 \right]}
\newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]}
\DeclareMathOperator{\tr}{tr}
\newcommand{\FactorLoadings}{\mathbf{\Gamma}}
\newcommand{\Uniquenesses}{\mathbf{\psi}}
\]
Recommender systems
The basic idea
“You may also like”, “Customers also bought”, feeds in social media, …
Generically, two stages:
- Predict some outcome for user / item interactions
- Ratings (a la Netflix)
- Purchases
- Clicks
- “Engagement”
- Maximize the prediction
- Don’t bother telling people what they won’t like
- (Usually)
- Subtle issues with prediction vs. action which we’ll get to next time
Very simple (dumb) baselines
- The best-seller / most-popular list
- Prediction is implicit: everyone’s pretty much like everyone one else, use average ratings
- We’ve been doing this for at least 100 years
- Good experimental evidence that it really does alter what (some) people do (Salganik, Dodds, and Watts 2006; Salganik and Watts 2008)
- Co-purchases, association lists
- Not much user modeling
- Problems of really common items
- (For a while, Amazon recommended Harry Potter books to everyone after everything)
- Also problems for really rare items
- (For a while, I was almost certainly the only person to have bought a certain math book on Amazon)
- (You can imagine your own privacy-destroying nightmare here)
Common approaches: nearest neighbors, matrix factorization, social recommendation
- Nearest neighbors
- PCA-like dimension reduction, matrix factorization
- Social recommendation: what did your friends like?
Nearest neighbors
Content-based nearest neighbors
- Represent each item as a \(p\)-dimensional feature vector
- Appropriate features will be different for music, video, garden tools, text (even different kinds of text)…
- Take the items user \(i\) has liked
- Treat the user as a vector:
- Find the average item vector for user \(i\)
- What are the items closest to that average?
- Refinements:
- Find nearest neighbors for each liked item, prioritize anything that’s a match to multiple items
- Use dis-likes to filter
- Do a more general regression of ratings on features
- Drawback: need features on the items which track what users actually care about
Item-based nearest neighbors
- Items are features
- For user \(i\) and potential item \(k\), in principle we use all other users \(j\) and all other items \(l\) to predict \(x_{ik}\)
- With a few million users and ten thousand features, want don’t want this to be \(O(np^2)\)
- Use all the tricks for finding nearest neighbors quickly
- Only make predictions for items highly similar to items \(i\) has already rated
- Items are similar when they get similar ratings from different users (i.e., users are features for items)
- Or even: only make predictions for items highly similar to items \(i\) has already liked
Dimension reduction
- Again, items are features
- Fix a number of (latent) factors \(q\)
- Minimize \[
\sum_{(i,k) ~ \mathrm{observed}}{\left(x_{ik} - \sum_{r=1}^{q}{f_{ir} g_{rj}}\right)^2}
\]
- \(r\) runs along the latent dimensions/factors
- \(f_{ir}\) is how high user \(i\) scores on factor \(r\)
- \(g_{rj}\) is how much item \(j\) weights on factor \(r\)
- Could tweak this to let each item have its own variance
- Matrix factorization because we’re saying \(\mathbf{x} \approx \mathbf{f} \mathbf{g}\), where \(\mathbf{x}\) is \([n\times p]\), \(\mathbf{f}\) is \([n\times q]\) and \(\mathbf{g}\) is \([q \times p]\)
- Practical minimization: gradient descent, alternating between \(\mathbf{f}\) and \(\mathbf{g}\)
- See backup for a lot more on factor modeling in general, and some other uses of it in data-mining in particular
Interpreting factor models
- The latent, inferred dimensions of the \(f_{ir}\) and \(g_{rj}\) values are the factors
- To be concrete, think of movies
- Each movie loads on to each factor
- E.g., one might load highly on “stuff blowing up”, “in space”, “dark and brooding”, “family secrets”, but not at all or negatively on “romantic comedy”, “tearjerker”, “bodily humor”
- Discovery: We don’t need to tell the system these are the dimensions movies vary along; it will find as many factors as we ask
- Interpretation: The factors it finds might not have humanly-comprehensible interpretations like “stuff blowing up” or “family secrets”
- Each user also has a score on each factor
- E.g., I might have high scores for “stuff blowing up”, “in space” and “romantic comedy”, but negative scores for “tearjerker”, “bodily humor” and “family secrets”
- Ratings are inner products plus noise
- Observable-specific noise helps capture ratings of movies which are extra variable, even given their loadings
Social recommendations
- We persuade/trick the users to give us a social network
- \(a_{ij} =\) user \(i\) follows users \(j\)
- Presume that people are similar to those they follow
- So estimate: \[
\widehat{x_{ik}} = \argmin_{m}{\sum_{j \neq i}{a_{ij}(m-x_{jk})^2}}
\]
Exercise: What’s \(\widehat{x_{ik}}\) in terms of the neighbors?
- Refinements:
- Some information from neighbor’s neighbors, etc.
- Some information from neighbor’s ratings of similar items
Combining approaches
- Nothing says you have to pick just one method!
- Fit multiple models and predict a weighted average of the models
- E.g., predictions might be 50% NN, 25% factorization, 25% network smoothing
- Or: use one model as a base, then fit a second model to its residuals and add the residual-model’s predictions to the base model’s
- E.g., use factor model as a base and then kNN on its residuals
- Or: use average ratings as a base, then factor model on residuals from that, then kNN on residuals from the factor model
Some obstacles to all approaches
- The “cold start” problem: what do for new users/items?
- New users: global averages, or social averaging if that’s available
- Maybe present them with items with high information first?
- New items: content-based predictions, or hope that everyone isn’t relying on your system completely
- Missing values are informative
- Tastes change
Tastes change
- Down-weight old data
- Easy but abrupt: don’t care about ratings more than, say, 100 days old
- Or: only use the last 100 ratings
- Or: make weights on items a gradually-decaying function of age
- Could also try to explicitly model change in tastes, but that adds to the computational burden
- One simple approach for factor models: \(\vec{f}_i(t+1) = \alpha \vec{f}_i(t) + (1-\alpha) (\mathrm{new\ estimate\ at\ time}\ t+1)\)
Maximization
- Once you have predicted ratings, pick the highest-predicted ratings
- Finding the maximum of \(p\) items takes \(O(p)\) time in the worst case, so it helps if you can cut this down
- Sorting
- Early stopping if it looks like the predicted rating will be low
- We’ve noted some tricks for only predicting ratings for items likely to be good
Summing up
- Recommendation systems work by first predicting what items users will like, and then maximizing the predictions
- Basically all prediction methods assume \(x_{ik}\) can be estimated from \(x_{jl}\) when \(j\) and/or \(l\) are similar to \(i\) and/or \(k\)
- More or less elaborate models
- Different notions of similarity
- Everyone wants to restrict the computational burden that comes with large \(n\) and \(p\)
Backup: Factor models in data mining
Factor models take off from PCA
- Start with \(n\) items in a data base (\(n\) big)
- Represent items as \(p\)-dimensional vectors of features (\(p\) too big for comfort), data is now \(\X\), dimension \([n \times p]\)
- Principal components analysis:
- Find the best \(q\)-dimensional linear approximation to the data
- Equivalent to finding \(q\) directions of maximum variance through the data
- Equivalent to finding top \(q\) eigenvalues and eigenvectors of \(\frac{1}{n}\X^T \X =\) sample variance matrix of the data
- New features = PC scores = projections on to the eigenvectors
- Variances along those directions = eigenvalues
PCA is not a model
- PCA says nothing about what the data should look like
- PCA makes no predictions new data (or old data!)
- PCA just finds a linear approximation to these data
- What would be a PCA-like model?
This is where factor analysis comes in
Remember PCA: \[
\S = \X \w
\] and \[
\X = \S \w^T
\]
(because \(\w^T = \w^{-1}\))
If we use only \(q\) PCs, then \[
\S_q = \X \w_q
\] but \[
\X \neq \S_q \w_q^T
\]
- Usual approach in statistics when the equations don’t hold: the error is random noise
The factor model
\(\vec{X}\) is \(p\)-dimensional, manifest, unhidden or observable
\(\vec{F}\) is \(q\)-dimensional, \(q < p\) but latent or hidden or unobserved
The model: \[\begin{eqnarray*}
\vec{X} & = & \FactorLoadings \vec{F} + \vec{\epsilon}\\
(\text{observables}) & = & (\text{factor loadings}) (\text{factor scores}) + (\text{noise})
\end{eqnarray*}\]
- \(\FactorLoadings =\) a \([p \times q]\) matrix of factor loadings
- Analogous to \(\w_q\) in PCA but without the orthonormal restrictions (some people also write \(\w\) for the loadings)
- Analogous to \(\beta\) in a linear regression
- Assumption: \(\vec{\epsilon}\) is uncorrelated with \(\vec{F}\) and has \(\Expect{\vec{\epsilon}} = 0\)
- \(p\)-dimensional vector (unlike the scalar noise in linear regression)
- Assumption: \(\Var{\vec{\epsilon}} \equiv \Uniquenesses\) is diagonal (i.e., no correlation across dimensions of the noise)
- Sometimes called the uniquenesses or the unique variance components
- Analogous to \(\sigma^2\) in a linear regression
- Some people write it \(\mathbf{\Sigma}\), others use that for \(\Var{\vec{X}}\)
- Means: all correlation between observables comes from the factors
- Not really an assumption: \(\Var{\vec{F}} = \mathbf{I}\)
- Not an assumption because we could always de-correlate, as in homework 2
- Assumption: \(\vec{\epsilon}\) is uncorrelated across units
- As we assume in linear regression…
Summary of the factor model assumptions
\[\begin{eqnarray}
\vec{X} & = & \FactorLoadings \vec{F} + \vec{\epsilon}\\
\Cov{\vec{F}, \vec{\epsilon}} & = & \mathbf{0}\\
\Var{\vec{\epsilon}} & \equiv & \Uniquenesses, ~ \text{diagonal}
\Expect{\vec{\epsilon}} & = & \vec{0}\\
\Var{\vec{F}} & = & \mathbf{I}
\end{eqnarray}\]
Some consequences of the assumptions
\[\begin{eqnarray}
\vec{X} & = & \FactorLoadings \vec{F} + \vec{\epsilon}\\
\Expect{\vec{X}} & = & \FactorLoadings \Expect{\vec{F}}
\end{eqnarray}\]
- Typically: center all variables so we can take \(\Expect{\vec{F}} = 0\)
\[\begin{eqnarray}
\vec{X} & = & \FactorLoadings \vec{F} + \vec{\epsilon}\\
\Var{\vec{X}} & = & \FactorLoadings \Var{\vec{F}} \FactorLoadings^T + \Var{\vec{\epsilon}}\\
& = & \FactorLoadings \FactorLoadings^T + \Uniquenesses
\end{eqnarray}\]
- \(\FactorLoadings\) is \(p\times q\) so this is low-rank-plus-diagonal
- or low-rank-plus-noise
- Contrast with PCA: that approximates the variance matrix as purely low-rank
\[\begin{eqnarray}
\vec{X} & = & \FactorLoadings \vec{F} + \vec{\epsilon}\\
\Var{\vec{X}} & = & \FactorLoadings \FactorLoadings^T + \Uniquenesses\\
\Cov{X_i, X_j} & = & \text{what?}
\end{eqnarray}\]
Geometry
As \(\vec{F}\) varies over \(q\) dimensions, \(\w \vec{F}\) sweeps out a \(q\)-dimensional subspace in \(p\)-dimensional space
Then \(\vec{\epsilon}\) perturbs out of this subspace
- If \(\Var{\vec{\epsilon}} = \mathbf{0}\) then we’d be exactly in the \(q\)-dimensional space, and we’d expect correspondence between factors and principal components
- (Modulo the rotation problem, to be discussed)
- If the noise isn’t zero, factors \(\neq\) PCs
- In extremes: the largest direction of variation could come from a big entry in \(\Uniquenesses\), not from the linear structure at all

How do we estimate a factor model?
\[
\vec{X} = \FactorLoadings \vec{F} + \vec{\epsilon}
\]
- We can’t regress \(\vec{X}\) on \(\vec{F}\) because we never see \(\vec{F}\)
Suppose we knew \(\Uniquenesses\)
- we’d say \[\begin{eqnarray}
\Var{\vec{X}} & = & \FactorLoadings\FactorLoadings^T + \Uniquenesses\\
\Var{\vec{X}} - \Uniquenesses & = & \FactorLoadings\FactorLoadings^T
\end{eqnarray}\]
- LHS is \(\Var{\FactorLoadings\vec{F}}\) so we know it’s symmetric and non-negative-definite
- \(\therefore\) We can eigendecompose LHS as \[\begin{eqnarray}
\Var{\vec{X}} - \Uniquenesses & = &\mathbf{v} \mathbf{\lambda} \mathbf{v}^T\\
& = & (\mathbf{v} \mathbf{\lambda}^{1/2}) (\mathbf{v} \mathbf{\lambda}^{1/2})^T
\end{eqnarray}\]
- \(\mathbf{\lambda} =\) diagonal matrix of eigenvalues, only \(q\) of which are non-zero
- Set \(\FactorLoadings = \mathbf{v} \mathbf{\lambda}^{1/2}\) and everything’s consistent
Suppose we knew \(\FactorLoadings\)
then we’d say \[\begin{eqnarray}
\Var{\vec{X}} & = & \FactorLoadings\FactorLoadings^T + \Uniquenesses\\
\Var{\vec{X}} - \FactorLoadings\FactorLoadings^T & = & \Uniquenesses
\end{eqnarray}\]
“One person’s vicious circle is another’s iterative approximation”:
- Start with a guess about \(\Uniquenesses\)
- Suitable guess: regress each observable on the others, residual variance is \(\Uniquenesses_{ii}\)
- Until the estimates converge:
- Use \(\Uniquenesses\) to find \(\FactorLoadings\) (by eigen-magic)
- Use \(\FactorLoadings\) to find \(\Uniquenesses\) (by subtraction)
- Once we have the loadings (and uniquenesses), we can estimate the scores
Estimating factor scores
- PC scores were just projection
Estimating factor scores isn’t so easy!
- Factor model: \[
\vec{X} = \FactorLoadings \vec{F} + \vec{\epsilon}
\]
It’d be convenient to estimate factor scores as \[
\FactorLoadings^{-1} \vec{X}
\] but \(\FactorLoadings^{-1}\) doesn’t exist!
- Typical approach: optimal linear estimator
- We know (from 401) that the optimal linear estimator of any \(Y\) from any \(\vec{Z}\) is \[
\Cov{Y, \vec{Z}} \Var{\vec{Z}}^{-1} \vec{Z}
\]
- (ignoring the intercept because everything’s centered)
- i.e., column vector of optimal coefficients is \(\Var{\vec{Z}}^{-1} \Cov{\vec{Z}, Y}\)
Here \[
\Cov{\vec{X}, \vec{F}} = \FactorLoadings\Var{F} = \FactorLoadings
\] and \[
\Var{\vec{X}} = \FactorLoadings\FactorLoadings^T + \Uniquenesses
\] so the optimal linear factor score estimates are \[
\FactorLoadings^T (\FactorLoadings\FactorLoadings^T + \Uniquenesses)^{-1} \vec{X}
\]
Example: Back to the dresses from HW 7
## Length Class Mode
## Gamma 14400 -none- numeric
## Z 205 -none- numeric
## Sigma 14400 -none- numeric
- Positive and negative images along the that factor:

2.4 Social recommendations
Exercise: What’s \(\widehat{x_{ik}}\) in terms of the neighbors?