\[ \DeclareMathOperator*{\argmin}{argmin} % thanks, wikipedia! \]

\[ \newcommand{\X}{\mathbf{x}} \newcommand{\w}{\mathbf{w}} \newcommand{\V}{\mathbf{v}} \newcommand{\S}{\mathbf{s}} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\SampleVar}[1]{\widehat{\mathrm{Var}}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \DeclareMathOperator{\tr}{tr} \newcommand{\FactorLoadings}{\mathbf{\Gamma}} \newcommand{\Uniquenesses}{\mathbf{\psi}} \]

“You may also like”, “Customers also bought”, feeds in social media, …

Generically, two stages:

*Predict*some outcome for user / item interactions- Ratings (a la Netflix)
- Purchases
- Clicks
- “Engagement”

*Maximize*the prediction- Don’t bother telling people what they won’t like
- (Usually)

- Subtle issues with prediction vs. action which we’ll get to next time

- The best-seller / most-popular list
- Prediction is implicit: everyone’s pretty much like everyone one else, use average ratings
- We’ve been doing this for at least 100 years
- An interesting exercise: go to the historical best-seller lists (at, e.g., [https://lithub.com/here-are-the-biggest-fiction-bestsellers-of-the-last-100-years/]) and see how long it takes to find a book or even an author whose name you recognize (whether or not you’ve read it)

- Good experimental evidence that it really does alter what (some) people do (Salganik, Dodds, and Watts 2006; Salganik and Watts 2008)

- Co-purchases, association lists
- Not much user modeling
- Problems of really common items
- (For a while, Amazon recommended Harry Potter books to
*everyone*after*everything*)

- (For a while, Amazon recommended Harry Potter books to
- Also problems for really rare items
- (For a while, I was almost certainly the only person to have bought a certain math book on Amazon)
- (You can imagine your own privacy-destroying nightmare here)

- Once you have predicted ratings, pick the highest-predicted ratings
- Finding the maximum of \(p\) items takes \(O(p)\) time in the worst case, so it helps if you can cut this down
- Sorting
- Early stopping if it looks like the predicted rating will be low

- We’ve noted some tricks for only predicting ratings for items likely to be good

- Finding the maximum of \(p\) items takes \(O(p)\) time in the worst case, so it helps if you can cut this down

- Recommendation systems work by first predicting what items users will like, and then maximizing the predictions
- Basically all prediction methods assume \(x_{ik}\) can be estimated from \(x_{jl}\) when \(j\) and/or \(l\) are similar to \(i\) and/or \(k\)
- More or less elaborate models
- Different notions of similarity

- Everyone wants to restrict the computational burden that comes with large \(n\) and \(p\)

- Start with \(n\) items in a data base (\(n\) big)
- Represent items as \(p\)-dimensional vectors of features (\(p\) too big for comfort), data is now \(\X\), dimension \([n \times p]\)
- Principal components analysis:
- Find the best \(q\)-dimensional linear approximation to the data
- Equivalent to finding \(q\) directions of maximum variance through the data
- Equivalent to finding top \(q\) eigenvalues and eigenvectors of \(\frac{1}{n}\X^T \X =\) sample variance matrix of the data
- New features = PC scores = projections on to the eigenvectors
- Variances along those directions = eigenvalues

- PCA says
*nothing*about what the data*should*look like - PCA makes
*no*predictions new data (or old data!) - PCA
*just*finds a linear approximation to*these*data - What would be a PCA-like model?

Remember PCA: \[ \S = \X \w \] and \[ \X = \S \w^T \]

(because \(\w^T = \w^{-1}\))

If we use only \(q\) PCs, then \[ \S_q = \X \w_q \] but \[ \X \neq \S_q \w_q^T \]

- Usual approach in statistics when the equations don’t hold: the error is random noise

\(\vec{X}\) is \(p\)-dimensional,

**manifest**,**unhidden**or**observable**\(\vec{F}\) is \(q\)-dimensional, \(q < p\) but

**latent**or**hidden**or**unobserved**The model: \[\begin{eqnarray*} \vec{X} & = & \FactorLoadings \vec{F} + \vec{\epsilon}\\ (\text{observables}) & = & (\text{factor loadings}) (\text{factor scores}) + (\text{noise}) \end{eqnarray*}\]

- \(\FactorLoadings =\) a \([p \times q]\) matrix of
**factor loadings**- Analogous to \(\w_q\) in PCA but without the orthonormal restrictions (some people also write \(\w\) for the loadings)
- Analogous to \(\beta\) in a linear regression

**Assumption**: \(\vec{\epsilon}\) is uncorrelated with \(\vec{F}\) and has \(\Expect{\vec{\epsilon}} = 0\)- \(p\)-dimensional vector (unlike the scalar noise in linear regression)

**Assumption**: \(\Var{\vec{\epsilon}} \equiv \Uniquenesses\) is diagonal (i.e., no correlation across dimensions of the noise)- Sometimes called the
**uniquenesses**or the**unique variance components** - Analogous to \(\sigma^2\) in a linear regression
- Some people write it \(\mathbf{\Sigma}\), others use that for \(\Var{\vec{X}}\)
- Means: all correlation between observables comes from the factors

- Sometimes called the
**Not really**an assumption: \(\Var{\vec{F}} = \mathbf{I}\)- Not an assumption because we could always de-correlate, as in homework 2

**Assumption**: \(\vec{\epsilon}\) is uncorrelated across units- As we assume in linear regression…

- Typically: center all variables so we can take \(\Expect{\vec{F}} = 0\)

- \(\FactorLoadings\) is \(p\times q\) so this is
**low-rank-plus-diagonal**- or
**low-rank-plus-noise** - Contrast with PCA: that approximates the variance matrix as
*purely*low-rank

- or

As \(\vec{F}\) varies over \(q\) dimensions, \(\w \vec{F}\) sweeps out a \(q\)-dimensional subspace in \(p\)-dimensional space

Then \(\vec{\epsilon}\) perturbs out of this subspace

- If \(\Var{\vec{\epsilon}} = \mathbf{0}\) then we’d be
*exactly*in the \(q\)-dimensional space, and we’d expect correspondence between factors and principal components- (Modulo the rotation problem, to be discussed)

- If the noise
*isn’t*zero, factors \(\neq\) PCs- In extremes: the largest direction of variation could come from a big entry in \(\Uniquenesses\), not from the linear structure at all

\[ \vec{X} = \FactorLoadings \vec{F} + \vec{\epsilon} \]

- We can’t regress \(\vec{X}\) on \(\vec{F}\) because we never see \(\vec{F}\)

- we’d say \[\begin{eqnarray} \Var{\vec{X}} & = & \FactorLoadings\FactorLoadings^T + \Uniquenesses\\ \Var{\vec{X}} - \Uniquenesses & = & \FactorLoadings\FactorLoadings^T \end{eqnarray}\]
- LHS is \(\Var{\FactorLoadings\vec{F}}\) so we know it’s symmetric and non-negative-definite
- \(\therefore\) We can eigendecompose LHS as \[\begin{eqnarray}
\Var{\vec{X}} - \Uniquenesses & = &\mathbf{v} \mathbf{\lambda} \mathbf{v}^T\\
& = & (\mathbf{v} \mathbf{\lambda}^{1/2}) (\mathbf{v} \mathbf{\lambda}^{1/2})^T
\end{eqnarray}\]
- \(\mathbf{\lambda} =\) diagonal matrix of eigenvalues, only \(q\) of which are non-zero

- Set \(\FactorLoadings = \mathbf{v} \mathbf{\lambda}^{1/2}\) and everything’s consistent

then we’d say \[\begin{eqnarray} \Var{\vec{X}} & = & \FactorLoadings\FactorLoadings^T + \Uniquenesses\\ \Var{\vec{X}} - \FactorLoadings\FactorLoadings^T & = & \Uniquenesses \end{eqnarray}\]

- Start with a guess about \(\Uniquenesses\)
- Suitable guess: regress each observable on the others, residual variance is \(\Uniquenesses_{ii}\)

- Until the estimates converge:
- Use \(\Uniquenesses\) to find \(\FactorLoadings\) (by eigen-magic)
- Use \(\FactorLoadings\) to find \(\Uniquenesses\) (by subtraction)

- Once we have the loadings (and uniquenesses), we can estimate the scores

- PC scores were just projection
Estimating factor scores isn’t so easy!

- Factor model: \[ \vec{X} = \FactorLoadings \vec{F} + \vec{\epsilon} \]
It’d be convenient to estimate factor scores as \[ \FactorLoadings^{-1} \vec{X} \] but \(\FactorLoadings^{-1}\) doesn’t exist!

- Typical approach: optimal linear estimator
- We know (from 401) that the optimal linear estimator of any \(Y\) from any \(\vec{Z}\) is \[
\Cov{Y, \vec{Z}} \Var{\vec{Z}}^{-1} \vec{Z}
\]
- (ignoring the intercept because everything’s centered)
- i.e., column vector of optimal coefficients is \(\Var{\vec{Z}}^{-1} \Cov{\vec{Z}, Y}\)

Here \[ \Cov{\vec{X}, \vec{F}} = \FactorLoadings\Var{F} = \FactorLoadings \] and \[ \Var{\vec{X}} = \FactorLoadings\FactorLoadings^T + \Uniquenesses \] so the optimal linear factor score estimates are \[ \FactorLoadings^T (\FactorLoadings\FactorLoadings^T + \Uniquenesses)^{-1} \vec{X} \]

- Fit a one-factor model:

```
## Length Class Mode
## Gamma 14400 -none- numeric
## Z 205 -none- numeric
## Sigma 14400 -none- numeric
```

- Positive and negative images along the that factor:

## 2.4 Social recommendations

Exercise: What’s \(\widehat{x_{ik}}\) in terms of the neighbors?