\[ \DeclareMathOperator*{\argmin}{argmin} % thanks, wikipedia! \]

\[ \newcommand{\X}{\mathbf{x}} \newcommand{\w}{\mathbf{w}} \newcommand{\V}{\mathbf{v}} \newcommand{\S}{\mathbf{s}} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\SampleVar}[1]{\widehat{\mathrm{Var}}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \DeclareMathOperator{\tr}{tr} \newcommand{\FactorLoadings}{\mathbf{\Gamma}} \newcommand{\Uniquenesses}{\mathbf{\psi}} \]

1 Recommender systems

1.1 The basic idea

“You may also like”, “Customers also bought”, feeds in social media, …

Generically, two stages:

  • Predict some outcome for user / item interactions
    • Ratings (a la Netflix)
    • Purchases
    • Clicks
    • “Engagement”
  • Maximize the prediction
    • Don’t bother telling people what they won’t like
    • (Usually)
  • Subtle issues with prediction vs. action which we’ll get to next time

1.2 Very simple (dumb) baselines

  • The best-seller / most-popular list
    • Prediction is implicit: everyone’s pretty much like everyone one else, use average ratings
    • We’ve been doing this for at least 100 years
    • Good experimental evidence that it really does alter what (some) people do (Salganik, Dodds, and Watts 2006; Salganik and Watts 2008)
  • Co-purchases, association lists
    • Not much user modeling
    • Problems of really common items
      • (For a while, Amazon recommended Harry Potter books to everyone after everything)
    • Also problems for really rare items
      • (For a while, I was almost certainly the only person to have bought a certain math book on Amazon)
      • (You can imagine your own privacy-destroying nightmare here)

2 Common approaches: nearest neighbors, matrix factorization, social recommendation

  • Nearest neighbors
    • Content-based
    • Item-based
  • PCA-like dimension reduction, matrix factorization
  • Social recommendation: what did your friends like?

2.1 Nearest neighbors

2.1.1 Content-based nearest neighbors

  • Represent each item as a \(p\)-dimensional feature vector
    • Appropriate features will be different for music, video, garden tools, text (even different kinds of text)…
  • Take the items user \(i\) has liked
  • Treat the user as a vector:
    • Find the average item vector for user \(i\)
    • What are the items closest to that average?
  • Refinements:
    • Find nearest neighbors for each liked item, prioritize anything that’s a match to multiple items
    • Use dis-likes to filter
    • Do a more general regression of ratings on features
  • Drawback: need features on the items which track what users actually care about

2.1.2 Item-based nearest neighbors

  • Items are features
  • For user \(i\) and potential item \(k\), in principle we use all other users \(j\) and all other items \(l\) to predict \(x_{ik}\)
  • With a few million users and ten thousand features, want don’t want this to be \(O(np^2)\)
    • Use all the tricks for finding nearest neighbors quickly
    • Only make predictions for items highly similar to items \(i\) has already rated
      • Items are similar when they get similar ratings from different users (i.e., users are features for items)
      • Or even: only make predictions for items highly similar to items \(i\) has already liked

2.2 Dimension reduction

  • Again, items are features
  • Fix a number of (latent) factors \(q\)
  • Minimize \[ \sum_{(i,k) ~ \mathrm{observed}}{\left(x_{ik} - \sum_{r=1}^{q}{f_{ir} g_{rj}}\right)^2} \]
    • \(r\) runs along the latent dimensions/factors
    • \(f_{ir}\) is how high user \(i\) scores on factor \(r\)
    • \(g_{rj}\) is how much item \(j\) weights on factor \(r\)
    • Could tweak this to let each item have its own variance
  • Matrix factorization because we’re saying \(\mathbf{x} \approx \mathbf{f} \mathbf{g}\), where \(\mathbf{x}\) is \([n\times p]\), \(\mathbf{f}\) is \([n\times q]\) and \(\mathbf{g}\) is \([q \times p]\)
  • Practical minimization: gradient descent, alternating between \(\mathbf{f}\) and \(\mathbf{g}\)
  • See backup for a lot more on factor modeling in general, and some other uses of it in data-mining in particular

2.3 Interpreting factor models

  • The latent, inferred dimensions of the \(f_{ir}\) and \(g_{rj}\) values are the factors
  • To be concrete, think of movies
  • Each movie loads on to each factor
    • E.g., one might load highly on “stuff blowing up”, “in space”, “dark and brooding”, “family secrets”, but not at all or negatively on “romantic comedy”, “tearjerker”, “bodily humor”
    • Discovery: We don’t need to tell the system these are the dimensions movies vary along; it will find as many factors as we ask
    • Interpretation: The factors it finds might not have humanly-comprehensible interpretations like “stuff blowing up” or “family secrets”
  • Each user also has a score on each factor
    • E.g., I might have high scores for “stuff blowing up”, “in space” and “romantic comedy”, but negative scores for “tearjerker”, “bodily humor” and “family secrets”
  • Ratings are inner products plus noise
    • Observable-specific noise helps capture ratings of movies which are extra variable, even given their loadings

2.4 Social recommendations

  • We persuade/trick the users to give us a social network
    • \(a_{ij} =\) user \(i\) follows users \(j\)
  • Presume that people are similar to those they follow
  • So estimate: \[ \widehat{x_{ik}} = \argmin_{m}{\sum_{j \neq i}{a_{ij}(m-x_{jk})^2}} \]

Exercise: What’s \(\widehat{x_{ik}}\) in terms of the neighbors?

  • Refinements:
    • Some information from neighbor’s neighbors, etc.
    • Some information from neighbor’s ratings of similar items

2.5 Combining approaches

  • Nothing says you have to pick just one method!
  • Fit multiple models and predict a weighted average of the models
  • E.g., predictions might be 50% NN, 25% factorization, 25% network smoothing
  • Or: use one model as a base, then fit a second model to its residuals and add the residual-model’s predictions to the base model’s
    • E.g., use factor model as a base and then kNN on its residuals
    • Or: use average ratings as a base, then factor model on residuals from that, then kNN on residuals from the factor model

2.6 Some obstacles to all approaches

  • The “cold start” problem: what do for new users/items?
    • New users: global averages, or social averaging if that’s available
      • Maybe present them with items with high information first?
    • New items: content-based predictions, or hope that everyone isn’t relying on your system completely
  • Missing values are informative
  • Tastes change

2.6.1 Missing values are information

  • Both factorization and NNs can handle missing values
    • Factorization: We saw how to do this
    • NNs: only use the variables with present values in the user we want to make predictions for
  • BUT not rating something is informative
    • You may not have heard of it…
    • … or it may be the kind of thing you don’t like
      • I rate mystery novels, not Christian parenting guides or how-to books on account software
  • Often substantial improvements from explicitly modeling missingness (Marlin et al. 2007)

2.6.2 Tastes change

  • Down-weight old data
    • Easy but abrupt: don’t care about ratings more than, say, 100 days old
    • Or: only use the last 100 ratings
    • Or: make weights on items a gradually-decaying function of age
  • Could also try to explicitly model change in tastes, but that adds to the computational burden
    • One simple approach for factor models: \(\vec{f}_i(t+1) = \alpha \vec{f}_i(t) + (1-\alpha) (\mathrm{new\ estimate\ at\ time}\ t+1)\)

3 Maximization

  • Once you have predicted ratings, pick the highest-predicted ratings
    • Finding the maximum of \(p\) items takes \(O(p)\) time in the worst case, so it helps if you can cut this down
      • Sorting
      • Early stopping if it looks like the predicted rating will be low
    • We’ve noted some tricks for only predicting ratings for items likely to be good

4 Summing up

  • Recommendation systems work by first predicting what items users will like, and then maximizing the predictions
  • Basically all prediction methods assume \(x_{ik}\) can be estimated from \(x_{jl}\) when \(j\) and/or \(l\) are similar to \(i\) and/or \(k\)
    • More or less elaborate models
    • Different notions of similarity
  • Everyone wants to restrict the computational burden that comes with large \(n\) and \(p\)

5 Backup: Factor models in data mining

5.1 Factor models take off from PCA

  • Start with \(n\) items in a data base (\(n\) big)
  • Represent items as \(p\)-dimensional vectors of features (\(p\) too big for comfort), data is now \(\X\), dimension \([n \times p]\)
  • Principal components analysis:
    • Find the best \(q\)-dimensional linear approximation to the data
    • Equivalent to finding \(q\) directions of maximum variance through the data
    • Equivalent to finding top \(q\) eigenvalues and eigenvectors of \(\frac{1}{n}\X^T \X =\) sample variance matrix of the data
    • New features = PC scores = projections on to the eigenvectors
    • Variances along those directions = eigenvalues

5.2 PCA is not a model

  • PCA says nothing about what the data should look like
  • PCA makes no predictions new data (or old data!)
  • PCA just finds a linear approximation to these data
  • What would be a PCA-like model?

5.3 This is where factor analysis comes in

Remember PCA: \[ \S = \X \w \] and \[ \X = \S \w^T \]

(because \(\w^T = \w^{-1}\))

If we use only \(q\) PCs, then \[ \S_q = \X \w_q \] but \[ \X \neq \S_q \w_q^T \]

  • Usual approach in statistics when the equations don’t hold: the error is random noise

5.4 The factor model

  • \(\vec{X}\) is \(p\)-dimensional, manifest, unhidden or observable

  • \(\vec{F}\) is \(q\)-dimensional, \(q < p\) but latent or hidden or unobserved

  • The model: \[\begin{eqnarray*} \vec{X} & = & \FactorLoadings \vec{F} + \vec{\epsilon}\\ (\text{observables}) & = & (\text{factor loadings}) (\text{factor scores}) + (\text{noise}) \end{eqnarray*}\]

  • \(\FactorLoadings =\) a \([p \times q]\) matrix of factor loadings
    • Analogous to \(\w_q\) in PCA but without the orthonormal restrictions (some people also write \(\w\) for the loadings)
    • Analogous to \(\beta\) in a linear regression
  • Assumption: \(\vec{\epsilon}\) is uncorrelated with \(\vec{F}\) and has \(\Expect{\vec{\epsilon}} = 0\)
    • \(p\)-dimensional vector (unlike the scalar noise in linear regression)
  • Assumption: \(\Var{\vec{\epsilon}} \equiv \Uniquenesses\) is diagonal (i.e., no correlation across dimensions of the noise)
    • Sometimes called the uniquenesses or the unique variance components
    • Analogous to \(\sigma^2\) in a linear regression
    • Some people write it \(\mathbf{\Sigma}\), others use that for \(\Var{\vec{X}}\)
    • Means: all correlation between observables comes from the factors
  • Not really an assumption: \(\Var{\vec{F}} = \mathbf{I}\)
    • Not an assumption because we could always de-correlate, as in homework 2
  • Assumption: \(\vec{\epsilon}\) is uncorrelated across units
    • As we assume in linear regression…

5.4.1 Summary of the factor model assumptions

\[\begin{eqnarray} \vec{X} & = & \FactorLoadings \vec{F} + \vec{\epsilon}\\ \Cov{\vec{F}, \vec{\epsilon}} & = & \mathbf{0}\\ \Var{\vec{\epsilon}} & \equiv & \Uniquenesses, ~ \text{diagonal} \Expect{\vec{\epsilon}} & = & \vec{0}\\ \Var{\vec{F}} & = & \mathbf{I} \end{eqnarray}\]

5.4.2 Some consequences of the assumptions

\[\begin{eqnarray} \vec{X} & = & \FactorLoadings \vec{F} + \vec{\epsilon}\\ \Expect{\vec{X}} & = & \FactorLoadings \Expect{\vec{F}} \end{eqnarray}\]
  • Typically: center all variables so we can take \(\Expect{\vec{F}} = 0\)
\[\begin{eqnarray} \vec{X} & = & \FactorLoadings \vec{F} + \vec{\epsilon}\\ \Var{\vec{X}} & = & \FactorLoadings \Var{\vec{F}} \FactorLoadings^T + \Var{\vec{\epsilon}}\\ & = & \FactorLoadings \FactorLoadings^T + \Uniquenesses \end{eqnarray}\]
  • \(\FactorLoadings\) is \(p\times q\) so this is low-rank-plus-diagonal
    • or low-rank-plus-noise
    • Contrast with PCA: that approximates the variance matrix as purely low-rank
\[\begin{eqnarray} \vec{X} & = & \FactorLoadings \vec{F} + \vec{\epsilon}\\ \Var{\vec{X}} & = & \FactorLoadings \FactorLoadings^T + \Uniquenesses\\ \Cov{X_i, X_j} & = & \text{what?} \end{eqnarray}\]

5.5 Geometry

  • As \(\vec{F}\) varies over \(q\) dimensions, \(\w \vec{F}\) sweeps out a \(q\)-dimensional subspace in \(p\)-dimensional space

  • Then \(\vec{\epsilon}\) perturbs out of this subspace

  • If \(\Var{\vec{\epsilon}} = \mathbf{0}\) then we’d be exactly in the \(q\)-dimensional space, and we’d expect correspondence between factors and principal components
    • (Modulo the rotation problem, to be discussed)
  • If the noise isn’t zero, factors \(\neq\) PCs
    • In extremes: the largest direction of variation could come from a big entry in \(\Uniquenesses\), not from the linear structure at all

5.6 How do we estimate a factor model?

\[ \vec{X} = \FactorLoadings \vec{F} + \vec{\epsilon} \]

  • We can’t regress \(\vec{X}\) on \(\vec{F}\) because we never see \(\vec{F}\)

5.6.1 Suppose we knew \(\Uniquenesses\)

  • we’d say \[\begin{eqnarray} \Var{\vec{X}} & = & \FactorLoadings\FactorLoadings^T + \Uniquenesses\\ \Var{\vec{X}} - \Uniquenesses & = & \FactorLoadings\FactorLoadings^T \end{eqnarray}\]
  • LHS is \(\Var{\FactorLoadings\vec{F}}\) so we know it’s symmetric and non-negative-definite
  • \(\therefore\) We can eigendecompose LHS as \[\begin{eqnarray} \Var{\vec{X}} - \Uniquenesses & = &\mathbf{v} \mathbf{\lambda} \mathbf{v}^T\\ & = & (\mathbf{v} \mathbf{\lambda}^{1/2}) (\mathbf{v} \mathbf{\lambda}^{1/2})^T \end{eqnarray}\]
    • \(\mathbf{\lambda} =\) diagonal matrix of eigenvalues, only \(q\) of which are non-zero
  • Set \(\FactorLoadings = \mathbf{v} \mathbf{\lambda}^{1/2}\) and everything’s consistent

5.6.2 Suppose we knew \(\FactorLoadings\)

then we’d say \[\begin{eqnarray} \Var{\vec{X}} & = & \FactorLoadings\FactorLoadings^T + \Uniquenesses\\ \Var{\vec{X}} - \FactorLoadings\FactorLoadings^T & = & \Uniquenesses \end{eqnarray}\]

5.6.3 “One person’s vicious circle is another’s iterative approximation”:

  • Start with a guess about \(\Uniquenesses\)
    • Suitable guess: regress each observable on the others, residual variance is \(\Uniquenesses_{ii}\)
  • Until the estimates converge:
    • Use \(\Uniquenesses\) to find \(\FactorLoadings\) (by eigen-magic)
    • Use \(\FactorLoadings\) to find \(\Uniquenesses\) (by subtraction)
  • Once we have the loadings (and uniquenesses), we can estimate the scores

5.7 Estimating factor scores

  • PC scores were just projection
  • Estimating factor scores isn’t so easy!

  • Factor model: \[ \vec{X} = \FactorLoadings \vec{F} + \vec{\epsilon} \]
  • It’d be convenient to estimate factor scores as \[ \FactorLoadings^{-1} \vec{X} \] but \(\FactorLoadings^{-1}\) doesn’t exist!

  • Typical approach: optimal linear estimator
  • We know (from 401) that the optimal linear estimator of any \(Y\) from any \(\vec{Z}\) is \[ \Cov{Y, \vec{Z}} \Var{\vec{Z}}^{-1} \vec{Z} \]
    • (ignoring the intercept because everything’s centered)
    • i.e., column vector of optimal coefficients is \(\Var{\vec{Z}}^{-1} \Cov{\vec{Z}, Y}\)
  • Here \[ \Cov{\vec{X}, \vec{F}} = \FactorLoadings\Var{F} = \FactorLoadings \] and \[ \Var{\vec{X}} = \FactorLoadings\FactorLoadings^T + \Uniquenesses \] so the optimal linear factor score estimates are \[ \FactorLoadings^T (\FactorLoadings\FactorLoadings^T + \Uniquenesses)^{-1} \vec{X} \]

5.8 Example: Back to the dresses from HW 7

  • Fit a one-factor model:
##       Length Class  Mode   
## Gamma 14400  -none- numeric
## Z       205  -none- numeric
## Sigma 14400  -none- numeric
  • Positive and negative images along the that factor: