---
title: Factor Analysis
author: 36-462/662, Fall 2019
date: 16 September 2019
output: slidy_presentation
bibliography: locusts.bib
---
```{r, include=FALSE}
# General set-up options
library(knitr)
opts_chunk$set(size="small",background="white", highlight=FALSE, cache=TRUE,
autodep=TRUE, tidy=TRUE, warning=FALSE, message=FALSE,
tidy.opts=list(comment=FALSE))
```
\[
\newcommand{\X}{\mathbf{x}}
\newcommand{\w}{\mathbf{w}}
\newcommand{\V}{\mathbf{v}}
\newcommand{\S}{\mathbf{s}}
\newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]}
\newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]}
\newcommand{\SampleVar}[1]{\widehat{\mathrm{Var}}\left[ #1 \right]}
\newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]}
\DeclareMathOperator{\tr}{tr}
\DeclareMathOperator*{\argmin}{argmin}
\newcommand{\FactorLoadings}{\mathbf{\Gamma}}
\newcommand{\Uniquenesses}{\mathbf{\psi}}
\]
## Recap
- Start with $n$ items in a data base ($n$ big)
- Represent items as $p$-dimensional vectors of features ($p$ too big for comfort), data is now $\X$, dimension $[n \times p]$
- Principal components analysis:
+ Find the best $q$-dimensional linear approximation to the data
+ Equivalent to finding $q$ directions of maximum variance through the data
+ Equivalent to finding top $q$ eigenvalues and eigenvectors of $\frac{1}{n}\X^T \X =$ sample variance matrix of the data
+ New features = PC scores = projections on to the eigenvectors
+ Variances along those directions = eigenvalues
## PCA is _not a model_
- PCA says _nothing_ about what the data _should_ look like
- PCA makes _no_ predictions new data (or old data!)
- PCA _just_ finds a linear approximation to _these_ data
- What would be a PCA-like model?
## This is where factor analysis comes in
Remember PCA:
\[
\S = \X \w
\]
and
\[
\X = \S \w^T
\]
(because $\w^T = \w^{-1}$)
If we use only $q$ PCs, then
\[
\S_q = \X \w_q
\]
but
\[
\X \neq \S_q \w_q^T
\]
- Usual approach in statistics when the equations don't hold: the error is random noise
## The factor model
$\vec{X}$ is $p$-dimensional, **manifest**, **unhidden** or **observable**
$\vec{F}$ is $q$-dimensional, $q < p$ but **latent** or **hidden** or **unobserved**
The model:
\begin{eqnarray*}
\vec{X} & = & \FactorLoadings \vec{F} + \vec{\epsilon}\\
(\text{observables}) & = & (\text{factor loadings}) (\text{factor scores}) + (\text{noise})
\end{eqnarray*}
## The factor model
\[
\vec{X} = \FactorLoadings \vec{F} + \vec{\epsilon}
\]
- $\FactorLoadings =$ a $[p \times q]$ matrix of **factor loadings**
+ Analogous to $\w_q$ in PCA but without the orthonormal restrictions (some people also write $\w$ for the loadings)
+ Analogous to $\beta$ in a linear regression
- **Assumption**: $\vec{\epsilon}$ is uncorrelated with $\vec{F}$ and has $\Expect{\vec{\epsilon}} = 0$
+ $p$-dimensional vector (unlike the scalar noise in linear regression)
- **Assumption**: $\Var{\vec{\epsilon}} \equiv \Uniquenesses$ is diagonal (i.e., no correlation across dimensions of the noise)
+ Sometimes called the **uniquenesses** or the **unique variance components**
+ Analogous to $\sigma^2$ in a linear regression
+ Some people write it $\mathbf{\Sigma}$, others use that for $\Var{\vec{X}}$
+ Means: all correlation between observables comes from the factors
- **Not really** an assumption: $\Var{\vec{F}} = \mathbf{I}$
+ Not an assumption because we could always de-correlate, as in homework 2
- **Assumption**: $\vec{\epsilon}$ is uncorrelated across units
+ As we assume in linear regression...
## The factor model: summary
\begin{eqnarray}
\vec{X} & = & \FactorLoadings \vec{F} + \vec{\epsilon}\\
\Cov{\vec{F}, \vec{\epsilon}} & = & \mathbf{0}\\
\Var{\vec{\epsilon}} & \equiv & \Uniquenesses, ~ \text{diagonal}
\Expect{\vec{\epsilon}} & = & \vec{0}\\
\Var{\vec{F}} & = & \mathbf{I}
\end{eqnarray}
## Some consequences of the assumptions
\begin{eqnarray}
\vec{X} & = & \FactorLoadings \vec{F} + \vec{\epsilon}\\
\Expect{\vec{X}} & = & \FactorLoadings \Expect{\vec{F}}
\end{eqnarray}
- Typically: center all variables so we can take $\Expect{\vec{F}} = 0$
## Some consequences of the assumptions
\begin{eqnarray}
\vec{X} & = & \FactorLoadings \vec{F} + \vec{\epsilon}\\
\Var{\vec{X}} & = & \FactorLoadings \Var{\vec{F}} \FactorLoadings^T + \Var{\vec{\epsilon}}\\
& = & \FactorLoadings \FactorLoadings^T + \Uniquenesses
\end{eqnarray}
- $\FactorLoadings$ is $p\times q$ so this is **low-rank-plus-diagonal**
+ or **low-rank-plus-noise**
+ Contrast with PCA: that approximates the variance matrix as _purely_ low-rank
## Some consequences of the assumptions
\begin{eqnarray}
\vec{X} & = & \FactorLoadings \vec{F} + \vec{\epsilon}\\
\Var{\vec{X}} & = & \FactorLoadings \FactorLoadings^T + \Uniquenesses\\
\Cov{X_i, X_j} & = & \text{what?}
\end{eqnarray}
## Geometry
- As $\vec{F}$ varies over $q$ dimensions, $\w \vec{F}$ sweeps out a $q$-dimensional subspace in $p$-dimensional space
- Then $\vec{\epsilon}$ perturbs out of this subspace
- If $\Var{\vec{\epsilon}} = \mathbf{0}$ then we'd be _exactly_ in the $q$-dimensional space, and we'd expect correspondence between factors and principal components
+ (Modulo the rotation problem, to be discussed)
- If the noise _isn't_ zero, factors $\neq$ PCs
+ In extremes: the largest direction of variation could come from a big entry in $\Uniquenesses$, not from the linear structure at all
## Geometry
```{r, echo=FALSE}
n <- 20; library(scatterplot3d)
f <- matrix(sort(rnorm(n)),ncol=1); Gamma <- matrix(c(0.5,0.2,-0.1),nrow=1)
fGamma <- f %*% Gamma; x <- fGamma + matrix(rnorm(n*3,sd=c(.15,.05,.09)),ncol=3,byrow=TRUE)
s3d <- scatterplot3d(x,xlab=expression(x^1),ylab=expression(x^2),
zlab=expression(x^3),pch=16)
s3d$points3d(matrix(seq(from=min(f)-1,to=max(f)+1,length.out=2),ncol=1)%*%Gamma,
col="red",type="l")
s3d$points3d(fGamma,col="red",pch=16)
for (i in 1:nrow(x)) {
s3d$points3d(x=c(x[i,1],fGamma[i,1]),y=c(x[i,2],fGamma[i,2]),z=c(x[i,3],fGamma[i,3]),
col="grey",type="l") }
```
## How do we estimate?
\[
\vec{X} = \FactorLoadings \vec{F} + \vec{\epsilon}
\]
Can't regress $\vec{X}$ on $\vec{F}$ because we never see $\vec{F}$
## Suppose we _knew_ $\Uniquenesses$
- we'd say
\begin{eqnarray}
\Var{\vec{X}} & = & \FactorLoadings\FactorLoadings^T + \Uniquenesses\\
\Var{\vec{X}} - \Uniquenesses & = & \FactorLoadings\FactorLoadings^T
\end{eqnarray}
- LHS is $\Var{\FactorLoadings\vec{F}}$ so we know it's symmetric and non-negative-definite
- $\therefore$ We can eigendecompose LHS as
\begin{eqnarray}
\Var{\vec{X}} - \Uniquenesses & = &\mathbf{v} \mathbf{\lambda} \mathbf{v}^T\\
& = & (\mathbf{v} \mathbf{\lambda}^{1/2}) (\mathbf{v} \mathbf{\lambda}^{1/2})^T
\end{eqnarray}
+ $\mathbf{\lambda} =$ diagonal matrix of eigenvalues, only $q$ of which are non-zero
- Set $\FactorLoadings = \mathbf{v} \mathbf{\lambda}^{1/2}$ and everything's consistent
## Suppose we knew $\FactorLoadings$
then we'd say
\begin{eqnarray}
\Var{\vec{X}} & = & \FactorLoadings\FactorLoadings^T + \Uniquenesses\\
\Var{\vec{X}} - \FactorLoadings\FactorLoadings^T & = & \Uniquenesses
\end{eqnarray}
## "One person's vicious circle is another's iterative approximation":
- Start with a guess about $\Uniquenesses$
+ Suitable guess: regress each observable on the others, residual variance is $\Uniquenesses_{ii}$
- Until the estimates converge:
+ Use $\Uniquenesses$ to find $\FactorLoadings$ (by eigen-magic)
+ Use $\FactorLoadings$ to find $\Uniquenesses$ (by subtraction)
- Once we have the loadings (and uniquenesses), we can estimate the scores
##
```{r}
small.height <- 80
small.width <- 60
source("../../hw/03/eigendresses.R")
dress.images <- image.directory.df(path="../../hw/03/amazon-dress-jpgs/", pattern="*.jpg",
width=small.width,
height=small.height)
```
```{r, tidy.opts=list(comment=TRUE)}
library("cate") # For high-dimensional (p > n) factor models
# function is a little fussy and needs its input to be a matrix rather
# than a data frame
# Also, it has a bunch of estimation methods, but the default one doesn't
# work nicely when some observables have zero variance (here, white pixels at
# the edges of every image --- use something a little more robust)
dresses.fa.1 <- factor.analysis(as.matrix(dress.images), r=1, method="pc")
summary(dresses.fa.1) # Factor loadings, factor scores, uniqunesses
```
##
```{r}
par(mfrow=c(1,2))
plot(vector.to.image(dresses.fa.1$Gamma, height=small.height, width=small.width))
plot(vector.to.image(-dresses.fa.1$Gamma, height=small.height, width=small.width))
par(mfrow=c(1,1))
```
##
```{r}
dresses.fa.5 <- factor.analysis(as.matrix(dress.images), r=5, method="pc")
summary(dresses.fa.5) # Factor loadings, factor scores, uniqunesses
```
##
```{r, echo=FALSE}
par(mfcol=c(2,5))
plot(vector.to.image(dresses.fa.5$Gamma[,1], height=small.height, width=small.width))
plot(vector.to.image(-dresses.fa.5$Gamma[,1], height=small.height, width=small.width))
plot(vector.to.image(dresses.fa.5$Gamma[,2], height=small.height, width=small.width))
plot(vector.to.image(-dresses.fa.5$Gamma[,2], height=small.height, width=small.width))
plot(vector.to.image(dresses.fa.5$Gamma[,3], height=small.height, width=small.width))
plot(vector.to.image(-dresses.fa.5$Gamma[,3], height=small.height, width=small.width))
plot(vector.to.image(dresses.fa.5$Gamma[,4], height=small.height, width=small.width))
plot(vector.to.image(-dresses.fa.5$Gamma[,4], height=small.height, width=small.width))
plot(vector.to.image(dresses.fa.5$Gamma[,5], height=small.height, width=small.width))
plot(vector.to.image(-dresses.fa.5$Gamma[,5], height=small.height, width=small.width))
par(mfrow=c(1,1))
```
dress vs model, width of dress, pose, pose, pose (?)
## Recover image no. 1 from the factor scores
```{r}
par(mfrow=c(1,2))
plot(vector.to.image(dress.images[1,], height=small.height, width=small.width))
plot(vector.to.image(dresses.fa.5$Gamma %*% dresses.fa.5$Z[1,],
height=small.height,
width=small.width))
par(mfrow=c(1,1))
```
## Checking assumptions
- Can't check assumptions about $\vec{F}$ or $\vec{\epsilon}$ directly
- Can check whether $\Var{\vec{X}}$ is low-rank-plus-noise
+ Need to know how far we should expect $\Var{\vec{X}}$ to be from low-rank-plus-noise
+ Can simulate
+ Exact theory if you assume everything's Gaussian
- _Other models_ can also give low-rank-plus-noise covariance
+ See readings from ADA
## Caution: the rotation problem
- Remember the factor model:
\[
\vec{X} = \FactorLoadings \vec{F} + \vec{\epsilon}
\]
with $\Var{\vec{F}} = \mathbf{I}$, $\Cov{\vec{F}, \vec{\epsilon}} = \mathbf{0}$, $\Var{\vec{\epsilon}} = \Uniquenesses$
- Now consider $\vec{G} = \mathbf{r} \vec{F}$ for any orthogonal matrix $\mathbf{r}$
\begin{eqnarray}
\vec{G} & = & \mathbf{r} \vec{F}\\
\Var{\vec{G}} &= & \mathbf{r}\Var{\vec{F}}\mathbf{r}^T\\
& = & \mathbf{r}\mathbf{I}\mathbf{r}^T = \mathbf{I}\\
\Cov{\vec{G}, \vec{\epsilon}} & = & \mathbf{r}\Cov{\vec{F}, \vec{\epsilon}} = \mathbf{0}\\
\vec{F} & = & \mathbf{r}^{-1} \vec{G} = \mathbf{r}^{T} \vec{G}\\
\vec{X} & = & \FactorLoadings \mathbf{r}^T \vec{G} + \vec{\epsilon}\\
& = & \FactorLoadings^{\prime} \vec{G} + \vec{\epsilon}\\
\end{eqnarray}
- Once we've found one factor solution, we can rotate to another, and _nothing observable changes_
- In other words: we're free to use any coordinate system we like for the latent variables
- Really a problem if we want to interpret the factors
+ Different rotations make _exactly_ the same predictions about the data
+ If we prefer one over another, it _cannot_ be because one of them fits the data better or has more empirical support (at least not _this_ data)
+ On the other hand, if our initial estimate of $\FactorLoadings$ is hard to interpret, we can always try rotating to make it easier to tell stories about
- Rotation is no problem at all if we just want to predict
## Applications
- Factor analysis begins with @Spearman-general-intelligence (IQ testing)
- @Thurstone-vectors-of-mind made it a general tool for psychometrics
- "Five factor model" of personality (openness, conscientiousness, extraversion, agreeableness, neuroticism): Basically, FA on personality quizzes
+ A complicated story of dictionaries, lunatic asylums, and linear algebra [@Paul-cult-of-personality]
+ Fails goodness-of-fit tests but that doesn't stop the psychologists [@Borsboom-attack]
- More recent applications:
+ Netflix again
+ Cambridge Analytica
- Not covered here: spatial and especially spatio-temporal data
## Netflix
- Each movie _loads_ on to each factor
+ E.g., one might load highly on "stuff blowing up", "in space", "dark and brooding" but not at all or negatively on "romantic comedy", "tearjerker", "bodily humor"
+ Discovery: We don't need to tell the system these are the dimensions movies vary along; it will find as many factors as we ask
+ Interpretation: The factors it finds might not have humanly-comprehensible interpretations like "stuff blowing up" or "family secrets"
+ Rotation:
- Each user _also_ has a score on each factor
+ E.g., I might have high scores for "stuff blowing up", "in space" and "romantic comedy", but negative scores for "tearjerker", "bodily humor" and "family secrets"
- Ratings are inner products plus noise
+ Observable-specific noise helps capture ratings of movies which are extra variable, even given their loadings
- Further reading: see readings for last lecture
## Cambridge Analytica
- UK political-operations firm
- Starting point: Data sets of Facebook likes _plus_ a five-factor personality test
+ Ran regressions to link likes to personality factors
- Then snarfed a lot of data from _other_ people about their Facebook likes
- Then extrapolated to personality scores
- Then sold this as the basis for targeting political ads in 2016 both in the UK and the US
+ Five-factor personality scores _do_ correlate to political preferences (see, e.g., [here](https://www.vox.com/policy-and-politics/2019/9/16/20856316/poll-yougov-painting-ideology-trump)), but so do education and IQ, which are all correlated with each other
+ Cambridge Analytica claimed to be able to target the inner psyches of voters and tap their hidden fears and desires
+ Not clear how well it worked or even how much of what they actually did used the estimated personality scores
## Cambridge Analytica
- At a technical level, Cambridge Analytica made (or claimed to make) a _lot_ of extrapolations
+ From Facebook likes among initial app users to latent factor scores
+ From Facebook likes among _friends_ of app users to latent factor scores
+ From factor scores to ad effectiveness
+ How did modeling error and noise propagate along this chain?
- Again, not clear that this worked any better than traditional demographics or even that the psychological parts were used in practice
- As with lots of data firms, a big contrast in rhetoric:
+ To customers, claims of doing magic
+ To regulators / governments, claims of being just polling / advising agency
+ The recent (Netflix!) documentary takes their earlier PR at face value...
- Clearly they were shady, but they don't seem to have been very effective
+ "They meant ill, but they were incompetent" is not _altogether_ comforting
- Further readings: linked to from the course homepage
## Summary
- Factor analysis models the data as linear functions of a few latent variables plus noise
- Extends principal components to an actual statistical model
- Not the only way to get linear-plus-noise structure in the data
- Next time: _Fast_ linear summaries, start of nonlinear structure
## Backup: Estimating factor scores
- PC scores were just projection
- Estimating factor scores isn't so easy!
- Factor model:
\[
\vec{X} = \FactorLoadings \vec{F} + \vec{\epsilon}
\]
- It'd be convenient to estimate factor scores as
\[
\FactorLoadings^{-1} \vec{X}
\]
but $\FactorLoadings^{-1}$ doesn't exist!
- Typical approach: optimal linear estimator
- We know (from 401) that the optimal linear estimator of any $Y$ from any $\vec{Z}$ is
\[
\Cov{Y, \vec{Z}} \Var{\vec{Z}}^{-1} \vec{Z}
\]
+ (ignoring the intercept because everything's centered)
+ i.e., column vector of optimal coefficients is $\Var{\vec{Z}}^{-1} \Cov{\vec{Z}, Y}$
- Here
\[
\Cov{\vec{X}, \vec{F}} = \FactorLoadings\Var{F} = \FactorLoadings
\]
and
\[
\Var{\vec{X}} = \FactorLoadings\FactorLoadings^T + \Uniquenesses
\]
so the optimal linear factor score estimates are
\[
\FactorLoadings^T (\FactorLoadings\FactorLoadings^T + \Uniquenesses)^{-1} \vec{X}
\]
## Backup: Factor models and high-dimensional variance estimation
- With $p$ observable features, a variance matrix has $p(p+1)/2$ entries (by symmetry)
- Ordinarily, to estimate $k$ parameters requires $n \geq k$ data points, so we'd need at least $p(p+1)/2$ data points to get a variance matrix
+ So it looks like we need $n=O(p^2)$ data points to estimate variance matrices
+ Trouble if $p=10^6$ or even $10^4$
- A $q$-factor model only has $pq+p=p(q+1)$ parameters
+ So we can get away with only $O(p)$ data points
+ What's going on in the data example above, where $p= `r ncol(dress.images)`$ but $n = `r nrow(dress.images)`$?
## References