36-462/662, Data Mining, Fall 2019

Lecture 17 (23 October 2019)

\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \]

- We’ve seen the basic ideas of information theory:
- How much information/uncertainty is in one random variable (\(H[X]\))
- How much information/uncertainty is in two RVs (\(H[X,Y]\))
- How much information/uncertainty is left after conditioning (\(H[Y|X]\))
- How much is uncertainty reduced by conditioning (\(H[Y] - H[X|Y]\))
- How much information does one variable give about another (\(I[X;Y]=H[Y]-H[X|Y]\))
- And conditional versions of all these

- We’ve seen how to use this for feature selection
- Rank features \(X_1, \ldots X_p\) by \(H[Y|X_i]\) or \(I[X_i;Y]\)
- Picking collects of features by evaluating \(I[Y; X_1, \ldots X_q]\)

- Now: What would success look like?
- Good features
- Good
*synthetic*features

- We observe \(X\) (possibly multivariate) and want to predict \(Y\)
- Conditioning on \(X\) gives us some amount of information about \(Y\) \[ I[X;Y] = H[Y] - H[Y|X] \]
- Our prediction is a function of \(X\), say \(\hat{Y}(X)\)
- Conditioning on \(\hat{Y}\) never gives us
*more*information: \[ I[\hat{Y};Y] = H[Y] - H[Y|\hat{Y}] \leq I[X;Y] \]- Because: (i) Conditioning never increases entropy, \(H[Y|X, f(X)] \leq H[Y|f(X)]\)
- But (ii) Conditioning on \(X\) and \(f(X)\) is the same as conditioning on \(X\), so \(H[Y|X, f(X)] = H[Y|X]\)
- So \(H[Y|f(X)] \geq H[Y|f(X), X] = H[Y|X]\)
- This is (one form of) the
**data-processing inequality**

- \(I[X;Y]\) is the
**predictive information**\(X\) has about \(Y\) and limits how good*any*prediction using \(X\) can be

- For discrete \(Y\), say \(M\) (for “mistake”) \(=1\) if \(Y\neq \hat{Y}\) and \(=0\) if \(Y=\hat{Y}\)
- Then one can show (see backup) that \[
H[Y|X] \leq H[M] + \Pr(M=1)\log{(|\mathcal{Y}|-1)}
\]
- Perfect classification (\(\Pr(M=1) = 0\)) implies \(H[Y|X] = 0\)
- RHS is an increasing function of \(\Pr(M=1)\) so higher \(H[Y|X] \Rightarrow\) higher inaccuracy, no matter what method we use

- Upshot: \(H[Y|X]\) lower bounds classification accuracy
- Even if we just care about accuracy, we should want to maximize information!

- Suppose we look not at \(X\) but \(T=\tau(X)\)
- Could be just picking out some dimensions of a multivariate \(X\)
- Could be applying transformations
- Could be creating new features out of old ones (as in PCA)

- Data-processing inequality says: \[ I[T;Y] \leq I[X;Y] \]
- When will \(I[T;Y] = I[X;Y]\)?

- Re-write mutual informations: \[ I[X;Y] = \sum_{x}{p(x) \sum_{y}{p(y|x) \log{\frac{p(y|x)}{p(y)}}}} \] and \[ I[T;Y] = \sum_{t}{p(t) \sum_{y}{p(y|t)\log{\frac{p(y|t)}{p(y)}}}} \]
- Suppose that \(\tau(x) = \tau(x^{\prime}) \Rightarrow p(y|x) = p(y|x^{\prime})\)
- Then for any \(x\) with \(\tau(x) = t\), \[
p(y|t) = p(y|x)
\]
- In more symbols, if \(x\in\tau^{-1}(t)\) then \(p(y|t) = p(y|x)\)
- Why is this true?

- Use this to group terms in the \(x\) sum for \(I[X;Y]\): \[\begin{eqnarray} I[X;Y] & = & \sum_{x}{p(x) \sum_{y}{p(y|x) \log{\frac{p(y|x)}{p(y)}}}}\\ & = & \sum_{t}{\sum_{x \in \tau^{-1}(t)}{p(x)\sum_{y}{p(y|x)\log{\frac{p(y|x)}{p(y)}}}}}\\ & = & \sum_{t}{\sum_{x \in \tau^{-1}(t)}{p(x) \sum_{y}{p(y|t)\log{\frac{p(y|t)}{p(y)}}}}}\\ & = & \sum_{t}{p(t) \sum_{y}{p(y|t)\log{\frac{p(y|t)}{p(y)}}}}\\ & = & I[T;Y] \end{eqnarray}\]
- We say that \(T=\tau(X)\) is
**sufficient**for predicting \(Y\) (from \(X\))

- Suppose that \[ \rho(x) = \rho(x^{\prime}) ~ \Leftrightarrow ~ p(y|x) = p(y|x^{\prime}) \]
- Then \(R=\rho(X)\) will be sufficient
- And for any other sufficient \(T\), \(R=g(T)\) for some function \(g\)
- \(H[R] \leq H[T]\) for any other sufficient \(T\)
- \(R\) is
**minimal sufficient** - The way \(\rho\) divides up \(X\) is a
**statistical relevance basis**(for predicting \(Y\)) (Salmon 1971, 1984; Shalizi and Crutchfield 2002)- “Distinctions that make a difference”

- We can’t compress below \(R\) without losing some predictive information
- What if we’re willing to give up some predictive information?
- Pick \(\beta > 0\) and do \[ \max_{\tau}{I[Y;\tau(X)] - \beta H[\tau(X)]} \] or \[ \max_{\tau}{I[Y;\tau(X)] - \beta I[\tau(X);X]} \]
- In words: Our benefit is the predictive information, our cost is the memory of the features
- \(\beta\) is the price at which we trade predictive information against memory
- Lagrange multipliers: equivalent to maximizing \(I[Y;\tau(X)]\) with a constraint on \(I[\tau(X);X]\)

- \(T=\tau(X)\) is called the
**bottleneck variable** - Searching for this optimal \(\tau\) (given \(\beta\)) is the
**information-bottleneck method**(Tishby, Pereira, and Bialek 1999)

- Sometimes, we can
*explicitly*work out the bottleneck or even the sufficient statistic - Sometimes, it gives us a benchmark to evaluate against
- What’s \(I[Y;T]\) for our favorite transformation of \(T\) of the features?
- What’s \(H[T]\) as compared to \(H[X]\)? How much information loss are we tolerating to get that compression?

- Sometimes it just inspires how we select features, or synthesize new features

- Start with \(p\)-dimensional feature vector \(X\)
- Consider functions \(\tau\) which map \(X\) down to \(q\)-dimensional vectors \(T\)
- Maybe constrained, e.g., only functions linear in \(X\)

- Still want to maximize \(I[Y;T]\), or maybe a bottlenecked version
- If calculating \(I[Y;T]\) is too hard, or not quite right for the job, look at prediction error

- Start with \(p\)-dimensional feature vector \(X\)
- Consider functions \(\tau\) which map \(X\) down to \(q\)-dimensional vectors \(T\)
- Think about how we’d reconstruct \(X\) from \(T\), \(\hat{X}(T)\)
- Maximize \(I[X;\hat{X}(T)]\) with constraint on \(I[T;X]\)
- Again, can swap in minimizing prediction error if you want

- This is basically what we did in PCA!

`word2vec`

- Start with words being binary features, \(p\approx\) number of entries in the dictionary
- i.e.,
*which*word do we see at this position?

- i.e.,
- Try to predict
*this*word from neighboring words- A huge problem!

- Map each word in to a vector of dimension \(q \ll p\), maybe say \(q=700\)
- Adjust the mapping of words to vectors to maximize predictive information
- Experimentally, works better than PCA on the bag-of-word vectors
- But a
*lot*more costly computationally (and in actual $$$)

- But a

- Start with high-dimensional feature vectors \(X\)
- Now map them to a
*discrete*set of categories, say \(k\) of them - \(I[\tau(X);X] \leq \log{k}\) (why?)
- Try to maximize \(I[X;\hat{X}(T)]\)
- Or, if that’s too hard, some measure of the error of recovering \(X\) from \(T\)

- This is
**clustering**, and we’ll look at it for the next few lessons

- \(I[X;Y]\) tells us about how well
*any*method using \(X\) can predict \(Y\) - Sufficient statistics maximally compress \(X\) without losing predictive information
- The information bottleneck method tells us about how to trade off compression against predictive information
- Dimension reduction \(\approx\) looking for a continuous bottleneck
- Clustering \(\approx\) looking for a discrete bottleneck

Replace sums with integrals as needed.

\(X\) and \(Y\) both continuous: \[\begin{eqnarray} I[X;Y] & = & \int{p(x,y) \left(\log{\left(\frac{p(x,y)}{p(x)p(y)}\right)}\right) dx dy}\\ & = & \int{p(x) \left(\int{p(y|x) \log{\left(\frac{p(y|x)}{p(y)}\right)} dy}\right) dx} \end{eqnarray}\] with \(p\) being the pdf everywhere

\(X\) continuous, \(Y\) discrete: \[ I[X;Y] = \int{p(x) \left(\sum_{y}{p(y|x) \log{\left(\frac{p(y|x)}{p(y)}\right)}}\right) dx} \] with \(p(x)\) being the pdf but \(p(y|x)\), \(p(y)\) being the (conditional) pmf

- In any case:
- \(I[X;Y] = I[Y;X]\)
- \(I[X;Y] \geq 0\)
- \(I[X;Y] = 0\) iff \(X\) and \(Y\) are independent
- \(I[f(X);Y] \leq I[X;Y]\), with equality iff \(f(x) = f(x^{\prime}) \Rightarrow p(y|x) = p(y|x^{\prime})\) (i.e., sufficiency)

because if \(M=1\), we know \(Y\neq \hat{Y}\) and there are at most \(|\mathcal{Y}|-1\) values \(Y\) could have

Finally, remember that \(H[Y|\hat{Y}(X)] \geq H[Y|X]\)

This result is called **Fano’s inequality**

- Originally about recovering the true message (\(Y\)) from a noisy signal (\(X\))
- For \(Y\) uniformly distributed on \(\mathcal{Y}\), Fano’s inequality implies \[
\Pr(M=1) \geq 1 - \frac{I[X;Y] + \log{2}}{\log{|\mathcal{Y}|}}
\]
- (Can you show this?)

- Fano’s inequality out to have many, many uses in prediction and estimation (Scarlett and Cevher, n.d.)
- Prediction: Obvious
- Estimation: Think of the parameter as the message (\(Y\)) we’re trying to recover from the noisy signal of the data set (\(X\))

- Initial bottleneck problem: \[
\max_{\tau}{I[Y;\tau(X)] - \beta H[\tau(X)]}
\]
- Willing to give up \(1\) bit of predictive information if it saves at least \(\beta\) bits of memory
- Lagrange multiplier form: \[ \max_{\tau}{I[Y;\tau(X)]} ~ \mathrm{subject\ to}~ H[\tau(X)] \leq c \]

- Equivalently: compressing by 1 bit is only worthwhile if it reduces predictive information by \(\leq 1/\beta\) bits \[ \min_{\tau}{H[\tau(X)] - (1/\beta) I[Y;\tau(X)]} \]
- As \(\beta \rightarrow\infty\), we become less and less willing to give up any predictive information
- Lagrange-multiplier form: \[ \min_{\tau}{H[\tau(X)]} ~ \mathrm{subject\ to} ~ I[Y;\tau(X)] \geq c^{\prime} \]

- The limit \(\beta=\infty\) is the minimal sufficient statistic / statistical relevance basis: \[ \min_{\tau}{H[\tau(X)]} ~\mathrm{subject\ to} ~ I[\tau(X);Y]=I[X;Y] \]

`word2vec`

in a little more detail- Each word \(w\) corresponds to a vector \(v_w\)
- Each “context” = window of \(k\) words around the focal word corresponds to a vector \(v_c\)
- Try to maximize \[
\max_{v_c, v_d}{\sum_{\mathrm{word}~ w ~\mathrm{appears\ in\ context} ~c}{\log{\frac{e^{v_c \cdot v_w}}{\sum_{c^{\prime}}{e^{v_{c^{\prime}} \cdot v_w}}}}}}
\]
- Words which appear in similar contexts should get similar vectors
- The actual maximization is too hard, so the
`word2vec`

software does something easier but related (Goldberg and Levy 2014) … - … which turns out to be (implicitly) factorizing a matrix of pointwise mutual informations between words and contexts (Levy and Goldberg 2014)

Goldberg, Yoav, and Omer Levy. 2014. “`word2vec`

Explained: Deriving Mikolov et Al.’s Negative-Sampling Word Embedding Method.” Electronic preprint, arxiv:1402.3722. https://arxiv.org/abs/1402.3722.

Levy, Omer, and Yoav Goldberg. 2014. “Neural Word Embedding as Implicit Matrix Factorization.” In *Advances in Neural Information Processing Systems 27 [Nips 2014]*, edited by Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, 2177–85. Curran Associates. http://papers.nips.cc/paper/5477-neural-word-embedding-as.

Salmon, Wesley C. 1971. *Statistical Explanation and Statistical Relevance*. Pittsburgh: University of Pittsburgh Press.

———. 1984. *Scientific Explanation and the Causal Structure of the World*. Princeton: Princeton University Press.

Scarlett, Jonathan, and Volkan Cevher. n.d. “An Introductory Guide to Fano’s Inequality with Applications in Statistical Estimation.” In *Information-Theoretic Methods in Data Science*, edited by Yonina Eldar and Miguel Rodrigues. Cambridge, England: Cambridge University Press. https://arxiv.org/abs/1901.00555.

Shalizi, Cosma Rohilla, and James P. Crutchfield. 2002. “Information Bottlenecks, Causal States, and Statistical Relevance Bases: How to Represent Relevant Information in Memoryless Transduction.” *Advances in Complex Systems* 5:91–95. http://arxiv.org/abs/nlin.AO/0006025.

Tishby, Naftali, Fernando C. Pereira, and William Bialek. 1999. “The Information Bottleneck Method.” In *Proceedings of the 37th Annual Allerton Conference on Communication, Control and Computing*, edited by B. Hajek and R. S. Sreenivas, 368–77. Urbana, Illinois: University of Illinois Press. http://arxiv.org/abs/physics/0004057.