# Information Theory III — Information for Prediction

Lecture 17 (23 October 2019)

$\newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]}$

# Information for Prediction

• We’ve seen the basic ideas of information theory:
• How much information/uncertainty is in one random variable ($$H[X]$$)
• How much information/uncertainty is in two RVs ($$H[X,Y]$$)
• How much information/uncertainty is left after conditioning ($$H[Y|X]$$)
• How much is uncertainty reduced by conditioning ($$H[Y] - H[X|Y]$$)
• How much information does one variable give about another ($$I[X;Y]=H[Y]-H[X|Y]$$)
• And conditional versions of all these
• We’ve seen how to use this for feature selection
• Rank features $$X_1, \ldots X_p$$ by $$H[Y|X_i]$$ or $$I[X_i;Y]$$
• Picking collects of features by evaluating $$I[Y; X_1, \ldots X_q]$$
• Now: What would success look like?
• Good features
• Good synthetic features

# Predictive Information

• We observe $$X$$ (possibly multivariate) and want to predict $$Y$$
• Conditioning on $$X$$ gives us some amount of information about $$Y$$ $I[X;Y] = H[Y] - H[Y|X]$
• Our prediction is a function of $$X$$, say $$\hat{Y}(X)$$
• Conditioning on $$\hat{Y}$$ never gives us more information: $I[\hat{Y};Y] = H[Y] - H[Y|\hat{Y}] \leq I[X;Y]$
• Because: (i) Conditioning never increases entropy, $$H[Y|X, f(X)] \leq H[Y|f(X)]$$
• But (ii) Conditioning on $$X$$ and $$f(X)$$ is the same as conditioning on $$X$$, so $$H[Y|X, f(X)] = H[Y|X]$$
• So $$H[Y|f(X)] \geq H[Y|f(X), X] = H[Y|X]$$
• This is (one form of) the data-processing inequality
• $$I[X;Y]$$ is the predictive information $$X$$ has about $$Y$$ and limits how good any prediction using $$X$$ can be

# Predictive Information vs. Accuracy

• For discrete $$Y$$, say $$M$$ (for “mistake”) $$=1$$ if $$Y\neq \hat{Y}$$ and $$=0$$ if $$Y=\hat{Y}$$
• Then one can show (see backup) that $H[Y|X] \leq H[M] + \Pr(M=1)\log{(|\mathcal{Y}|-1)}$
• Perfect classification ($$\Pr(M=1) = 0$$) implies $$H[Y|X] = 0$$
• RHS is an increasing function of $$\Pr(M=1)$$ so higher $$H[Y|X] \Rightarrow$$ higher inaccuracy, no matter what method we use

# Predictive Information vs. Accuracy

• Upshot: $$H[Y|X]$$ lower bounds classification accuracy
• Even if we just care about accuracy, we should want to maximize information!

# Reducing the Features

• Suppose we look not at $$X$$ but $$T=\tau(X)$$
• Could be just picking out some dimensions of a multivariate $$X$$
• Could be applying transformations
• Could be creating new features out of old ones (as in PCA)
• Data-processing inequality says: $I[T;Y] \leq I[X;Y]$
• When will $$I[T;Y] = I[X;Y]$$?

# Sufficiency

• Re-write mutual informations: $I[X;Y] = \sum_{x}{p(x) \sum_{y}{p(y|x) \log{\frac{p(y|x)}{p(y)}}}}$ and $I[T;Y] = \sum_{t}{p(t) \sum_{y}{p(y|t)\log{\frac{p(y|t)}{p(y)}}}}$
• Suppose that $$\tau(x) = \tau(x^{\prime}) \Rightarrow p(y|x) = p(y|x^{\prime})$$
• Then for any $$x$$ with $$\tau(x) = t$$, $p(y|t) = p(y|x)$
• In more symbols, if $$x\in\tau^{-1}(t)$$ then $$p(y|t) = p(y|x)$$
• Why is this true?
• Use this to group terms in the $$x$$ sum for $$I[X;Y]$$: $\begin{eqnarray} I[X;Y] & = & \sum_{x}{p(x) \sum_{y}{p(y|x) \log{\frac{p(y|x)}{p(y)}}}}\\ & = & \sum_{t}{\sum_{x \in \tau^{-1}(t)}{p(x)\sum_{y}{p(y|x)\log{\frac{p(y|x)}{p(y)}}}}}\\ & = & \sum_{t}{\sum_{x \in \tau^{-1}(t)}{p(x) \sum_{y}{p(y|t)\log{\frac{p(y|t)}{p(y)}}}}}\\ & = & \sum_{t}{p(t) \sum_{y}{p(y|t)\log{\frac{p(y|t)}{p(y)}}}}\\ & = & I[T;Y] \end{eqnarray}$
• We say that $$T=\tau(X)$$ is sufficient for predicting $$Y$$ (from $$X$$)

# Sufficiency cont’d.

• Suppose that $\rho(x) = \rho(x^{\prime}) ~ \Leftrightarrow ~ p(y|x) = p(y|x^{\prime})$
• Then $$R=\rho(X)$$ will be sufficient
• And for any other sufficient $$T$$, $$R=g(T)$$ for some function $$g$$
• $$H[R] \leq H[T]$$ for any other sufficient $$T$$
• $$R$$ is minimal sufficient
• The way $$\rho$$ divides up $$X$$ is a statistical relevance basis (for predicting $$Y$$) (Salmon 1971, 1984; Shalizi and Crutchfield 2002)
• “Distinctions that make a difference”

# From Sufficiency to the Information Bottleneck

• We can’t compress below $$R$$ without losing some predictive information
• What if we’re willing to give up some predictive information?
• Pick $$\beta > 0$$ and do $\max_{\tau}{I[Y;\tau(X)] - \beta H[\tau(X)]}$ or $\max_{\tau}{I[Y;\tau(X)] - \beta I[\tau(X);X]}$
• In words: Our benefit is the predictive information, our cost is the memory of the features
• $$\beta$$ is the price at which we trade predictive information against memory
• Lagrange multipliers: equivalent to maximizing $$I[Y;\tau(X)]$$ with a constraint on $$I[\tau(X);X]$$
• $$T=\tau(X)$$ is called the bottleneck variable
• Searching for this optimal $$\tau$$ (given $$\beta$$) is the information-bottleneck method (Tishby, Pereira, and Bialek 1999)

# Why This Matters

• Sometimes, we can explicitly work out the bottleneck or even the sufficient statistic
• Sometimes, it gives us a benchmark to evaluate against
• What’s $$I[Y;T]$$ for our favorite transformation of $$T$$ of the features?
• What’s $$H[T]$$ as compared to $$H[X]$$? How much information loss are we tolerating to get that compression?
• Sometimes it just inspires how we select features, or synthesize new features

# Dimension Reduction with a Target Variable

• Start with $$p$$-dimensional feature vector $$X$$
• Consider functions $$\tau$$ which map $$X$$ down to $$q$$-dimensional vectors $$T$$
• Maybe constrained, e.g., only functions linear in $$X$$
• Still want to maximize $$I[Y;T]$$, or maybe a bottlenecked version
• If calculating $$I[Y;T]$$ is too hard, or not quite right for the job, look at prediction error

# Dimension Reduction without a Target Variable

• Start with $$p$$-dimensional feature vector $$X$$
• Consider functions $$\tau$$ which map $$X$$ down to $$q$$-dimensional vectors $$T$$
• Think about how we’d reconstruct $$X$$ from $$T$$, $$\hat{X}(T)$$
• Maximize $$I[X;\hat{X}(T)]$$ with constraint on $$I[T;X]$$
• Again, can swap in minimizing prediction error if you want
• This is basically what we did in PCA!

# word2vec

• Start with words being binary features, $$p\approx$$ number of entries in the dictionary
• i.e., which word do we see at this position?
• Try to predict this word from neighboring words
• A huge problem!
• Map each word in to a vector of dimension $$q \ll p$$, maybe say $$q=700$$
• Adjust the mapping of words to vectors to maximize predictive information
• Experimentally, works better than PCA on the bag-of-word vectors
• But a lot more costly computationally (and in actual \$)

# Clustering

• Start with high-dimensional feature vectors $$X$$
• Now map them to a discrete set of categories, say $$k$$ of them
• $$I[\tau(X);X] \leq \log{k}$$ (why?)
• Try to maximize $$I[X;\hat{X}(T)]$$
• Or, if that’s too hard, some measure of the error of recovering $$X$$ from $$T$$
• This is clustering, and we’ll look at it for the next few lessons

# Summing up

• $$I[X;Y]$$ tells us about how well any method using $$X$$ can predict $$Y$$
• Sufficient statistics maximally compress $$X$$ without losing predictive information
• The information bottleneck method tells us about how to trade off compression against predictive information
• Dimension reduction $$\approx$$ looking for a continuous bottleneck
• Clustering $$\approx$$ looking for a discrete bottleneck

# Backup: Information for Continuous Variables

Replace sums with integrals as needed.

• $$X$$ and $$Y$$ both continuous: $\begin{eqnarray} I[X;Y] & = & \int{p(x,y) \left(\log{\left(\frac{p(x,y)}{p(x)p(y)}\right)}\right) dx dy}\\ & = & \int{p(x) \left(\int{p(y|x) \log{\left(\frac{p(y|x)}{p(y)}\right)} dy}\right) dx} \end{eqnarray}$ with $$p$$ being the pdf everywhere

• $$X$$ continuous, $$Y$$ discrete: $I[X;Y] = \int{p(x) \left(\sum_{y}{p(y|x) \log{\left(\frac{p(y|x)}{p(y)}\right)}}\right) dx}$ with $$p(x)$$ being the pdf but $$p(y|x)$$, $$p(y)$$ being the (conditional) pmf

• In any case:
• $$I[X;Y] = I[Y;X]$$
• $$I[X;Y] \geq 0$$
• $$I[X;Y] = 0$$ iff $$X$$ and $$Y$$ are independent
• $$I[f(X);Y] \leq I[X;Y]$$, with equality iff $$f(x) = f(x^{\prime}) \Rightarrow p(y|x) = p(y|x^{\prime})$$ (i.e., sufficiency)

# Backup: Conditional entropy and classification accuracy

$\begin{eqnarray} H[M|Y,\hat{Y}] & = & 0\\ H[Y|\hat{Y}] & = & H[Y|\hat{Y}] + H[M|Y,\hat{Y}]\\ & = & H[Y,M|\hat{Y}]\\ & = & H[M|\hat{Y}] + H[Y|M,\hat{Y}]\\ & = & H[M|\hat{Y}] + \Pr(M=0)H[Y|M=0,\hat{Y}] + \Pr(M=1)H[Y|M=1,\hat{Y}]\\ & \leq & H[M] + \Pr(M=0)H[Y|M=0,\hat{Y}] + \Pr(M=1)H[Y|M=1,\hat{Y}]\\ & = & H[M] + \Pr(M=0)\cdot 0 + \Pr(M=1) H[Y|M=1,\hat{Y}]\\ & = & H[M] + \Pr(M=1) H[Y|M=1,\hat{Y}]\\ & \leq & H[M] + \Pr(M=1)\log{(|\mathcal{Y}|-1)} \end{eqnarray}$

because if $$M=1$$, we know $$Y\neq \hat{Y}$$ and there are at most $$|\mathcal{Y}|-1$$ values $$Y$$ could have

Finally, remember that $$H[Y|\hat{Y}(X)] \geq H[Y|X]$$

This result is called Fano’s inequality

• Originally about recovering the true message ($$Y$$) from a noisy signal ($$X$$)
• For $$Y$$ uniformly distributed on $$\mathcal{Y}$$, Fano’s inequality implies $\Pr(M=1) \geq 1 - \frac{I[X;Y] + \log{2}}{\log{|\mathcal{Y}|}}$
• (Can you show this?)
• Fano’s inequality out to have many, many uses in prediction and estimation (Scarlett and Cevher, n.d.)
• Prediction: Obvious
• Estimation: Think of the parameter as the message ($$Y$$) we’re trying to recover from the noisy signal of the data set ($$X$$)

# Backup: Sufficiency and the Bottleneck, Take 2 (Shalizi and Crutchfield 2002)

• Initial bottleneck problem: $\max_{\tau}{I[Y;\tau(X)] - \beta H[\tau(X)]}$
• Willing to give up $$1$$ bit of predictive information if it saves at least $$\beta$$ bits of memory
• Lagrange multiplier form: $\max_{\tau}{I[Y;\tau(X)]} ~ \mathrm{subject\ to}~ H[\tau(X)] \leq c$
• Equivalently: compressing by 1 bit is only worthwhile if it reduces predictive information by $$\leq 1/\beta$$ bits $\min_{\tau}{H[\tau(X)] - (1/\beta) I[Y;\tau(X)]}$
• As $$\beta \rightarrow\infty$$, we become less and less willing to give up any predictive information
• Lagrange-multiplier form: $\min_{\tau}{H[\tau(X)]} ~ \mathrm{subject\ to} ~ I[Y;\tau(X)] \geq c^{\prime}$
• The limit $$\beta=\infty$$ is the minimal sufficient statistic / statistical relevance basis: $\min_{\tau}{H[\tau(X)]} ~\mathrm{subject\ to} ~ I[\tau(X);Y]=I[X;Y]$

# Backup: word2vec in a little more detail

• Each word $$w$$ corresponds to a vector $$v_w$$
• Each “context” = window of $$k$$ words around the focal word corresponds to a vector $$v_c$$
• Try to maximize $\max_{v_c, v_d}{\sum_{\mathrm{word}~ w ~\mathrm{appears\ in\ context} ~c}{\log{\frac{e^{v_c \cdot v_w}}{\sum_{c^{\prime}}{e^{v_{c^{\prime}} \cdot v_w}}}}}}$
• Words which appear in similar contexts should get similar vectors
• The actual maximization is too hard, so the word2vec software does something easier but related (Goldberg and Levy 2014)
• … which turns out to be (implicitly) factorizing a matrix of pointwise mutual informations between words and contexts (Levy and Goldberg 2014)

# References

Goldberg, Yoav, and Omer Levy. 2014. “word2vec Explained: Deriving Mikolov et Al.’s Negative-Sampling Word Embedding Method.” Electronic preprint, arxiv:1402.3722. https://arxiv.org/abs/1402.3722.

Levy, Omer, and Yoav Goldberg. 2014. “Neural Word Embedding as Implicit Matrix Factorization.” In Advances in Neural Information Processing Systems 27 [Nips 2014], edited by Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, 2177–85. Curran Associates. http://papers.nips.cc/paper/5477-neural-word-embedding-as.

Salmon, Wesley C. 1971. Statistical Explanation and Statistical Relevance. Pittsburgh: University of Pittsburgh Press.

———. 1984. Scientific Explanation and the Causal Structure of the World. Princeton: Princeton University Press.

Scarlett, Jonathan, and Volkan Cevher. n.d. “An Introductory Guide to Fano’s Inequality with Applications in Statistical Estimation.” In Information-Theoretic Methods in Data Science, edited by Yonina Eldar and Miguel Rodrigues. Cambridge, England: Cambridge University Press. https://arxiv.org/abs/1901.00555.

Shalizi, Cosma Rohilla, and James P. Crutchfield. 2002. “Information Bottlenecks, Causal States, and Statistical Relevance Bases: How to Represent Relevant Information in Memoryless Transduction.” Advances in Complex Systems 5:91–95. http://arxiv.org/abs/nlin.AO/0006025.

Tishby, Naftali, Fernando C. Pereira, and William Bialek. 1999. “The Information Bottleneck Method.” In Proceedings of the 37th Annual Allerton Conference on Communication, Control and Computing, edited by B. Hajek and R. S. Sreenivas, 368–77. Urbana, Illinois: University of Illinois Press. http://arxiv.org/abs/physics/0004057.