--- title: Information Theory III --- Information for Prediction author: 36-462/662, Data Mining, Fall 2019 date: Lecture 17 (23 October 2019) bibliography: locusts.bib output: slidy_presentation --- $\newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]}$ ## Information for Prediction - We've seen the basic ideas of information theory: + How much information/uncertainty is in one random variable ($H[X]$) + How much information/uncertainty is in two RVs ($H[X,Y]$) + How much information/uncertainty is left after conditioning ($H[Y|X]$) + How much is uncertainty reduced by conditioning ($H[Y] - H[X|Y]$) + How much information does one variable give about another ($I[X;Y]=H[Y]-H[X|Y]$) + And conditional versions of all these - We've seen how to use this for feature selection + Rank features $X_1, \ldots X_p$ by $H[Y|X_i]$ or $I[X_i;Y]$ + Picking collects of features by evaluating $I[Y; X_1, \ldots X_q]$ - Now: What would success look like? + Good features + Good _synthetic_ features ## Predictive Information - We observe $X$ (possibly multivariate) and want to predict $Y$ - Conditioning on $X$ gives us some amount of information about $Y$ $I[X;Y] = H[Y] - H[Y|X]$ - Our prediction is a function of $X$, say $\hat{Y}(X)$ - Conditioning on $\hat{Y}$ never gives us _more_ information: $I[\hat{Y};Y] = H[Y] - H[Y|\hat{Y}] \leq I[X;Y]$ + Because: (i) Conditioning never increases entropy, $H[Y|X, f(X)] \leq H[Y|f(X)]$ + But (ii) Conditioning on $X$ and $f(X)$ is the same as conditioning on $X$, so $H[Y|X, f(X)] = H[Y|X]$ + So $H[Y|f(X)] \geq H[Y|f(X), X] = H[Y|X]$ + This is (one form of) the **data-processing inequality** - $I[X;Y]$ is the **predictive information** $X$ has about $Y$ and limits how good _any_ prediction using $X$ can be ## Predictive Information vs. Accuracy - For discrete $Y$, say $M$ (for "mistake") $=1$ if $Y\neq \hat{Y}$ and $=0$ if $Y=\hat{Y}$ - Then one can show (see backup) that $H[Y|X] \leq H[M] + \Pr(M=1)\log{(|\mathcal{Y}|-1)}$ + Perfect classification ($\Pr(M=1) = 0$) implies $H[Y|X] = 0$ + RHS is an increasing function of $\Pr(M=1)$ so higher $H[Y|X] \Rightarrow$ higher inaccuracy, no matter what method we use {r, echo=FALSE} curve(-x*log(x, base=2)-(1-x)*log(1-x, base=2), from=0, to=0.5, xlab="Desired classification error rate", ylab="Maximum allowable H[Y|X]", ylim=c(0, log(10, base=2))) curve(-x*log(x, base=2)-(1-x)*log(1-x, base=2)+x, add=TRUE, lty="dashed") curve(-x*log(x, base=2)-(1-x)*log(1-x, base=2)+x*9, add=TRUE, lty="dotted") legend("topleft", legend=c(expression(group("|",Y,"|")==2), expression(group("|",Y,"|")==3), expression(group("|",Y,"|")==10)), lty=c("solid","dashed","dotted"), cex=0.75)  ## Predictive Information vs. Accuracy - Upshot: $H[Y|X]$ lower bounds classification accuracy - Even if we just care about accuracy, we should want to maximize information! ## Reducing the Features - Suppose we look not at $X$ but $T=\tau(X)$ + Could be just picking out some dimensions of a multivariate $X$ + Could be applying transformations + Could be creating new features out of old ones (as in PCA) - Data-processing inequality says: $I[T;Y] \leq I[X;Y]$ - When will $I[T;Y] = I[X;Y]$? ## Sufficiency - Re-write mutual informations: $I[X;Y] = \sum_{x}{p(x) \sum_{y}{p(y|x) \log{\frac{p(y|x)}{p(y)}}}}$ and $I[T;Y] = \sum_{t}{p(t) \sum_{y}{p(y|t)\log{\frac{p(y|t)}{p(y)}}}}$ - Suppose that $\tau(x) = \tau(x^{\prime}) \Rightarrow p(y|x) = p(y|x^{\prime})$ - Then for any $x$ with $\tau(x) = t$, $p(y|t) = p(y|x)$ + In more symbols, if $x\in\tau^{-1}(t)$ then $p(y|t) = p(y|x)$ + Why is this true? - Use this to group terms in the $x$ sum for $I[X;Y]$: \begin{eqnarray} I[X;Y] & = & \sum_{x}{p(x) \sum_{y}{p(y|x) \log{\frac{p(y|x)}{p(y)}}}}\\ & = & \sum_{t}{\sum_{x \in \tau^{-1}(t)}{p(x)\sum_{y}{p(y|x)\log{\frac{p(y|x)}{p(y)}}}}}\\ & = & \sum_{t}{\sum_{x \in \tau^{-1}(t)}{p(x) \sum_{y}{p(y|t)\log{\frac{p(y|t)}{p(y)}}}}}\\ & = & \sum_{t}{p(t) \sum_{y}{p(y|t)\log{\frac{p(y|t)}{p(y)}}}}\\ & = & I[T;Y] \end{eqnarray} - We say that $T=\tau(X)$ is **sufficient** for predicting $Y$ (from $X$) ## Sufficiency cont'd. - Suppose that $\rho(x) = \rho(x^{\prime}) ~ \Leftrightarrow ~ p(y|x) = p(y|x^{\prime})$ - Then $R=\rho(X)$ will be sufficient - And for any other sufficient $T$, $R=g(T)$ for some function $g$ - $H[R] \leq H[T]$ for any other sufficient $T$ - $R$ is **minimal sufficient** - The way $\rho$ divides up $X$ is a **statistical relevance basis** (for predicting $Y$) [@Salmon-1971; @Salmon-1984; @Bottleneck-note] + "Distinctions that make a difference" ![](prediction-process-histories-partitioned-into-causal-states.png){width=50%} ## From Sufficiency to the Information Bottleneck - We can't compress below $R$ without losing some predictive information - What if we're willing to give up some predictive information? - Pick $\beta > 0$ and do $\max_{\tau}{I[Y;\tau(X)] - \beta H[\tau(X)]}$ or $\max_{\tau}{I[Y;\tau(X)] - \beta I[\tau(X);X]}$ - In words: Our benefit is the predictive information, our cost is the memory of the features + $\beta$ is the price at which we trade predictive information against memory + Lagrange multipliers: equivalent to maximizing $I[Y;\tau(X)]$ with a constraint on $I[\tau(X);X]$ - $T=\tau(X)$ is called the **bottleneck variable** - Searching for this optimal $\tau$ (given $\beta$) is the **information-bottleneck method** [@Tishby-Pereira-Bialek-bottleneck] ## Why This Matters - Sometimes, we can _explicitly_ work out the bottleneck or even the sufficient statistic - Sometimes, it gives us a benchmark to evaluate against + What's $I[Y;T]$ for our favorite transformation of $T$ of the features? + What's $H[T]$ as compared to $H[X]$? How much information loss are we tolerating to get that compression? - Sometimes it just inspires how we select features, or synthesize new features ## Dimension Reduction with a Target Variable - Start with $p$-dimensional feature vector $X$ - Consider functions $\tau$ which map $X$ down to $q$-dimensional vectors $T$ + Maybe constrained, e.g., only functions linear in $X$ - Still want to maximize $I[Y;T]$, or maybe a bottlenecked version - If calculating $I[Y;T]$ is too hard, or not quite right for the job, look at prediction error ## Dimension Reduction without a Target Variable - Start with $p$-dimensional feature vector $X$ - Consider functions $\tau$ which map $X$ down to $q$-dimensional vectors $T$ - Think about how we'd reconstruct $X$ from $T$, $\hat{X}(T)$ - Maximize $I[X;\hat{X}(T)]$ with constraint on $I[T;X]$ + Again, can swap in minimizing prediction error if you want - This is basically what we did in PCA! ## word2vec - Start with words being binary features, $p\approx$ number of entries in the dictionary + i.e., _which_ word do we see at this position? - Try to predict _this_ word from neighboring words + A huge problem! - Map each word in to a vector of dimension $q \ll p$, maybe say $q=700$ - Adjust the mapping of words to vectors to maximize predictive information - Experimentally, works better than PCA on the bag-of-word vectors + But a _lot_ more costly computationally (and in actual $) ## Clustering - Start with high-dimensional feature vectors$X$- Now map them to a _discrete_ set of categories, say$k$of them -$I[\tau(X);X] \leq \log{k}$(why?) - Try to maximize$I[X;\hat{X}(T)]$+ Or, if that's too hard, some measure of the error of recovering$X$from$T$- This is **clustering**, and we'll look at it for the next few lessons ## Summing up -$I[X;Y]$tells us about how well _any_ method using$X$can predict$Y$- Sufficient statistics maximally compress$X$without losing predictive information - The information bottleneck method tells us about how to trade off compression against predictive information - Dimension reduction$\approx$looking for a continuous bottleneck - Clustering$\approx$looking for a discrete bottleneck ## Backup: Information for Continuous Variables Replace sums with integrals as needed. -$X$and$Y$both continuous: \begin{eqnarray} I[X;Y] & = & \int{p(x,y) \left(\log{\left(\frac{p(x,y)}{p(x)p(y)}\right)}\right) dx dy}\\ & = & \int{p(x) \left(\int{p(y|x) \log{\left(\frac{p(y|x)}{p(y)}\right)} dy}\right) dx} \end{eqnarray} with$p$being the pdf everywhere -$X$continuous,$Y$discrete: $I[X;Y] = \int{p(x) \left(\sum_{y}{p(y|x) \log{\left(\frac{p(y|x)}{p(y)}\right)}}\right) dx}$ with$p(x)$being the pdf but$p(y|x)$,$p(y)$being the (conditional) pmf - In any case: +$I[X;Y] = I[Y;X]$+$I[X;Y] \geq 0$+$I[X;Y] = 0$iff$X$and$Y$are independent +$I[f(X);Y] \leq I[X;Y]$, with equality iff$f(x) = f(x^{\prime}) \Rightarrow p(y|x) = p(y|x^{\prime})$(i.e., sufficiency) ## Backup: Conditional entropy and classification accuracy \begin{eqnarray} H[M|Y,\hat{Y}] & = & 0\\ H[Y|\hat{Y}] & = & H[Y|\hat{Y}] + H[M|Y,\hat{Y}]\\ & = & H[Y,M|\hat{Y}]\\ & = & H[M|\hat{Y}] + H[Y|M,\hat{Y}]\\ & = & H[M|\hat{Y}] + \Pr(M=0)H[Y|M=0,\hat{Y}] + \Pr(M=1)H[Y|M=1,\hat{Y}]\\ & \leq & H[M] + \Pr(M=0)H[Y|M=0,\hat{Y}] + \Pr(M=1)H[Y|M=1,\hat{Y}]\\ & = & H[M] + \Pr(M=0)\cdot 0 + \Pr(M=1) H[Y|M=1,\hat{Y}]\\ & = & H[M] + \Pr(M=1) H[Y|M=1,\hat{Y}]\\ & \leq & H[M] + \Pr(M=1)\log{(|\mathcal{Y}|-1)} \end{eqnarray} because if$M=1$, we know$Y\neq \hat{Y}$and there are at most$|\mathcal{Y}|-1$values$Y$could have Finally, remember that$H[Y|\hat{Y}(X)] \geq H[Y|X]$This result is called **Fano's inequality** - Originally about recovering the true message ($Y$) from a noisy signal ($X$) - For$Y$uniformly distributed on$\mathcal{Y}$, Fano's inequality implies $\Pr(M=1) \geq 1 - \frac{I[X;Y] + \log{2}}{\log{|\mathcal{Y}|}}$ + (Can you show this?) - Fano's inequality out to have many, many uses in prediction and estimation [@Scarlett-Cevher-on-Fano] + Prediction: Obvious + Estimation: Think of the parameter as the message ($Y$) we're trying to recover from the noisy signal of the data set ($X$) ## Backup: Sufficiency and the Bottleneck, Take 2 [@Bottleneck-note] - Initial bottleneck problem: $\max_{\tau}{I[Y;\tau(X)] - \beta H[\tau(X)]}$ + Willing to give up$1$bit of predictive information if it saves at least$\beta$bits of memory + Lagrange multiplier form: $\max_{\tau}{I[Y;\tau(X)]} ~ \mathrm{subject\ to}~ H[\tau(X)] \leq c$ - Equivalently: compressing by 1 bit is only worthwhile if it reduces predictive information by$\leq 1/\beta$bits $\min_{\tau}{H[\tau(X)] - (1/\beta) I[Y;\tau(X)]}$ - As$\beta \rightarrow\infty$, we become less and less willing to give up any predictive information + Lagrange-multiplier form: $\min_{\tau}{H[\tau(X)]} ~ \mathrm{subject\ to} ~ I[Y;\tau(X)] \geq c^{\prime}$ - The limit$\beta=\infty$is the minimal sufficient statistic / statistical relevance basis: $\min_{\tau}{H[\tau(X)]} ~\mathrm{subject\ to} ~ I[\tau(X);Y]=I[X;Y]$ ## Backup: word2vec in a little more detail - Each word$w$corresponds to a vector$v_w$- Each "context" = window of$k$words around the focal word corresponds to a vector$v_c\$ - Try to maximize $\max_{v_c, v_d}{\sum_{\mathrm{word}~ w ~\mathrm{appears\ in\ context} ~c}{\log{\frac{e^{v_c \cdot v_w}}{\sum_{c^{\prime}}{e^{v_{c^{\prime}} \cdot v_w}}}}}}$ + Words which appear in similar contexts should get similar vectors + The actual maximization is too hard, so the word2vec software does something easier but related [@Goldberg-Levy-word2vec-explained] ... + ... which turns out to be (implicitly) factorizing a matrix of pointwise mutual informations between words and contexts [@Levy-Goldberg-neural-word-embedding] ## References