Cosma Shalizi

22 September 2018

\[ \newcommand{\indep}{\perp} \]

- Causal discovery is in trouble, because it’s only pointwise-consistent
- Order selection for Markov models is a solved problem
- This
*should*induce acute cognitive dissonance

- I read Spirtes, Glymour, and Scheines (1993) and it changed my life

- Start with undirected edges between everything
- Test whether \(V_i \indep V_j | V_k\)
- If yes, remove the \(V_i-V_j\) edge
- If no, move on to other and/or bigger \(V_k\)

- Orient edges by hunting for signs of colliders
- Needs (or seems to need):
- Faithfulness
- Consistent independence tests

- My background was in Markovian representations of stochastic processes
- What’s the smallest, simplest hidden Markov process that will generate the original process?

- I used ideas from SGS to write an algorithm for learning those representations (Shalizi and Klinkner 2004)
- Partially anticipated by Foulkes (1959)

- There are lots of conditional independence tests…
- Pointwise consistent test sequence \(\phi_n\) of whether \(P \in \mathcal{P}\) or \(P \in \mathcal{Q}\): the error probabilities go to zero \[\begin{eqnarray} \forall P \in \mathcal{P}, \lim_{n\rightarrow\infty}{P(\phi_n \neq \mathcal{P})} & = & 0 ~\text{(false positives, "size")}\\ \forall Q \in \mathcal{Q}, \lim_{n\rightarrow\infty}{Q(\phi_n \neq \mathcal{Q})} & = & 0 ~\text{(false negatives, 1-"power")} \end{eqnarray}\]
- Uniformly consistent test: \[\begin{eqnarray} \lim_{n\rightarrow\infty}{\sup_{P\in\mathcal{P}}{P(\phi_n \neq \mathcal{P}))}} & = & 0\\ \lim_{n\rightarrow\infty}{\sup_{Q\in\mathcal{Q}}{Q(\phi_n \neq \mathcal{Q}))}} & = & 0 \end{eqnarray}\]

- Robins et al. (2003):
- Generally no
*uniformly*consistent tests for causal effect \(=0\) \((\mathcal{P})\) vs. effect \(\neq 0\) \((\mathcal{Q})\) - Though there are pointwise-consistent tests

- Generally no

- KL divergence between \(P\) and \(Q\): \[
D(P\|Q) \equiv \int{\log{\left(\frac{dP}{dQ}(x)\right)} dP(x)} = \int{\log{\left(\frac{p(x)}{q(x)}\right)} p(x) dx}
\]
- \(D(P\|Q) \geq 0\), \(D(P\|Q) = 0 \Leftrightarrow P=Q\)

- For testing point null \(P\) vs. point alternative \(Q\), at fixed false-positive probability, \(P(\phi_n = Q) = \alpha\), \[
-\lim{\frac{1}{n}\log{Q(\phi_n = P)}} \leq D(P\|Q)
\]
- Similarly, at fixed power, \(Q(\phi_n = Q)\), \(D(Q\|P)\) controls the false-positive probability
- Minimal divergences over \(\mathcal{P}\), \(\mathcal{Q}\) control error probabilities for composite hypotheses

- Mutual information between \(X\) and \(Y\) (Shannon 1948): \[
I[X;Y] \equiv D(P_{XY}\|P_X \otimes P_Y) = \int{\log{\left(\frac{p(x,y)}{p(x)p(y)}\right)} p(x,y) dx dy}
\]
- \(X \indep Y\) iff \(I[X;Y] = 0\)
- Similarly \(X \indep Y | Z\) iff \(I[X;Y|Z]=0\)

- Can also use the “lautum information”, \(L[X;Y] \equiv D(P_X \otimes P_Y \| P_{XY})\) (Palomar and Verdú 2008)
- When testing null of independence vs. alt. of dependence
- \(I[X;Y]\) controls exponential rate at which false positives \(\rightarrow 0\)
- \(L[X;Y]\) controls exponential rate at which power \(\rightarrow 1\)

- We can have arbitrarily small, but non-zero, information \(\Rightarrow\) arbitrarily slow growth in power (or shrinking of size)

- The classic Markov chain: \[ X_{t+1:\infty} \indep X_{-\infty:t-1} | X_t \]
- Higher-order Markov chain, of order \(r\): \[ X_{t+1:\infty} \indep X_{-\infty:t-r} | X_{t-r+1:t} \]
- Variable-length Markov chains / “sofic” processes / “context” trees: for each \(X_{-\infty:t}\), there is a \(K\) such that \[ X_{t+1:\infty} \indep X_{-\infty:t-K} | X_{t-K+1:t} \]

- Markov models are all about conditional independence relations
- In
**dependent**data, not multivariate IID samples

- In
- Always about the current state screening off the past from the future
- When this is causal is another story…

- Can we
*learn*the order of a Markov chain?- This means learning independence relations from the data
- If this is possible, what’s the trick?

- \(\mathcal{M}_q =\) all chains of order \(q\)
- \(\mathcal{M}_{q} \subset \mathcal{M}_{q+1}\)
- Order of \(P\) = smallest \(r\) s.t. \(P \in \mathcal{M}_r\)
- Order selection: given \(X_{1:n}\) generated from unknown \(P\), find \(\hat{r}_n \equiv \hat{r}(X_{1:n}) \rightarrow r\) (in prob. or a.s.)
- Consistent estimation of the order \(\Rightarrow\) consistent test of a given order
- \(\Rightarrow\) consistent test of corresponding conditional independence

- KL divergence
*rate*is \(d(P\|Q)\) is \[ d(P\|Q) \equiv \lim{\frac{1}{n}D(P(X_{1:n}) | Q(X_{1:n}))} \] - For stochastic processes, \(d\) controls the exponential rates at which error probabilities \(\rightarrow 0\)

- For each \(r\), MLE of \(X_{1:n}\) in \(\mathcal{M}_r\) is well-defined
- \(\Lambda_r(X_{1:n}) \equiv\) log-likelihood, at the MLE, of data
- Theorem (Billingsley 1961): for well-behaved chains, if \(q < r\), \[ \frac{1}{n}\left(\Lambda_q - \Lambda_{r}\right) \rightarrow -\inf_{Q \in \mathcal{M}_q}{d(P\|Q)} ~ \text{(P-a.s.)} \] while if \(q > r\), \[ 2[\Lambda_{q} - \Lambda_{r}] \rightsquigarrow \chi^2_{\mathrm{dim}(\mathcal{M}_q)-\mathrm{dim}(\mathcal{M}_{r})} \]
- This doesn’t quite do it, though — multiple testing is a real problem!

- The probability of
*under*-estimating the order can go to 0 exponentially fast: \(\exists \hat{r}_n\) such that \[ -\lim{\frac{1}{n}P(\hat{r}_n < r)} = \inf_{Q\in \mathcal{M}_q, q<r}{d(Q\|P)} > 0 \] - For any consistent \(\hat{r}_n\), the best rate of
*over*-estimation is sub-exponential: \[ -\lim{\frac{1}{n}P(\hat{r}_n > r)} = 0 \] - (Cappé, Moulines, and Rydén 2005, sec. 15.7, based earlier work of Finesso)

- \(\mathcal{M}_{q} \subset \mathcal{M}_{q+1}\) — and not dense in \(\mathcal{M}_{q+1}\)
- \(\therefore\) We can’t come arbitrarily close to an order-2 chain by using order-1 chains
- But higher-order chains can come
*arbitrarily*close to a given lower-order chain- Maybe \(X_{t-1000}\) has a minute, but non-zero, influence on \(X_{t+1}\) (but only if \(X_t = 42\))
- \(r=1001\) but it looks really close to order \(1\)

- BIC is consistent (Csiszár and Shields 2000) for Markov chain order
- So are many other penalized-maximum-likelihood tools
- AIC is
**inconsistent**(as usual)

- Many other consistent estimators, based on information theory (Peres and Shields 2005)
- The common element is universal rates of convergence for empirical distributions of Markov chains

- (Based on Peres and Shields (2005))

- Conditional entropy: \[\begin{eqnarray} H[X_{k+1}|X_{1:k}] & \equiv & -\sum_{x_{1:k+1}}{P(x_{1:k+1})\log{P(x_{k+1}|x_{1:k})}}\\ H[X_{k+2}|X_{1:k+1}] & \leq & H[X_{k+1}|X_{1:k}] ~ \text{(assuming stationarity)} \end{eqnarray}\]
- For a Markov chain of order \(r\), and any \(k\geq r\), \(H[X_{k+1}|X_{1:k}] = H[X_{r+1}|X_{1:r}] \equiv h\), the
**entropy rate**- Another way of seeing the screening-off of the future from the past

- \(\hat{P}_n \equiv\) empirical probability based on \(X_{1:n}\)
- Empirical
**entropy rate**of order \(k\): \[ \hat{h}_n(k) \equiv -\sum_{x_{1:k+1}}{\hat{P}_n(x_{1:k+1})\log{\hat{P}_n(x_{k+1}|x_{1:k})}} \] - If \(k \geq r\), \(\hat{h}_n(k) \rightarrow h\)
- If \(k < r\), \(\hat{h}_n(k) \rightarrow H[X_{k+1}|X_{1:k}] > h\)
- Both convergences happen \(P\)-a.s.

- Repetition length \(\ell_n\): the length of the longest initial segment that repeats \[\begin{eqnarray}
\ell(n) \equiv \max{\left\{k ~: ~ \exists t \geq 1: x_{1:k} = x_{t+1:t+k}\right\}}
\end{eqnarray}\] Useful because of the
**recurrence time entropy estimator**: \(P\)-a.s., \[ \frac{\log{n}}{\ell(n)} \rightarrow h \]- Regardless of the order of \(P\)
- and even if non-Markov (at any order)

- Define the order estimator: \[ \hat{r}_n = \min{\left\{k ~: ~\hat{h}_n(k) \leq \frac{\log{n}}{\ell(n)} + 2 (\log{n})^{-1/4}\right\}} \]
- It’s consistent: \(P\)-a.s., \[ \hat{r}_n \rightarrow r \]

- \(\frac{\log{n}}{\ell(n)} \rightarrow h\)
- If \(k < r\), \(\hat{h}_n(k) \rightarrow H[X_{k}|X_{1:k}] > h\) so eventually \(\hat{r}_n \geq r\)
- \((\log{n})^{-1/4}\) bounds the long-run
*under*-estimation of both estimators of \(h\), so eventually \(\hat{r}_n \leq r\)- Could use any other estimator with known, \(o(1)\) bound on under-estimation
- Obviously the work of the proof is in the bound!

- In a variable-length chain, the length of the relevant history \(K\) can vary
- e.g., \(X_t = 1\) fixes distribution of \(X_{t+1}\), but if \(X_t=0\), need to consider \(X_{t-1}\)

- Suppose \(K \leq r\) no matter what
- \(\Rightarrow\) the variable-length chain is a special case of the order-\(r\) chain, with extra independence relations
- Violates “faithfulness”?

- Arrange possible sequences \(X_{1:r}\) into a tree
- Chop off the sub-tree when conditional distribution of \(X_{t+1}\) stops changing
- Leaves are now “contexts” (J. Rissanen)

- Subtrees are strictly nested in super-trees
- Exponential rates for under-estimation, sub-exponential rates for over-estimation (Galves and Leonardi 2008)

- BIC is nonetheless consistent for learning the tree (Csiszár and Talata 2006)
- So is the Peres-Shields estimator

- In Markov processes, we can learn quite intricate and subtle conditional independence relations
- Proofs can be intricate, but their core is elementary large-deviations properties

- These rates of convergence are not necessarily uniform over all possible Markov processes
- Nobody is very bothered by the lack of strict uniformity

Billingsley, Patrick. 1961. *Statistical Inference for Markov Processes*. Chicago: University of Chicago Press.

Cappé, Olivier, Eric Moulines, and Tobias Rydén. 2005. *Inference in Hidden Markov Models*. New York: Springer.

Csiszár, Imre, and Paul C. Shields. 2000. “The Consistency of the BIC Markov Order Estimator.” *Annals of Statistics* 28:1601–19. http://projecteuclid.org/euclid.aos/1015957472.

Csiszár, Imre, and Zsolt Talata. 2006. “Context Tree Estimation for Not Necessarily Finite Memory Processes, via BIC and MDL.” *IEEE Transactions on Information Theory* 52:1007–16. https://doi.org/10.1109/TIT.2005.864431.

Foulkes, J. D. 1959. “A Class of Machines Which Determine the Statistical Structure of a Sequence of Characters.” In *Wescon Convention Record*, 4:66–73. Institute of Radio Engineers.

Galves, Antonio, and Florencia Leonardi. 2008. “Exponential Inequalities for Empirical Unbounded Context Trees.” In *In and Out of Equilibrium 2*, edited by V. Sidoravicius and M. E. Vares, 257–69. Basel: Birkhäuser. https://doi.org/10.1007/978-3-7643-8786-0_12.

Palomar, Daniel P., and Sergio Verdú. 2008. “Lautum Information.” *IEEE Transactions on Information Theory* 54:964–75. http://www.princeton.edu/~verdu/lautum.info.pdf.

Peres, Yuval, and Paul Shields. 2005. “Two New Markov Order Estimators.” arxiv:math.ST/0506080. http://arxiv.org/abs/math.ST/0506080.

Robins, James M., Richard Scheines, Peter Spirtes, and Larry Wasserman. 2003. “Uniform Consistency in Causal Inference.” *Biometrika* 90:491–515. http://www.stat.cmu.edu/tr/tr725/tr725.html.

Shalizi, Cosma Rohilla, and Kristina Lisa Klinkner. 2004. “Blind Construction of Optimal Nonlinear Recursive Predictors for Discrete Sequences.” In *Uncertainty in Artificial Intelligence: Proceedings of the Twentieth Conference (Uai 2004)*, edited by Max Chickering and Joseph Y. Halpern, 504–11. Arlington, Virginia: AUAI Press. http://arxiv.org/abs/cs.LG/0406011.

Shannon, Claude E. 1948. “A Mathematical Theory of Communication.” *Bell System Technical Journal* 27:379–423.

Spirtes, Peter, Clark Glymour, and Richard Scheines. 1993. *Causation, Prediction, and Search*. 1st ed. Berlin: Springer-Verlag.