Lessons (?) for causal discovery from Markov models

Cosma Shalizi

22 September 2018

\[ \newcommand{\indep}{\perp} \]

Two pieces of conventional wisdom

Causal discovery is in trouble, because it’s only pointwise-consistent
Order selection for Markov models is a solved problem
This should induce acute cognitive dissonance

Causal discovery in multivariate data

I read Spirtes, Glymour, and Scheines (1993) and it changed my life

The classic view (Spirtes, Glymour, and Scheines 1993)

Start with undirected edges between everything
Test whether \(V_i \indep V_j | V_k\)
- If yes, remove the \(V_i-V_j\) edge
- If no, move on to other and/or bigger \(V_k\)
Orient edges by hunting for signs of colliders
Needs (or seems to need):
- Faithfulness
- Consistent independence tests

This seems fine

My background was in Markovian representations of stochastic processes
- What’s the smallest, simplest hidden Markov process that will generate the original process?
I used ideas from SGS to write an algorithm for learning those representations (Shalizi and Klinkner 2004)
- Partially anticipated by Foulkes (1959)

Consistent conditional independence tests

There are lots of conditional independence tests…
Pointwise consistent test sequence \(\phi_n\) of whether \(P \in \mathcal{P}\) or \(P \in \mathcal{Q}\): the error probabilities go to zero \[\begin{eqnarray} \forall P \in \mathcal{P}, \lim_{n\rightarrow\infty}{P(\phi_n \neq \mathcal{P})} & = & 0 ~\text{(false positives, "size")}\\ \forall Q \in \mathcal{Q}, \lim_{n\rightarrow\infty}{Q(\phi_n \neq \mathcal{Q})} & = & 0 ~\text{(false negatives, 1-"power")} \end{eqnarray}\]
Uniformly consistent test: \[\begin{eqnarray} \lim_{n\rightarrow\infty}{\sup_{P\in\mathcal{P}}{P(\phi_n \neq \mathcal{P}))}} & = & 0\\ \lim_{n\rightarrow\infty}{\sup_{Q\in\mathcal{Q}}{Q(\phi_n \neq \mathcal{Q}))}} & = & 0 \end{eqnarray}\]

So what’s the problem?

Robins et al. (2003):
- Generally no uniformly consistent tests for causal effect \(=0\) \((\mathcal{P})\) vs. effect \(\neq 0\) \((\mathcal{Q})\)
- Though there are pointwise-consistent tests

Brief excursion into information theory

KL divergence between \(P\) and \(Q\): \[ D(P\|Q) \equiv \int{\log{\left(\frac{dP}{dQ}(x)\right)} dP(x)} = \int{\log{\left(\frac{p(x)}{q(x)}\right)} p(x) dx} \]
- \(D(P\|Q) \geq 0\), \(D(P\|Q) = 0 \Leftrightarrow P=Q\)
For testing point null \(P\) vs. point alternative \(Q\), at fixed false-positive probability, \(P(\phi_n = Q) = \alpha\), \[ -\lim{\frac{1}{n}\log{Q(\phi_n = P)}} \leq D(P\|Q) \]
- Similarly, at fixed power, \(Q(\phi_n = Q)\), \(D(Q\|P)\) controls the false-positive probability
- Minimal divergences over \(\mathcal{P}\), \(\mathcal{Q}\) control error probabilities for composite hypotheses

The point of the information theory

Mutual information between \(X\) and \(Y\) (Shannon 1948): \[ I[X;Y] \equiv D(P_{XY}\|P_X \otimes P_Y) = \int{\log{\left(\frac{p(x,y)}{p(x)p(y)}\right)} p(x,y) dx dy} \]
- \(X \indep Y\) iff \(I[X;Y] = 0\)
- Similarly \(X \indep Y | Z\) iff \(I[X;Y|Z]=0\)
Can also use the “lautum information”, \(L[X;Y] \equiv D(P_X \otimes P_Y \| P_{XY})\) (Palomar and Verdú 2008)
When testing null of independence vs. alt. of dependence
- \(I[X;Y]\) controls exponential rate at which false positives \(\rightarrow 0\)
- \(L[X;Y]\) controls exponential rate at which power \(\rightarrow 1\)
We can have arbitrarily small, but non-zero, information \(\Rightarrow\) arbitrarily slow growth in power (or shrinking of size)

Let’s talk about Markov models

The classic Markov chain: \[ X_{t+1:\infty} \indep X_{-\infty:t-1} | X_t \]
Higher-order Markov chain, of order \(r\): \[ X_{t+1:\infty} \indep X_{-\infty:t-r} | X_{t-r+1:t} \]
Variable-length Markov chains / “sofic” processes / “context” trees: for each \(X_{-\infty:t}\), there is a \(K\) such that \[ X_{t+1:\infty} \indep X_{-\infty:t-K} | X_{t-K+1:t} \]

More conditional independence

Markov models are all about conditional independence relations
- In dependent data, not multivariate IID samples
Always about the current state screening off the past from the future
- When this is causal is another story…
Can we learn the order of a Markov chain?
- This means learning independence relations from the data
- If this is possible, what’s the trick?

Focus on learning the Markov order

\(\mathcal{M}_q =\) all chains of order \(q\)
\(\mathcal{M}_{q} \subset \mathcal{M}_{q+1}\)
Order of \(P\) = smallest \(r\) s.t. \(P \in \mathcal{M}_r\)
Order selection: given \(X_{1:n}\) generated from unknown \(P\), find \(\hat{r}_n \equiv \hat{r}(X_{1:n}) \rightarrow r\) (in prob. or a.s.)
- Consistent estimation of the order \(\Rightarrow\) consistent test of a given order
- \(\Rightarrow\) consistent test of corresponding conditional independence

More information theory

KL divergence rate is \(d(P\|Q)\) is \[ d(P\|Q) \equiv \lim{\frac{1}{n}D(P(X_{1:n}) | Q(X_{1:n}))} \]
For stochastic processes, \(d\) controls the exponential rates at which error probabilities \(\rightarrow 0\)

Likelihood ratio to the rescue?

For each \(r\), MLE of \(X_{1:n}\) in \(\mathcal{M}_r\) is well-defined
\(\Lambda_r(X_{1:n}) \equiv\) log-likelihood, at the MLE, of data
Theorem (Billingsley 1961): for well-behaved chains, if \(q < r\), \[ \frac{1}{n}\left(\Lambda_q - \Lambda_{r}\right) \rightarrow -\inf_{Q \in \mathcal{M}_q}{d(P\|Q)} ~ \text{(P-a.s.)} \] while if \(q > r\), \[ 2[\Lambda_{q} - \Lambda_{r}] \rightsquigarrow \chi^2_{\mathrm{dim}(\mathcal{M}_q)-\mathrm{dim}(\mathcal{M}_{r})} \]
This doesn’t quite do it, though — multiple testing is a real problem!

A general result

The probability of under-estimating the order can go to 0 exponentially fast: \(\exists \hat{r}_n\) such that \[ -\lim{\frac{1}{n}P(\hat{r}_n < r)} = \inf_{Q\in \mathcal{M}_q, q<r}{d(Q\|P)} > 0 \]
For any consistent \(\hat{r}_n\), the best rate of over-estimation is sub-exponential: \[ -\lim{\frac{1}{n}P(\hat{r}_n > r)} = 0 \]
(Cappé, Moulines, and Rydén 2005, sec. 15.7, based earlier work of Finesso)

What’s the intuition?

\(\mathcal{M}_{q} \subset \mathcal{M}_{q+1}\) — and not dense in \(\mathcal{M}_{q+1}\)
\(\therefore\) We can’t come arbitrarily close to an order-2 chain by using order-1 chains
But higher-order chains can come arbitrarily close to a given lower-order chain
- Maybe \(X_{t-1000}\) has a minute, but non-zero, influence on \(X_{t+1}\) (but only if \(X_t = 42\))
- \(r=1001\) but it looks really close to order \(1\)

Nonetheless, there is consistency

BIC is consistent (Csiszár and Shields 2000) for Markov chain order
- So are many other penalized-maximum-likelihood tools
- AIC is inconsistent (as usual)
Many other consistent estimators, based on information theory (Peres and Shields 2005)
The common element is universal rates of convergence for empirical distributions of Markov chains

A closer look at one of those approaches

(Based on Peres and Shields (2005))

Conditional entropy: \[\begin{eqnarray} H[X_{k+1}|X_{1:k}] & \equiv & -\sum_{x_{1:k+1}}{P(x_{1:k+1})\log{P(x_{k+1}|x_{1:k})}}\\ H[X_{k+2}|X_{1:k+1}] & \leq & H[X_{k+1}|X_{1:k}] ~ \text{(assuming stationarity)} \end{eqnarray}\]
For a Markov chain of order \(r\), and any \(k\geq r\), \(H[X_{k+1}|X_{1:k}] = H[X_{r+1}|X_{1:r}] \equiv h\), the entropy rate
- Another way of seeing the screening-off of the future from the past

(continued)

\(\hat{P}_n \equiv\) empirical probability based on \(X_{1:n}\)
Empirical entropy rate of order \(k\): \[ \hat{h}_n(k) \equiv -\sum_{x_{1:k+1}}{\hat{P}_n(x_{1:k+1})\log{\hat{P}_n(x_{k+1}|x_{1:k})}} \]
If \(k \geq r\), \(\hat{h}_n(k) \rightarrow h\)
If \(k < r\), \(\hat{h}_n(k) \rightarrow H[X_{k+1}|X_{1:k}] > h\)
- Both convergences happen \(P\)-a.s.

(continued)

Repetition length \(\ell_n\): the length of the longest initial segment that repeats \[\begin{eqnarray} \ell(n) \equiv \max{\left\{k ~: ~ \exists t \geq 1: x_{1:k} = x_{t+1:t+k}\right\}} \end{eqnarray}\] Useful because of the recurrence time entropy estimator: \(P\)-a.s., \[ \frac{\log{n}}{\ell(n)} \rightarrow h \]
- Regardless of the order of \(P\)
- and even if non-Markov (at any order)

(continued)

Define the order estimator: \[ \hat{r}_n = \min{\left\{k ~: ~\hat{h}_n(k) \leq \frac{\log{n}}{\ell(n)} + 2 (\log{n})^{-1/4}\right\}} \]
It’s consistent: \(P\)-a.s., \[ \hat{r}_n \rightarrow r \]

Why is the Peres-Shields estimator consistent?

\(\frac{\log{n}}{\ell(n)} \rightarrow h\)
If \(k < r\), \(\hat{h}_n(k) \rightarrow H[X_{k}|X_{1:k}] > h\) so eventually \(\hat{r}_n \geq r\)
\((\log{n})^{-1/4}\) bounds the long-run under-estimation of both estimators of \(h\), so eventually \(\hat{r}_n \leq r\)
- Could use any other estimator with known, \(o(1)\) bound on under-estimation
- Obviously the work of the proof is in the bound!

A little more about variable-length chains

In a variable-length chain, the length of the relevant history \(K\) can vary
- e.g., \(X_t = 1\) fixes distribution of \(X_{t+1}\), but if \(X_t=0\), need to consider \(X_{t-1}\)
Suppose \(K \leq r\) no matter what
\(\Rightarrow\) the variable-length chain is a special case of the order-\(r\) chain, with extra independence relations
- Violates “faithfulness”?
Arrange possible sequences \(X_{1:r}\) into a tree
- Chop off the sub-tree when conditional distribution of \(X_{t+1}\) stops changing
- Leaves are now “contexts” (J. Rissanen)
Subtrees are strictly nested in super-trees
- Exponential rates for under-estimation, sub-exponential rates for over-estimation (Galves and Leonardi 2008)
BIC is nonetheless consistent for learning the tree (Csiszár and Talata 2006)
- So is the Peres-Shields estimator

Summing up

In Markov processes, we can learn quite intricate and subtle conditional independence relations
- Proofs can be intricate, but their core is elementary large-deviations properties
These rates of convergence are not necessarily uniform over all possible Markov processes
Nobody is very bothered by the lack of strict uniformity

What’s the moral?

(original)

References

Billingsley, Patrick. 1961. Statistical Inference for Markov Processes. Chicago: University of Chicago Press.

Cappé, Olivier, Eric Moulines, and Tobias Rydén. 2005. Inference in Hidden Markov Models. New York: Springer.

Csiszár, Imre, and Paul C. Shields. 2000. “The Consistency of the BIC Markov Order Estimator.” Annals of Statistics 28:1601–19. http://projecteuclid.org/euclid.aos/1015957472.

Csiszár, Imre, and Zsolt Talata. 2006. “Context Tree Estimation for Not Necessarily Finite Memory Processes, via BIC and MDL.” IEEE Transactions on Information Theory 52:1007–16. https://doi.org/10.1109/TIT.2005.864431.

Foulkes, J. D. 1959. “A Class of Machines Which Determine the Statistical Structure of a Sequence of Characters.” In Wescon Convention Record, 4:66–73. Institute of Radio Engineers.

Galves, Antonio, and Florencia Leonardi. 2008. “Exponential Inequalities for Empirical Unbounded Context Trees.” In In and Out of Equilibrium 2, edited by V. Sidoravicius and M. E. Vares, 257–69. Basel: Birkhäuser. https://doi.org/10.1007/978-3-7643-8786-0_12.

Palomar, Daniel P., and Sergio Verdú. 2008. “Lautum Information.” IEEE Transactions on Information Theory 54:964–75. http://www.princeton.edu/~verdu/lautum.info.pdf.

Peres, Yuval, and Paul Shields. 2005. “Two New Markov Order Estimators.” arxiv:math.ST/0506080. http://arxiv.org/abs/math.ST/0506080.

Robins, James M., Richard Scheines, Peter Spirtes, and Larry Wasserman. 2003. “Uniform Consistency in Causal Inference.” Biometrika 90:491–515. http://www.stat.cmu.edu/tr/tr725/tr725.html.

Shalizi, Cosma Rohilla, and Kristina Lisa Klinkner. 2004. “Blind Construction of Optimal Nonlinear Recursive Predictors for Discrete Sequences.” In Uncertainty in Artificial Intelligence: Proceedings of the Twentieth Conference (Uai 2004), edited by Max Chickering and Joseph Y. Halpern, 504–11. Arlington, Virginia: AUAI Press. http://arxiv.org/abs/cs.LG/0406011.

Shannon, Claude E. 1948. “A Mathematical Theory of Communication.” Bell System Technical Journal 27:379–423.

Spirtes, Peter, Clark Glymour, and Richard Scheines. 1993. Causation, Prediction, and Search. 1st ed. Berlin: Springer-Verlag.