36-467/36-667

20 November 2018

\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\indep}{\perp} \]

In classical physics or dynamical systems, “state” is the present variable which fixes all future observables

Back away from determinism: state determines

*distribution*of future observablesWould like a

*small*stateWould like the state to be

*well-behaved*, e.g. MarkovTry construct states by constructing predictions

Upper-case letters are random variables, lower-case their realizations

Stochastic process \(\ldots, X_{-1}, X_0, X_1, X_2, \ldots\)

\(X_{s}^{t} = (X_s, X_{s+1}, \ldots X_{t-1}, X_t)\)

- Past up to and including \(t\) is \(X^t_{-\infty}\), future is \(X_{t+1}^{\infty}\)
- Discrete time isn’t required but cuts down on measure theory

- Look at \(X^t_{-\infty}\), make a guess about \(X_{t+1}^{\infty}\)
- Most general guess is a probability distribution
- Only ever attend to selected aspects of \(X^t_{-\infty}\)
- mean, variance, phase of 1st three Fourier modes, ….

- \(\therefore\) guess is a
*function*or*statistic*of \(X^t_{-\infty}\) - A good statistic should give us lots of information about \(X_{t+1}^{\infty}\)

- The
**entropy**of a random variable \(X\): \[ H[X] \equiv -\sum_{x}{\Prob{X=x}\log_2{\Prob{X=x}}} \] - Some mathematical properties:
- \(H[X] \geq 0\)
- \(H[X] = 0\) if and only if \(\Prob{X=x_0} = 1\) for some \(x_0\)
- If \(X\) has \(k\) possible values, \(H[X] \leq \log_2{k}\)
- \(H[X] = \log_2{k}\) if and only if \(\Prob{X=x} = 1/k\) for all \(x\)

**Data processing inequality**: \(H[X] \geq H[\sigma(X)]\)- \(H[X] = H[\sigma(X)]\) iff \(\sigma\) is 1-1

- \(H[X] \approx\) number of bits needed to encode the value of \(X\)
- \(H[X]\) is the uncertainty in the distribution of \(X\)

- Conditional entropy: \[\begin{eqnarray}
H[Y|X=x] & \equiv & -\sum_{y}{\Prob{Y=y|X=x}\log_2{\Prob{Y=y|X=x}}}\\
H[Y|X] & \equiv & \Expect{H[Y|X=X]}
\end{eqnarray}\]
- Condition on multiple variables in the natural way

- Joint entropy: \[ H[X,Y] = H[X] + H[Y|X] = H[Y] + H[X|Y] \]
- Same interpretations as entropy

- Limiting entropy per time-step: \[ h \equiv \lim_{n\rightarrow\infty}{\frac{1}{n}H[X_1^n]} \]
- Equivalently, limiting conditional entropy: \[ h = \lim_{n\rightarrow\infty}{H[X_{n+1}|X_1^n]} \]
- Both limits exist and are equal for all stationary processes
- Interpretations:
- How many bits do we need to code the next observation, given the complete history? (“source coding”)
- How much uncertainty is there about what happens next, given the complete history?
- \(h=0\) for deterministic processes

- The information \(X\) gives about \(Y\) is the reduction in uncertainty: \[ H[Y] - H[Y|X] \]
- Some algebra shows that \[ H[Y] - H[Y|X] = H[X] - H[X|Y] \]
- Define the
**mutual information**\[ I[X;Y] \equiv H[Y] - H[Y|X] \] - Some properties:
- \(I[X;Y] \geq 0\)
- \(I[X;Y] = 0\) if and only if \(X\) and \(Y\) are independent

- Data processing: \(I[X;Y] \geq I[\sigma(X);Y]\)
- Can be preserved even if \(\sigma\) isn’t 1-1

- We want to make predictions
- Use \(I[X^{\infty}_{t+1};\sigma(X_{-\infty}^t)]\) to measure how much information the statistic \(\sigma\) gives us about the future
- Can also look at \(I[X^{t+m}_{t+1};\sigma(X_{-\infty}^t)]\) for all \(m\)

- How big can we make that?

- For any statistic \(\sigma\), \[ I[X^{\infty}_{t+1};X_{-\infty}^t] \geq I[X^{\infty}_{t+1};\sigma(X_{-\infty}^t)] \]
- \(\sigma\) is
**predictively sufficient**iff \[ I[X^{\infty}_{t+1};X_{-\infty}^t] = I[X^{\infty}_{t+1};\sigma(X_{-\infty}^t)] \] - Sufficient statistics retain all predictive information in the data

(Crutchfield and Young 1989)

- Histories \(a\) and \(b\) are equivalent iff \[ \Prob{X_{t+1}^{\infty}|X_{-\infty}^{t} = a} = \Prob{X_{t+1}^{\infty}|X_{-\infty}^{t} = b} \]
- \([a] \equiv\) all histories equivalent to \(a\)
- The statistic of interest, the
**predictive state**, is \[ \epsilon(x^t_{-\infty}) = [x^t_{-\infty}] \] - Set \(s_t = \epsilon(x^t_{-\infty})\)
- A state is an equivalence class of histories
*and*a distribution over future events - IID = 1 state, periodic = \(p\) states

set of histories, color-coded by conditional distribution of futures

Partitioning histories into predictive states

(Shalizi and Crutchfield 2001)

- \[ I[X^{\infty}_{t+1};X^t_{-\infty}] = I[X^{\infty}_{t+1};\epsilon(X^t_{-\infty})] \]
- because \[\begin{eqnarray*} \Prob{X^{\infty}_{t+1}|S_t = \epsilon(x^t_{-\infty})} & = & \int_{y \in [x^t_{-\infty}]}{\Prob{X^{\infty}_{t+1}|X^t_{-\infty}=y} \Prob{X^t_{-\infty}=y|S_t = \epsilon(x^t_{-\infty})} dy}\\ & = & \Prob{X^{\infty}_{t+1}|X^t_{-\infty}=x^t_{-\infty}} \end{eqnarray*}\]

A non-sufficient partition of histories

Effect of insufficiency on predictive distributions

- Future observations are independent of the past given the causal state: \[ X^{\infty}_{t+1} \indep X^{t}_{-\infty} | S_{t+1} \]
- by sufficiency: \[\begin{eqnarray*} \Prob{X^{\infty}_{t+1}|X^{t}_{-\infty}=x^t_{-\infty}, S_{t+1} = \epsilon(x^t_{-\infty})} & = & \Prob{X^{\infty}_{t+1}|X^{t}_{-\infty}=x^t_{-\infty}}\\ & = & \Prob{X^{\infty}_{t+1}|S_{t+1} = \epsilon(x^t_{-\infty})} \end{eqnarray*}\]

- Recursive transitions for states: \[
\epsilon(x^{t+1}_{-\infty}) = T(\epsilon(x^t_{-\infty}), x_{t+1})
\]
- Automata theory: “deterministic transitions” (even though there are probabilities)

- In continuous time: \[ \epsilon(x^{t+h}_{-\infty}) = T(\epsilon(x^t_{-\infty}),x^{t+h}_{t}) \]

- Claim: If \(u \sim v\), then \(ua \sim va\) for any next observation \(a\)
- Fix any set of future events \(F\) \[\begin{eqnarray*}
\Prob{X^{\infty}_{t+1} \in aF|X^t_{-\infty} = u} & = & \Prob{X^{\infty}_{t+1} \in aF|X^t_{-\infty} = v}\\
\Prob{X_{t+1}= a, X^{\infty}_{t+2} \in F|X^t_{-\infty} = u} & = & \Prob{X_{t+1}= a, X^{\infty}_{t+2} \in F|X^t_{-\infty} = v}
\end{eqnarray*}\] \[\begin{eqnarray*}
\Prob{X^{\infty}_{t+2} \in F|X^{t+1}_{-\infty} = ua}\Prob{X_{t+1}= a|X^t_{-\infty} = u} & = & \Prob{X^{\infty}_{t+2} \in F|X^{t+1}_{-\infty} = va}\Prob{X_{t+1}= a|X^t_{-\infty} = v}
\end{eqnarray*}\] \[\begin{eqnarray*}
\Prob{X^{\infty}_{t+2} \in F|X^{t+1}_{-\infty} = ua} & = & \Prob{X^{\infty}_{t+2} \in F|X^{t+1}_{-\infty} = va}\\
ua & \sim & va
\end{eqnarray*}\]
- (same for continuous values or time but need more measure theory)

- \[ S_{t+1}^{\infty} \indep S^{t-1}_{-\infty}|S_t \]
- because \[ S_{t+1}^{\infty} = T(S_t,X_{t}^{\infty}) \] and \[ X_{t}^{\infty} \indep \left\{ X^{t-1}_{-\infty}, S^{t-1}_{-\infty}\right\} | S_t \]

- \(\epsilon\) is
**minimal sufficient**\(=\) can be computed from any other sufficient statistic - \(\therefore\) for any sufficient \(\eta\), exists a function \(g\) such that \[ \epsilon(X^{t}_{-\infty}) = g(\eta(X^t_{-\infty})) \]
- \(\therefore\) if \(\eta\) is sufficient \[ I[\epsilon(X^{t}_{-\infty}); X^{t}_{-\infty}] \leq I[\eta(X^{t}_{-\infty}); X^{t}_{-\infty}] \]

Sufficient, but not minimal, partition of histories

Coarser than the predictive states, but not sufficient

- There is really no other minimal sufficient statistic
- If \(\eta\) is minimal, there is an \(h\) such that \[ \eta = h(\epsilon) ~\mathrm{w.p.1} \]
- but \(\epsilon = g(\eta)\) (w.p.1)
- so \[\begin{eqnarray*} g(h(\epsilon)) & = & \epsilon\\ h(g(\eta)) & = & \eta \end{eqnarray*}\]
- \(\therefore\) \(\epsilon\) and \(\eta\) partition histories in the same way (w.p.1)

- If \(R_t = \eta(X^{t-1}_{-\infty})\) is also sufficient, then \[ H[R_{t+1}|R_t] \geq H[S_{t+1}|S_t] \]
- \(\therefore\) the predictive states are the closest we get to a deterministic model, without losing power

- \[\begin{eqnarray*} h & \equiv & \lim_{n\rightarrow\infty}{H[X_{n+1}|X^n_1]}\\ & = & \lim_{n\rightarrow\infty}{H[X_{n+1}|S_n]} ~\text{(sufficiency)}\\ & = & H[X_2|S_1] ~\text{(stationarity)} \end{eqnarray*}\]
- so the predictive states lets us calculate the entropy rate
- and do source coding

- The observed process \((X_t)\) is non-Markovian and ugly
- But it is generated from a homogeneous Markov process \((S_t)\)
- After minimization, this representation is (essentially) unique
- Can exist smaller Markovian representations, but then always have distributions over those states
- and those distributions correspond to predictive states

- Common-or-garden HMM: \[ S_{t+1} \indep X_{t+1}|S_t \]
- But here \[ S_{t+1} = T(S_t, X_{t+1}) \]
- This is a
**chain with complete connections**(Onicescu and Mihoc 1935; Iosifescu and Grigorescu 1990)

HMM

CCC

- Blocks of As of any length, separated by even-length blocks of Bs
- Histories ending \(AB\) have different implications than histories ending \(ABB\) than \(ABBB\) than \(ABBBB\)
- \(\Rightarrow\) Not Markov at any finite order

- Statistical relevance basis (Salmon 1971, 1984)
- Measure-theoretic prediction process (Knight 1975)
- Forecasting/true measure complexity (Grassberger 1986)
- Causal states, \(\epsilon\) machine (Crutchfield and Young 1989)
- Not generally causal
- Though maybe sometimes? (Shalizi and Moore 2003)

- Observable operator model (Jaeger 2000)
- Predictive state representations (Littman, Sutton, and Singh 2002)
- Sufficient posterior representation (Langford, Salakhutdinov, and Zhang 2009)

Knight (1975) gave most general constructions

- Non-stationary \(X\)
- \(t\) continuous (but discrete works as special case)
- \(X_t\) with values in continuous spaces
- Lusin space = image of a complete separable metrizable space under a measurable bijection

- \(S_t\) is a Markov process with recursive updating

- Everything so far has been math/probability
- The Oracle tells us the infinite-dimensional distribution of \(X\)

- Can we do some statistics and find the states?
- Two senses of “find”: learn in a fixed model vs. discover the right model

**Problem**: Given states and transitions (\(\epsilon, T\)), realization \(x_1^n\), estimate \(\Prob{X_{t+1}=x|S_t=s}\)

- Just estimation for stochastic processes
- Easier than ordinary HMMs because \(S_t\) is a function of trajectory
- Exponential families in the all-discrete case, very tractable

**Problem**: Given \(x_1^n\), estimate \(\epsilon, T, \Prob{X_{t+1}=x|S_t=s}\)

- Much harder!
- Why should this be possible at all?
- Can’t cover every possible approach (and will neglect Langford, Salakhutdinov, and Zhang (2009);Pfau, Bartlett, and Wood (2010) both)

- Key observation: Recursion + one-step-ahead predictive sufficiency \(\Rightarrow\) general predictive sufficiency
- Get next-step distribution right by independence testing
- Then make states recursive

Assumes discrete observations, discrete time, finite causal states

Paper: Shalizi and Klinkner (2004); code, [https://github.com/stites/CSSR]

- Start with all histories in the same state
- Given current partition of histories into states, test whether going one step further back into the past changes the next-step conditional distribution
- Use a hypothesis test to control false positive rate

- If yes, split that cell of the partition, but see if it matches an existing distribution
- Must allow this merging or else lose minimality

- If no match, add new cell to the partition
- Stop when no more divisions can be made or a maximum history length \(\Lambda\) is reached
- For consistency, \(\Lambda < \frac{\log{n}}{h_1 + \iota}\) for some \(\iota\)

- Need to determinize a probabilistic automaton
- Several ways of doing this, but a solved technical problem in automata theory
- Textbooks: Lewis and Papadimitriou (1998); Hopcroft and Ullman (1979)
- Brilliant but bizarre math approach: Conway (1971)

- Trickiest coding in the algorithm and can influence the finite-sample behavior

- \(\mathcal{S} =\) true predictive state structure
- \(\widehat{\mathcal{S}}_n\) = structure reconstructed from \(n\) data points
- Assume: finite # of states, every state has a finite history, using long enough histories, technicalities: \[ \Prob{\widehat{\mathcal{S}}_n \neq \mathcal{S}} \rightarrow 0 \]
- \(\mathcal{D} =\) true distribution, \(\widehat{\mathcal{D}}_n\) = inferred \[
\Expect{{\|\widehat{\mathcal{D}}_n - \mathcal{D}\|}_{1}} = O(n^{-1/2})
\]
- Same order of convergence as for IID samples

- Empirical conditional distributions for histories converge
- ergodic theorem for Markov chains

- Histories in the same state become harder to accidentally separate
- Histories in different states become harder to confuse
- Each state’s predictive distribution converges \(O(n^{-1/2})\)
- CLT for Markov chains

reconstruction with \(\Lambda = 3\), \(n=1000\), \(\alpha = 0.005\)

N.B., CSSR did not know that there were 2 states, or how they were connected — it discovered this

- Neural spike train analysis (Haslinger, Klinkner, and Shalizi 2010), fMRI analysis (Merriam, Genovese and Shalizi in prep.)
- Geomagnetic fluctuations (Clarke, Freeman, and Watkins 2003)
- Natural language processing (M. Padró and Padró 2005b, 2005a, 2005c, 2007a, 2007b)
- Anomaly detection (David S. Friedlander, Phoha, and Brooks 2003; Davis S. Friedlander et al. 2003; Ray 2004)
- Information sharing in networks (Klinkner, Shalizi, and Camperi 2006; Shalizi, Camperi, and Klinkner 2007)
- Social media propagation (Cointet, Faure, and Roth 2007)

- Your stochastic process has a unique, minimal Markovian representation
- This representation has nice predictive properties
- Can reconstruct from sample data in some cases
- and a lot more could be done in this line

- Optimal strategy, under any loss function, only needs a sufficient statistic (Blackwell & Girshick)
- Strategies using insufficient statistics can generally be improved (Blackwell & Rao)
- Excuse for not worrying about particular loss functions

(Tishby, Pereira, and Bialek 1999)

- For inputs \(X\) and outputs \(Y\), fix \(\beta > 0\), find \(\eta(X)\), the
**bottleneck variable**, maximizing \[ I[\eta(X);Y] - \beta I[\eta(X);X] \] - give up 1 bit of predictive information for \(\beta\) bits of memory
- Predictive sufficiency comes as \(\beta \rightarrow \infty\), unwilling to lose
*any*predictive power

(Littman, Sutton, and Singh 2002; Shalizi 2001)

- System output \((X_t)\), input \((Y_t)\)
- Histories \(x^t_{-\infty}, y^t_{-\infty}\) have distributions of output \(x_{t+1}\) for each further input \(y_{t+1}\)
- Equivalence class these distributions and enforce recursive updating
- Internal states of the system, not trying to predict future inputs

(Shalizi 2003; Shalizi, Klinkner, and Haslinger 2004; Shalizi et al. 2006; Jänicke et al. 2007; Goerg 2013, 2014; Goerg and Shalizi 2012, 2013; Montañez and Shalizi 2017)

- Dynamic random field \(X(\vec{r},t)\)
- Assume a finite maximum speed \(c\) at which information can propagate

- Past cone: points in space-time which could matter to \(X(\vec{r},t)\)
- Future cone: points in space-time for which \(X(\vec{r},t)\) could matter

- Equivalence-class past cone configurations by conditional distributions over future cones
- \(S(\vec{r},t)\) is a Markov field
- Minimal sufficiency, recursive updating, etc., all go through

**Statistical forecasting complexity**is (Grassberger 1986; Crutchfield and Young 1989) \[ C \equiv I[\epsilon(X^t_{-\infty});X^t_{-\infty}] \]- \(=\) amount of information about the past needed for optimal prediction
- \(=H[\epsilon(X^t_{-\infty})]\) for predictive causal states
- \(=\) expected algorithmic sophistication (Gács, Tromp, and Vitanyi 2001)
- \(=\log\)(period) for period processes
- \(=\log\)(geometric mean(recurrence time)) for stationary processes
- Property of the
*process*, not our models

Clarke, Richard W., Mervyn P. Freeman, and Nicholas W. Watkins. 2003. “Application of Computational Mechanics to the Analysis of Natural Data: An Example in Geomagnetism.” *Physical Review E* 67:0126203. http://arxiv.org/abs/cond-mat/0110228.

Cointet, Jean-Philippe, Emmanuel Faure, and Camille Roth. 2007. “Intertemporal Topic Correlations in Online Media.” In *Proceedings of the International Conference on Weblogs and Social Media [Icwsm]*. Boulder, CO, USA. http://camille.roth.free.fr/travaux/cointetfaureroth-icwsm-cr4p.pdf.

Conway, J. H. 1971. *Regular Algebra and Finite Machines*. London: Chapman; Hall.

Crutchfield, James P., and Karl Young. 1989. “Inferring Statistical Complexity.” *Physical Review Letters* 63:105–8. http://www.santafe.edu/~cmg/compmech/pubs/ISCTitlePage.htm.

Friedlander, David S., Shashi Phoha, and Richard Brooks. 2003. “Determination of Vehicle Behavior Based on Distributed Sensor Network Data.” In *Advanced Signal Processing Algorithms, Architectures, and Implementations XIII*, edited by Franklin T. Luk. Vol. 5205. Proceedings of the Spie. Bellingham, WA: SPIE.

Friedlander, Davis S., Isanu Chattopadhayay, Asok Ray, Shashi Phoha, and Noah Jacobson. 2003. “Anomaly Prediction in Mechanical System Using Symbolic Dynamics.” In *Proceedings of the 2003 American Control Conference, Denver, Co, 4–6 June 2003*.

Gács, Péter, John T. Tromp, and Paul M. B. Vitanyi. 2001. “Algorithmic Statistics.” *IEEE Transactions on Information Theory* 47:2443–63. http://arxiv.org/abs/math.PR/0006233.

Goerg, Georg M. 2013. *LICORS: Light Cone Reconstruction of States — Predictive State Estimation from Spatio-Temporal Data*. http://CRAN.R-project.org/package=LICORS.

———. 2014. *LSC: Local Statistical Complexity — Automatic Pattern Discovery in Spatio-Temporal Data*. http://CRAN.R-project.org/package=LSC.

Goerg, Georg M., and Cosma Rohilla Shalizi. 2012. “LICORS: Light Cone Reconstruction of States for Non-Parametric Forecasting of Spatio-Temporal Systems.” Statistics Department, CMU. http://arxiv.org/abs/1206.2398.

———. 2013. “Mixed LICORS: A Nonparametric Algorithm for Predictive State Reconstruction.” In *Sixteenth International Conference on Artificial Intelligence and Statistics*, edited by Carlos M. Carvalho and Pradeep Ravikumar, 289–97. http://arxiv.org/abs/1211.3760.

Grassberger, Peter. 1986. “Toward a Quantitative Theory of Self-Generated Complexity.” *International Journal of Theoretical Physics* 25:907–38.

Haslinger, Robert, Kristina Lisa Klinkner, and Cosma Rohilla Shalizi. 2010. “The Computational Structure of Spike Trains.” *Neural Computation* 22:121–57. https://doi.org/10.1162/neco.2009.12-07-678.

Hopcroft, John E., and Jeffrey D. Ullman. 1979. *Introduction to Automata Theory, Languages, and Computation*. Reading: Addison-Wesley.

Iosifescu, Marius, and Serban Grigorescu. 1990. *Dependence with Complete Connections and Its Applications*. Cambridge, England: Cambridge University Press.

Jaeger, Herbert. 2000. “Observable Operator Models for Discrete Stochastic Time Series.” *Neural Computation* 12:1371–98. https://doi.org/10.1162/089976600300015411.

Jänicke, Heike, Alexander Wiebel, Gerik Scheuermann, and Wolfgang Kollmann. 2007. “Multifield Visualization Using Local Statistical Complexity.” *IEEE Transactions on Visualization and Computer Graphics* 13:1384–91. https://doi.org/10.1109/TVCG.2007.70615.

Klinkner, Kristina Lisa, Cosma Rohilla Shalizi, and Marcelo F. Camperi. 2006. “Measuring Shared Information and Coordinated Activity in Neuronal Networks.” In *Advances in Neural Information Processing Systems 18 (Nips 2005)*, edited by Yair Weiss, Bernhard Schölkopf, and John C. Platt, 667–74. Cambridge, Massachusetts: MIT Press. http://arxiv.org/abs/q-bio.NC/0506009.

Knight, Frank B. 1975. “A Predictive View of Continuous Time Processes.” *Annals of Probability* 3:573–96. http://projecteuclid.org/euclid.aop/1176996302.

Langford, John, Ruslan Salakhutdinov, and Tong Zhang. 2009. “Learning Nonlinear Dynamic Models.” In *Proceedings of the 26th Annual International Conference on Machine Learning [Icml 2009]*, edited by Andrea Danyluk, Léon Bottou, and Michael Littman, 593–600. New York: Association for Computing Machinery. http://arxiv.org/abs/0905.3369.

Lewis, Harry R., and Christos H. Papadimitriou. 1998. *Elements of the Theory of Computation*. Second. Upper Saddle River, New Jersey: Prentice-Hall.

Littman, Michael L., Richard S. Sutton, and Satinder Singh. 2002. “Predictive Representations of State.” In *Advances in Neural Information Processing Systems 14 (Nips 2001)*, edited by Thomas G. Dietterich, Suzanna Becker, and Zoubin Ghahramani, 1555–61. Cambridge, Massachusetts: MIT Press. http://papers.nips.cc/paper/1983-predictive-representations-of-state.

Montañez, George D., and Cosma Rohilla Shalizi. 2017. “The LICORS Cabinet: Nonparametric Algorithms for Spatio-Temporal Prediction.” In *International Joint Conference on Neural Networks 2017 [Ijcnn 2017]*, 2811–9. https://doi.org/10.1109/IJCNN.2017.7966203.

Onicescu, Octav, and Gheorghe Mihoc. 1935. “Sur Les Chaînes de Variables Statistiques.” *Comptes Rendus de L’Académie Des Sciences de Paris* 200:511–12.

Padró, Muntsa, and Lluís Padró. 2005a. “A Named Entity Recognition System Based on a Finite Automata Acquisition Algorithm.” *Procesamiento Del Lenguaje Natural* 35:319–26. http://www.lsi.upc.edu/~nlp/papers/2005/sepln05-pp.pdf.

———. 2005b. “Applying a Finite Automata Acquisition Algorithm to Named Entity Recognition.” In *Proceedings of 5th International Workshop on Finite-State Methods and Natural Language Processing (Fsmnlp’05)*. http://www.lsi.upc.edu/~nlp/papers/2005/fsmnlp05-pp.pdf.

———. 2005c. “Approaching Sequential NLP Tasks with an Automata Acquisition Algorithm.” In *Proceedings of International Conference on Recent Advances in Nlp (Ranlp’05)*. http://www.lsi.upc.edu/~nlp/papers/2005/ranlp05-pp.pdf.

———. 2007a. “ME-CSSR: An Extension of CSSR Using Maximum Entropy Models.” In *Proceedings of Finite State Methods for Natural Language Processing (Fsmnlp) 2007*. http://www.lsi.upc.edu/%7Enlp/papers/2007/fsmnlp07-pp.pdf.

———. 2007b. “Studying CSSR Algorithm Applicability on Nlp Tasks.” *Procesamiento Del Lenguaje Natural* 39:89–96. http://www.lsi.upc.edu/%7Enlp/papers/2007/sepln07-pp.pdf.

Pfau, David, Nicholas Bartlett, and Frank Wood. 2010. “Probabilistic Deterministic Infinite Automata.” In *Advances in Neural Information Processing Systems 23 [Nips 2010]*, edited by John Lafferty, C. K. I. Williams, John Shawe-Taylor, Richard S. Zemel, and A. Culotta, 1930–8. Cambridge, Massachusetts: MIT Press. http://books.nips.cc/papers/files/nips23/NIPS2010_1179.pdf.

Ray, Asok. 2004. “Symbolic Dynamic Analysis of Complex Systems for Anomaly Detection.” *Signal Processing* 84:1115–30.

Salmon, Wesley C. 1971. *Statistical Explanation and Statistical Relevance*. Pittsburgh: University of Pittsburgh Press.

———. 1984. *Scientific Explanation and the Causal Structure of the World*. Princeton: Princeton University Press.

Shalizi, Cosma Rohilla. 2001. “Causal Architecture, Complexity and Self-Organization in Time Series and Cellular Automata.” PhD thesis, University of Wisconsin-Madison. http://bactra.org/thesis/.

———. 2003. “Optimal Nonlinear Prediction of Random Fields on Networks.” *Discrete Mathematics and Theoretical Computer Science* AB(DMCS):11–30. http://arxiv.org/abs/math.PR/0305160.

Shalizi, Cosma Rohilla, Marcelo F. Camperi, and Kristina Lisa Klinkner. 2007. “Discovering Functional Communities in Dynamical Networks.” In *Statistical Network Analysis: Models, Issues, and New Directions: ICML 2006 Workshop on Statistical Network Analysis, Pittsburgh, Pa, Usa, June 2006: Reivsed Selected Papers*, edited by Edo Airoldi, David M. Blei, Stephen E. Fienberg, Anna Goldenberg, Eric P. Xing, and Alice X. Zheng, 4503:140–57. Lecture Notes in Computer Science. New York: Springer-Verlag. http://arxiv.org/abs/q-bio.NC/0609008.

Shalizi, Cosma Rohilla, and James P. Crutchfield. 2001. “Computational Mechanics: Pattern and Prediction, Structure and Simplicity.” *Journal of Statistical Physics* 104:817–79. http://arxiv.org/abs/cond-mat/9907176.

Shalizi, Cosma Rohilla, Robert Haslinger, Jean-Baptiste Rouquier, Kristina Lisa Klinkner, and Cristopher Moore. 2006. “Automatic Filters for the Detection of Coherent Structure in Spatiotemporal Systems.” *Physical Review E* 73:036104. http://arxiv.org/abs/nlin.CG/0508001.

Shalizi, Cosma Rohilla, and Kristina Lisa Klinkner. 2004. “Blind Construction of Optimal Nonlinear Recursive Predictors for Discrete Sequences.” In *Uncertainty in Artificial Intelligence: Proceedings of the Twentieth Conference (Uai 2004)*, edited by Max Chickering and Joseph Y. Halpern, 504–11. Arlington, Virginia: AUAI Press. http://arxiv.org/abs/cs.LG/0406011.

Shalizi, Cosma Rohilla, Kristina Lisa Klinkner, and Robert Haslinger. 2004. “Quantifying Self-Organization with Optimal Predictors.” *Physical Review Letters* 93:118701. https://doi.org/10.1103/PhysRevLett.93.118701.

Shalizi, Cosma Rohilla, and Cristopher Moore. 2003. “What Is a Macrostate? From Subjective Measurements to Objective Dynamics.” arxiv:cond-mat/0303625. http://arxiv.org/abs/cond-mat/0303625.

Tishby, Naftali, Fernando C. Pereira, and William Bialek. 1999. “The Information Bottleneck Method.” In *Proceedings of the 37th Annual Allerton Conference on Communication, Control and Computing*, edited by B. Hajek and R. S. Sreenivas, 368–77. Urbana, Illinois: University of Illinois Press. http://arxiv.org/abs/physics/0004057.