Information Theory and Optimal Prediction

36-467/36-667

20 November 2018

\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\indep}{\perp} \]

“State”

In classical physics or dynamical systems, “state” is the present variable which fixes all future observables
Back away from determinism: state determines distribution of future observables
Would like a small state
Would like the state to be well-behaved, e.g. Markov
Try construct states by constructing predictions

Notation etc.

Upper-case letters are random variables, lower-case their realizations
Stochastic process \(\ldots, X_{-1}, X_0, X_1, X_2, \ldots\)
\(X_{s}^{t} = (X_s, X_{s+1}, \ldots X_{t-1}, X_t)\)
Past up to and including \(t\) is \(X^t_{-\infty}\), future is \(X_{t+1}^{\infty}\)
- Discrete time isn’t required but cuts down on measure theory

Making a Prediction

Look at \(X^t_{-\infty}\), make a guess about \(X_{t+1}^{\infty}\)
Most general guess is a probability distribution
Only ever attend to selected aspects of \(X^t_{-\infty}\)
- mean, variance, phase of 1st three Fourier modes, ….
\(\therefore\) guess is a function or statistic of \(X^t_{-\infty}\)
A good statistic should give us lots of information about \(X_{t+1}^{\infty}\)

Entropy

The entropy of a random variable \(X\): \[ H[X] \equiv -\sum_{x}{\Prob{X=x}\log_2{\Prob{X=x}}} \]
Some mathematical properties:
- \(H[X] \geq 0\)
- \(H[X] = 0\) if and only if \(\Prob{X=x_0} = 1\) for some \(x_0\)
- If \(X\) has \(k\) possible values, \(H[X] \leq \log_2{k}\)
- \(H[X] = \log_2{k}\) if and only if \(\Prob{X=x} = 1/k\) for all \(x\)
Data processing inequality: \(H[X] \geq H[\sigma(X)]\)
- \(H[X] = H[\sigma(X)]\) iff \(\sigma\) is 1-1
\(H[X] \approx\) number of bits needed to encode the value of \(X\)
\(H[X]\) is the uncertainty in the distribution of \(X\)

Joint and Conditional Entropy

Conditional entropy: \[\begin{eqnarray} H[Y|X=x] & \equiv & -\sum_{y}{\Prob{Y=y|X=x}\log_2{\Prob{Y=y|X=x}}}\\ H[Y|X] & \equiv & \Expect{H[Y|X=X]} \end{eqnarray}\]
- Condition on multiple variables in the natural way
Joint entropy: \[ H[X,Y] = H[X] + H[Y|X] = H[Y] + H[X|Y] \]
Same interpretations as entropy

Entropy Rate

Limiting entropy per time-step: \[ h \equiv \lim_{n\rightarrow\infty}{\frac{1}{n}H[X_1^n]} \]
Equivalently, limiting conditional entropy: \[ h = \lim_{n\rightarrow\infty}{H[X_{n+1}|X_1^n]} \]
Both limits exist and are equal for all stationary processes
Interpretations:
- How many bits do we need to code the next observation, given the complete history? (“source coding”)
- How much uncertainty is there about what happens next, given the complete history?
- \(h=0\) for deterministic processes

Information

The information \(X\) gives about \(Y\) is the reduction in uncertainty: \[ H[Y] - H[Y|X] \]
Some algebra shows that \[ H[Y] - H[Y|X] = H[X] - H[X|Y] \]
Define the mutual information \[ I[X;Y] \equiv H[Y] - H[Y|X] \]
Some properties:
- \(I[X;Y] \geq 0\)
- \(I[X;Y] = 0\) if and only if \(X\) and \(Y\) are independent
Data processing: \(I[X;Y] \geq I[\sigma(X);Y]\)
- Can be preserved even if \(\sigma\) isn’t 1-1

Predictive Information

We want to make predictions
Use \(I[X^{\infty}_{t+1};\sigma(X_{-\infty}^t)]\) to measure how much information the statistic \(\sigma\) gives us about the future
- Can also look at \(I[X^{t+m}_{t+1};\sigma(X_{-\infty}^t)]\) for all \(m\)
How big can we make that?

Predictive Sufficiency

For any statistic \(\sigma\), \[ I[X^{\infty}_{t+1};X_{-\infty}^t] \geq I[X^{\infty}_{t+1};\sigma(X_{-\infty}^t)] \]
\(\sigma\) is predictively sufficient iff \[ I[X^{\infty}_{t+1};X_{-\infty}^t] = I[X^{\infty}_{t+1};\sigma(X_{-\infty}^t)] \]
Sufficient statistics retain all predictive information in the data

Predictive states

(Crutchfield and Young 1989)

Histories \(a\) and \(b\) are equivalent iff \[ \Prob{X_{t+1}^{\infty}|X_{-\infty}^{t} = a} = \Prob{X_{t+1}^{\infty}|X_{-\infty}^{t} = b} \]
\([a] \equiv\) all histories equivalent to \(a\)
The statistic of interest, the predictive state, is \[ \epsilon(x^t_{-\infty}) = [x^t_{-\infty}] \]
Set \(s_t = \epsilon(x^t_{-\infty})\)
A state is an equivalence class of histories and a distribution over future events
IID = 1 state, periodic = \(p\) states

set of histories, color-coded by conditional distribution of futures

Partitioning histories into predictive states

Sufficiency

(Shalizi and Crutchfield 2001)

\[ I[X^{\infty}_{t+1};X^t_{-\infty}] = I[X^{\infty}_{t+1};\epsilon(X^t_{-\infty})] \]
because \[\begin{eqnarray*} \Prob{X^{\infty}_{t+1}|S_t = \epsilon(x^t_{-\infty})} & = & \int_{y \in [x^t_{-\infty}]}{\Prob{X^{\infty}_{t+1}|X^t_{-\infty}=y} \Prob{X^t_{-\infty}=y|S_t = \epsilon(x^t_{-\infty})} dy}\\ & = & \Prob{X^{\infty}_{t+1}|X^t_{-\infty}=x^t_{-\infty}} \end{eqnarray*}\]

A non-sufficient partition of histories

Effect of insufficiency on predictive distributions

Markov Properties

Future observations are independent of the past given the causal state: \[ X^{\infty}_{t+1} \indep X^{t}_{-\infty} | S_{t+1} \]
by sufficiency: \[\begin{eqnarray*} \Prob{X^{\infty}_{t+1}|X^{t}_{-\infty}=x^t_{-\infty}, S_{t+1} = \epsilon(x^t_{-\infty})} & = & \Prob{X^{\infty}_{t+1}|X^{t}_{-\infty}=x^t_{-\infty}}\\ & = & \Prob{X^{\infty}_{t+1}|S_{t+1} = \epsilon(x^t_{-\infty})} \end{eqnarray*}\]

Recursive Updating/Deterministic Transitions

Recursive transitions for states: \[ \epsilon(x^{t+1}_{-\infty}) = T(\epsilon(x^t_{-\infty}), x_{t+1}) \]
- Automata theory: “deterministic transitions” (even though there are probabilities)
In continuous time: \[ \epsilon(x^{t+h}_{-\infty}) = T(\epsilon(x^t_{-\infty}),x^{t+h}_{t}) \]

Recursive Updating from Sufficiency

Claim: If \(u \sim v\), then \(ua \sim va\) for any next observation \(a\)
Fix any set of future events \(F\) \[\begin{eqnarray*} \Prob{X^{\infty}_{t+1} \in aF|X^t_{-\infty} = u} & = & \Prob{X^{\infty}_{t+1} \in aF|X^t_{-\infty} = v}\\ \Prob{X_{t+1}= a, X^{\infty}_{t+2} \in F|X^t_{-\infty} = u} & = & \Prob{X_{t+1}= a, X^{\infty}_{t+2} \in F|X^t_{-\infty} = v} \end{eqnarray*}\] \[\begin{eqnarray*} \Prob{X^{\infty}_{t+2} \in F|X^{t+1}_{-\infty} = ua}\Prob{X_{t+1}= a|X^t_{-\infty} = u} & = & \Prob{X^{\infty}_{t+2} \in F|X^{t+1}_{-\infty} = va}\Prob{X_{t+1}= a|X^t_{-\infty} = v} \end{eqnarray*}\] \[\begin{eqnarray*} \Prob{X^{\infty}_{t+2} \in F|X^{t+1}_{-\infty} = ua} & = & \Prob{X^{\infty}_{t+2} \in F|X^{t+1}_{-\infty} = va}\\ ua & \sim & va \end{eqnarray*}\]
- (same for continuous values or time but need more measure theory)

Predictive States are Markovian

\[ S_{t+1}^{\infty} \indep S^{t-1}_{-\infty}|S_t \]
because \[ S_{t+1}^{\infty} = T(S_t,X_{t}^{\infty}) \] and \[ X_{t}^{\infty} \indep \left\{ X^{t-1}_{-\infty}, S^{t-1}_{-\infty}\right\} | S_t \]

Minimality

\(\epsilon\) is minimal sufficient \(=\) can be computed from any other sufficient statistic
\(\therefore\) for any sufficient \(\eta\), exists a function \(g\) such that \[ \epsilon(X^{t}_{-\infty}) = g(\eta(X^t_{-\infty})) \]
\(\therefore\) if \(\eta\) is sufficient \[ I[\epsilon(X^{t}_{-\infty}); X^{t}_{-\infty}] \leq I[\eta(X^{t}_{-\infty}); X^{t}_{-\infty}] \]

Sufficient, but not minimal, partition of histories

Coarser than the predictive states, but not sufficient

Uniqueness

There is really no other minimal sufficient statistic
If \(\eta\) is minimal, there is an \(h\) such that \[ \eta = h(\epsilon) ~\mathrm{w.p.1} \]
but \(\epsilon = g(\eta)\) (w.p.1)
so \[\begin{eqnarray*} g(h(\epsilon)) & = & \epsilon\\ h(g(\eta)) & = & \eta \end{eqnarray*}\]
\(\therefore\) \(\epsilon\) and \(\eta\) partition histories in the same way (w.p.1)

Minimal stochasticity

If \(R_t = \eta(X^{t-1}_{-\infty})\) is also sufficient, then \[ H[R_{t+1}|R_t] \geq H[S_{t+1}|S_t] \]
\(\therefore\) the predictive states are the closest we get to a deterministic model, without losing power

Entropy Rate

\[\begin{eqnarray*} h & \equiv & \lim_{n\rightarrow\infty}{H[X_{n+1}|X^n_1]}\\ & = & \lim_{n\rightarrow\infty}{H[X_{n+1}|S_n]} ~\text{(sufficiency)}\\ & = & H[X_2|S_1] ~\text{(stationarity)} \end{eqnarray*}\]
so the predictive states lets us calculate the entropy rate
- and do source coding

Minimal Markovian Representation

The observed process \((X_t)\) is non-Markovian and ugly
But it is generated from a homogeneous Markov process \((S_t)\)
After minimization, this representation is (essentially) unique
- Can exist smaller Markovian representations, but then always have distributions over those states
- and those distributions correspond to predictive states

What Sort of Markov Model?

Common-or-garden HMM: \[ S_{t+1} \indep X_{t+1}|S_t \]
But here \[ S_{t+1} = T(S_t, X_{t+1}) \]
This is a chain with complete connections (Onicescu and Mihoc 1935; Iosifescu and Grigorescu 1990)

Example of a CCC: Even Process

Blocks of As of any length, separated by even-length blocks of Bs
- Histories ending \(AB\) have different implications than histories ending \(ABB\) than \(ABBB\) than \(ABBBB\)
- \(\Rightarrow\) Not Markov at any finite order

Inventions

Statistical relevance basis (Salmon 1971, 1984)
Measure-theoretic prediction process (Knight 1975)
Forecasting/true measure complexity (Grassberger 1986)
Causal states, \(\epsilon\) machine (Crutchfield and Young 1989)
- Not generally causal
- Though maybe sometimes? (Shalizi and Moore 2003)
Observable operator model (Jaeger 2000)
Predictive state representations (Littman, Sutton, and Singh 2002)
Sufficient posterior representation (Langford, Salakhutdinov, and Zhang 2009)

How Broad Are These Results?

Knight (1975) gave most general constructions

Non-stationary \(X\)
\(t\) continuous (but discrete works as special case)
\(X_t\) with values in continuous spaces
- Lusin space = image of a complete separable metrizable space under a measurable bijection
\(S_t\) is a Markov process with recursive updating

Connecting to Data

Everything so far has been math/probability
- The Oracle tells us the infinite-dimensional distribution of \(X\)
Can we do some statistics and find the states?
Two senses of “find”: learn in a fixed model vs. discover the right model

Learning

Problem: Given states and transitions (\(\epsilon, T\)), realization \(x_1^n\), estimate \(\Prob{X_{t+1}=x|S_t=s}\)

Just estimation for stochastic processes
- Easier than ordinary HMMs because \(S_t\) is a function of trajectory
- Exponential families in the all-discrete case, very tractable

Discovery

Problem: Given \(x_1^n\), estimate \(\epsilon, T, \Prob{X_{t+1}=x|S_t=s}\)

Much harder!
- Why should this be possible at all?
- Can’t cover every possible approach (and will neglect Langford, Salakhutdinov, and Zhang (2009);Pfau, Bartlett, and Wood (2010) both)

CSSR: Causal State Splitting Reconstruction

Key observation: Recursion + one-step-ahead predictive sufficiency \(\Rightarrow\) general predictive sufficiency
- Get next-step distribution right by independence testing
- Then make states recursive
Assumes discrete observations, discrete time, finite causal states
Paper: Shalizi and Klinkner (2004); code, [https://github.com/stites/CSSR]

One-Step Ahead Prediction

Start with all histories in the same state
Given current partition of histories into states, test whether going one step further back into the past changes the next-step conditional distribution
- Use a hypothesis test to control false positive rate
If yes, split that cell of the partition, but see if it matches an existing distribution
- Must allow this merging or else lose minimality
If no match, add new cell to the partition
Stop when no more divisions can be made or a maximum history length \(\Lambda\) is reached
- For consistency, \(\Lambda < \frac{\log{n}}{h_1 + \iota}\) for some \(\iota\)

Ensuring Recursive Transitions

Need to determinize a probabilistic automaton
Several ways of doing this, but a solved technical problem in automata theory
- Textbooks: Lewis and Papadimitriou (1998); Hopcroft and Ullman (1979)
- Brilliant but bizarre math approach: Conway (1971)
Trickiest coding in the algorithm and can influence the finite-sample behavior

Convergence

\(\mathcal{S} =\) true predictive state structure
\(\widehat{\mathcal{S}}_n\) = structure reconstructed from \(n\) data points
Assume: finite # of states, every state has a finite history, using long enough histories, technicalities: \[ \Prob{\widehat{\mathcal{S}}_n \neq \mathcal{S}} \rightarrow 0 \]
\(\mathcal{D} =\) true distribution, \(\widehat{\mathcal{D}}_n\) = inferred \[ \Expect{{\|\widehat{\mathcal{D}}_n - \mathcal{D}\|}_{1}} = O(n^{-1/2}) \]
- Same order of convergence as for IID samples

Hand-waving

Empirical conditional distributions for histories converge
- ergodic theorem for Markov chains
Histories in the same state become harder to accidentally separate
Histories in different states become harder to confuse
Each state’s predictive distribution converges \(O(n^{-1/2})\)
- CLT for Markov chains

Example: The Even Process

reconstruction with \(\Lambda = 3\), \(n=1000\), \(\alpha = 0.005\)

N.B., CSSR did not know that there were 2 states, or how they were connected — it discovered this

Some Uses

Neural spike train analysis (Haslinger, Klinkner, and Shalizi 2010), fMRI analysis (Merriam, Genovese and Shalizi in prep.)
Geomagnetic fluctuations (Clarke, Freeman, and Watkins 2003)
Natural language processing (M. Padró and Padró 2005b, 2005a, 2005c, 2007a, 2007b)
Anomaly detection (David S. Friedlander, Phoha, and Brooks 2003; Davis S. Friedlander et al. 2003; Ray 2004)
Information sharing in networks (Klinkner, Shalizi, and Camperi 2006; Shalizi, Camperi, and Klinkner 2007)
Social media propagation (Cointet, Faure, and Roth 2007)

Summary

Your stochastic process has a unique, minimal Markovian representation
This representation has nice predictive properties
Can reconstruct from sample data in some cases
- and a lot more could be done in this line

Backup: Why Care About Sufficiency?

Optimal strategy, under any loss function, only needs a sufficient statistic (Blackwell & Girshick)
Strategies using insufficient statistics can generally be improved (Blackwell & Rao)
Excuse for not worrying about particular loss functions

Backup: A Cousin: The Information Bottleneck

(Tishby, Pereira, and Bialek 1999)

For inputs \(X\) and outputs \(Y\), fix \(\beta > 0\), find \(\eta(X)\), the bottleneck variable, maximizing \[ I[\eta(X);Y] - \beta I[\eta(X);X] \]
give up 1 bit of predictive information for \(\beta\) bits of memory
Predictive sufficiency comes as \(\beta \rightarrow \infty\), unwilling to lose any predictive power

Backup: Extension to Input-Output Systems

(Littman, Sutton, and Singh 2002; Shalizi 2001)

System output \((X_t)\), input \((Y_t)\)
Histories \(x^t_{-\infty}, y^t_{-\infty}\) have distributions of output \(x_{t+1}\) for each further input \(y_{t+1}\)
Equivalence class these distributions and enforce recursive updating
Internal states of the system, not trying to predict future inputs

Backup: Extension to Spatiotemporal Systems

(Shalizi 2003; Shalizi, Klinkner, and Haslinger 2004; Shalizi et al. 2006; Jänicke et al. 2007; Goerg 2013, 2014; Goerg and Shalizi 2012, 2013; Montañez and Shalizi 2017)

Dynamic random field \(X(\vec{r},t)\)
- Assume a finite maximum speed \(c\) at which information can propagate
Past cone: points in space-time which could matter to \(X(\vec{r},t)\)
Future cone: points in space-time for which \(X(\vec{r},t)\) could matter

Equivalence-class past cone configurations by conditional distributions over future cones
\(S(\vec{r},t)\) is a Markov field
Minimal sufficiency, recursive updating, etc., all go through

Backup: Statistical Forecasting Complexity

Statistical forecasting complexity is (Grassberger 1986; Crutchfield and Young 1989) \[ C \equiv I[\epsilon(X^t_{-\infty});X^t_{-\infty}] \]
\(=\) amount of information about the past needed for optimal prediction
\(=H[\epsilon(X^t_{-\infty})]\) for predictive causal states
\(=\) expected algorithmic sophistication (Gács, Tromp, and Vitanyi 2001)
\(=\log\)(period) for period processes
\(=\log\)(geometric mean(recurrence time)) for stationary processes
Property of the process, not our models

References

Clarke, Richard W., Mervyn P. Freeman, and Nicholas W. Watkins. 2003. “Application of Computational Mechanics to the Analysis of Natural Data: An Example in Geomagnetism.” Physical Review E 67:0126203. http://arxiv.org/abs/cond-mat/0110228.

Cointet, Jean-Philippe, Emmanuel Faure, and Camille Roth. 2007. “Intertemporal Topic Correlations in Online Media.” In Proceedings of the International Conference on Weblogs and Social Media [Icwsm]. Boulder, CO, USA. http://camille.roth.free.fr/travaux/cointetfaureroth-icwsm-cr4p.pdf.

Conway, J. H. 1971. Regular Algebra and Finite Machines. London: Chapman; Hall.

Crutchfield, James P., and Karl Young. 1989. “Inferring Statistical Complexity.” Physical Review Letters 63:105–8. http://www.santafe.edu/~cmg/compmech/pubs/ISCTitlePage.htm.

Friedlander, David S., Shashi Phoha, and Richard Brooks. 2003. “Determination of Vehicle Behavior Based on Distributed Sensor Network Data.” In Advanced Signal Processing Algorithms, Architectures, and Implementations XIII, edited by Franklin T. Luk. Vol. 5205. Proceedings of the Spie. Bellingham, WA: SPIE.

Friedlander, Davis S., Isanu Chattopadhayay, Asok Ray, Shashi Phoha, and Noah Jacobson. 2003. “Anomaly Prediction in Mechanical System Using Symbolic Dynamics.” In Proceedings of the 2003 American Control Conference, Denver, Co, 4–6 June 2003.

Gács, Péter, John T. Tromp, and Paul M. B. Vitanyi. 2001. “Algorithmic Statistics.” IEEE Transactions on Information Theory 47:2443–63. http://arxiv.org/abs/math.PR/0006233.

Goerg, Georg M. 2013. LICORS: Light Cone Reconstruction of States — Predictive State Estimation from Spatio-Temporal Data. http://CRAN.R-project.org/package=LICORS.

———. 2014. LSC: Local Statistical Complexity — Automatic Pattern Discovery in Spatio-Temporal Data. http://CRAN.R-project.org/package=LSC.

Goerg, Georg M., and Cosma Rohilla Shalizi. 2012. “LICORS: Light Cone Reconstruction of States for Non-Parametric Forecasting of Spatio-Temporal Systems.” Statistics Department, CMU. http://arxiv.org/abs/1206.2398.

———. 2013. “Mixed LICORS: A Nonparametric Algorithm for Predictive State Reconstruction.” In Sixteenth International Conference on Artificial Intelligence and Statistics, edited by Carlos M. Carvalho and Pradeep Ravikumar, 289–97. http://arxiv.org/abs/1211.3760.

Grassberger, Peter. 1986. “Toward a Quantitative Theory of Self-Generated Complexity.” International Journal of Theoretical Physics 25:907–38.

Haslinger, Robert, Kristina Lisa Klinkner, and Cosma Rohilla Shalizi. 2010. “The Computational Structure of Spike Trains.” Neural Computation 22:121–57. https://doi.org/10.1162/neco.2009.12-07-678.

Hopcroft, John E., and Jeffrey D. Ullman. 1979. Introduction to Automata Theory, Languages, and Computation. Reading: Addison-Wesley.

Iosifescu, Marius, and Serban Grigorescu. 1990. Dependence with Complete Connections and Its Applications. Cambridge, England: Cambridge University Press.

Jaeger, Herbert. 2000. “Observable Operator Models for Discrete Stochastic Time Series.” Neural Computation 12:1371–98. https://doi.org/10.1162/089976600300015411.

Jänicke, Heike, Alexander Wiebel, Gerik Scheuermann, and Wolfgang Kollmann. 2007. “Multifield Visualization Using Local Statistical Complexity.” IEEE Transactions on Visualization and Computer Graphics 13:1384–91. https://doi.org/10.1109/TVCG.2007.70615.

Klinkner, Kristina Lisa, Cosma Rohilla Shalizi, and Marcelo F. Camperi. 2006. “Measuring Shared Information and Coordinated Activity in Neuronal Networks.” In Advances in Neural Information Processing Systems 18 (Nips 2005), edited by Yair Weiss, Bernhard Schölkopf, and John C. Platt, 667–74. Cambridge, Massachusetts: MIT Press. http://arxiv.org/abs/q-bio.NC/0506009.

Knight, Frank B. 1975. “A Predictive View of Continuous Time Processes.” Annals of Probability 3:573–96. http://projecteuclid.org/euclid.aop/1176996302.

Langford, John, Ruslan Salakhutdinov, and Tong Zhang. 2009. “Learning Nonlinear Dynamic Models.” In Proceedings of the 26th Annual International Conference on Machine Learning [Icml 2009], edited by Andrea Danyluk, Léon Bottou, and Michael Littman, 593–600. New York: Association for Computing Machinery. http://arxiv.org/abs/0905.3369.

Lewis, Harry R., and Christos H. Papadimitriou. 1998. Elements of the Theory of Computation. Second. Upper Saddle River, New Jersey: Prentice-Hall.

Littman, Michael L., Richard S. Sutton, and Satinder Singh. 2002. “Predictive Representations of State.” In Advances in Neural Information Processing Systems 14 (Nips 2001), edited by Thomas G. Dietterich, Suzanna Becker, and Zoubin Ghahramani, 1555–61. Cambridge, Massachusetts: MIT Press. http://papers.nips.cc/paper/1983-predictive-representations-of-state.

Montañez, George D., and Cosma Rohilla Shalizi. 2017. “The LICORS Cabinet: Nonparametric Algorithms for Spatio-Temporal Prediction.” In International Joint Conference on Neural Networks 2017 [Ijcnn 2017], 2811–9. https://doi.org/10.1109/IJCNN.2017.7966203.

Onicescu, Octav, and Gheorghe Mihoc. 1935. “Sur Les Chaînes de Variables Statistiques.” Comptes Rendus de L’Académie Des Sciences de Paris 200:511–12.

Padró, Muntsa, and Lluís Padró. 2005a. “A Named Entity Recognition System Based on a Finite Automata Acquisition Algorithm.” Procesamiento Del Lenguaje Natural 35:319–26. http://www.lsi.upc.edu/~nlp/papers/2005/sepln05-pp.pdf.

———. 2005b. “Applying a Finite Automata Acquisition Algorithm to Named Entity Recognition.” In Proceedings of 5th International Workshop on Finite-State Methods and Natural Language Processing (Fsmnlp’05). http://www.lsi.upc.edu/~nlp/papers/2005/fsmnlp05-pp.pdf.

———. 2005c. “Approaching Sequential NLP Tasks with an Automata Acquisition Algorithm.” In Proceedings of International Conference on Recent Advances in Nlp (Ranlp’05). http://www.lsi.upc.edu/~nlp/papers/2005/ranlp05-pp.pdf.

———. 2007a. “ME-CSSR: An Extension of CSSR Using Maximum Entropy Models.” In Proceedings of Finite State Methods for Natural Language Processing (Fsmnlp) 2007. http://www.lsi.upc.edu/%7Enlp/papers/2007/fsmnlp07-pp.pdf.

———. 2007b. “Studying CSSR Algorithm Applicability on Nlp Tasks.” Procesamiento Del Lenguaje Natural 39:89–96. http://www.lsi.upc.edu/%7Enlp/papers/2007/sepln07-pp.pdf.

Pfau, David, Nicholas Bartlett, and Frank Wood. 2010. “Probabilistic Deterministic Infinite Automata.” In Advances in Neural Information Processing Systems 23 [Nips 2010], edited by John Lafferty, C. K. I. Williams, John Shawe-Taylor, Richard S. Zemel, and A. Culotta, 1930–8. Cambridge, Massachusetts: MIT Press. http://books.nips.cc/papers/files/nips23/NIPS2010_1179.pdf.

Ray, Asok. 2004. “Symbolic Dynamic Analysis of Complex Systems for Anomaly Detection.” Signal Processing 84:1115–30.

Salmon, Wesley C. 1971. Statistical Explanation and Statistical Relevance. Pittsburgh: University of Pittsburgh Press.

———. 1984. Scientific Explanation and the Causal Structure of the World. Princeton: Princeton University Press.

Shalizi, Cosma Rohilla. 2001. “Causal Architecture, Complexity and Self-Organization in Time Series and Cellular Automata.” PhD thesis, University of Wisconsin-Madison. http://bactra.org/thesis/.

———. 2003. “Optimal Nonlinear Prediction of Random Fields on Networks.” Discrete Mathematics and Theoretical Computer Science AB(DMCS):11–30. http://arxiv.org/abs/math.PR/0305160.

Shalizi, Cosma Rohilla, Marcelo F. Camperi, and Kristina Lisa Klinkner. 2007. “Discovering Functional Communities in Dynamical Networks.” In Statistical Network Analysis: Models, Issues, and New Directions: ICML 2006 Workshop on Statistical Network Analysis, Pittsburgh, Pa, Usa, June 2006: Reivsed Selected Papers, edited by Edo Airoldi, David M. Blei, Stephen E. Fienberg, Anna Goldenberg, Eric P. Xing, and Alice X. Zheng, 4503:140–57. Lecture Notes in Computer Science. New York: Springer-Verlag. http://arxiv.org/abs/q-bio.NC/0609008.

Shalizi, Cosma Rohilla, and James P. Crutchfield. 2001. “Computational Mechanics: Pattern and Prediction, Structure and Simplicity.” Journal of Statistical Physics 104:817–79. http://arxiv.org/abs/cond-mat/9907176.

Shalizi, Cosma Rohilla, Robert Haslinger, Jean-Baptiste Rouquier, Kristina Lisa Klinkner, and Cristopher Moore. 2006. “Automatic Filters for the Detection of Coherent Structure in Spatiotemporal Systems.” Physical Review E 73:036104. http://arxiv.org/abs/nlin.CG/0508001.

Shalizi, Cosma Rohilla, and Kristina Lisa Klinkner. 2004. “Blind Construction of Optimal Nonlinear Recursive Predictors for Discrete Sequences.” In Uncertainty in Artificial Intelligence: Proceedings of the Twentieth Conference (Uai 2004), edited by Max Chickering and Joseph Y. Halpern, 504–11. Arlington, Virginia: AUAI Press. http://arxiv.org/abs/cs.LG/0406011.

Shalizi, Cosma Rohilla, Kristina Lisa Klinkner, and Robert Haslinger. 2004. “Quantifying Self-Organization with Optimal Predictors.” Physical Review Letters 93:118701. https://doi.org/10.1103/PhysRevLett.93.118701.

Shalizi, Cosma Rohilla, and Cristopher Moore. 2003. “What Is a Macrostate? From Subjective Measurements to Objective Dynamics.” arxiv:cond-mat/0303625. http://arxiv.org/abs/cond-mat/0303625.

Tishby, Naftali, Fernando C. Pereira, and William Bialek. 1999. “The Information Bottleneck Method.” In Proceedings of the 37th Annual Allerton Conference on Communication, Control and Computing, edited by B. Hajek and R. S. Sreenivas, 368–77. Urbana, Illinois: University of Illinois Press. http://arxiv.org/abs/physics/0004057.