# Information Theory and Optimal Prediction

20 November 2018

$\newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\indep}{\perp}$

# “State”

• In classical physics or dynamical systems, “state” is the present variable which fixes all future observables

• Back away from determinism: state determines distribution of future observables

• Would like a small state

• Would like the state to be well-behaved, e.g. Markov

• Try construct states by constructing predictions

# Notation etc.

• Upper-case letters are random variables, lower-case their realizations

• Stochastic process $$\ldots, X_{-1}, X_0, X_1, X_2, \ldots$$

• $$X_{s}^{t} = (X_s, X_{s+1}, \ldots X_{t-1}, X_t)$$

• Past up to and including $$t$$ is $$X^t_{-\infty}$$, future is $$X_{t+1}^{\infty}$$
• Discrete time isn’t required but cuts down on measure theory

# Making a Prediction

• Look at $$X^t_{-\infty}$$, make a guess about $$X_{t+1}^{\infty}$$
• Most general guess is a probability distribution
• Only ever attend to selected aspects of $$X^t_{-\infty}$$
• mean, variance, phase of 1st three Fourier modes, ….
• $$\therefore$$ guess is a function or statistic of $$X^t_{-\infty}$$
• A good statistic should give us lots of information about $$X_{t+1}^{\infty}$$

# Entropy

• The entropy of a random variable $$X$$: $H[X] \equiv -\sum_{x}{\Prob{X=x}\log_2{\Prob{X=x}}}$
• Some mathematical properties:
• $$H[X] \geq 0$$
• $$H[X] = 0$$ if and only if $$\Prob{X=x_0} = 1$$ for some $$x_0$$
• If $$X$$ has $$k$$ possible values, $$H[X] \leq \log_2{k}$$
• $$H[X] = \log_2{k}$$ if and only if $$\Prob{X=x} = 1/k$$ for all $$x$$
• Data processing inequality: $$H[X] \geq H[\sigma(X)]$$
• $$H[X] = H[\sigma(X)]$$ iff $$\sigma$$ is 1-1
• $$H[X] \approx$$ number of bits needed to encode the value of $$X$$
• $$H[X]$$ is the uncertainty in the distribution of $$X$$

# Joint and Conditional Entropy

• Conditional entropy: $\begin{eqnarray} H[Y|X=x] & \equiv & -\sum_{y}{\Prob{Y=y|X=x}\log_2{\Prob{Y=y|X=x}}}\\ H[Y|X] & \equiv & \Expect{H[Y|X=X]} \end{eqnarray}$
• Condition on multiple variables in the natural way
• Joint entropy: $H[X,Y] = H[X] + H[Y|X] = H[Y] + H[X|Y]$
• Same interpretations as entropy

# Entropy Rate

• Limiting entropy per time-step: $h \equiv \lim_{n\rightarrow\infty}{\frac{1}{n}H[X_1^n]}$
• Equivalently, limiting conditional entropy: $h = \lim_{n\rightarrow\infty}{H[X_{n+1}|X_1^n]}$
• Both limits exist and are equal for all stationary processes
• Interpretations:
• How many bits do we need to code the next observation, given the complete history? (“source coding”)
• How much uncertainty is there about what happens next, given the complete history?
• $$h=0$$ for deterministic processes

# Information

• The information $$X$$ gives about $$Y$$ is the reduction in uncertainty: $H[Y] - H[Y|X]$
• Some algebra shows that $H[Y] - H[Y|X] = H[X] - H[X|Y]$
• Define the mutual information $I[X;Y] \equiv H[Y] - H[Y|X]$
• Some properties:
• $$I[X;Y] \geq 0$$
• $$I[X;Y] = 0$$ if and only if $$X$$ and $$Y$$ are independent
• Data processing: $$I[X;Y] \geq I[\sigma(X);Y]$$
• Can be preserved even if $$\sigma$$ isn’t 1-1

# Predictive Information

• We want to make predictions
• Use $$I[X^{\infty}_{t+1};\sigma(X_{-\infty}^t)]$$ to measure how much information the statistic $$\sigma$$ gives us about the future
• Can also look at $$I[X^{t+m}_{t+1};\sigma(X_{-\infty}^t)]$$ for all $$m$$
• How big can we make that?

# Predictive Sufficiency

• For any statistic $$\sigma$$, $I[X^{\infty}_{t+1};X_{-\infty}^t] \geq I[X^{\infty}_{t+1};\sigma(X_{-\infty}^t)]$
• $$\sigma$$ is predictively sufficient iff $I[X^{\infty}_{t+1};X_{-\infty}^t] = I[X^{\infty}_{t+1};\sigma(X_{-\infty}^t)]$
• Sufficient statistics retain all predictive information in the data

# Predictive states

(Crutchfield and Young 1989)

• Histories $$a$$ and $$b$$ are equivalent iff $\Prob{X_{t+1}^{\infty}|X_{-\infty}^{t} = a} = \Prob{X_{t+1}^{\infty}|X_{-\infty}^{t} = b}$
• $$[a] \equiv$$ all histories equivalent to $$a$$
• The statistic of interest, the predictive state, is $\epsilon(x^t_{-\infty}) = [x^t_{-\infty}]$
• Set $$s_t = \epsilon(x^t_{-\infty})$$
• A state is an equivalence class of histories and a distribution over future events
• IID = 1 state, periodic = $$p$$ states

set of histories, color-coded by conditional distribution of futures

Partitioning histories into predictive states

# Sufficiency

(Shalizi and Crutchfield 2001)

• $I[X^{\infty}_{t+1};X^t_{-\infty}] = I[X^{\infty}_{t+1};\epsilon(X^t_{-\infty})]$
• because $\begin{eqnarray*} \Prob{X^{\infty}_{t+1}|S_t = \epsilon(x^t_{-\infty})} & = & \int_{y \in [x^t_{-\infty}]}{\Prob{X^{\infty}_{t+1}|X^t_{-\infty}=y} \Prob{X^t_{-\infty}=y|S_t = \epsilon(x^t_{-\infty})} dy}\\ & = & \Prob{X^{\infty}_{t+1}|X^t_{-\infty}=x^t_{-\infty}} \end{eqnarray*}$

A non-sufficient partition of histories

Effect of insufficiency on predictive distributions

# Markov Properties

• Future observations are independent of the past given the causal state: $X^{\infty}_{t+1} \indep X^{t}_{-\infty} | S_{t+1}$
• by sufficiency: $\begin{eqnarray*} \Prob{X^{\infty}_{t+1}|X^{t}_{-\infty}=x^t_{-\infty}, S_{t+1} = \epsilon(x^t_{-\infty})} & = & \Prob{X^{\infty}_{t+1}|X^{t}_{-\infty}=x^t_{-\infty}}\\ & = & \Prob{X^{\infty}_{t+1}|S_{t+1} = \epsilon(x^t_{-\infty})} \end{eqnarray*}$

# Recursive Updating/Deterministic Transitions

• Recursive transitions for states: $\epsilon(x^{t+1}_{-\infty}) = T(\epsilon(x^t_{-\infty}), x_{t+1})$
• Automata theory: “deterministic transitions” (even though there are probabilities)
• In continuous time: $\epsilon(x^{t+h}_{-\infty}) = T(\epsilon(x^t_{-\infty}),x^{t+h}_{t})$

# Recursive Updating from Sufficiency

• Claim: If $$u \sim v$$, then $$ua \sim va$$ for any next observation $$a$$
• Fix any set of future events $$F$$ $\begin{eqnarray*} \Prob{X^{\infty}_{t+1} \in aF|X^t_{-\infty} = u} & = & \Prob{X^{\infty}_{t+1} \in aF|X^t_{-\infty} = v}\\ \Prob{X_{t+1}= a, X^{\infty}_{t+2} \in F|X^t_{-\infty} = u} & = & \Prob{X_{t+1}= a, X^{\infty}_{t+2} \in F|X^t_{-\infty} = v} \end{eqnarray*}$ $\begin{eqnarray*} \Prob{X^{\infty}_{t+2} \in F|X^{t+1}_{-\infty} = ua}\Prob{X_{t+1}= a|X^t_{-\infty} = u} & = & \Prob{X^{\infty}_{t+2} \in F|X^{t+1}_{-\infty} = va}\Prob{X_{t+1}= a|X^t_{-\infty} = v} \end{eqnarray*}$ $\begin{eqnarray*} \Prob{X^{\infty}_{t+2} \in F|X^{t+1}_{-\infty} = ua} & = & \Prob{X^{\infty}_{t+2} \in F|X^{t+1}_{-\infty} = va}\\ ua & \sim & va \end{eqnarray*}$
• (same for continuous values or time but need more measure theory)

# Predictive States are Markovian

• $S_{t+1}^{\infty} \indep S^{t-1}_{-\infty}|S_t$
• because $S_{t+1}^{\infty} = T(S_t,X_{t}^{\infty})$ and $X_{t}^{\infty} \indep \left\{ X^{t-1}_{-\infty}, S^{t-1}_{-\infty}\right\} | S_t$

# Minimality

• $$\epsilon$$ is minimal sufficient $$=$$ can be computed from any other sufficient statistic
• $$\therefore$$ for any sufficient $$\eta$$, exists a function $$g$$ such that $\epsilon(X^{t}_{-\infty}) = g(\eta(X^t_{-\infty}))$
• $$\therefore$$ if $$\eta$$ is sufficient $I[\epsilon(X^{t}_{-\infty}); X^{t}_{-\infty}] \leq I[\eta(X^{t}_{-\infty}); X^{t}_{-\infty}]$

Sufficient, but not minimal, partition of histories

Coarser than the predictive states, but not sufficient

# Uniqueness

• There is really no other minimal sufficient statistic
• If $$\eta$$ is minimal, there is an $$h$$ such that $\eta = h(\epsilon) ~\mathrm{w.p.1}$
• but $$\epsilon = g(\eta)$$ (w.p.1)
• so $\begin{eqnarray*} g(h(\epsilon)) & = & \epsilon\\ h(g(\eta)) & = & \eta \end{eqnarray*}$
• $$\therefore$$ $$\epsilon$$ and $$\eta$$ partition histories in the same way (w.p.1)

# Minimal stochasticity

• If $$R_t = \eta(X^{t-1}_{-\infty})$$ is also sufficient, then $H[R_{t+1}|R_t] \geq H[S_{t+1}|S_t]$
• $$\therefore$$ the predictive states are the closest we get to a deterministic model, without losing power

# Entropy Rate

• $\begin{eqnarray*} h & \equiv & \lim_{n\rightarrow\infty}{H[X_{n+1}|X^n_1]}\\ & = & \lim_{n\rightarrow\infty}{H[X_{n+1}|S_n]} ~\text{(sufficiency)}\\ & = & H[X_2|S_1] ~\text{(stationarity)} \end{eqnarray*}$
• so the predictive states lets us calculate the entropy rate
• and do source coding

# Minimal Markovian Representation

• The observed process $$(X_t)$$ is non-Markovian and ugly
• But it is generated from a homogeneous Markov process $$(S_t)$$
• After minimization, this representation is (essentially) unique
• Can exist smaller Markovian representations, but then always have distributions over those states
• and those distributions correspond to predictive states

# What Sort of Markov Model?

• Common-or-garden HMM: $S_{t+1} \indep X_{t+1}|S_t$
• But here $S_{t+1} = T(S_t, X_{t+1})$
• This is a chain with complete connections (Onicescu and Mihoc 1935; Iosifescu and Grigorescu 1990)

HMM

CCC

# Example of a CCC: Even Process

• Blocks of As of any length, separated by even-length blocks of Bs
• Histories ending $$AB$$ have different implications than histories ending $$ABB$$ than $$ABBB$$ than $$ABBBB$$
• $$\Rightarrow$$ Not Markov at any finite order

# Inventions

• Statistical relevance basis (Salmon 1971, 1984)
• Measure-theoretic prediction process (Knight 1975)
• Forecasting/true measure complexity (Grassberger 1986)
• Causal states, $$\epsilon$$ machine (Crutchfield and Young 1989)
• Not generally causal
• Though maybe sometimes? (Shalizi and Moore 2003)
• Observable operator model (Jaeger 2000)
• Predictive state representations (Littman, Sutton, and Singh 2002)
• Sufficient posterior representation (Langford, Salakhutdinov, and Zhang 2009)

# How Broad Are These Results?

Knight (1975) gave most general constructions

• Non-stationary $$X$$
• $$t$$ continuous (but discrete works as special case)
• $$X_t$$ with values in continuous spaces
• Lusin space = image of a complete separable metrizable space under a measurable bijection
• $$S_t$$ is a Markov process with recursive updating

# Connecting to Data

• Everything so far has been math/probability
• The Oracle tells us the infinite-dimensional distribution of $$X$$
• Can we do some statistics and find the states?
• Two senses of “find”: learn in a fixed model vs. discover the right model

# Learning

Problem: Given states and transitions ($$\epsilon, T$$), realization $$x_1^n$$, estimate $$\Prob{X_{t+1}=x|S_t=s}$$

• Just estimation for stochastic processes
• Easier than ordinary HMMs because $$S_t$$ is a function of trajectory
• Exponential families in the all-discrete case, very tractable

# Discovery

Problem: Given $$x_1^n$$, estimate $$\epsilon, T, \Prob{X_{t+1}=x|S_t=s}$$

• Much harder!
• Why should this be possible at all?
• Can’t cover every possible approach (and will neglect Langford, Salakhutdinov, and Zhang (2009);Pfau, Bartlett, and Wood (2010) both)

# CSSR: Causal State Splitting Reconstruction

• Key observation: Recursion + one-step-ahead predictive sufficiency $$\Rightarrow$$ general predictive sufficiency
• Get next-step distribution right by independence testing
• Then make states recursive
• Assumes discrete observations, discrete time, finite causal states

• Paper: Shalizi and Klinkner (2004); code, [https://github.com/stites/CSSR]

• Given current partition of histories into states, test whether going one step further back into the past changes the next-step conditional distribution
• Use a hypothesis test to control false positive rate
• If yes, split that cell of the partition, but see if it matches an existing distribution
• Must allow this merging or else lose minimality
• If no match, add new cell to the partition
• Stop when no more divisions can be made or a maximum history length $$\Lambda$$ is reached
• For consistency, $$\Lambda < \frac{\log{n}}{h_1 + \iota}$$ for some $$\iota$$

# Ensuring Recursive Transitions

• Need to determinize a probabilistic automaton
• Several ways of doing this, but a solved technical problem in automata theory
• Textbooks: Lewis and Papadimitriou (1998); Hopcroft and Ullman (1979)
• Brilliant but bizarre math approach: Conway (1971)
• Trickiest coding in the algorithm and can influence the finite-sample behavior

# Convergence

• $$\mathcal{S} =$$ true predictive state structure
• $$\widehat{\mathcal{S}}_n$$ = structure reconstructed from $$n$$ data points
• Assume: finite # of states, every state has a finite history, using long enough histories, technicalities: $\Prob{\widehat{\mathcal{S}}_n \neq \mathcal{S}} \rightarrow 0$
• $$\mathcal{D} =$$ true distribution, $$\widehat{\mathcal{D}}_n$$ = inferred $\Expect{{\|\widehat{\mathcal{D}}_n - \mathcal{D}\|}_{1}} = O(n^{-1/2})$
• Same order of convergence as for IID samples

# Hand-waving

• Empirical conditional distributions for histories converge
• ergodic theorem for Markov chains
• Histories in the same state become harder to accidentally separate
• Histories in different states become harder to confuse
• Each state’s predictive distribution converges $$O(n^{-1/2})$$
• CLT for Markov chains

# Example: The Even Process

reconstruction with $$\Lambda = 3$$, $$n=1000$$, $$\alpha = 0.005$$

N.B., CSSR did not know that there were 2 states, or how they were connected — it discovered this

# Some Uses

• Neural spike train analysis (Haslinger, Klinkner, and Shalizi 2010), fMRI analysis (Merriam, Genovese and Shalizi in prep.)
• Geomagnetic fluctuations (Clarke, Freeman, and Watkins 2003)
• Natural language processing (M. Padró and Padró 2005b, 2005a, 2005c, 2007a, 2007b)
• Anomaly detection (David S. Friedlander, Phoha, and Brooks 2003; Davis S. Friedlander et al. 2003; Ray 2004)
• Information sharing in networks (Klinkner, Shalizi, and Camperi 2006; Shalizi, Camperi, and Klinkner 2007)
• Social media propagation (Cointet, Faure, and Roth 2007)

# Summary

• Your stochastic process has a unique, minimal Markovian representation
• This representation has nice predictive properties
• Can reconstruct from sample data in some cases
• and a lot more could be done in this line

# Backup: Why Care About Sufficiency?

• Optimal strategy, under any loss function, only needs a sufficient statistic (Blackwell & Girshick)
• Strategies using insufficient statistics can generally be improved (Blackwell & Rao)
• Excuse for not worrying about particular loss functions

# Backup: A Cousin: The Information Bottleneck

(Tishby, Pereira, and Bialek 1999)

• For inputs $$X$$ and outputs $$Y$$, fix $$\beta > 0$$, find $$\eta(X)$$, the bottleneck variable, maximizing $I[\eta(X);Y] - \beta I[\eta(X);X]$
• give up 1 bit of predictive information for $$\beta$$ bits of memory
• Predictive sufficiency comes as $$\beta \rightarrow \infty$$, unwilling to lose any predictive power

# Backup: Extension to Input-Output Systems

(Littman, Sutton, and Singh 2002; Shalizi 2001)

• System output $$(X_t)$$, input $$(Y_t)$$
• Histories $$x^t_{-\infty}, y^t_{-\infty}$$ have distributions of output $$x_{t+1}$$ for each further input $$y_{t+1}$$
• Equivalence class these distributions and enforce recursive updating
• Internal states of the system, not trying to predict future inputs

# Backup: Extension to Spatiotemporal Systems

(Shalizi 2003; Shalizi, Klinkner, and Haslinger 2004; Shalizi et al. 2006; Jänicke et al. 2007; Goerg 2013, 2014; Goerg and Shalizi 2012, 2013; Montañez and Shalizi 2017)

• Dynamic random field $$X(\vec{r},t)$$
• Assume a finite maximum speed $$c$$ at which information can propagate
• Past cone: points in space-time which could matter to $$X(\vec{r},t)$$
• Future cone: points in space-time for which $$X(\vec{r},t)$$ could matter

• Equivalence-class past cone configurations by conditional distributions over future cones
• $$S(\vec{r},t)$$ is a Markov field
• Minimal sufficiency, recursive updating, etc., all go through

# Backup: Statistical Forecasting Complexity

• Statistical forecasting complexity is (Grassberger 1986; Crutchfield and Young 1989) $C \equiv I[\epsilon(X^t_{-\infty});X^t_{-\infty}]$
• $$=$$ amount of information about the past needed for optimal prediction
• $$=H[\epsilon(X^t_{-\infty})]$$ for predictive causal states
• $$=$$ expected algorithmic sophistication (Gács, Tromp, and Vitanyi 2001)
• $$=\log$$(period) for period processes
• $$=\log$$(geometric mean(recurrence time)) for stationary processes
• Property of the process, not our models

# References

Clarke, Richard W., Mervyn P. Freeman, and Nicholas W. Watkins. 2003. “Application of Computational Mechanics to the Analysis of Natural Data: An Example in Geomagnetism.” Physical Review E 67:0126203. http://arxiv.org/abs/cond-mat/0110228.

Cointet, Jean-Philippe, Emmanuel Faure, and Camille Roth. 2007. “Intertemporal Topic Correlations in Online Media.” In Proceedings of the International Conference on Weblogs and Social Media [Icwsm]. Boulder, CO, USA. http://camille.roth.free.fr/travaux/cointetfaureroth-icwsm-cr4p.pdf.

Conway, J. H. 1971. Regular Algebra and Finite Machines. London: Chapman; Hall.

Crutchfield, James P., and Karl Young. 1989. “Inferring Statistical Complexity.” Physical Review Letters 63:105–8. http://www.santafe.edu/~cmg/compmech/pubs/ISCTitlePage.htm.

Friedlander, David S., Shashi Phoha, and Richard Brooks. 2003. “Determination of Vehicle Behavior Based on Distributed Sensor Network Data.” In Advanced Signal Processing Algorithms, Architectures, and Implementations XIII, edited by Franklin T. Luk. Vol. 5205. Proceedings of the Spie. Bellingham, WA: SPIE.

Friedlander, Davis S., Isanu Chattopadhayay, Asok Ray, Shashi Phoha, and Noah Jacobson. 2003. “Anomaly Prediction in Mechanical System Using Symbolic Dynamics.” In Proceedings of the 2003 American Control Conference, Denver, Co, 4–6 June 2003.

Gács, Péter, John T. Tromp, and Paul M. B. Vitanyi. 2001. “Algorithmic Statistics.” IEEE Transactions on Information Theory 47:2443–63. http://arxiv.org/abs/math.PR/0006233.

Goerg, Georg M. 2013. LICORS: Light Cone Reconstruction of States — Predictive State Estimation from Spatio-Temporal Data. http://CRAN.R-project.org/package=LICORS.

———. 2014. LSC: Local Statistical Complexity — Automatic Pattern Discovery in Spatio-Temporal Data. http://CRAN.R-project.org/package=LSC.

Goerg, Georg M., and Cosma Rohilla Shalizi. 2012. “LICORS: Light Cone Reconstruction of States for Non-Parametric Forecasting of Spatio-Temporal Systems.” Statistics Department, CMU. http://arxiv.org/abs/1206.2398.

———. 2013. “Mixed LICORS: A Nonparametric Algorithm for Predictive State Reconstruction.” In Sixteenth International Conference on Artificial Intelligence and Statistics, edited by Carlos M. Carvalho and Pradeep Ravikumar, 289–97. http://arxiv.org/abs/1211.3760.

Grassberger, Peter. 1986. “Toward a Quantitative Theory of Self-Generated Complexity.” International Journal of Theoretical Physics 25:907–38.

Haslinger, Robert, Kristina Lisa Klinkner, and Cosma Rohilla Shalizi. 2010. “The Computational Structure of Spike Trains.” Neural Computation 22:121–57. https://doi.org/10.1162/neco.2009.12-07-678.

Hopcroft, John E., and Jeffrey D. Ullman. 1979. Introduction to Automata Theory, Languages, and Computation. Reading: Addison-Wesley.

Iosifescu, Marius, and Serban Grigorescu. 1990. Dependence with Complete Connections and Its Applications. Cambridge, England: Cambridge University Press.

Jaeger, Herbert. 2000. “Observable Operator Models for Discrete Stochastic Time Series.” Neural Computation 12:1371–98. https://doi.org/10.1162/089976600300015411.

Jänicke, Heike, Alexander Wiebel, Gerik Scheuermann, and Wolfgang Kollmann. 2007. “Multifield Visualization Using Local Statistical Complexity.” IEEE Transactions on Visualization and Computer Graphics 13:1384–91. https://doi.org/10.1109/TVCG.2007.70615.

Klinkner, Kristina Lisa, Cosma Rohilla Shalizi, and Marcelo F. Camperi. 2006. “Measuring Shared Information and Coordinated Activity in Neuronal Networks.” In Advances in Neural Information Processing Systems 18 (Nips 2005), edited by Yair Weiss, Bernhard Schölkopf, and John C. Platt, 667–74. Cambridge, Massachusetts: MIT Press. http://arxiv.org/abs/q-bio.NC/0506009.

Knight, Frank B. 1975. “A Predictive View of Continuous Time Processes.” Annals of Probability 3:573–96. http://projecteuclid.org/euclid.aop/1176996302.

Langford, John, Ruslan Salakhutdinov, and Tong Zhang. 2009. “Learning Nonlinear Dynamic Models.” In Proceedings of the 26th Annual International Conference on Machine Learning [Icml 2009], edited by Andrea Danyluk, Léon Bottou, and Michael Littman, 593–600. New York: Association for Computing Machinery. http://arxiv.org/abs/0905.3369.

Lewis, Harry R., and Christos H. Papadimitriou. 1998. Elements of the Theory of Computation. Second. Upper Saddle River, New Jersey: Prentice-Hall.

Littman, Michael L., Richard S. Sutton, and Satinder Singh. 2002. “Predictive Representations of State.” In Advances in Neural Information Processing Systems 14 (Nips 2001), edited by Thomas G. Dietterich, Suzanna Becker, and Zoubin Ghahramani, 1555–61. Cambridge, Massachusetts: MIT Press. http://papers.nips.cc/paper/1983-predictive-representations-of-state.

Montañez, George D., and Cosma Rohilla Shalizi. 2017. “The LICORS Cabinet: Nonparametric Algorithms for Spatio-Temporal Prediction.” In International Joint Conference on Neural Networks 2017 [Ijcnn 2017], 2811–9. https://doi.org/10.1109/IJCNN.2017.7966203.

Onicescu, Octav, and Gheorghe Mihoc. 1935. “Sur Les Chaînes de Variables Statistiques.” Comptes Rendus de L’Académie Des Sciences de Paris 200:511–12.

Padró, Muntsa, and Lluís Padró. 2005a. “A Named Entity Recognition System Based on a Finite Automata Acquisition Algorithm.” Procesamiento Del Lenguaje Natural 35:319–26. http://www.lsi.upc.edu/~nlp/papers/2005/sepln05-pp.pdf.

———. 2005b. “Applying a Finite Automata Acquisition Algorithm to Named Entity Recognition.” In Proceedings of 5th International Workshop on Finite-State Methods and Natural Language Processing (Fsmnlp’05). http://www.lsi.upc.edu/~nlp/papers/2005/fsmnlp05-pp.pdf.

———. 2005c. “Approaching Sequential NLP Tasks with an Automata Acquisition Algorithm.” In Proceedings of International Conference on Recent Advances in Nlp (Ranlp’05). http://www.lsi.upc.edu/~nlp/papers/2005/ranlp05-pp.pdf.

———. 2007a. “ME-CSSR: An Extension of CSSR Using Maximum Entropy Models.” In Proceedings of Finite State Methods for Natural Language Processing (Fsmnlp) 2007. http://www.lsi.upc.edu/%7Enlp/papers/2007/fsmnlp07-pp.pdf.

———. 2007b. “Studying CSSR Algorithm Applicability on Nlp Tasks.” Procesamiento Del Lenguaje Natural 39:89–96. http://www.lsi.upc.edu/%7Enlp/papers/2007/sepln07-pp.pdf.

Pfau, David, Nicholas Bartlett, and Frank Wood. 2010. “Probabilistic Deterministic Infinite Automata.” In Advances in Neural Information Processing Systems 23 [Nips 2010], edited by John Lafferty, C. K. I. Williams, John Shawe-Taylor, Richard S. Zemel, and A. Culotta, 1930–8. Cambridge, Massachusetts: MIT Press. http://books.nips.cc/papers/files/nips23/NIPS2010_1179.pdf.

Ray, Asok. 2004. “Symbolic Dynamic Analysis of Complex Systems for Anomaly Detection.” Signal Processing 84:1115–30.

Salmon, Wesley C. 1971. Statistical Explanation and Statistical Relevance. Pittsburgh: University of Pittsburgh Press.

———. 1984. Scientific Explanation and the Causal Structure of the World. Princeton: Princeton University Press.

Shalizi, Cosma Rohilla. 2001. “Causal Architecture, Complexity and Self-Organization in Time Series and Cellular Automata.” PhD thesis, University of Wisconsin-Madison. http://bactra.org/thesis/.

———. 2003. “Optimal Nonlinear Prediction of Random Fields on Networks.” Discrete Mathematics and Theoretical Computer Science AB(DMCS):11–30. http://arxiv.org/abs/math.PR/0305160.

Shalizi, Cosma Rohilla, Marcelo F. Camperi, and Kristina Lisa Klinkner. 2007. “Discovering Functional Communities in Dynamical Networks.” In Statistical Network Analysis: Models, Issues, and New Directions: ICML 2006 Workshop on Statistical Network Analysis, Pittsburgh, Pa, Usa, June 2006: Reivsed Selected Papers, edited by Edo Airoldi, David M. Blei, Stephen E. Fienberg, Anna Goldenberg, Eric P. Xing, and Alice X. Zheng, 4503:140–57. Lecture Notes in Computer Science. New York: Springer-Verlag. http://arxiv.org/abs/q-bio.NC/0609008.

Shalizi, Cosma Rohilla, and James P. Crutchfield. 2001. “Computational Mechanics: Pattern and Prediction, Structure and Simplicity.” Journal of Statistical Physics 104:817–79. http://arxiv.org/abs/cond-mat/9907176.

Shalizi, Cosma Rohilla, Robert Haslinger, Jean-Baptiste Rouquier, Kristina Lisa Klinkner, and Cristopher Moore. 2006. “Automatic Filters for the Detection of Coherent Structure in Spatiotemporal Systems.” Physical Review E 73:036104. http://arxiv.org/abs/nlin.CG/0508001.

Shalizi, Cosma Rohilla, and Kristina Lisa Klinkner. 2004. “Blind Construction of Optimal Nonlinear Recursive Predictors for Discrete Sequences.” In Uncertainty in Artificial Intelligence: Proceedings of the Twentieth Conference (Uai 2004), edited by Max Chickering and Joseph Y. Halpern, 504–11. Arlington, Virginia: AUAI Press. http://arxiv.org/abs/cs.LG/0406011.

Shalizi, Cosma Rohilla, Kristina Lisa Klinkner, and Robert Haslinger. 2004. “Quantifying Self-Organization with Optimal Predictors.” Physical Review Letters 93:118701. https://doi.org/10.1103/PhysRevLett.93.118701.

Shalizi, Cosma Rohilla, and Cristopher Moore. 2003. “What Is a Macrostate? From Subjective Measurements to Objective Dynamics.” arxiv:cond-mat/0303625. http://arxiv.org/abs/cond-mat/0303625.

Tishby, Naftali, Fernando C. Pereira, and William Bialek. 1999. “The Information Bottleneck Method.” In Proceedings of the 37th Annual Allerton Conference on Communication, Control and Computing, edited by B. Hajek and R. S. Sreenivas, 368–77. Urbana, Illinois: University of Illinois Press. http://arxiv.org/abs/physics/0004057.