Information Theory and Optimal Prediction


20 November 2018

\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\indep}{\perp} \]


Notation etc.

Making a Prediction


Joint and Conditional Entropy

Entropy Rate


Predictive Information

Predictive Sufficiency

Predictive states

(Crutchfield and Young 1989)

set of histories, color-coded by conditional distribution of futures

Partitioning histories into predictive states


(Shalizi and Crutchfield 2001)

A non-sufficient partition of histories

Effect of insufficiency on predictive distributions

Markov Properties

Recursive Updating/Deterministic Transitions

Recursive Updating from Sufficiency

Predictive States are Markovian


Sufficient, but not minimal, partition of histories

Coarser than the predictive states, but not sufficient


Minimal stochasticity

Entropy Rate

Minimal Markovian Representation

What Sort of Markov Model?



Example of a CCC: Even Process


How Broad Are These Results?

Knight (1975) gave most general constructions

Connecting to Data


Problem: Given states and transitions (\(\epsilon, T\)), realization \(x_1^n\), estimate \(\Prob{X_{t+1}=x|S_t=s}\)


Problem: Given \(x_1^n\), estimate \(\epsilon, T, \Prob{X_{t+1}=x|S_t=s}\)

CSSR: Causal State Splitting Reconstruction

One-Step Ahead Prediction

Ensuring Recursive Transitions



Example: The Even Process

reconstruction with \(\Lambda = 3\), \(n=1000\), \(\alpha = 0.005\)

N.B., CSSR did not know that there were 2 states, or how they were connected — it discovered this

Some Uses


Backup: Why Care About Sufficiency?

Backup: A Cousin: The Information Bottleneck

(Tishby, Pereira, and Bialek 1999)

Backup: Extension to Input-Output Systems

(Littman, Sutton, and Singh 2002; Shalizi 2001)

Backup: Extension to Spatiotemporal Systems

(Shalizi 2003; Shalizi, Klinkner, and Haslinger 2004; Shalizi et al. 2006; Jänicke et al. 2007; Goerg 2013, 2014; Goerg and Shalizi 2012, 2013; Montañez and Shalizi 2017)

Backup: Statistical Forecasting Complexity


Clarke, Richard W., Mervyn P. Freeman, and Nicholas W. Watkins. 2003. “Application of Computational Mechanics to the Analysis of Natural Data: An Example in Geomagnetism.” Physical Review E 67:0126203.

Cointet, Jean-Philippe, Emmanuel Faure, and Camille Roth. 2007. “Intertemporal Topic Correlations in Online Media.” In Proceedings of the International Conference on Weblogs and Social Media [Icwsm]. Boulder, CO, USA.

Conway, J. H. 1971. Regular Algebra and Finite Machines. London: Chapman; Hall.

Crutchfield, James P., and Karl Young. 1989. “Inferring Statistical Complexity.” Physical Review Letters 63:105–8.

Friedlander, David S., Shashi Phoha, and Richard Brooks. 2003. “Determination of Vehicle Behavior Based on Distributed Sensor Network Data.” In Advanced Signal Processing Algorithms, Architectures, and Implementations XIII, edited by Franklin T. Luk. Vol. 5205. Proceedings of the Spie. Bellingham, WA: SPIE.

Friedlander, Davis S., Isanu Chattopadhayay, Asok Ray, Shashi Phoha, and Noah Jacobson. 2003. “Anomaly Prediction in Mechanical System Using Symbolic Dynamics.” In Proceedings of the 2003 American Control Conference, Denver, Co, 4–6 June 2003.

Gács, Péter, John T. Tromp, and Paul M. B. Vitanyi. 2001. “Algorithmic Statistics.” IEEE Transactions on Information Theory 47:2443–63.

Goerg, Georg M. 2013. LICORS: Light Cone Reconstruction of States — Predictive State Estimation from Spatio-Temporal Data.

———. 2014. LSC: Local Statistical Complexity — Automatic Pattern Discovery in Spatio-Temporal Data.

Goerg, Georg M., and Cosma Rohilla Shalizi. 2012. “LICORS: Light Cone Reconstruction of States for Non-Parametric Forecasting of Spatio-Temporal Systems.” Statistics Department, CMU.

———. 2013. “Mixed LICORS: A Nonparametric Algorithm for Predictive State Reconstruction.” In Sixteenth International Conference on Artificial Intelligence and Statistics, edited by Carlos M. Carvalho and Pradeep Ravikumar, 289–97.

Grassberger, Peter. 1986. “Toward a Quantitative Theory of Self-Generated Complexity.” International Journal of Theoretical Physics 25:907–38.

Haslinger, Robert, Kristina Lisa Klinkner, and Cosma Rohilla Shalizi. 2010. “The Computational Structure of Spike Trains.” Neural Computation 22:121–57.

Hopcroft, John E., and Jeffrey D. Ullman. 1979. Introduction to Automata Theory, Languages, and Computation. Reading: Addison-Wesley.

Iosifescu, Marius, and Serban Grigorescu. 1990. Dependence with Complete Connections and Its Applications. Cambridge, England: Cambridge University Press.

Jaeger, Herbert. 2000. “Observable Operator Models for Discrete Stochastic Time Series.” Neural Computation 12:1371–98.

Jänicke, Heike, Alexander Wiebel, Gerik Scheuermann, and Wolfgang Kollmann. 2007. “Multifield Visualization Using Local Statistical Complexity.” IEEE Transactions on Visualization and Computer Graphics 13:1384–91.

Klinkner, Kristina Lisa, Cosma Rohilla Shalizi, and Marcelo F. Camperi. 2006. “Measuring Shared Information and Coordinated Activity in Neuronal Networks.” In Advances in Neural Information Processing Systems 18 (Nips 2005), edited by Yair Weiss, Bernhard Schölkopf, and John C. Platt, 667–74. Cambridge, Massachusetts: MIT Press.

Knight, Frank B. 1975. “A Predictive View of Continuous Time Processes.” Annals of Probability 3:573–96.

Langford, John, Ruslan Salakhutdinov, and Tong Zhang. 2009. “Learning Nonlinear Dynamic Models.” In Proceedings of the 26th Annual International Conference on Machine Learning [Icml 2009], edited by Andrea Danyluk, Léon Bottou, and Michael Littman, 593–600. New York: Association for Computing Machinery.

Lewis, Harry R., and Christos H. Papadimitriou. 1998. Elements of the Theory of Computation. Second. Upper Saddle River, New Jersey: Prentice-Hall.

Littman, Michael L., Richard S. Sutton, and Satinder Singh. 2002. “Predictive Representations of State.” In Advances in Neural Information Processing Systems 14 (Nips 2001), edited by Thomas G. Dietterich, Suzanna Becker, and Zoubin Ghahramani, 1555–61. Cambridge, Massachusetts: MIT Press.

Montañez, George D., and Cosma Rohilla Shalizi. 2017. “The LICORS Cabinet: Nonparametric Algorithms for Spatio-Temporal Prediction.” In International Joint Conference on Neural Networks 2017 [Ijcnn 2017], 2811–9.

Onicescu, Octav, and Gheorghe Mihoc. 1935. “Sur Les Chaînes de Variables Statistiques.” Comptes Rendus de L’Académie Des Sciences de Paris 200:511–12.

Padró, Muntsa, and Lluís Padró. 2005a. “A Named Entity Recognition System Based on a Finite Automata Acquisition Algorithm.” Procesamiento Del Lenguaje Natural 35:319–26.

———. 2005b. “Applying a Finite Automata Acquisition Algorithm to Named Entity Recognition.” In Proceedings of 5th International Workshop on Finite-State Methods and Natural Language Processing (Fsmnlp’05).

———. 2005c. “Approaching Sequential NLP Tasks with an Automata Acquisition Algorithm.” In Proceedings of International Conference on Recent Advances in Nlp (Ranlp’05).

———. 2007a. “ME-CSSR: An Extension of CSSR Using Maximum Entropy Models.” In Proceedings of Finite State Methods for Natural Language Processing (Fsmnlp) 2007.

———. 2007b. “Studying CSSR Algorithm Applicability on Nlp Tasks.” Procesamiento Del Lenguaje Natural 39:89–96.

Pfau, David, Nicholas Bartlett, and Frank Wood. 2010. “Probabilistic Deterministic Infinite Automata.” In Advances in Neural Information Processing Systems 23 [Nips 2010], edited by John Lafferty, C. K. I. Williams, John Shawe-Taylor, Richard S. Zemel, and A. Culotta, 1930–8. Cambridge, Massachusetts: MIT Press.

Ray, Asok. 2004. “Symbolic Dynamic Analysis of Complex Systems for Anomaly Detection.” Signal Processing 84:1115–30.

Salmon, Wesley C. 1971. Statistical Explanation and Statistical Relevance. Pittsburgh: University of Pittsburgh Press.

———. 1984. Scientific Explanation and the Causal Structure of the World. Princeton: Princeton University Press.

Shalizi, Cosma Rohilla. 2001. “Causal Architecture, Complexity and Self-Organization in Time Series and Cellular Automata.” PhD thesis, University of Wisconsin-Madison.

———. 2003. “Optimal Nonlinear Prediction of Random Fields on Networks.” Discrete Mathematics and Theoretical Computer Science AB(DMCS):11–30.

Shalizi, Cosma Rohilla, Marcelo F. Camperi, and Kristina Lisa Klinkner. 2007. “Discovering Functional Communities in Dynamical Networks.” In Statistical Network Analysis: Models, Issues, and New Directions: ICML 2006 Workshop on Statistical Network Analysis, Pittsburgh, Pa, Usa, June 2006: Reivsed Selected Papers, edited by Edo Airoldi, David M. Blei, Stephen E. Fienberg, Anna Goldenberg, Eric P. Xing, and Alice X. Zheng, 4503:140–57. Lecture Notes in Computer Science. New York: Springer-Verlag.

Shalizi, Cosma Rohilla, and James P. Crutchfield. 2001. “Computational Mechanics: Pattern and Prediction, Structure and Simplicity.” Journal of Statistical Physics 104:817–79.

Shalizi, Cosma Rohilla, Robert Haslinger, Jean-Baptiste Rouquier, Kristina Lisa Klinkner, and Cristopher Moore. 2006. “Automatic Filters for the Detection of Coherent Structure in Spatiotemporal Systems.” Physical Review E 73:036104.

Shalizi, Cosma Rohilla, and Kristina Lisa Klinkner. 2004. “Blind Construction of Optimal Nonlinear Recursive Predictors for Discrete Sequences.” In Uncertainty in Artificial Intelligence: Proceedings of the Twentieth Conference (Uai 2004), edited by Max Chickering and Joseph Y. Halpern, 504–11. Arlington, Virginia: AUAI Press.

Shalizi, Cosma Rohilla, Kristina Lisa Klinkner, and Robert Haslinger. 2004. “Quantifying Self-Organization with Optimal Predictors.” Physical Review Letters 93:118701.

Shalizi, Cosma Rohilla, and Cristopher Moore. 2003. “What Is a Macrostate? From Subjective Measurements to Objective Dynamics.” arxiv:cond-mat/0303625.

Tishby, Naftali, Fernando C. Pereira, and William Bialek. 1999. “The Information Bottleneck Method.” In Proceedings of the 37th Annual Allerton Conference on Communication, Control and Computing, edited by B. Hajek and R. S. Sreenivas, 368–77. Urbana, Illinois: University of Illinois Press.