\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\SampleVar}[1]{\widehat{\mathrm{Var}}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\TrueRegFunc}{\mu} \newcommand{\EstRegFunc}{\widehat{\TrueRegFunc}} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator{\det}{det} \newcommand{\TrueNoise}{\epsilon} \newcommand{\EstNoise}{\widehat{\TrueNoise}} \newcommand{\Indicator}[1]{\mathbb{I}\left( #1 \right)} \newcommand{\se}[1]{\mathrm{se}\left[ #1 \right]} \newcommand{\CrossEntropy}{\ell} \newcommand{\xmin}{x_{\mathrm{min}}} \]

Agenda

Spatial narratives
Reminders: inference with IID data
Generalization to dependent, heterogeneous data

Spatial narratives

Drs. Jessica Benner and Emma Slayton, CMU Library

Spatial Services at University Libraries
Spatial Storytelling series
- Kickoff event 5–8 pm on November 1 in the University Center (with food!)
- Competition and Expo 3–5 pm on April 18 (with cash prizes!)
Story Maps from the Library of Congress
- Camera and Locomotive: Tracks Across the Continent
- Behind Barbed Wire: Japanese-American Internment Camp Newspapers

In our previous episodes

Many plausible ideas about how to estimate with dependent data
Some models which generate dependent data
How do we know that ideas work (in those models or elsewhere)?
Strategy: go back to inference with independent data, and see what we can abstract

Until further notice…

Assume \(X_1, X_2, \ldots X_n\) are independent and identically distributed (IID)
- Each \(X_i\) might have multiple dimensions, e.g., \(X_i = (Y_i, Z_i)\)
We don’t know the common distribution of the \(X\)’s
That distribution might be parametric with unknown parameter(s) \(\theta\), and pdf \(f(x;\theta)\)
- E.g., Gaussian, \(f(x;\theta) = \frac{1}{\sqrt{2\pi \theta_2}} e^{-(x-\theta_1)^2/2\theta_2}\)
- E.g., Pareto/power-law, \(f(x;\theta) = \frac{\theta - 1}{\xmin} {\left(\frac{x}{\xmin}\right)}^{-\theta}\) for \(x \geq \xmin\)
- The Pareto is a common model for “heavy tailed” data (Clauset, Shalizi, and Newman 2009)
Or it might be nonparametric, meaning we don’t assume any parametric form

What we want to infer

Parameters, in a parametric model
Moments
Quantiles
More complex things — optimal slope of \(Y_i\) on \(Z_i\)
In general, some function \(\psi(p)\) of the true pdf \(p\)

Some standard estimates work because of the law of large numbers

Obvious example: Use \(\overline{X}\) as an estimate of \(\Expect{X}\)
Less obvious: \(\frac{n}{n+\nu}\overline{X}\) also works for any fixed \(\nu > 0\)

What do we mean by “works”?

Consistency: \(\hat{\psi}_n \rightarrow \psi\) as \(n\rightarrow\infty\)
Also nice to know:
- Bias: \(\Expect{\hat{\psi}_n} - \psi\) (should \(\rightarrow 0\))
- Variance: \(\Var{\hat{\psi}_n}\) (also should \(\rightarrow 0\))
- Sampling distribution of \(\hat{\psi}_n\) (to get confidence sets)

The basic ingredient for consistency is the law of large numbers

Sample mean \(\overline{X}_n = n^{-1}\sum_{i=1}^{n}{X_i}\)
Assume the \(X_i\) all have mean \(\Expect{X}\), variance \(\Var{X}\), and are uncorrelated
Exercise: Prove that \[ \Expect{(\overline{X}_n-\Expect{X})^2} \rightarrow 0 ~\text{as} ~ n\rightarrow\infty \]

Hint: \(\Expect{Z^2} = (\Expect{Z})^2 + \Var{Z}\), for any variable \(Z\)

Solution to the exercise

\[\begin{eqnarray} \Expect{\overline{X}_n} & = & \Expect{\frac{1}{n}\sum_{i=1}^{n}{X_i}}\\ & = & \frac{n\Expect{X_1}}{n} = \Expect{X}\\ \Var{\overline{X}_n} & = & \Var{\frac{1}{n}\sum_{i=1}^{n}{X_i}}\\ & = & \frac{\sum_{i=1}^{n}{\Var{X_i}}}{n^2}\\ & = & \frac{n \Var{X}}{n^2} = \frac{\Var{X}}{n} \rightarrow 0\\ \therefore \Expect{(\overline{X}_n-\Expect{X})^2} & = & \left(\Expect{\overline{X}_n}-\Expect{X}\right)^2 + \Var{\overline{X}_n}\\ & = & 0 + n^{-1}\Var{X} \rightarrow 0 \end{eqnarray}\]

Law of large numbers in general

If the \(X_i\) are IID, then so are \(h(X_i)\), for any function \(h\), so \[ \overline{h(X)_n} \rightarrow \Expect{h(X)} \]

This is part of why maximum likelihood works

The normalized log-likelihood is a (random) function of \(\theta\): \[ L_n(\theta) = \frac{1}{n}\sum_{i=1}^{n}{\log{f(X_i;\theta)}} \]
Apply the LLN: for each \(\theta\), \[ L_n(\theta) \rightarrow \Expect{\log{f(X;\theta)}} \equiv \CrossEntropy(\theta) \]
Fact: for any pdfs \(f\), \(g\), \[ \int{f(x) \log{f(x)} dx} \geq \int{f(x) \log{g(x)} dx} \]
- a.k.a. “Gibbs inequality”
- (similarly for pmfs)
So \(\CrossEntropy(\theta)\) is maximized at the true \(\theta\), \(\theta_0\): \[ \int{f(x;\theta_0) \log{f(x;\theta_0)} dx} \geq \int{f(x;\theta_0) \log{f(x;\theta)} dx} \]

Maximum likelihood can work even without moments

For the Pareto distribution, \(\Expect{X^k} = \infty\) if \(\theta \leq k\)
- So \(\Var{X} = \infty\) if \(\theta \leq 2\)
But \(\log{f}\) is well-behaved: \[ \log{f(x;\theta)} = \log{(\theta -1)} - (\theta-1)\log{\xmin} - \theta \log{x} \]
- Exercise (off-line): Find \(\Expect{\log{X}}\) and \(\Var{\log{X}}\) in terms of \(\theta\) and \(\xmin\)
What does the normalized log-likelihood function look like?

Convergence of the log-likelihood function (an example)

Normalized log-likelihoods for 10 different IID samples from the Pareto distribution (\(n=10\), \(\theta=1.5\), \(\xmin=1\))

Convergence of the log-likelihood function (an example)

Normalized log-likelihoods for 10 different IID samples from the Pareto distribution (\(n=10^3\), \(\theta=1.5\), \(\xmin=1\))

Convergence of the log-likelihood function (an example)

Normalized log-likelihoods for 10 different IID samples from the Pareto distribution (\(n=10^5\), \(\theta=1.5\), \(\xmin=1\))

Convergence of the log-likelihood function (an example)

Normalized log-likelihoods for the Pareto distribution, showing convergence as \(n\rightarrow\infty\) along a single IID sequence

The more general pattern

Assume (1): \(\hat{\psi}_n = \argmin_{\psi}{M_n(\psi)}\) for some random functions \(M_n\)
- If you maximize instead, minimize the negative
Assume (2): \(M_n(\psi) \rightarrow m(\psi)\) as \(n\rightarrow \infty\)
Assume (3): \(m(\psi)\) has a unique minimum at the true \(\psi_0\)
Then, in general, \(\hat{\psi}_n \rightarrow \psi_0\)

(some disclaimers apply)

This applies pretty broadly

Estimating the expectation with the sample mean: \[\begin{eqnarray} M_n(\psi) & = & \frac{1}{n}\sum_{i=1}^{n}{(X_i - \psi)^2}\\ m(\psi) & = & \Expect{(X-\psi)^2} \end{eqnarray}\]
Estimating a simple linear regression: \[\begin{eqnarray} M_n(\psi) & = & \frac{1}{n}\sum_{i=1}^{n}{(Y_i - \psi_1 - \psi_2 Z_i)^2}\\ m(\psi) & = & \Expect{(Y-\psi_1 - \psi_2 Z)^2} \end{eqnarray}\]
Nonlinear least squares, with regression function \(\TrueRegFunc(z;\psi)\) \[\begin{eqnarray} M_n(\psi) & = & \frac{1}{n}\sum_{i=1}^{n}{(Y_i - \TrueRegFunc(Z_i;\psi))^2}\\ m(\psi) & = & \Expect{(Y-\TrueRegFunc(Z;\psi))^2} \end{eqnarray}\]
Estimating the \(\alpha\) quantile (Koenker and Hallock 2001): \[\begin{eqnarray} M_n(\psi) & = & \frac{1}{n}\sum_{i=1}^{n}{X_i(\alpha - \mathbf{I}(X_i < 0))}\\ m(\psi) & = & \Expect{X(\alpha - \mathbf{I}(X < 0))} \end{eqnarray}\]
- Extends naturally to estimating conditional quantiles

What about estimation error?

If we’re using a mean, we can use the standard error of the mean: \[ \Var{\overline{X}_n} = \frac{1}{n}\Var{X} \Rightarrow \se{\overline{X}_n} = \sqrt{\frac{\Var{X}}{n}} \]
If \(\hat{\psi}_n = h(A_n,B_n)\) and we know \(\Var{A}_n\), \(\Var{B}_n\), use propagation of error (a.k.a. the delta method):

\[\begin{eqnarray} h(\Expect{A_n}, \Expect{B_n}) & \approx & h(A_n, B_n) + (\Expect{A_n} - A_n)\frac{\partial h}{\partial a} + (\Expect{B_n}-B_n)\frac{\partial h}{\partial b} ~ \text{(Taylor series)}\\ \hat{\psi}_n = h(A_n, B_n) & \approx & h(\Expect{A_n}, \Expect{B_n}) + (A_n - \Expect{A_n})\frac{\partial h}{\partial a} + (B_n - \Expect{B_n})\frac{\partial h}{\partial b}\\ \Var{\hat{\psi}_n} & \approx & {\left(\frac{\partial h}{\partial a}\right)}^2\Var{A_n} +{\left(\frac{\partial h}{\partial b}\right)}^2\Var{B_n} + 2\left(\frac{\partial h}{\partial a}\frac{\partial h}{\partial b}\right)\Cov{A_n, B_n} \end{eqnarray}\]

What if it’s some weird optimization problem?

Estimating by optimizing

Assume:
- \(\hat{\psi}_n\) minimizes some \(M_n(\psi)\)
- \(\psi_0\) minimizes \(m(\psi)\) (uniquely)
- \(M_n(\psi) \rightarrow m(\psi)\) (everywhere)
- \(\psi\) is one-dimensional (for now)
\[\begin{eqnarray} 0 & = & \frac{dM_n}{d\psi}(\hat{\psi}_n) ~ \text{(optimum)}\\ 0 & \approx & \frac{dM_n}{d\psi}(\psi_0) + (\hat{\psi}_n - \psi_0)\frac{d^2 M_n}{d\psi^2}(\psi_0) ~ \text{(Taylor expansion)}\\ ( \hat{\psi}_n - \psi_0) \frac{d^2 M_n}{d\psi^2}(\psi_0) & \approx & -\frac{dM_n}{d\psi}(\psi_0)\\ \hat{\psi}_n - \psi_0 & \approx & -\frac{\frac{dM_n}{d\psi}(\psi_0)}{\frac{d^2 M_n}{d\psi^2}(\psi_0)}\\ \hat{\psi}_n & \approx & \psi_0 - \frac{\frac{dM_n}{d\psi}(\psi_0)}{\frac{d^2 M_n}{d\psi^2}(\psi_0)} \end{eqnarray}\]

Estimating by optimizing

Still assuming \(\psi\) is one-dimensional, write \(M_n^{\prime}\) and \(M_n^{\prime\prime}\) for the derivatives \[ \hat{\psi}_n \approx \psi_0 - \frac{M_n^{\prime}}{M_n^{\prime\prime}} \] As \(n\rightarrow \infty\), \[\begin{eqnarray} M_n^{\prime} & \rightarrow & m^{\prime}(\psi_0) = 0 ~ \text{(optimum)}\\ M_n^{\prime^\prime} & \rightarrow & m^{\prime\prime}(\psi_0) > 0 ~ \text{(optimum)} \end{eqnarray}\] so \[ \hat{\psi}_n \rightarrow \psi_0 \]

Estimating by optimizing

(still in 1D)

What about variance? \[\begin{eqnarray} \hat{\psi}_n & \approx & \psi_0 - \frac{M_n^{\prime}}{M_n^{\prime\prime}}\\ \Var{\hat{\psi}_n} & \approx & \Var{\frac{M_n^{\prime}}{M_n^{\prime\prime}}}\\ & = & \frac{\Var{M_n^{\prime}}}{(m^{\prime\prime})^2}\\ \se{\hat{\psi}_n} & \approx & \frac{\sqrt{\Var{M^{\prime}_n}}}{m^{\prime\prime}} \end{eqnarray}\]
Variance of \(\hat{\psi}_n\) goes down as the curvature goes up
Variance of \(\hat{\psi}_n\) goes up with the variance in \(M_n\)
- If \(M_n\) is an average, with \(\Var{M_n} = O(1/n)\), so is \(M^{\prime}_n\), and \(\Var{M_n^{\prime}} = O(1/n)\)
- Then \(\se{\hat{\psi}_n} = O(1/\sqrt{n})\)
What about the distribution?
- If \(M_n \rightsquigarrow \mathcal{N}\), because of the CLT, then
- \(M_n^{\prime} \rightsquigarrow \mathcal{N}\) (usually)

Estimating by optimizing

More general case: \(\psi\) is a vector \[\begin{eqnarray} \hat{\psi}_n & = & \argmin_{\psi}{M_n(\psi)}\\ 0 & = & \nabla M_n(\hat{\psi}_n)\\ 0 & \approx & \nabla M_n(\psi_0) + (\hat{\psi}_n-\psi_0) \nabla \nabla M_n(\psi_0)\\ \nabla \nabla M_n(\psi_0) & \equiv & \mathbf{H}_n(\psi_0) \rightarrow \mathbf{h}\\ \hat{\psi}_n & \approx & \psi_0 - \mathbf{h}^{-1} \nabla M_n(\psi_0)\\ \Var{\hat{\psi}_n } & \approx & \mathbf{h}^{-1} \Var{\nabla M_n(\psi_0)} \mathbf{h}^{-1} \end{eqnarray}\]
- \(\mathbf{h} = \nabla \nabla m(\psi_0)\) Hessian of \(m\) at \(\psi_0\)
Again, if \(\nabla M_n(\psi_0)\) is an average, CLT suggests \(\rightsquigarrow \mathcal{N}\)

In practice…

Approximate \(\mathbf{h}\) by \(\nabla \nabla M_n(\hat{\psi}_n)\), say \(\mathbf{H}_n\)
Approximate \(\Var{\nabla M_n(\psi_0)}\) by sample covariance of the derivative of \(M_n\), say \(\mathbf{J}_n\)
Sandwich covariance matrix is \(\mathbf{H}_n^{-1}\mathbf{J}_n \mathbf{H}_n^{-1}\)

Special application to maximum likelihood

\(\nabla \nabla \CrossEntropy(\theta) \equiv \mathbf{i}(\theta)\), the Fisher information
Fisher identity: \(\Var{\nabla L_n(\theta)} = \frac{1}{n} \mathbf{i}(\theta)\), if the model is right
So \(\Var{\hat{\theta}_n} = \frac{1}{n} \mathbf{i}^{-1}(\theta_0)\), if the model is right
Exercise (offline):
- Find the Fisher information for the Pareto

Generalizing away from IID data

Estimate by optimizing some objective function \(M_n\) \[ \hat{\psi}_n = \argmin_{\psi}{M_n(\psi)} \]
Hope the objective function tends to a limit function \(m\) \[ M_n(\psi) \rightarrow m(\psi) \]
Hope that the limiting function has its minimum in the right place \[ \argmin_{\psi}{m(\psi)} = \psi_0 \]
Then the exact same arguments apply: \[\begin{eqnarray} \hat{\psi}_n & \rightarrow & \psi_0\\ \Var{\hat{\psi}_n} & \rightarrow & \mathbf{h}^{-1} \Var{\nabla M_n(\psi_0)} \mathbf{h}^{-1} \end{eqnarray}\]
We need to make sure our objective functions converge
We’re going to need a law of large numbers for dependent data

Summary

We usually estimate by minimizing an objective function
If the objective function has a well-behaved limit, the estimate will converge
The variance of the estimate depends on:
- The curvature of the objective function (\(\uparrow\) curvature \(\downarrow\) variance)
- The variance of the derivative of the objective function
None of these ingredients actually requires independence
Once we have convergence of our objective function, everything will fall in to place

Backup: Modes of convergence

Consistency is usually defined as “convergence in probability”: For any \(\epsilon > 0\), \[ \Prob{|\hat{\psi}_n - \psi| \geq \epsilon} \rightarrow 0 \]
The main slides use \(L_2\) convergence, \[ \Expect{(\hat{\psi}_n - \psi)^2} \rightarrow 0 \]
- \(L_2\) convergence implies convergence in probability by Chebyshev’s inequality but not vice versa
There is also “strong convergence” or “almost sure convergence”: \[ \Prob{\hat{\psi}_n \rightarrow \psi} = 1 \]
- A.S.-convergence implies convergence in probability but not vice versa

Backup: Chebyshev’s inequality

For any random variable \(Z\), and any \(\epsilon > 0\),

\[ \Prob{|Z-\Expect{Z}| > \epsilon} \leq \frac{\Var{Z}}{\epsilon^2} \]

Proof: Apply Markov’s inequality to \((Z-\Expect{Z})^2\), which is \(\geq 0\) and has expectation \(\Var{Z}\).

Backup: Markov’s inequality

For any non-negative random variable \(Z\), and any \(\epsilon > 0\),

\[ \Prob{Z \geq \epsilon} \leq \frac{\Expect{Z}}{\epsilon} \]

Proof: \[\begin{eqnarray} Z & = & Z\Indicator{Z \geq \epsilon} + Z\Indicator{Z < \epsilon}\\ \Expect{Z} & = & \Expect{Z \Indicator{Z \geq \epsilon}} + \Expect{Z \Indicator{Z < \epsilon}}\\ & \geq & \Expect{Z \Indicator{Z \geq \epsilon}}\\ & \geq & \Expect{\epsilon \Indicator{Z \geq \epsilon}}\\ & = & \epsilon\Expect{\Indicator{Z \geq \epsilon}} = \epsilon \Prob{Z \geq \epsilon} \end{eqnarray}\]

Backup: Disclaimers to the “More general pattern”

Need to assume (2’) that \(M_n(\psi) \rightarrow m(\psi)\) uniformly over \(\psi\)
- Or at least (2’’) uniformly over a region where \(\hat{\psi}_n\) will concentrate
Need to assume (3’) that \(m(\psi)\) has a unique minimum at \(\psi_0\), and that this minimum is “well-separated”
- Roughly: \(m(\psi) - m(\psi_0)\) can be very small only if \(\psi\) is close to \(\psi_0\)

Backup: The Gibbs inequality

\[\begin{eqnarray} \int{f(x) \log{f(x)} dx} - \int{f(x) \log{g(x)} dx} & = & \int{f(x) (\log{f(x)} - \log{g(x)}) dx}\\ & = & \int{f(x) \log{\frac{f(x)}{g(x)}} dx}\\ & = & -\int{f(x) \log{\frac{g(x)}{f(x)}} dx}\\ & \geq & -\log{\int{f(x) \frac{g(x)}{f(x)} dx}} = \log{1} = 0 \end{eqnarray}\]

where the last line uses Jensen’s inequality

Backup: Jensen’s inequality

A function \(h\) is convex when, for any \(w \in [0,1]\), \[ w h(x_1) + (1-w) h(x_2) \geq h(wx_1 + (1-w) x_2) \]
- i.e., for any two points on the curve of \(h(x)\), the curve lies below the straight line connecting the points
The function \(-\log{x}\) is convex (draw it!)
Jensen’s inequality: for any convex function \(h\), \(\Expect{h(X)} \geq h(\Expect{X})\), under any distribution of \(X\)
- basically follows from the definition of convexity
- Average of points on the curve vs. the curve at the average value
Extension: for any function \(q\) (convex or not), \(\Expect{h(q(X))} \geq h(\Expect{q(X)})\)

References

Clauset, Aaron, Cosma Rohilla Shalizi, and M. E. J. Newman. 2009. “Power-Law Distributions in Empirical Data.” SIAM Review 51:661–703. http://arxiv.org/abs/0706.1062.

Koenker, Roger, and Kevin F. Hallock. 2001. “Quantile Regression.” Journal of Economic Perspectives 15:143–56. https://doi.org/10.1257/jep.15.4.143.

Inference I — Inference with Independent Data

Agenda

Spatial narratives

In our previous episodes

Until further notice…

What we want to infer

Some standard estimates work because of the law of large numbers

What do we mean by “works”?

The basic ingredient for consistency is the law of large numbers

Solution to the exercise

Law of large numbers in general

This is part of why maximum likelihood works

Maximum likelihood can work even without moments

Convergence of the log-likelihood function (an example)

Convergence of the log-likelihood function (an example)

Convergence of the log-likelihood function (an example)

Convergence of the log-likelihood function (an example)

The more general pattern

This applies pretty broadly

What about estimation error?

Estimating by optimizing

Estimating by optimizing

Estimating by optimizing

Estimating by optimizing

In practice…

Special application to maximum likelihood

Generalizing away from IID data

Summary

Backup: Modes of convergence

Backup: Chebyshev’s inequality

Backup: Markov’s inequality

Backup: Disclaimers to the “More general pattern”

Backup: The Gibbs inequality

Backup: Jensen’s inequality

References