# Inference I — Inference with Independent Data

23 October 2018


# Agenda

• Spatial narratives
• Reminders: inference with IID data
• Generalization to dependent, heterogeneous data

# Spatial narratives

Drs. Jessica Benner and Emma Slayton, CMU Library

# In our previous episodes

• Many plausible ideas about how to estimate with dependent data
• Some models which generate dependent data
• How do we know that ideas work (in those models or elsewhere)?
• Strategy: go back to inference with independent data, and see what we can abstract

# Until further notice…

• Assume $$X_1, X_2, \ldots X_n$$ are independent and identically distributed (IID)
• Each $$X_i$$ might have multiple dimensions, e.g., $$X_i = (Y_i, Z_i)$$
• We don’t know the common distribution of the $$X$$’s
• That distribution might be parametric with unknown parameter(s) $$\theta$$, and pdf \$f(x;)
• E.g., Gaussian, $$f(x;\theta) = \frac{1}{\sqrt{2\pi \theta_2}} e^{-(x-\theta_1)^2/2\theta_2}$$
• E.g., Pareto/power-law, $$f(x;\theta) = \frac{\theta - 1}{\xmin} {\left(\frac{x}{\xmin}\right)}^{-\theta}$$ for $$x \geq \xmin$$
• The Pareto is a common model for “heavy tailed” data (Clauset, Shalizi, and Newman 2009)
• Or it might be nonparametric, meaning we don’t assume any parametric form

# What we want to infer

• Parameters, in a parametric model
• Moments
• Quantiles
• More complex things — optimal slope of $$Y_i$$ on $$Z_i$$
• In general, some function $$\psi(p)$$ of the true pdf $$p$$

# Some standard estimates work because of the law of large numbers

• Obvious example: Use $$\overline{X}$$ as an estimate of $$\Expect{X}$$
• Less obvious: $$\frac{n}{n+\nu}\overline{X}$$ also works for any fixed $$\nu > 0$$

# What do we mean by “works”?

• Consistency: $$\hat{\psi}_n \rightarrow \psi$$ as $$n\rightarrow\infty$$
• Also nice to know:
• Bias: $$\Expect{\hat{\psi}_n} - \psi$$ (should $$\rightarrow 0$$)
• Variance: $$\Var{\hat{\psi}_n}$$ (also should $$\rightarrow 0$$)
• Sampling distribution of $$\hat{\psi}_n$$ (to get confidence sets)

# The basic ingredient for consistency is the law of large numbers

• Sample mean $$\overline{X}_n = n^{-1}\sum_{i=1}^{n}{X_i}$$
• Assume the $$X_i$$ all have mean $$\Expect{X}$$, variance $$\Var{X}$$, and are uncorrelated
• Exercise: Prove that $\Expect{(\overline{X}_n-\Expect{X})^2} \rightarrow 0 ~\text{as} ~ n\rightarrow\infty$

Hint: $$\Expect{Z^2} = (\Expect{Z})^2 + \Var{Z}$$, for any variable $$Z$$

# Solution to the exercise

$\begin{eqnarray} \Expect{\overline{X}_n} & = & \Expect{\frac{1}{n}\sum_{i=1}^{n}{X_i}}\\ & = & \frac{n\Expect{X_1}}{n} = \Expect{X}\\ \Var{\overline{X}_n} & = & \Var{\frac{1}{n}\sum_{i=1}^{n}{X_i}}\\ & = & \frac{\sum_{i=1}^{n}{\Var{X_i}}}{n^2}\\ & = & \frac{n \Var{X}}{n^2} = \frac{\Var{X}}{n} \rightarrow 0\\ \therefore \Expect{(\overline{X}_n-\Expect{X})^2} & = & \left(\Expect{\overline{X}_n}-\Expect{X}\right)^2 + \Var{\overline{X}_n}\\ & = & 0 + n^{-1}\Var{X} \rightarrow 0 \end{eqnarray}$

# Law of large numbers in general

• If the $$X_i$$ are IID, then so are $$h(X_i)$$, for any function $$h$$, so $\overline{h(X)_n} \rightarrow \Expect{h(X)}$

# This is part of why maximum likelihood works

• The normalized log-likelihood is a (random) function of $$\theta$$: $L_n(\theta) = \frac{1}{n}\sum_{i=1}^{n}{\log{f(X_i;\theta)}}$
• Apply the LLN: for each $$\theta$$, $L_n(\theta) \rightarrow \Expect{\log{f(X;\theta)}} \equiv \CrossEntropy(\theta)$
• Fact: for any pdfs $$f$$, $$g$$, $\int{f(x) \log{f(x)} dx} \geq \int{f(x) \log{g(x)} dx}$
• a.k.a. “Gibbs inequality”
• (similarly for pmfs)
• So $$\CrossEntropy(\theta)$$ is maximized at the true $$\theta$$, $$\theta_0$$: $\int{f(x;\theta_0) \log{f(x;\theta_0)} dx} \geq \int{f(x;\theta_0) \log{f(x;\theta)} dx}$

# Maximum likelihood can work even without moments

• For the Pareto distribution, $$\Expect{X^k} = \infty$$ if $$\theta \leq k$$
• So $$\Var{X} = \infty$$ if $$\theta \leq 2$$
• But $$\log{f}$$ is well-behaved: $\log{f(x;\theta)} = \log{(\theta -1)} - (\theta-1)\log{\xmin} - \theta \log{x}$
• Exercise (off-line): Find $$\Expect{\log{X}}$$ and $$\Var{\log{X}}$$ in terms of $$\theta$$ and $$\xmin$$
• What does the normalized log-likelihood function look like?

# Convergence of the log-likelihood function (an example)

Normalized log-likelihoods for 10 different IID samples from the Pareto distribution ($$n=10$$, $$\theta=1.5$$, $$\xmin=1$$)

# Convergence of the log-likelihood function (an example)

Normalized log-likelihoods for 10 different IID samples from the Pareto distribution ($$n=10^3$$, $$\theta=1.5$$, $$\xmin=1$$)

# Convergence of the log-likelihood function (an example)

Normalized log-likelihoods for 10 different IID samples from the Pareto distribution ($$n=10^5$$, $$\theta=1.5$$, $$\xmin=1$$)

# Convergence of the log-likelihood function (an example)

Normalized log-likelihoods for the Pareto distribution, showing convergence as $$n\rightarrow\infty$$ along a single IID sequence

# The more general pattern

• Assume (1): $$\hat{\psi}_n = \argmin_{\psi}{M_n(\psi)}$$ for some random functions $$M_n$$
• If you maximize instead, minimize the negative
• Assume (2): $$M_n(\psi) \rightarrow m(\psi)$$ as $$n\rightarrow \infty$$
• Assume (3): $$m(\psi)$$ has a unique minimum at the true $$\psi_0$$
• Then, in general, $$\hat{\psi}_n \rightarrow \psi_0$$

(some disclaimers apply)

• Estimating the expectation with the sample mean: $\begin{eqnarray} M_n(\psi) & = & \frac{1}{n}\sum_{i=1}^{n}{(X_i - \psi)^2}\\ m(\psi) & = & \Expect{(X-\psi)^2} \end{eqnarray}$
• Estimating a simple linear regression: $\begin{eqnarray} M_n(\psi) & = & \frac{1}{n}\sum_{i=1}^{n}{(Y_i - \psi_1 - \psi_2 Z_i)^2}\\ m(\psi) & = & \Expect{(Y-\psi_1 - \psi_2 Z)^2} \end{eqnarray}$
• Nonlinear least squares, with regression function $$\TrueRegFunc(z;\psi)$$ $\begin{eqnarray} M_n(\psi) & = & \frac{1}{n}\sum_{i=1}^{n}{(Y_i - \TrueRegFunc(Z_i;\psi))^2}\\ m(\psi) & = & \Expect{(Y-\TrueRegFunc(Z;\psi))^2} \end{eqnarray}$
• Estimating the $$\alpha$$ quantile (Koenker and Hallock 2001): $\begin{eqnarray} M_n(\psi) & = & \frac{1}{n}\sum_{i=1}^{n}{X_i(\alpha - \mathbf{I}(X_i < 0))}\\ m(\psi) & = & \Expect{X(\alpha - \mathbf{I}(X < 0))} \end{eqnarray}$
• Extends naturally to estimating conditional quantiles

• If we’re using a mean, we can use the standard error of the mean: $\Var{\overline{X}_n} = \frac{1}{n}\Var{X} \Rightarrow \se{\overline{X}_n} = \sqrt{\frac{\Var{X}}{n}}$
• If $$\hat{\psi}_n = h(A_n,B_n)$$ and we know $$\Var{A}_n$$, $$\Var{B}_n$$, use propagation of error (a.k.a. the delta method):
$\begin{eqnarray} h(\Expect{A_n}, \Expect{B_n}) & \approx & h(A_n, B_n) + (\Expect{A_n} - A_n)\frac{\partial h}{\partial a} + (\Expect{B_n}-B_n)\frac{\partial h}{\partial b} ~ \text{(Taylor series)}\\ \hat{\psi}_n = h(A_n, B_n) & \approx & h(\Expect{A_n}, \Expect{B_n}) + (A_n - \Expect{A_n})\frac{\partial h}{\partial a} + (B_n - \Expect{B_n})\frac{\partial h}{\partial b}\\ \Var{\hat{\psi}_n} & \approx & {\left(\frac{\partial h}{\partial a}\right)}^2\Var{A_n} +{\left(\frac{\partial h}{\partial b}\right)}^2\Var{B_n} + 2\left(\frac{\partial h}{\partial a}\frac{\partial h}{\partial b}\right)\Cov{A_n, B_n} \end{eqnarray}$
• What if it’s some weird optimization problem?

# Estimating by optimizing

• Assume:
• $$\hat{\psi}_n$$ minimizes some $$M_n(\psi)$$
• $$\psi_0$$ minimizes $$m(\psi)$$ (uniquely)
• $$M_n(\psi) \rightarrow m(\psi)$$ (everywhere)
• $$\psi$$ is one-dimensional (for now)
• $\begin{eqnarray} 0 & = & \frac{dM_n}{d\psi}(\hat{\psi}_n) ~ \text{(optimum)}\\ 0 & \approx & \frac{dM_n}{d\psi}(\psi_0) + (\hat{\psi}_n - \psi_0)\frac{d^2 M_n}{d\psi^2}(\psi_0) ~ \text{(Taylor expansion)}\\ ( \hat{\psi}_n - \psi_0) \frac{d^2 M_n}{d\psi^2}(\psi_0) & \approx & -\frac{dM_n}{d\psi}(\psi_0)\\ \hat{\psi}_n - \psi_0 & \approx & -\frac{\frac{dM_n}{d\psi}(\psi_0)}{\frac{d^2 M_n}{d\psi^2}(\psi_0)}\\ \hat{\psi}_n & \approx & \psi_0 - \frac{\frac{dM_n}{d\psi}(\psi_0)}{\frac{d^2 M_n}{d\psi^2}(\psi_0)} \end{eqnarray}$

# Estimating by optimizing

• Still assuming $$\psi$$ is one-dimensional, write $$M_n^{\prime}$$ and $$M_n^{\prime\prime}$$ for the derivatives $\hat{\psi}_n \approx \psi_0 - \frac{M_n^{\prime}}{M_n^{\prime\prime}}$ As $$n\rightarrow \infty$$, $\begin{eqnarray} M_n^{\prime} & \rightarrow & m^{\prime}(\psi_0) = 0 ~ \text{(optimum)}\\ M_n^{\prime^\prime} & \rightarrow & m^{\prime\prime}(\psi_0) > 0 ~ \text{(optimum)} \end{eqnarray}$ so $\hat{\psi}_n \rightarrow \psi_0$

# Estimating by optimizing

(still in 1D)

• What about variance? $\begin{eqnarray} \hat{\psi}_n & \approx & \psi_0 - \frac{M_n^{\prime}}{M_n^{\prime\prime}}\\ \Var{\hat{\psi}_n} & \approx & \Var{\frac{M_n^{\prime}}{M_n^{\prime\prime}}}\\ & = & \frac{\Var{M_n^{\prime}}}{(m^{\prime\prime})^2}\\ \se{\hat{\psi}_n} & \approx & \frac{\sqrt{\Var{M^{\prime}_n}}}{m^{\prime\prime}} \end{eqnarray}$
• Variance of $$\hat{\psi}_n$$ goes down as the curvature goes up
• Variance of $$\hat{\psi}_n$$ goes up with the variance in $$M_n$$
• If $$M_n$$ is an average, with $$\Var{M_n} = O(1/n)$$, so is $$M^{\prime}_n$$, and $$\Var{M_n^{\prime}} = O(1/n)$$
• Then $$\se{\hat{\psi}_n} = O(1/\sqrt{n})$$
• If $$M_n \rightsquigarrow \mathcal{N}$$, because of the CLT, then
• $$M_n^{\prime} \rightsquigarrow \mathcal{N}$$ (usually)

# Estimating by optimizing

• More general case: $$\psi$$ is a vector $\begin{eqnarray} \hat{\psi}_n & = & \argmin_{\psi}{M_n(\psi)}\\ 0 & = & \nabla M_n(\hat{\psi}_n)\\ 0 & \approx & \nabla M_n(\psi_0) + (\hat{\psi}_n-\psi_0) \nabla \nabla M_n(\psi_0)\\ \nabla \nabla M_n(\psi_0) & \equiv & \mathbf{H}_n(\psi_0) \rightarrow \mathbf{h}\\ \hat{\psi}_n & \approx & \psi_0 - \mathbf{h}^{-1} \nabla M_n(\psi_0)\\ \Var{\hat{\psi}_n } & \approx & \mathbf{h}^{-1} \Var{\nabla M_n(\psi_0)} \mathbf{h}^{-1} \end{eqnarray}$
• $$\mathbf{h} = \nabla \nabla m(\psi_0)$$ Hessian of $$m$$ at $$\psi_0$$
• Again, if $$\nabla M_n(\psi_0)$$ is an average, CLT suggests $$\rightsquigarrow \mathcal{N}$$

# In practice…

• Approximate $$\mathbf{h}$$ by $$\nabla \nabla M_n(\hat{\psi}_n)$$, say $$\mathbf{H}_n$$
• Approximate $$\Var{\nabla M_n(\psi_0)}$$ by sample covariance of the derivative of $$M_n$$, say $$\mathbf{J}_n$$
• Sandwich covariance matrix is $$\mathbf{H}_n^{-1}\mathbf{J}_n \mathbf{H}_n^{-1}$$

# Special application to maximum likelihood

• $$\nabla \nabla \CrossEntropy(\theta) \equiv \mathbf{i}(\theta)$$, the Fisher information
• Fisher identity: $$\Var{\nabla L_n(\theta)} = \frac{1}{n} \mathbf{i}(\theta)$$, if the model is right
• So $$\Var{\hat{\theta}_n} = \frac{1}{n} \mathbf{i}^{-1}(\theta_0)$$, if the model is right

• Exercise (offline):
• Find the Fisher information for the Pareto

# Generalizing away from IID data

• Estimate by optimizing some objective function $$M_n$$ $\hat{\psi}_n = \argmin_{\psi}{M_n(\psi)}$
• Hope the objective function tends to a limit function $$m$$ $M_n(\psi) \rightarrow m(\psi)$
• Hope that the limiting function has its minimum in the right place $\argmin_{\psi}{m(\psi)} = \psi_0$
• Then the exact same arguments apply: $\begin{eqnarray} \hat{\psi}_n & \rightarrow & \psi_0\\ \Var{\hat{\psi}_n} & \rightarrow & \mathbf{h}^{-1} \Var{\nabla M_n(\psi_0)} \mathbf{h}^{-1} \end{eqnarray}$
• We need to make sure our objective functions converge
• We’re going to need a law of large numbers for dependent data

# Summary

• We usually estimate by minimizing an objective function
• If the objective function has a well-behaved limit, the estimate will converge
• The variance of the estimate depends on:
• The curvature of the objective function ($$\uparrow$$ curvature $$\downarrow$$ variance)
• The variance of the derivative of the objective function
• None of these ingredients actually requires independence
• Once we have convergence of our objective function, everything will fall in to place

# Backup: Modes of convergence

• Consistency is usually defined as “convergence in probability”: For any $$\epsilon > 0$$, $\Prob{|\hat{\psi}_n - \psi| \geq \epsilon} \rightarrow 0$
• The main slides use $$L_2$$ convergence, $\Expect{(\hat{\psi}_n - \psi)^2} \rightarrow 0$
• $$L_2$$ convergence implies convergence in probability by Chebyshev’s inequality but not vice versa
• There is also “strong convergence” or “almost sure convergence”: $\Prob{\hat{\psi}_n \rightarrow \psi} = 1$
• A.S.-convergence implies convergence in probability but not vice versa

# Backup: Chebyshev’s inequality

For any random variable $$Z$$, and any $$\epsilon > 0$$,

$\Prob{|Z-\Expect{Z}| > \epsilon} \leq \frac{\Var{Z}}{\epsilon^2}$

Proof: Apply Markov’s inequality to $$(Z-\Expect{Z})^2$$, which is $$\geq 0$$ and has expectation $$\Var{Z}$$.

# Backup: Markov’s inequality

For any non-negative random variable $$Z$$, and any $$\epsilon > 0$$,

$\Prob{Z \geq \epsilon} \leq \frac{\Expect{Z}}{\epsilon}$

Proof: $\begin{eqnarray} Z & = & Z\Indicator{Z \geq \epsilon} + Z\Indicator{Z < \epsilon}\\ \Expect{Z} & = & \Expect{Z \Indicator{Z \geq \epsilon}} + \Expect{Z \Indicator{Z < \epsilon}}\\ & \geq & \Expect{Z \Indicator{Z \geq \epsilon}}\\ & \geq & \Expect{\epsilon \Indicator{Z \geq \epsilon}}\\ & = & \epsilon\Expect{\Indicator{Z \geq \epsilon}} = \epsilon \Prob{Z \geq \epsilon} \end{eqnarray}$

# Backup: Disclaimers to the “More general pattern”

• Need to assume (2’) that $$M_n(\psi) \rightarrow m(\psi)$$ uniformly over $$\psi$$
• Or at least (2’’) uniformly over a region where $$\hat{\psi}_n$$ will concentrate
• Need to assume (3’) that $$m(\psi)$$ has a unique minimum at $$\psi_0$$, and that this minimum is “well-separated”
• Roughly: $$m(\psi) - m(\psi_0)$$ can be very small only if $$\psi$$ is close to $$\psi_0$$

# Backup: The Gibbs inequality

$\begin{eqnarray} \int{f(x) \log{f(x)} dx} - \int{f(x) \log{g(x)} dx} & = & \int{f(x) (\log{f(x)} - \log{g(x)}) dx}\\ & = & \int{f(x) \log{\frac{f(x)}{g(x)}} dx}\\ & = & -\int{f(x) \log{\frac{g(x)}{f(x)}} dx}\\ & \geq & -\log{\int{f(x) \frac{g(x)}{f(x)} dx}} = \log{1} = 0 \end{eqnarray}$

where the last line uses Jensen’s inequality

# Backup: Jensen’s inequality

• A function $$h$$ is convex when, for any $$w \in [0,1]$$, $w h(x_1) + (1-w) h(x_2) \geq h(wx_1 + (1-w) x_2)$
• i.e., for any two points on the curve of $$h(x)$$, the curve lies below the straight line connecting the points
• The function $$-\log{x}$$ is convex (draw it!)
• Jensen’s inequality: for any convex function $$h$$, $$\Expect{h(X)} \geq h(\Expect{X})$$, under any distribution of $$X$$
• basically follows from the definition of convexity
• Average of points on the curve vs. the curve at the average value
• Extension: for any function $$q$$ (convex or not), $$\Expect{h(q(X))} \geq h(\Expect{q(X)})$$

# References

Clauset, Aaron, Cosma Rohilla Shalizi, and M. E. J. Newman. 2009. “Power-Law Distributions in Empirical Data.” SIAM Review 51:661–703. http://arxiv.org/abs/0706.1062.

Koenker, Roger, and Kevin F. Hallock. 2001. “Quantile Regression.” Journal of Economic Perspectives 15:143–56. https://doi.org/10.1257/jep.15.4.143.