---
title: Density estimation
author:
date: 21 March 2019
output: slidy_presentation
---
```{r, echo=FALSE}
library(knitr)
opts_chunk$set(size="small",background="white", highlight=FALSE, cache=TRUE,
autodep=TRUE, tidy=TRUE,
warning=FALSE, message=FALSE)
options(np.messages=FALSE)
```
\[
\newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]}
\newcommand{\Indicator}[1]{\mathbb{1}\left( #1 \right)}
\]
## Data: body weights of cats
```{r}
library(MASS)
data(cats)
head(cats$Bwt)
summary(cats$Bwt)
```
## The empirical cumulative distribution function
- ECDF:
\[
\hat{F}_n(a) = \frac{1}{n}\sum_{i=1}^{n}{\Indicator{x_i \leq a}}
\]
+ Makes a jump at each distinct value seen in the data
+ Constant in between the jumps
## Computationally: `ecdf`
```{r}
plot(ecdf(cats$Bwt), xlab="Body weight (kg)", ylab="Cumulative probability")
```
## The ECDF estimates the CDF
- Suppose $x_1, x_2, \ldots x_n$ are IID with CDF $F$. Then for _any_ $a$,
\[
\hat{F}_n(a) \rightarrow F(a)
\]
by law of large numbers
+ Also: see backup slides
## The CDF is not enough
```{r, echo=FALSE}
plot(ecdf(cats$Bwt), xlab="Body weight (kg)", ylab="Cumulative probability")
```
- What's the probability of a cat weighing between $3.21$ and $3.22$ kg?
+ ECDF says: exactly 0
- Q from discussion: What about running a smoother (e.g., a spline) through the ECDF?
- A: You could do that, but you'd need to make sure the curve was always monotone, between 0 and 1, etc.
+ Once we estimate the density, we can recover the CDF by integrating
## The CDF is not enough
- More generally, pdf is derivative of CDF, $f(x) = dF/dx$
- Derivative of an _empirical_ CDF is either 0 or $\infty$
+ Where is it $\infty$?
- How do we estimate a pdf?
## What about a histogram?
```{r}
hist(cats$Bwt, xlab="Body weight (kg)", ylab="Number of cats", breaks=20)
```
- Gives us an idea of where we have more or fewer data points...
## Fitting a parametric model
- Pick a parametric pdf that looks sensible
- For example, the gamma distribution with shape parameter $a$ and scale $s$ is
\[
f(x) = \frac{1}{s^a \Gamma(a)} x^{a-1} e^{-x/s}
\]
+ Exponential is the special case $a=1$
+ $a > 1$ is a hump with an exponential right tail
```{r, echo=FALSE}
curve(dgamma(x, shape=1, scale=1), lty="solid", from=0, to=4,
ylab="gamma pdf")
curve(dgamma(x, shape=2, scale=1), lty="dashed", add=TRUE)
curve(dgamma(x, shape=2, scale=0.5), lty="dotted", add=TRUE)
legend("topright", legend=c(expression(list(a==1, s==1)),
expression(list(a==2, s==1)),
expression(list(a==2, s==0.5))),
lty=c("solid", "dashed", "dotted"))
```
## Fitting a parametric model (cont'd)
- Now maximize the likelihood
- For some (but not all) standard distributions, the function `fitdistr` in the package MASS
has pre-programmed maximum likelihood estimates
+ You saw this in homework 3 for the $t$ distribution
+ As it happens, `fitdistr` knows about gamma distributions too
- It's also easy to maximize the log-likelihood yourself
```{r}
# Define the (negative, normalized) log-likelihood function
gamma.loglike <- function(params, data=cats$Bwt) {
-mean(log(dgamma(x=data, shape=params[1], scale=params[2])))
}
# Ask R to optimize it, with a starting guess about the parameters
cats.gamma <- optim(fn=gamma.loglike, par=c(1,1))
# What is the best (negative, normalized) log-likelihood?
cats.gamma$value
# What parameters did optim find for us?
cats.gamma$par
```
## Compare the fitted density to the histogram
```{r}
plot(hist(cats$Bwt,plot=FALSE, breaks=20), freq=FALSE,
xlab="Body weight (kg)")
curve(dgamma(x, shape=cats.gamma$par[1], scale=cats.gamma$par[2]), add=TRUE)
```
## The histogram is already a density estimate
- Estimated density is _constant_ on bins
- $N_i =$ number of samples in bin $i$
- $i(x)=$ which bin $x$ falls into
- $h =$ width of the bins
\[
\hat{f}_{hist}(x) \equiv \frac{1}{h} \frac{N_{i(x)}}{n}
\]
## Why this works
Probability of falling into a region of length $h$ around $x$ will be
\[
\int_{x-h/2}^{x+h/2}{f(u) du} \approx f(x) h
\]
for very small $h$. So
\[
N_i \sim \mathrm{Binom}(n, f(x) h)
\]
and
\[
\hat{f}_{hist}(x) = \frac{N_i}{nh} \sim \frac{1}{nh} \mathrm{Binom}(n, f(x) h)
\]
which $\rightarrow f(x)$ as $n \rightarrow \infty$
## How bad is a histogram?
Error from variance: $\propto 1/nh$
Error from bias: $\propto h^2 (f^{\prime}(x))^2$
Total error for estimating the density, $O(h^2) + O(1/nh)$
Best $h = O(n^{-1/3})$
Best total error, $O(n^{-2/3})$
vs. $O(n^{-1})$ if we've got a correct parametric model
Q from discussion: What about variable-width bins?
A: That can help, but it's even better to get away from fixed bins altogether
## Kernels
Idea: "Smooth out" each observation into a little distribution
Kernel functions _are_ pdfs
Put a copy of the kernel at each observation; add them up
\[
\hat{f}_{kern}(x) = \frac{1}{n} \sum_{i=1}^{n}{\frac{1}{h} K\left(\frac{x-x_i}{h}\right)}
\]
## How this works for cats
Say $h=0.01$
```{r, results="hide", echo=FALSE}
# Artificially add a very small amount of noise to each value, just for
# plotting purposes
jittered.cats <- jitter(cats$Bwt)
plot(0, xlim=range(cats$Bwt), ylim=c(0,4), type="n",
xlab="Body weight (kg)", ylab="Density") # Establish empty plot
# Tick mark on the axis for each data point
rug(jittered.cats, side=1, col="grey")
# Put a little, scaled-down copy of the Gaussian pdf around each data point
sapply(jittered.cats, function(xi) {curve(dnorm(x, mean=xi, sd=0.01)/nrow(cats), n=1000, add=TRUE, col="grey")})
# Show what we get when we sum up all the little grey curves
lines(density(cats$Bwt, bw=0.01))
```
## How do we pick the bandwidth?
- Cross-validation
- Cross-validate _what_?
- **Integrated squared error**: $\int{(f(x) - \hat{f}(x))^2 dx}$
- Or (negative, normalized) **log-likelihood** $\int{-\log{(\hat{f}(x))} f(x) dx}$
## Evaluating the log-likelihood
- The integral is an expected value:
\[
\int{-\log{(\hat{f}(x))} f(x) dx} = \Expect{-\log{\hat{f}(X)}}
\]
- With $n$ IID samples $x_i$,
\[
\Expect{-\log{\hat{f}(X)}} \approx \frac{1}{n}\sum_{i=1}^{n}{-\log{\hat{f}(x_i)}}
\]
by LLN
+ A similar but more complicated trick for ISE; see backup slides
## Computationally: `npudens`
```{r}
library(np)
bwt.np <- npudens(cats$Bwt)
plot(bwt.np, xlab="Body weight (kg)")
```
## Joint PDFs
- What about multiple variables? We want $f(x_1, x_2, \ldots x_p) = f(\vec{x})$
```{r}
plot(Hwt ~ Bwt, data=cats, xlab="Body weight (kg)", ylab="Heart weight (g)")
```
- Parametric models work just the same way as before
- Kernel density estimates work almost the same way
## Joint KDEs
- Use a product of kernels, one for each variable
\[
\hat{f}_{kern}(x_1, x_2) = \frac{1}{n} \sum_{i=1}^{n}{\frac{1}{h_1 h_2} K\left(\frac{x_{1}-x_{i1}}{h_1}\right)K\left(\frac{x_{2}-x_{i2}}{h_2}\right)}
\]
- Need as many bandwidths as there are variables
- Pick bandwidths for best prediction of _both_ variables
## Computationally: `npudens` again
```{r}
cats.np <- npudens(~Bwt+Hwt, data=cats)
plot(cats.np, view="fixed", theta=10, phi=45)
```
**EXERCISE**: Where do the ridges come from?
- Q1: What happened to the bandwidth on bodyweight (`Bwt`)?
+ It got a lot smaller
- Q2: Why did that bandwidth change?
+ There's a strong relationship between `Bwt` and `Hwt`; if we want to get _both_ right, we can't average over as wide a range of `Bwt`
## Conditional PDFs
- Conditional pdf:
\[
f(x_2|x_1) = \frac{f(x_1, x_2)}{f(x_1)}
\]
so find the joint and marginal pdfs and divide
- Conditional KDE:
\[
\hat{f}_{kern}(x_2|x_1) = \frac{\frac{1}{n} \sum_{i=1}^{n}{\frac{1}{h_1 h_2} K\left(\frac{x_{1}-x_{i1}}{h_1}\right)K\left(\frac{x_{2}-x_{i2}}{h_2}\right)}}{\frac{1}{n} \sum_{i=1}^{n}{\frac{1}{h_1} K\left(\frac{x-x_i}{h_1}\right)}}
\]
- Pick bandwidths for best prediction of $x_2$
+ i.e., look at the conditional log-likelihood of $x_2$ given $x_1$
## Computationally: `npcdens`
```{r}
cats.hwt.on.bwt <- npcdens(Hwt ~ Bwt, data=cats)
plot(cats.hwt.on.bwt, view="fixed", theta=10, phi=65)
```
## How well does kernel density estimation work?
- The same sort of analysis we did for kernel regression will work here too
+ More book-keeping --- see the text
- Integrated squared error of KDE is $O(n^{-4/5})$ in 1D
+ Worse than a well-specified parametric model, $O(n^{-1})$
+ Better than a mis-specified parametric model, $O(1)$
+ Better than a histogram, $O(n^{-2/3})$
- In $p$ dimensions, ISE of joint KDE is $O(n^{-4/(4+p)})$
+ Nothing which is _universally_ consistent does any better
+ Curse of dimensionality again
## Summing up
- It's easy to learn the CDF
+ use the empirical CDF
- It's easy to estimate parametric distributions
+ maximize the log-likelihood
- Histograms are a first step to a _non-parametric_ density estimate
+ Kind of ugly and statistically inefficient, though
- Kernel density estimation lets us learn marginal pdfs, joint pdfs, conditional pdfs...
+ The curse of dimensionality is always with us though
READING: Chapter 14
## Backup: More about the ECDF
- **Glivenko-Cantelli** theorem: If $X$'s are IID with CDF $F$, then
\[
\max_{a}{|\hat{F}_n(a) - F(a)|} \rightarrow 0
\]
+ This is a stronger result than just saying "for _each_ $a$, $|\hat{F}_n(a)-F(a)| \rightarrow 0$"
- **Kolmogorov-Smirnov** (KS) test: To test whether $X$'s are IID with CDF $F_0$, use $d_{KS} = \max_{a}{|\hat{F}_n(a)-F_0(a)|}$
+ If $X \sim F_0$, then $d_{KS} \rightarrow 0$ as $n\rightarrow \infty$
+ If $X \not\sim F_0$, then $d_{KS} \rightarrow$ some positive number
+ The sampling distribution of $d_{KS}$ under the null turns out to be the _same_ for all $F_0$ (at least for big $n$)
+ Needs modification if you estimate the $F_0$ from data as well
## Backup: Parametric density estimates
- We have a parametric model for the density $f(x;\theta)$ with a _finite_-dimensional vector of parameters $\theta$
- We maximize the log-likelihood to get $\hat{\theta}$
- Equivalently we minimize the negative normalized log-likelihood
\[
L(\theta) = -\frac{1}{n}\sum_{i=1}^{n}{\log{f(X_i;\theta)}}
\]
- This is a sample mean so LLN says $L(\theta) \rightarrow \Expect{L(\theta)} \equiv \lambda(\theta)$
- You can show that _if_ the data came from the model with some true parameter $\theta_0$, then $\lambda(\theta_0)$ is the unique global minimum
## Backup: More about parametric density estimates
- Quite generally, the MLE $\hat{\theta} \rightarrow \theta_0$
- Almost as generally, for large $n$, $\hat{\theta} \sim \mathcal{N}(\theta_0, \mathbf{J}(\theta_0)/n)$
+ Hence $\hat{\theta} = \theta_0 + O(1/\sqrt{n})$
+ Here $\mathbf{J}(\theta) =$ matrix of 2nd partial derivatives of $\lambda$ = **Fisher information**
+ Can plug in $\mathbf{J}(\hat{\theta})$ to get approximate confidence intervals, etc.
+ Doesn't matter if the _data_ is Gaussian --- $L(\theta)$ is a sample average so by the CLT _it's_ (approximately) Gaussian, and that propagates
- Things are more complicated if the model is mis-specified
## Backup: Evaluating ISE
\begin{eqnarray}
\int{(f(x) - \hat{f}(x))^2 dx} & = & \int{f^2(x) - 2f(x)\hat{f}(x) \hat{f}^2(x) dx}\\
& = & \int{f^2(x) dx} - 2\int{\hat{f}(x) f(x) dx} + \int{\hat{f}^2(x) dx}\\
& = & (A) + (B) + (C)\\
& = & (\text{same for all estimates}) - (\text{depends on truth and estimate}) + \text{(depends only on estimate)}
\end{eqnarray}
- If we're comparing different estimates, term (A) doesn't matter, so we can ignore it
- Term (C) depends only on our estimate, and we can get it by (say) numerical integration
- That leaves term (B), and with $n$ independent samples $x_i$,
\[
\int{\hat{f}(x) f(x) dx} \approx \frac{1}{n}\sum_{i=1}^{n}{\hat{f}(x_i)}
\]
by LLN