Density estimation

21 March 2019

\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Indicator}[1]{\mathbb{1}\left( #1 \right)} \]

Data: body weights of cats

## [1] 2.0 2.0 2.0 2.1 2.1 2.1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.300   2.700   2.724   3.025   3.900

The empirical cumulative distribution function

Computationally: ecdf

plot(ecdf(cats$Bwt), xlab = "Body weight (kg)", ylab = "Cumulative probability")

The ECDF estimates the CDF

The CDF is not enough

The CDF is not enough

What about a histogram?

hist(cats$Bwt, xlab = "Body weight (kg)", ylab = "Number of cats", breaks = 20)

Fitting a parametric model

Fitting a parametric model (cont’d)

# Define the (negative, normalized) log-likelihood function
gamma.loglike <- function(params, data = cats$Bwt) {
    -mean(log(dgamma(x = data, shape = params[1], scale = params[2])))
# Ask R to optimize it, with a starting guess about the parameters
cats.gamma <- optim(fn = gamma.loglike, par = c(1, 1))
# What is the best (negative, normalized) log-likelihood?
## [1] 0.6677077
# What parameters did optim find for us?
## [1] 32.65187332  0.08341489

Compare the fitted density to the histogram

plot(hist(cats$Bwt, plot = FALSE, breaks = 20), freq = FALSE, xlab = "Body weight (kg)")
curve(dgamma(x, shape = cats.gamma$par[1], scale = cats.gamma$par[2]), add = TRUE)

The histogram is already a density estimate

\[ \hat{f}_{hist}(x) \equiv \frac{1}{h} \frac{N_{i(x)}}{n} \]

Why this works

Probability of falling into a region of length \(h\) around \(x\) will be \[ \int_{x-h/2}^{x+h/2}{f(u) du} \approx f(x) h \] for very small \(h\). So \[ N_i \sim \mathrm{Binom}(n, f(x) h) \] and \[ \hat{f}_{hist}(x) = \frac{N_i}{nh} \sim \frac{1}{nh} \mathrm{Binom}(n, f(x) h) \] which \(\rightarrow f(x)\) as \(n \rightarrow \infty\)

How bad is a histogram?

Error from variance: \(\propto 1/nh\)

Error from bias: \(\propto h^2 (f^{\prime}(x))^2\)

Total error for estimating the density, \(O(h^2) + O(1/nh)\)

Best \(h = O(n^{-1/3})\)

Best total error, \(O(n^{-2/3})\)

vs. \(O(n^{-1})\) if we’ve got a correct parametric model

Q from discussion: What about variable-width bins?

A: That can help, but it’s even better to get away from fixed bins altogether


Idea: “Smooth out” each observation into a little distribution

Kernel functions are pdfs

Put a copy of the kernel at each observation; add them up

\[ \hat{f}_{kern}(x) = \frac{1}{n} \sum_{i=1}^{n}{\frac{1}{h} K\left(\frac{x-x_i}{h}\right)} \]

How this works for cats

Say \(h=0.01\)

How do we pick the bandwidth?

Evaluating the log-likelihood

Computationally: npudens

library(np) <- npudens(cats$Bwt)
plot(, xlab = "Body weight (kg)")