\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\TrueRegFunc}{\mu} \newcommand{\EstRegFunc}{\widehat{\TrueRegFunc}} \newcommand{\TrueNoise}{\epsilon} \newcommand{\EstNoise}{\widehat{\TrueNoise}} \DeclareMathOperator{\tr}{tr} \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator{\dof}{DoF} \DeclareMathOperator{\det}{det} \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \]

Our data will generally be a realization of a **random process** or **stochastic process**, i.e., a collection of random variables, with one random variable for each point in time and space. The random variable we observe at spatial coordinate \(r\) and time \(t\) will be, generically, \(X(r,t)\). If we just care about spatial data, we’ll just have \(X(r)\), and purely temporal data (like the Kyoto data) will be \(X(t)\). In general, the random variables at different points will all be statistically dependent on each other, meaning that \(\Prob{X(r,t) = x_1 ~\mathrm{and} X(q,s) = x_2} \neq \Prob{X(r,t) = x_1} \Prob{X(q,s) = x_2}\). Mathematically, it’s sometimes convenient to think of the whole random process as a *random function*, with the domain of the function being coordinates (\(r, t\)) and the range being possible values of \(X(r,t)\). The “dependence” means that when we randomly pick the function, we have to pick the *whole* function, we don’t get to make an independent random choice of the value of \(X(r,t)\) for each separate \(r,t\).

Because the \(X(r,t)\) are random variables, it makes sense to think about their central tendency, and the expected value or mean is usually the most convenient measure of central tendency: \[
\TrueRegFunc(r,t) = \Expect{X(r,t)}
\] This is what mean by the **trend** of the process. Notice that this is a *deterministic* function, though usually an unknown one we’d like to learn.

Mathematically, we can always decompose^{1} the deterministic random variables into the trend plus random or stochastic **fluctuations**, which always have expectation zero: \[\begin{eqnarray}
X(r,t) & = & \TrueRegFunc(r,t) + \TrueNoise(r,t)\\
\Expect{\TrueNoise(r,t)} & = & 0
\end{eqnarray}\]

We can’t assert anything more about \(\TrueNoise\) without further assumptions. In particular, we can’t make the usual assumptions that are made in regression courses, that \(\TrueNoise\) is uncorrelated, that it has constant variance, that it has the same distribution at each point, that it’s Gaussian, etc. But we can, without any loss of generality, say \(\Expect{\TrueNoise(r,t)} = 0\) everywhere.

We would like to learn the trend function \(\TrueRegFunc\).

If we have *many* independent realizations of the process, we could just average them. Say we can re-run some experiment many times, getting independent realizations \(X^{(1)}, X^{(2)}, \ldots X^{(m)}\). Then \[
\frac{1}{m}\sum_{i=1}^{m}{X^{(i)}(r,t)} \rightarrow \TrueRegFunc(r,t)
\] by the ordinary law of large numbers. As my mention of “experiment” suggests, this idea *can* be applied in the experimental sciences. In neuroscience, for instance, we might repeatedly measure what happens to a particular neuron (or group of neurons, or brain region) after we apply some stimulus to an experimental animal^{2}. This is, however, not available in non-experimental contexts.

More typically, we have only *one* realization of the \(X\) process — only one history of cherry-blossoms in Kyoto, only one map of shale deposits under Pennsylvania, etc. We will only be able to learn about \(\TrueRegFunc\) under assumptions, usually assumptions about \(\TrueRegFunc\) itself^{3}.

In some situations, we might have a well-confirmed scientific theory which tells us *exactly* what the trend should be, say that \(\TrueRegFunc(r,t) = f(r,t)\) for some completely-specified function \(f\). There are very few situations where this applies — *perhaps* in parts of astronomy and physics.

More common, even in physics and astronomy, is the situation where theory pins down a specific functional *form* for the trend, but there are some unknown coefficients in the formula. That is, the theory says that \(\TrueRegFunc(r,t)\) has to be of the form \(f(r,t;\theta)\), where \(f\) is some mathematical expression where everything is known except for one or more parameters \(\theta\). One would then need to estimate \(\theta\) from the data \(X\) to obtain an estimate of the trend. (I’ll say a little bit about *how* to do this estimate below, and we’ll come back to it later in the course.) The dependence of the observations on each other makes the statistical inference more complicated, which we’ll come back to later in the course, but in principle this is just about estimating parameters.

In order for fitting a theoretical model to the data to be a good idea, you need to have reasons to think your theory *does* give you a good idea of the trend. There are domains where this is true — physics, chemistry, some parts of biology (especially the more “population”-y fields: genetics, evolution, ecology, epidemiology) — where we have very well-supported theories which are detailed enough to pin down specific functional forms. (Some economists think this is true in economics as well, but I think they’re fooling themselves.) Beyond that, though, scientific theories are usually much looser – they might tell us that one variable tends to go up with another, but just don’t have the detail to give a functional form. This might not be a permanent condition — it used to be true even of physics! — but for now, in lots of scientific domains, let alone in business or policy, we just don’t have credible theories which tell us about trends.

I said above that if we’ve got a theory which says that the trend function \(\TrueRegFunc(r,t)\) should have a specific form, \(f(r,t;\theta)\), we can estimate the unknown parameters \(\theta\) by fitting the model to the data. This is usually called “curve fitting”, though when we’re dealing with spatial data it’d more properly be “surface fitting”, etc. The oldest^{4} and in many ways most natural way to do this is ordinary least squares (OLS). You’re already familiar with OLS from linear regression, but here’s how it goes for general curve-fitting.

We have \(n\) data points, with coordinates \((r_1, t_1), (r_2, t_2), \ldots (r_n, t_n)\), and at the \(i^{\mathrm{th}}\) data point we observed \(X(r_i, t_i)\). If we make a particular guess about the parameters, say \(\theta\), we predict \(f(r_i, t_i; \theta)\) for what we’d observe at that data point. This will give us a certain mean squared error (MSE) over the data points: \[
MSE(\theta) = \frac{1}{n}\sum_{i=1}^{n}{(X(r_i, t_i) - f(r_i, t_i; \theta))^2}
\] The least-squares estimate (LSE) is the value of the parameters which minimizes the MSE: \[
\hat{\theta} = \argmin_{\theta}{\frac{1}{n}\sum_{i=1}^{n}{(X(r_i, t_i) - f(r_i, t_i; \theta))^2}}
\] Taking the derivative and setting it equal to zero gives \[
\sum_{i=1}^{n}{(X(r_i, t_i) - f(r_i, t_i; \theta)) \nabla_{\theta} f(r_i, t_i; \theta)} = 0
\] which is called the **estimating equation** (or **normal equation**) for the problem. If \(f(r, t; \theta)\) is linear in \(\theta\) this is a linear system of equations which we can solve the same way you did in your linear models course. If \(f(r, t; \theta)\) is non-linear in \(\theta\), we need to use numerical methods for solving non-linear equations (like Newton’s method; see the optional appendix below).

Now, *why* should we do this? Well, suppose the model is right, in the sense that \(\TrueRegFunc(r,t) = f(r,t;\theta_0)\) for some particular \(\theta_0\). What will happen to the *expected* MSE? Use linearity of expectation: \[\begin{eqnarray}
\Expect{MSE(\theta_0)} & = & \frac{1}{n}\sum_{i=1}^{n}{\Expect{(X(r_i, t_i) - f(r_i, t_i; \theta_0))^2}}\\
& = & \frac{1}{n}\sum_{i=1}^{n}{\Expect{(\TrueRegFunc(r_i, t_i) + \TrueNoise(r_i, t_i) - f(r_i, t_i; \theta_0))^2}}\\
& = & \frac{1}{n}\sum_{i=1}^{n}{\Expect{(f(r_i, t_i;\theta_0) + \TrueNoise(r_i, t_i) - f(r_i, t_i; \theta_0))^2}}\\
& = & \frac{1}{n}\sum_{i=1}^{n}{\Expect{(\TrueNoise(r_i, t_i))^2}}\\
& = & \frac{1}{n}\sum_{i=1}^{n}{\Expect{(\TrueNoise(r_i, t_i))^2}}\\
& = & \frac{1}{n}\sum_{i=1}^{n}{\Var{\TrueNoise(r_i, t_i)}}
\end{eqnarray}\] since \(\Expect{\TrueNoise(r,t)} = 0\) everywhere. On the other hand, look at what happens if we use a *different* parameter value \(\theta \neq \theta_0\); things start off almost the same but then take a turn: \[\begin{eqnarray}
\Expect{MSE(\theta)} & = & \frac{1}{n}\sum_{i=1}^{n}{\Expect{(X(r_i, t_i) - f(r_i, t_i; \theta))^2}}\\
& = & \frac{1}{n}\sum_{i=1}^{n}{\Expect{(\TrueRegFunc(r_i, t_i) + \TrueNoise(r_i, t_i) - f(r_i, t_i; \theta))^2}}\\
& = & \frac{1}{n}\sum_{i=1}^{n}{\Expect{(f(r_i, t_i;\theta_0) + \TrueNoise(r_i, t_i) - f(r_i, t_i; \theta))^2}}\\
& = & \frac{1}{n}\sum_{i=1}^{n}{\Expect{((f(r_i, t_i;\theta_0) - f(r_i, t_i; \theta) + \TrueNoise(r_i, t_i))^2}}\\
& = & \frac{1}{n}\sum_{i=1}^{n}{\Expect{(f(r_i, t_i;\theta_0) - f(r_i, t_i; \theta))^2 + (f(r_i, t_i;\theta_0) - f(r_i, t_i; \theta))\TrueNoise(r_i, t_i) + \TrueNoise(r_i, t_i)^2}}\\
& = & \frac{1}{n}\sum_{i=1}^{n}{(f(r_i, t_i;\theta_0) - f(r_i, t_i; \theta))^2 + \Var{\TrueNoise(r_i, t_i)}}
\end{eqnarray}\] since \((f(r_i, t_i;\theta_0) - f(r_i, t_i; \theta))\) is non-random and \(\Expect{\TrueNoise(r_i, t_i)} = 0\). So \[
\Expect{MSE(\theta)} - \Expect{MSE(\theta_0)} = \frac{1}{n}\sum_{i=1}^{n}{(f(r_i, t_i;\theta_0) - f(r_i, t_i; \theta))^2} > 0
\] In a word: the true parameter value has a smaller *expected* MSE than any other parameter value. If we think that the MSE from larger and larger data sets should converge on the expected MSE, then the least-squares estimates should converge on the true parameter values. When and why the large-sample MSE should converge is something we’ll come back to later in the course.

Here’s a nice example of curve-fitting to get a trend, from Bulliet (1979). Using historical sources^{5}, Bulliet was able to work out when various people in medieval Iran converted to Islam, and so what fraction of all the conversions happened between particular years:

By taking cumulative sums, this also told Bulliet what fraction of conversions had happened up to any given year: