Since my Ph.D. training was split between the Statistics Department and the Heinz School of Public Policy, I can attest that statisticians and economists sometimes talk past each other. This blog post explains the econometric terms exogenous and endogenous in a framework that is perhaps more natural for a statistician. The post ends with a realization that I should pay more attention to exogeneity when including covariates to increase the power for an experiment.

# A statistician's model

Consider the following regression model:
\begin{equation}\begin{aligned}
\Y & = \X\boldbeta + \e &
\e & \sim N(\Zero,\R) \label{eq:regmodel}
\end{aligned}\end{equation}
where is a **known** design matrix composed of **fixed numbers**, is a vector of unknown regression coefficients, and is a vector of the residual errors with covariance matrix .

There is a vast literature devoted to studying this model. For example, the optimal method of estimation for this model varies depending on the specification of . If where is the identity matrix, then Ordinary Least Squares (OLS) is optimal for estimation and hypothesis testing. If where is positive definite and known then Generalized Least Squares (GLS) is optimal. If is positive definite but unknown, then REstricted Maximum Likelihood (REML) has desirable properties.

Note that since is assumed to be composed of fixed numbers, it is not random. By definition, its covariance with any random variable is zero.

# An econometrician's model

On the advice of Brian Kovak and Seth Richards-Shubik I use the econmetrics textbook used by CMU (and may other places):

Wooldridge, J. M. (2011). Econometric Analysis of Cross Section and Panel Data. MIT Press, Cambridge, MA, 2nd edition.

as representative of the views of econometricians.

Although Wooldridge uses the same notation as the model of Equation \ref{eq:regmodel}, his model is quite different: the columns of the matrix are assumed to be draws from some random variable. His matrix is not fixed, it is random.

From the perspective of a statistician, Wooldridge is modeling as a **transformation of random variables**:
\begin{equation*}
Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_K X_K + U
\end{equation*}
such that are all observable **random variables**,
is an unobservable random variable, and
are constants that are fixed and
unknown. We observe a random draw from this structural model
\begin{equation*}
y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_K x_K + u
\end{equation*}
such that is a draw from , is a draw from , etc.

Wooldridge outlines a series of assumptions that allow an econometrician to ignore the stochastic nature of the matrix and use the regression model of Equation \ref{eq:regmodel} instead of the true model. Wooldridge gives the key assumption a technical name: **exogeneity**. A variable is exogenous if it is uncorrelated with the error term . If all variables in the matrix are exogenous, then Wooldridge argues regression model of Equation \ref{eq:regmodel} is optimal for point estimation and hypothesis testing -- even though the model is wrong!

# A statistician's example: Stochastic and exogeneity

I, at least, had great trouble following Wooldridge's argument, so I created the following example to better understand his claims. I hope that it is useful to others.

Let be partitioned into two parts: an column of ones and columns of multivariate random normal vectors : \begin{aligned} \X & = \begin{bmatrix} \One_n & \X_1 & \X_2 & \dots & \X_{p-1} \end{bmatrix} \\ & = \begin{bmatrix} \One_n & \widetilde{\X}_{n\times (p-1)} \end{bmatrix} \end{aligned} where the subscripts denote the dimensions of the matrices. The multivariate random normal vectors can be correlated with each other. To express these correlations compactly, the operator (see Henderson and Searle (1979) or Wikipedia for properties) is used to stack the columns of the matrix on top of one another \begin{align*} \Vecm{\X} & = \begin{bmatrix} \One_n \\ \X_1 \\ \X_2 \\ \vdots \\ \X_{p-1} \end{bmatrix} \end{align*} so that the mean and variance can be expressed as \begin{align*} \E{\Vecm{\X}} & = \begin{bmatrix} \One_n \\ \M_1 \\ \M_2 \\ \vdots \\ \M_{p-1} \end{bmatrix} \\ \Var{\Vecm{\X}} & = \begin{bmatrix} \Zero_{n\times n} & \Zero \\ \Zero & \boldSigma_{n(p-1) \times n(p-1) } \end{bmatrix} \quad \end{align*} where the matrix of means and covariance matrix are assumed to be fixed but unknown. Note that the although the intercept is uncorrelated with the random portions of , the random portions of are allowed to covary arbitrarily.

Let the unobservable error be a mean zero multivariate normal, which can be correlated with the random portions of . We parametrize the joint distribution as \begin{align} \begin{pmatrix} \Vecm{\X} \\ \e \end{pmatrix} & = \begin{pmatrix} \One_n \\ \Vecm{\widetilde{\X}} \\ \e \end{pmatrix} \sim N \left( \A, \B \right) \label{eq:jointdist} \\ \A & = \begin{bmatrix} \One_n \\ \Vecm{\M} \\ \Zero_{n \times 1} \end{bmatrix} \nonumber \\ \B & = \begin{bmatrix} \Zero_{n\times n} & \Zero & \Zero \\ \Zero & \boldSigma_{n(p-1) \times n(p-1) } & \Q \\ \Zero & \Q^\top & \R_{n \times n} \end{bmatrix} \nonumber \quad \end{align} Finally, let be a linear transformation of random variables and such that \begin{align} \Y & = \X \boldbeta + \e \label{eq:randommodel} \end{align} where is a vector of regression coefficients.

With a little tedium and algebra, it can be shown that the distribution of from Equation \ref{eq:randommodel} is \begin{align} \Y & \sim N \left( \begin{bmatrix} \One_{n} & \M \end{bmatrix} \begin{bmatrix} \beta_0 \\ \widetilde{\boldbeta} \end{bmatrix} \quad , \quad \B_\boldSigma + \B_\Q + \B_\R \right) \label{eq:disty} \\ \B_\boldSigma & = \left( \widetilde{\boldbeta}^\top \otimes \I_n \right) \boldSigma \left( \widetilde{\boldbeta} \otimes \I_n \right) \nonumber \\ \B_\Q & = \left( \widetilde{\boldbeta}^\top \otimes \I_n \right) \Q + \left\{ \left( \widetilde{\boldbeta}^\top \otimes \I_n \right) \Q \right\}^\top \nonumber \\ \B_\R & = \R \nonumber \end{align} where corresponds to the regression coefficients for the random part of and is the direct product (see Henderson and Searle (1979) or Wikipedia for properties).

The marginal distribution of given in Equation \ref{eq:disty} is complex but it has the components that one might expect. The mean, for example, is a function of the means of the columns of and the vector of regression coefficients . The terms of the variance-covariance matrix correspond to the random parts of the model: represents the contribution of the nonzero covariance of the columns of , represents the contribution of the non-zero covariance between and , and represents the contribution of the non-zero covariance of the random error .

The marginal distribution of , however, is not useful for inference on because the marginal distribution of depends only on the unknown quantities , , , and . The data, , do not appear in the distribution at all!

One strategy to perform inference on is to condition on the observed values of and then restrict attention to cases where the resulting conditional distribution is tractable. We proceed as follows. Note that the conditional distribution of conditional on an observed draw of

has only one random component, the conditional distribution of the unobservable error. The rules for conditional normal distributions (via Wikipedia) and the joint distribution in Equation \ref{eq:jointdist} imply that is normally distributed with mean and variance \begin{align*} \Egiven{\e}{ \Vecm{ \widetilde{\X} } = \Vecm{\widetilde{\X}_0} } & & = & & \Zero_{n \times 1} + \Q^\top \boldSigma^{-1} \left( \Vecm{ \widetilde{\X}_0} - \Vecm{\M} \right) \\ \Vargiven{\e}{ \Vecm{ \widetilde{\X} } = \Vecm{\widetilde{\X}_0} } & & = & & \R + \Q^\top \boldSigma^{-1} \Q \quad . \end{align*} The mean and variance of are then \begin{align} \Egiven{\Y}{\X = \X_0} & & = & & \X_0 \boldbeta + \Q^\top \boldSigma^{-1} \left( \Vecm{ \widetilde{\X}_0} - \Vecm{\M} \right) \label{eq:condymean} \\ \Vargiven{\Y}{\X = \X_0} & & = & & \R + \Q^\top \boldSigma^{-1} \Q \nonumber \end{align} where the unknown quantities are , , , and . In the special case where , i.e. all of the columns of are uncorrelated with with the error , then the conditional distribution will only depend on and \begin{align} \given{Y}{\X=\X_0, \Q=\Zero} & \sim N\left(\X_0 \boldbeta \quad , \quad \R \right) \label{eq:keyresult} \quad , \end{align} which is the regression model of Equation \ref{eq:regmodel}.

That Equation \ref{eq:keyresult} is equivalent to Equation \ref{eq:regmodel} is somewhat shocking. We began with as a linear transformation of random variables contained in and ended up with a conditional model that is equivalent to a regression where is composed of fixed numbers. The key condition, that , i.e. in econometric parlance the columns of are exogenous, allows use to use all of our regression theory. It is quite wonderful.

# A lesson for at least one statistician

Equations \ref{eq:condymean} and \ref{eq:keyresult} contain a lesson for statisticians: it can be shown that a regression estimate of a treatment effect from a designed experiment will be inconsistent if there is only one endogenous variable that was included in the design matrix. This is not something that I have considered in the past when including covariates in the model to increase my power to detect a treatment effect. Clearly, this is something that I should consider in the future!

Greg GandenbergerCan you give an example of a model that contains an endogenous variable that might cause trouble if it is assumed to be exogenous?

nmvPost authorConsider an educational experiment conducted in a multicultural setting such that (i) the poor in the community are disproportionately immigrants with Limited English Proficiency (LEP), and (ii) the test used to measure the output of the experiment is far less accurate at measuring the true knowledge of a student when the student has LEP. This suggests that economic variables, such as Free/Reduced Price lunch status may be endogenous, i.e. poverty is correlated with LEP which is in turn correlated with the error term.

Greg GandenbergerThat's helpful. Thanks!