# Stochastic regressors and exogeneity for statisticians

Since my Ph.D. training was split between the Statistics Department and the Heinz School of Public Policy, I can attest that statisticians and economists sometimes talk past each other. This blog post explains the econometric terms exogenous and endogenous in a framework that is perhaps more natural for a statistician. The post ends with a realization that I should pay more attention to exogeneity when including covariates to increase the power for an experiment.

$%some boilerplate stuff \newcommand{\Hm}{\mathbf{H}} \newcommand{\A}{\mathbf{A}} \newcommand{\LL}{\mathbf{L}} \newcommand{\BGamma}{\mathbf{\Gamma}} \newcommand{\I}{\mathbf{I}} \newcommand{\C}{\mathbf{C}} \newcommand{\Q}{\mathbf{Q}} \newcommand{\J}{\mathbf{J}} \newcommand{\K}{\mathbf{K}} \newcommand{\X}{\mathbf{X}} \newcommand{\Y}{\mathbf{Y}} \newcommand{\V}{\mathbf{V}} \newcommand{\M}{\mathbf{M}} \newcommand{\Z}{\mathbf{Z}} \newcommand{\U}{\mathbf{U}} \newcommand{\B}{\mathbf{B}} \newcommand{\ur}{\mathbf{u}} \newcommand{\e}{\mathbf{e}} \newcommand{\D}{\mathbf{D}} \newcommand{\R}{\mathbf{R}} \newcommand{\Sm}{\mathbf{S}} \newcommand{\T}{\mathbf{T}} \newcommand{\PP}{\mathbf{P}} \newcommand{\One}{\mathbf{1}} %\newcommand{\Zero}{{\mathbf{0}}} \newcommand{\Zero}{\pmb{0}} %\newcommand{\Zero}{0} % Palintio zero is too small %\newcommand{\ell}{\mathcal{l}} \newcommand{\boldbeta}{\boldsymbol\beta} \newcommand{\boldepsilon}{\boldsymbol\epsilon} \newcommand{\boldSigma}{\boldsymbol\Sigma} \newcommand{\boldchi}{\boldsymbol\chi} \newcommand{\boldmu}{\boldsymbol\mu} \newcommand{\bLambda}{\boldsymbol\Lambda} \newcommand{\Xminus}{(\X^\top\X)^{-1}\X^\top} \newcommand{\Xominus}{(\X_0^\top\X_0)^{-1}\X_0^\top} \newcommand{\XminusT}{\X(\X^\top\X)^{-1}} \newcommand{\XominusT}{\X_0(\X_0^\top\X_0)^{-1}} \newcommand{\XminusGLS}{(\X^\top\V_0^{-1}\X)^{-1}\X^\top\V_0^{-1}} \newcommand{\XminusGLST}{\V_0^{-1}\X(\X^\top\V_0^{-1}\X)^{-1}} % % \newcommand{\lt}{\ell^\top} \newcommand{\bhat}[1]{\widehat{\boldbeta}_{#1}} \newcommand{\varhat}[1]{\widehat{\sigma^2}_{#1}} \newcommand{\sehat}[1]{\widehat{\text{SE}}\left[#1\right]} %% %% \DeclareMathOperator{\Vecmop}{Vec} \newcommand{\Vecm}[1]{\Vecmop \left[#1\right]} \newcommand{\given}[2]{ #1 \left|\!\left\{#2\right\}\right.} \newcommand{\E}[1]{\mathbb{E}\left[#1\right]} \newcommand{\Egiven}[2]{\mathbb{E}\left[ #1 \left| #2\right.\right]} \newcommand{\Var}[1]{\text{Var}\left[#1\right]} \newcommand{\Cov}[1]{\text{Cov}\left[#1\right]} \newcommand{\Vargiven}[2]{\text{Var}\left[ \left. #1 \right| #2\right]}$

# A statistician's model

Consider the following regression model: \begin{aligned} \Y & = \X\boldbeta + \e & \e & \sim N(\Zero,\R) \label{eq:regmodel} \end{aligned} where $\X$ is a known design matrix composed of fixed numbers, $\boldbeta$ is a vector of unknown regression coefficients, and $\e$ is a vector of the residual errors with covariance matrix $\R$.

There is a vast literature devoted to studying this model. For example, the optimal method of estimation for this model varies depending on the specification of $\R$. If $\R = \sigma^2 \I$ where $\I$ is the identity matrix, then Ordinary Least Squares (OLS) is optimal for estimation and hypothesis testing. If $\R = \sigma^2 \V$ where $\V$ is positive definite and known then Generalized Least Squares (GLS) is optimal. If $\R$ is positive definite but unknown, then REstricted Maximum Likelihood (REML) has desirable properties.

Note that since $\X$ is assumed to be composed of fixed numbers, it is not random. By definition, its covariance with any random variable is zero.

# An econometrician's model

On the advice of Brian Kovak and Seth Richards-Shubik I use the econmetrics textbook used by CMU (and may other places):

Wooldridge, J. M. (2011). Econometric Analysis of Cross Section and Panel Data. MIT Press, Cambridge, MA, 2nd edition.

as representative of the views of econometricians.

Although Wooldridge uses the same notation as the model of Equation \ref{eq:regmodel}, his model is quite different: the columns of the $\X$ matrix are assumed to be draws from some random variable. His $\X$ matrix is not fixed, it is random.

From the perspective of a statistician, Wooldridge is modeling $Y$ as a transformation of random variables: \begin{equation*} Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_K X_K + U \end{equation*} such that $Y, X_1, X_2, \dots X_K$ are all observable random variables, $U$ is an unobservable random variable, and $\beta_0, \beta_1, \dots \beta_K$ are constants that are fixed and unknown. We observe a random draw from this structural model \begin{equation*} y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_K x_K + u \end{equation*} such that $y$ is a draw from $Y$, $x_1$ is a draw from $X_1$, etc.

Wooldridge outlines a series of assumptions that allow an econometrician to ignore the stochastic nature of the $\X$ matrix and use the regression model of Equation \ref{eq:regmodel} instead of the true model. Wooldridge gives the key assumption a technical name: exogeneity. A variable is exogenous if it is uncorrelated with the error term $\e$. If all variables in the $\X$ matrix are exogenous, then Wooldridge argues regression model of Equation \ref{eq:regmodel} is optimal for point estimation and hypothesis testing -- even though the model is wrong!

# A statistician's example: Stochastic $\X$ and exogeneity

I, at least, had great trouble following Wooldridge's argument, so I created the following example to better understand his claims. I hope that it is useful to others.

Let $\X$ be partitioned into two parts: an column of ones and $p-1$ columns of multivariate random normal vectors $\X_i$: \begin{aligned} \X & = \begin{bmatrix} \One_n & \X_1 & \X_2 & \dots & \X_{p-1} \end{bmatrix} \\ & = \begin{bmatrix} \One_n & \widetilde{\X}_{n\times (p-1)} \end{bmatrix} \end{aligned} where the subscripts denote the dimensions of the matrices. The multivariate random normal vectors can be correlated with each other. To express these correlations compactly, the $\Vecm{\cdot}$ operator (see Henderson and Searle (1979) or Wikipedia for properties) is used to stack the columns of the $\X$ matrix on top of one another \begin{align*} \Vecm{\X} & = \begin{bmatrix} \One_n \\ \X_1 \\ \X_2 \\ \vdots \\ \X_{p-1} \end{bmatrix} \end{align*} so that the mean and variance can be expressed as \begin{align*} \E{\Vecm{\X}} & = \begin{bmatrix} \One_n \\ \M_1 \\ \M_2 \\ \vdots \\ \M_{p-1} \end{bmatrix} \\ \Var{\Vecm{\X}} & = \begin{bmatrix} \Zero_{n\times n} & \Zero \\ \Zero & \boldSigma_{n(p-1) \times n(p-1) } \end{bmatrix} \quad \end{align*} where the matrix of means $\M$ and covariance matrix $\boldSigma$ are assumed to be fixed but unknown. Note that the although the intercept $\One_n$ is uncorrelated with the random portions of $\X$, the random portions of $\X$ are allowed to covary arbitrarily.

Let the unobservable error $\e$ be a mean zero multivariate normal, which can be correlated with the random portions of $\X$. We parametrize the joint distribution as \begin{align} \begin{pmatrix} \Vecm{\X} \\ \e \end{pmatrix} & = \begin{pmatrix} \One_n \\ \Vecm{\widetilde{\X}} \\ \e \end{pmatrix} \sim N \left( \A, \B \right) \label{eq:jointdist} \\ \A & = \begin{bmatrix} \One_n \\ \Vecm{\M} \\ \Zero_{n \times 1} \end{bmatrix} \nonumber \\ \B & = \begin{bmatrix} \Zero_{n\times n} & \Zero & \Zero \\ \Zero & \boldSigma_{n(p-1) \times n(p-1) } & \Q \\ \Zero & \Q^\top & \R_{n \times n} \end{bmatrix} \nonumber \quad \end{align} Finally, let $\Y$ be a linear transformation of random variables $\X$ and $\e$ such that \begin{align} \Y & = \X \boldbeta + \e \label{eq:randommodel} \end{align} where $\boldbeta$ is a $p \times 1$ vector of regression coefficients.

With a little tedium and algebra, it can be shown that the distribution of $\Y$ from Equation \ref{eq:randommodel} is \begin{align} \Y & \sim N \left( \begin{bmatrix} \One_{n} & \M \end{bmatrix} \begin{bmatrix} \beta_0 \\ \widetilde{\boldbeta} \end{bmatrix} \quad , \quad \B_\boldSigma + \B_\Q + \B_\R \right) \label{eq:disty} \\ \B_\boldSigma & = \left( \widetilde{\boldbeta}^\top \otimes \I_n \right) \boldSigma \left( \widetilde{\boldbeta} \otimes \I_n \right) \nonumber \\ \B_\Q & = \left( \widetilde{\boldbeta}^\top \otimes \I_n \right) \Q + \left\{ \left( \widetilde{\boldbeta}^\top \otimes \I_n \right) \Q \right\}^\top \nonumber \\ \B_\R & = \R \nonumber \end{align} where $\widetilde{\boldbeta}$ corresponds to the regression coefficients for the random part of $\X$ and $\otimes$ is the direct product (see Henderson and Searle (1979) or Wikipedia for properties).

The marginal distribution of $\Y$ given in Equation \ref{eq:disty} is complex but it has the components that one might expect. The mean, for example, is a function of the means of the columns of $\X$ and the vector of regression coefficients $\boldbeta$. The terms of the variance-covariance matrix correspond to the random parts of the model: $\B_\boldSigma$ represents the contribution of the nonzero covariance of the columns of $\X$, $\B_\Q$ represents the contribution of the non-zero covariance between $\X$ and $\e$, and $\B_\R$ represents the contribution of the non-zero covariance of the random error $\e$.

The marginal distribution of $\Y$, however, is not useful for inference on $\boldbeta$ because the marginal distribution of $\Y$ depends only on the unknown quantities $\M$, $\boldbeta$, $\boldSigma$, $\Q$ and $\R$. The data, $\X$, do not appear in the distribution at all!

One strategy to perform inference on $\boldbeta$ is to condition $\Y$ on the observed values of $\X$ and then restrict attention to cases where the resulting conditional distribution is tractable. We proceed as follows. Note that the conditional distribution of $\Y$ conditional on an observed draw of $\X$

has only one random component, the conditional distribution of the unobservable error. The rules for conditional normal distributions (via Wikipedia) and the joint distribution in Equation \ref{eq:jointdist} imply that $\given{\e}{\X=\X_0}$ is normally distributed with mean and variance \begin{align*} \Egiven{\e}{ \Vecm{ \widetilde{\X} } = \Vecm{\widetilde{\X}_0} } & & = & & \Zero_{n \times 1} + \Q^\top \boldSigma^{-1} \left( \Vecm{ \widetilde{\X}_0} - \Vecm{\M} \right) \\ \Vargiven{\e}{ \Vecm{ \widetilde{\X} } = \Vecm{\widetilde{\X}_0} } & & = & & \R + \Q^\top \boldSigma^{-1} \Q \quad . \end{align*} The mean and variance of $\given{\Y}{\X = \X_0}$ are then \begin{align} \Egiven{\Y}{\X = \X_0} & & = & & \X_0 \boldbeta + \Q^\top \boldSigma^{-1} \left( \Vecm{ \widetilde{\X}_0} - \Vecm{\M} \right) \label{eq:condymean} \\ \Vargiven{\Y}{\X = \X_0} & & = & & \R + \Q^\top \boldSigma^{-1} \Q \nonumber \end{align} where the unknown quantities are $\boldbeta$, $\M$, $\Q$, $\boldSigma$ and $\R$. In the special case where $\Q = \Zero$, i.e. all of the columns of $\X$ are uncorrelated with with the error $\e$, then the conditional distribution will only depend on $\boldbeta$ and $\R$ \begin{align} \given{Y}{\X=\X_0, \Q=\Zero} & \sim N\left(\X_0 \boldbeta \quad , \quad \R \right) \label{eq:keyresult} \quad , \end{align} which is the regression model of Equation \ref{eq:regmodel}.

That Equation \ref{eq:keyresult} is equivalent to Equation \ref{eq:regmodel} is somewhat shocking. We began with $\Y$ as a linear transformation of random variables contained in $\X$ and ended up with a conditional model that is equivalent to a regression where $\X$ is composed of fixed numbers. The key condition, that $\Q = \Zero$, i.e. in econometric parlance the columns of $\X$ are exogenous, allows use to use all of our regression theory. It is quite wonderful.

# A lesson for at least one statistician

Equations \ref{eq:condymean} and \ref{eq:keyresult} contain a lesson for statisticians: it can be shown that a regression estimate of a treatment effect from a designed experiment will be inconsistent if there is only one endogenous variable that was included in the design matrix. This is not something that I have considered in the past when including covariates in the model to increase my power to detect a treatment effect. Clearly, this is something that I should consider in the future!

## 3 thoughts on “Stochastic regressors and exogeneity for statisticians”

1. nmv Post author

Consider an educational experiment conducted in a multicultural setting such that (i) the poor in the community are disproportionately immigrants with Limited English Proficiency (LEP), and (ii) the test used to measure the output of the experiment is far less accurate at measuring the true knowledge of a student when the student has LEP. This suggests that economic variables, such as Free/Reduced Price lunch status may be endogenous, i.e. poverty is correlated with LEP which is in turn correlated with the error term.