\documentclass{article} \input{../common} \newcommand{\RidgeEstimator}{\tilde{\mathbf{\beta}}_{\lambda}} \begin{document} \title{Lecture 19: Interactions} \author{36-401, Fall 2015, Section B} \date{3 November 2015} \maketitle \tableofcontents << echo=FALSE >>= library(knitr) opts_chunk$set(size='small', background='white', cache=TRUE, autodep=TRUE, options(show.signif.stars=FALSE)) @ \newpage When we say that there are no interactions between$X_i$and$X_j$, we mean that $\frac{\partial \Expect{Y|X=x}}{\partial d x_i}$ is not a function of$x_i$. Said another way, there are no interactions if and only if $\Expect{Y|X=x} = \alpha + \sum_{i=1}^{p}{f_i(x_i)}$ so that each coordinate of$X$makes its own separate, additive contribution to$Y$. The standard multiple linear regression model of course includes no interactions between any of the predictor variables. General considerations of probability theory, mathematical modeling, statistical theory, etc., give us no reason whatsoever to anticipate that interactions are rare, or that when they exist they are small. You might be so lucky as to not have any to deal with, but you should not {\em presume} you will be lucky. \paragraph{Diagnosing the presence of interactions} See Lecture 15 for some ideas about how to do this. One trick not mentioned there is to plot the residuals from an interaction-free model against the product of two predictors, e.g., against$X_1 X_2$. This, however, presumes a particular form for the interaction, gone over in the next section. \section{The Conventional Form of Interactions in Linear Models} The usual way of including interactions in a linear model is to add a product term, as, e.g., $$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1 X_2 + \epsilon \label{eqn:product-interaction}$$ Once we add such a term, we estimate$\beta_3$in exactly the same way we'd estimate any other coefficient. \paragraph{Interpretation} In the model of Eq.\ \ref{eqn:product-interaction}, it is no longer correct to interpret$\beta_1$as$\Expect{Y|X_1=x_1+1, X_2=x_2} - \Expect{Y|X_1=x_1, X_2=x_2}$. That difference is, rather$\beta_1 + \beta_3 X_2$. Similarly,$\beta_2$is no longer the expected difference in$Y$between two otherwise-identical cases where$X_2$differs by 1. The fact that we can't give one answer to how much does the response change when we change this variable?'', that the correct answer to that question always involves the other variable, is what interaction {\em means}. What we can say is that$\beta_1$is the slope with regard to$X_1$when$X_2=0$, and likewise$\beta_2$is how much we expect$Y$to change for a one-unit change in$X_2$when$X_1=0$.$\beta_3$is the rate at which the slope on$X_1$changes as$X_2$changes, and likewise the rate at which the slope on$X_2$changes with$X_1$(see Exercise \ref{exercise:equal-rates-of-change-in-slopes} for why it's both). \paragraph{Diagnostics and inference} Diagnostics for a product term goes just like it would for any other: the residuals should have the same distribution no matter what the value of$X_i X_j$happens to be; all the usual plots can be made using$X_i X_j$as the predictor variable. Inference, too, works exactly the same way. \paragraph{Terminology} The coefficients which go with the linear terms,$\beta_1$and$\beta_2$above, are often called the main effects'', while$\beta_3$would be an interaction effect''. I think this terminology is misleading in at least two ways. First, by talking about effects'' at all, it carries causal implications which are not usually warranted by a regression. Second, it implies that the linear terms, being main'', are bigger or more important than the interactions, and again there's usually no reason to think that. Why we don't use names like linear coefficients'' and product coefficients'', I couldn't say. \paragraph{Products without linear terms considered dubious} It is very rare to find models where there is a product term$X_i X_j$without both the linear terms$X_i$and$X_j$. If, say, the$X_i$term was missing, it would mean that$Y$was completely insensitive to$X_i$when$X_j=0$, but only then. This is weird, and indeed flies in the face of one of the best justifications for using product interactions (\S \ref{sec:y-u-no-taylor-expand}). There's no intrinsic reason it couldn't happen, but you should expect models like that to receive additional scrutiny. \subsection{Why Product Interactions?} \label{sec:y-u-no-taylor-expand} Most texts on linear regression do not even attempt to justify using interaction terms that look like$X_1 X_2$, as opposed to$\frac{X_1 X_2}{1+|X_1 X_2|}$, or$X_1 H(X_2 - c)$\marginpar{$H$is the Heaviside step function,${H(x) = \left\{ \begin{array}{cc} 1 & x \geq 0 \\ 0 & x < 0\end{array}\right.}$.}, etc., etc. Here is the best justification I can find. Suppose that the real regression function$\Expect{Y|X=x} = \mu(x)$is a smooth function of all the coordinates of$x$. Because it is smooth, we should be able to do a Taylor expansion around any particular point, say$x^*$: $\mu(x) \approx \mu(x^*) + \sum_{i=1}^{p}{(x_i-x_i^*)\left.\frac{\partial \mu}{\partial x_i}\right|_{x=x^*}} + \frac{1}{2}\sum_{i=1}^{p}{\sum_{j=1}^{p}{(x_i-x_i^*)(x_j-x_j^*)\left.\frac{\partial^2 \mu}{\partial x_i x_j}\right|_{x=x^*}}}$ The first term,$\mu(x^*)$, is a constant. The next sum will give us linear terms in all the$x_i$(plus more constants). The double sum after that will give us terms for each product$x_i x_j$, plus all the squares$x_i^2$, plus more constants. Thus, if the true regression function is smooth, and we only see a small range of values for each predictor variable, using product terms is reasonable --- provided we also include quadratic terms for each variable. (See Lecture 16 on polynomial regression for how to do that.) \paragraph{Non-product interactions} If have a particular sort of non-product interaction term in mind, say$\frac{X_1 X_2}{1+|X_1 X_2|}$, there is not particular difficulty in estimating it; just form a new column of predictors with the appropriate values, and estimate a coefficient on it like any other. Interpretation may, however, become even more tricky, and there is also the issue of deciding on what sort of interaction. In 402, we will see ways of discovering reasonable interaction terms automatically, by two-dimensional smoothing. \section{Interaction of Categorical and Numerical Variables} If we multiply the indicator variable for a binary category, say$X_B$, with an ordinary numerical variable, say$X_1$, we get a different slope on$X_i$for each category: $$\label{eqn:varying-slope} Y = \beta_0 + \beta_1 X_1 + \beta_{1B}X_B X_1 + \epsilon$$ When$X_B=0$, the slope on$X_1$is$\beta_1$, but when$X_B=1$, the slope on$X_1$is$\beta_1 + \beta_{1B}$; the coefficient for the interaction is the {\em difference} in slopes between the two categories. This is just like the way the coefficients on categorical variables back in Lecture 16 (adjusted effects'') were differences between the intercepts for the categories. In fact, look closely at Eq.\ \ref{eqn:varying-slope}. It says that the categories share a common {\em intercept}, but their regression lines are not parallel (unless$\beta_{1B} = 0$). We could expand the model by letting each category have its own slope and its own intercept: $Y = \beta_0 + \beta_B X_B + \beta_1 X_1 + \beta_{1B}X_B X_1 + \epsilon$ This model, where everything is interacted with the category'', is {\em very} close to just running two separate regressions, one per category. It does, however, insist on having a single noise variance$\sigma^2$(which separate regressions wouldn't accomplish). It also let you form confidence intervals for$\beta_B$and$\beta_{1B}$; if one or the other of these is tightly focused around 0, you might consider dropping that term and re-estimating\footnote{You {\em could} get the same effect with two separate regressions, by getting a confidence interval for the difference in the two estimates of the slope or the two estimates of the intercept, but the answer would come to the same as what you'd get from the joint regression with full interactions.}. Also, if there were additional predictors in the model which were not interacted with the category, e.g., $Y = \beta_0 + \beta_B X_B + \beta_1 X_1 + \beta_{1B}X_B X_1 + \beta_2 X_2 + \epsilon$ then this would definitely not be the same as running two separate regressions. As with linear terms for categorical variables (adjusted effects''), everything works much the same for variables with more than two levels: we add one indicator variable for all but one (reference or baseline) level of the category, we interact the indicators with the other predictor or predictors of interest, and the coefficients are differences to the slopes. \subsection{Interactions of Categorical Variables with Each Other} Nothing stops the variable you interact a categorical with from being another categorical. When that happens, you get terms which only apply to individuals which belong to {\em both} categories, e.g., to plumbers in Ohio. \paragraph{Categorical interactions vs.\ group or conditional means} Suppose we have two binary categorical variables, with corresponding indicator variables$X_B$and$X_C$. If we fit a model of the form $Y = \beta_0 + \beta_1 X_B + \beta_2 X_C + \beta_3 X_B X_C + \epsilon$ then we can make the following identifications: \begin{eqnarray} \label{eqn:betas-from-group-means-first} \Expect{Y|X_B=0, X_C=0} & = & \beta_0\\ \Expect{Y|X_B=1, X_C=0} & = & \beta_0 + \beta_1\\ \Expect{Y|X_B=0, X_C=1} & = & \beta_0 + \beta_2\\ \label{eqn:betas-from-group-means-last} \Expect{Y|X_B=1, X_C=1} & = & \beta_0 + \beta_1 + \beta_2 + \beta_3 \end{eqnarray} Conversely, these give us four equations in four unknowns, so if we know the group or conditional means on the left-hand sides, we could solve these equations for the$\beta$s (Exercise \ref{exercise:betas-from-group-means}). Notice that if our only predictor variables were these two categorical variables, we'd have one parameter for each distinct value of$X$--- the model is {\bf saturated} --- and we'd have very little ability to tell that the model was wrong, regardless of how big$n$might be. One way we might check it would be to look at the distribution of residuals for each distinct group --- by assumption they should all be the same. Of course if we have additional predictor variables, we can check the residuals against them. \section{Higher-Order Interactions} Nothing stops us from considering interactions among three or more variables, rather than just two. Again, the conventional form for this is a product,$X_i X_j X_k$. Again, the best justification for this I've ever seen is a higher-order Taylor expansion, which suggests using terms like$X_i^2 X_j$and$X_i^3$as well. Again, there is nothing special about diagnostics or inference for higher-order interaction terms. Trying to describe their interpretation in words gets extra tricky, however. \section{Product Interactions in R} The \texttt{lm} function is set up to comprehend multiplicative or product interactions in model formulas. Pure product interactions are denoted by \texttt{:}, so the formula <>= lm(y ~ x1:x2) @ corresponds to the model$Y = \beta_0 + \beta X_1 X_2 + \epsilon$. (Intercepts are included by default in R.) Since it is relatively rare to include just a product term without linear terms, it's more common to use the symbol \verb_*_, which expands out to both sets of terms. That is, <>= lm(y ~ x1*x2) @ is equivalent to <>= lm(y ~ x1+x2+x1:x2) @ and both estimate the model$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1 X_2 + \epsilon$. This special sense of \verb_*_ in formulas over-rides its ordinary sense of multiplication; if you wanted to specify a regression on, say$1000 X_2$, you'd have to write \verb_I(1000*x2)_ rather than \verb_1000*x2_. Also notice that R thinks, not unreasonably, that \verb_x1:x1_ is just the same as \verb_x1_; if you want higher powers of a variable, use \verb_I(x1^2)_ or \texttt{poly(x1,2)}. The \verb_:_ will apply to combinations of variables. Thus <>= (x1+x2):(x3+x4) @ is equivalent to <>= x1:x3 + x1:x4 + x2:x3 + x2:x4 @ Similarly for \verb_*_. This <>= (x1+x2)*(x3+x4) @ expands out to this: <>= x1 + x2 + x3 + x4 + x1:x3 + x1:x4 + x2:x3 + x2:x4 @ The reason you can't just write \verb_x1^2_ in your model formula is that the power operator {\em also} has a special meaning in formulas, of repeatedly \verb_*_-ing its argument with itself. That is, this <>= (x1+x2+x3)^2 @ is equivalent to <>= (x1+x2+x3)*(x1+x2+x3) @ which in turn is equivalent to <>= x1 + x2 + x3 + x1:x2 + x1:x3 + x2:x3 @ (Remember that \verb_x1:x1_ is just \verb_x1_.) I find these operators in formulas most useful when I want to interact lots of variables with a category: <>= lm(y ~ (x1+x2+x3+x5)*xcat + x4) @ is a lot more compact than writing everything out, as <>= lm(y ~ xcat + x1 + x2 + x3 + x5 + x1:xcat + x2:xcat + x3:xcat + x5:xcat + x4) @ and it's also something I'm a lot less likely to get wrong. Even writing out the whole formula term by term would be a lot less work, and lead to many fewer errors, than creating all the interacted columns by hand. \paragraph{\texttt{poly} and interactions} If you want to use \texttt{poly} to do polynomial regression, as in Lecture 16, {\em and} we want interactions, we can do it: <>= lm(y ~ poly(x1, x2, degree=2)) @ This creates linear terms for both variables (which it gives names ending \texttt{1.0} and \texttt{0.1}), quadratic terms for both variables (names ending in \texttt{2.0} and \texttt{0.2}), and their product term (whose name ends in \texttt{1.1}). We have to explicitly name the \texttt{degree} argument; otherwise, \texttt{poly} doesn't know when we've stopped giving it columns we want to interact. If we set \texttt{degree} higher than 2, we'll get interactions between powers of the variables, and if we gave it$k > 2$variables, we'd get all possible$2, 3, \ldots k$-way interactions. \subsection{Economic Mobility vs.\ Commuting, Again} \label{sec:mobility-vs-commuting} Let's continue with the data from the first DAP. <<>>= mobility <- read.csv("http://www.stat.cmu.edu/~cshalizi/mreg/15/dap/1/mobility.csv") @ As in Lecture 16, on categorical variables, we'll introduce a new binary category, indicating whether each state was or was not a part of the Confederacy in the Civil War. (See that lecture for detailed comments.) <<>>= # The states of the Confederacy Confederacy <- c("AR", "AL", "FL", "GA", "LA", "MS", "NC", "SC", "TN", "TX", "VA") mobility$Dixie <- mobility$State %in% Confederacy @ In that lecture, we allowed this new indicator variable to change the intercept; you will recall that that term was negative and highly significant. Here, we'll let being in the South affect the slope on \texttt{Commute} as well, that is, we introduce an interaction between \texttt{Commute} and \texttt{Dixie}: <<>>= mob.dixie <- lm(Mobility ~ Commute*Dixie, data=mobility) signif(coefficients(summary(mob.dixie)),3) @ (See also Exercise \ref{exercise:star-vs-colon}.) The coefficient for the interaction is negative, suggesting that increasing the fraction of workers with short commutes predicts a smaller difference in rates of mobility in the South than it does in the rest of the country. This coefficient is not significantly different from zero, but, more importantly, we can be confident it is small, compared to the base-line value of the slope on \texttt{Commute}: <<>>= signif(confint(mob.dixie),3) @ Thus, even if the South does have a different slope than the rest of the country, it is not a very different slope. The difference in the intercept, however, is more substantial. It, too, is not significant at the 5\% level, but that is because (as we see from the confidence interval) it might be quite large and negative ($\Sexpr{signif(100*coefficients(mob.dixie)["DixieTRUE"],1)}$percentage points, when the mean is about \Sexpr{signif(100*mean(mobility[,"Mobility"]),2)}\% and the largest value is \Sexpr{signif(100*max(mobility[,"Mobility"]),2)}\%), or perhaps just barely positive --- it's not so precisely measured, but it's either lowering the expected rate of mobility or adding to it trivially. Of course, we should really do all our diagnostics here before paying much attention to these inferential statistics, but I offer this by way of illustration of the functions. As a further illustraiton, see Exercise \ref{exercise:mapping}. \section{Exercises} To think through or to practice on, not to hand in. \begin{enumerate} \item \label{exercise:equal-rates-of-change-in-slopes} Consider an apparently different, and perhaps more-interpretable, model than Eq. \ref{eqn:product-interaction}, namely $Y = \alpha_0 + (\alpha_1 + \alpha_2 X_2) X_1 + (\alpha_3 + \alpha_4 X_1) X_2 + \epsilon$ Show that this can always be re-written in the same form as Eq.\ \ref{eqn:product-interaction}, and express the latter's$\beta_0, \beta_1, \beta_2$in terms of the$\alpha$s of this model. Can models of the form of Eq.\ \ref{eqn:product-interaction} always be re-written in this form? If so, express the$\alpha$parameters in terms of the$\beta$s; if not, give a counter-example. \item \label{exercise:betas-from-group-means} Solve Eqs.\ \ref{eqn:betas-from-group-means-first}--\ref{eqn:betas-from-group-means-last} for the$\beta\$s. \item \label{exercise:star-vs-colon} Check that we get the same set of terms, with the same coefficients, as in \S \ref{sec:mobility-vs-commuting}, if we fit our model with <>= lm(Mobility ~ Commute+Dixie+Commute:Dixie, data=mobility) @ Why does this happen? \item \label{exercise:mapping} Using the mobility data, regress \texttt{Mobility} on \begin{enumerate} \item Latitude and longitude (only); \item Latitude, longitude, and their product (only); \item Latitude, longitude, their product, and their squares (only). \end{enumerate} For each model, make maps\footnote{See the hint on the DAP 1 assignment for help with making such maps.} of the fitted values and the residuals. Describe the resulting geographic patterns, and compare them (qualitatively) to a map of the actual values of \texttt{Mobility}. Can you explain why the maps of fitted values look like they do, based on the terms included in the model? \end{enumerate} \bibliography{locusts} \bibliographystyle{crs} \end{document}