DATA PROBLEMS: Multicollinearity

Multicollinearity is one of the most widely taught of all the pathological diseases of
econometrics. It is also one of the more frequently misunderstood of the pathological
diseases. I believe this to be the case because on the surface it is conceptually a very
simple idea. When we look a bit further, symptoms that we ascribe to multicollinearity may
be the result of something else. Similarly, we may be misled by the usual
diagnostics. When you are done reading this section of the notes, go to "Explorations in Multicollinearity."

We begin with the usual model y = xb + u. y and u are
nx1, x is nxk, and b is kx1. The error term is well behaved.
There is no correlation between the independent variables and the error term. If where u_{i} is a
"stochastic" term, with finite mean and variance, and not all the l_{i} are zero then we have a problem of multicollinearity.
That is, there is some linear dependence, albeit not exact, between the columns of the
design matrix. This defintion is already problematic. The classical assumption of
regression analysis is that the columns of the design matrix are linearly independent;
they span a vector space of dimension k; the matrix x'x has k non-zero roots. The model
and the experiment are designed so that the independent variables have separate and
independent effects on the dependent variable. With economic data the experimant usually
cannot be reproduced or redesigned, in spite of the fact that a common assumption that x
is fixed in repeated samples.. The data is not always well behaved.

What if the linear dependence, while not exact, is close? What if one of the roots is very
small and another very large? What if one of the explanatory variables doesn't have much
variation in it? In the sense of linear algebra the matrix x will still span a vector
space of dimension k. That is, we can estimate the parameter vector b,
but not very precisely.

Our problem is a sample phenomenon. The sample may not be rich enough to allow for all the
effects we believe to exist.

**Classic Multicollinearity and Imprecision of OLS**

Then we can show

where

Clearly, the classic case of multicollinearity affects the precision of our estimator.
Of course, in the event that x_{1} and x_{2} have an exact linear
relationship the situation is even worse. The best that we could hope to do would be to
estimate the single regression coefficient in y_{i} = ax_{i}
+ u_{i}, where a = b_{1}
+ lb_{2
}**Example
**The model is

y

x

x

x

The results, with standard errors in parentheses, are

y_{t} = |
-5.92 |
+ 2.1 x_{2t} |
+ .133 x_{3t} |
+ .55 x_{4t} |
+ e_{t} |

(1.27) |
(.2) |
(.006) |
(.11) |

All t statistics are significant.

Suppose we include an additional variable; consumption.

y_{t} = |
-8.79 |
+ 2.1 x_{2t} |
- .021 x_{3t} |
+ .559 x_{4t} |
+ .235 x_{5t} |
+ e_{t} |

(1.38) |
(.2) |
(.051) |
(.087) |
(.077) |

The inclusion of x_{5} has sopped up all the variation in y that had been
previously explained by x_{3}. The consumption variable is almost
indistinguishable from GDP. The correlation between them is .99.

* Multicollinearity, Characteristic Vectors and Roots
*Suppose y = xb+ u. We can find the characteristic vectors
of x'x. Call them A, so A'x'xA = L, where L is a diagonal matrix of the characteristic roots of x'x. Since A is
a collection of orthogonal vectors we can write (x'x)

Now write A = [ a

Looking at a particular product of a pair of vectors we have

a_{1} is the characteristic vector in the first column of A and

a_{1}'(x'x)^{-1}a_{1}=l_{1} so
a_{1} may be thought of as picking off the variance of b_{1}.
So for a particular coefficient in the regression model

If one of the a_{jl}^{2} is quite large relative to its characteristic
root then we have a problem. Or, if one of the l_{k} is
quite small then we have a problem.

Using characteristic roots and vectors there is some simple geometry to the problem.

Assume that the number of observations is n=4 and the number of unknown location
parameters is k=2. The scatter of observations on the indpendent variables is shown in
figure 1. Each * corresponds to the plot of an observation on x_{1} and x_{2}

**Figure 1
**

where e_{1} and e_{2} are the basis vectors for the space spanned by
the two 4x1 vectors of observations on the independent variables. The characteristic
vectors of x'x form a new basis which has the following relation to the original (see
figure 2).

**Figure 2
**

For this two variable model we have

Note and recall that the law of cosines is
adjacent over hypotenuse. So cos(d) = a_{11} is large
relative to a_{12} and cos(y)=a_{21} is small
relative to a_{22}.

Now Ax'xA' = L and x'xa_{i} = l_{i}a_{i},
or a_{i}'x'xa_{i} = l_{i}. Making a
simple transformation we can rewrite this as z_{i}'z_{i} = l_{i}. So l_{i} can be
thought of as the variation in the data along the i^{th} axis, since z_{i}'z_{i}
is the explained sum of squares of the projection of the n observations onto a_{i}.
In our diagram the data along a_{1} is quite spread out, but is quite compressed
along a_{2}, so l_{1} >> l_{2}. Therefore, numerator and denominator in are of the same order of magnitude. However, a_{22}^{2}
is large and l_{2} is small so the large size of implies that b_{2}
is measured with less precision.

**Example:**

We can give some numerical flesh to the exposition. In our example we are given the
following design matrix and observations on the dependent variable:

Working in the real world of empirical analysis this would be all you would know about the
data generating process. Since this is an experiment designed to show you the effects of
multicollinearity, the following information is also provided

This should enable you to compute the four realizations of the disturbance vector. Can you
do it?

Given the data, we can calculate the least squares coefficient estimates as. Now calculate an estimate of the error variance

Now we have enough information to compute the variance-covariance matrix for the
coefficient vector and the observed t-statistics.

The critical t for a two tail test at the 10% level with two degrees of freedom is 2.9.
No coefficient is statistically different from zero, in spite of what we know to be the
truth.

Recall that the characteristic roots of a matrix of full rank are all nonzero. The roots
for the matrix x'x are 45.642 and 1.358. While neither is zero, the larger is 33.6 times
larger than the smaller. On the basis of our earlier observations about the variance of
the estimator, this is potentially a problem. The simple correlation between the two
indpendent variables is .942. The characteristic vectors corresponding to the two roots
are. In figure 3 these vectors and the four
observations for the independent variables are plotted. The vectors are the short,
perpendicular lines at about 45^{o} to the usual axes. These vectors form the
basis for the same vector space. Notice, however, that in the a_{1} dimension the
data on the independent variables is not very spread out. In the a_{2} dimension
it is quite variable. The figure also shows the regression line of x_{1} on x_{2};
the dashed line.

**Figure 3
**

As shown above, we can calculate the coefficient variances from the characteristic
roots and vectors.

We can also use the matrix of characteristic vectors, A = [ a_{1} | a_{2}
] to transform the original data onto the new basis of the vector space. There are four
data points, with two coordinates each. This projection is given by

If we square each term then add across the rows we get an explained sum of squares for
each of the two dimensions

Notice that these sums of squares are the characteristic roots. If the data are spread out
along one of the characteristic vectors then we get a large root. If the data does not
show a lot of variation along one of the vectors then we get a small root.

Returning to the coefficient variances in terms of roots and vectors, we see that a large
estimated variance of the coefficient estimator can result from an unfortunately large
error variance. There is nothing one can do about this. There could also be a lack of
variation in the data along one of the characteristic vectors. This is often characterized
by data having an elliptical shape in the plane of the independent variables, rather than
being scattered in a spherical or cube shape. The large estimate of the coefficient
variance might also be due to unfortunate values for the elements of the characteristic
vectors. The sizes of these elements are related to the rotation of the axes of the
characteristic vectors relative to the original axes. The rotation is large when the data
has a lot of variation in one variable, but not the other, or when the independent
variables are highly correlated.

Let us repeat the example with a 'good' design matrix. This design matrix is good in the
sense that the simple correlation between the independent variables is zero. The observed
data is now

The true coefficient vector is unchanged. There has been a new draw from the N(0,2)
distribution for the error vector. Applying least squares we obtain the following results

Neither coefficient is different from zero. Since the correlation between RHS variables is
zero, the conventional researcher might conclude that multicollinearity is not the source
of his/her bad results. On the other hand, while neither of the two variables is
significant, they do explain 29% of the variation in the dependent variable. This is
usually taken as evidence that there is collinearity.

The eigen values for the matrix x'x are 6.25 and 10.75. On the basis of the condition
number there is no reason to suspect that collinearity is the problem. To sort out all of
this, Figure 3 is repeated for the new design matrix. In figure 4 we again show the
regression of x_{1} on x_{2}, the characteristic vectors, and the data.

**Figure 4
**It would seem that the bad results are a result of the unfortunately large estimate of
the error variance, which inflates the estimates of the variances of the coefficients.

Now repeat the exercise again with another 'bad' design matrix and a new set of observed values of the dependent variable.

The data are plotted in figure 5 as the small boxes. The figure suggests that there is little or no relation between them; the regression line is the dashed line. The simple correlation is .70, rather high. The basis vectors corresponding to the eigen vectors of x'x lie very nearly along the original basis.

The model results are

One coefficient is now different from zero. Why isn't the other also different from
zero? The characteristic roots are 49.156 and 4.856. The largest is about ten times the
size of the other. This is not a small condition number, although the graph suggests that
collinearity is not a problem. The problem here is that the data is rather spread out
along one axis, but not the other. This ill conditioned design matrix produces a set of
results which one might ascribe to multicollinearity.

Finally, we turn to the received doctrine about multicollinearity.

* Consequences
*1. Even in the presence of multicollinearity, OLS is BLUE and consistent.

2. Standard errors of the estimates tend to be large.

3. Large standard errors mean large confidence intervals.

4. Large standard errors mean small observed test statistics. The researcher will accept too many null hypotheses. The probability of a type II error is large.

5. Estimates of standard errors and parameters tend to be sensitive to changes in the data and the specification of the model.

2. High simple correlation coefficients are sufficient but not necessary for multicollinearity.

3. Farrar-Glauber suggest regressing groups of variables on a culprit. If there is collinearity then the resulting F statistic will be large.

4. One can compute the condition number. That is, the ratio of the largest to the smallest root of the matrix x'x. This may not always be useful as the standard errors of the estimates depend on the ratios of elements of the characteristic vectors to the roots.

5. Leamer suggests using the magnification factor

where R_{k.}^{2} is the coefficient of determination from the
regression of one of the explanatory variables on all of the others.

* Remediation
*1. Use prior information or restrictions on the coefficients. One clever way to do
this was developed by Theil and Goldberger. See, for example, Theil, Principles of
Econometrics, Wiley, 1971, P 347-352.

2. Use additional data sources. This does not mean more of the same. It means pooling cross section and time series.

3. Transform the data. For example, inversion or differencing.

4. Use a principal components estimator. This involves using a weighted average of the regressors, rather than all of the regressors. The classic in this application is George Pidot, "A Principal Components Analysis of the Determinants of Local Government Fiscal Patterns", Review of Economics and Statistics, Vol. 51, 1969, P 176-188.

5. Another alternative regression technique is ridge regression. This involves putting extra weight on the main diagonal of x'x so that it produces more precise estimates. This is a biased estimator.

6. Some writers encourage dropping troublesome RHS variables. This begs the question of specification error.

Now that you've finished reading the notes, go to "Explorations
in Multicollinearity."