変数を捨てずに、高い多重共線性を持つ線形回帰で不安定な

高い多重共線性を持つ線形回帰のベータ安定性？

線形回帰で、変数 $x_1$ と $x_2$ 多重共線性が高いとしましょう（相関は約0.9です）。

$\beta$ 係数の安定性が心配なので、多重共線性を扱う必要があります。

教科書の解決策は、変数の1つを捨てることです。

しかし、単に変数を捨てることで有用な情報を失いたくありません。

助言がありますか？

regression multicollinearity

— ルナ
ソース

ある種の正則化スキーム（リッジ回帰など）を試しましたか？

— ネスター

回答:

相関行列が特異値に近い場合（つまり、変数の相関が高い場合）、リッジ回帰アプローチを試すことができます。これにより、堅牢な推定値が提供されます $\beta$ 。

唯一の問題は、正則化パラメーター選択方法 $\lambda$ です。別の値を試すことをお勧めしますが、これは単純な問題ではありません。

お役に立てれば！

— ポール
ソース

クロスバリデーションは、

;-) を選択するために通常行うことです。

λ

$\lambda$

— ネストール

確かに（答えとNestors'コメントを+1）、そしてあなたがの固有分解使用して『正規形』（で計算を実行する場合は

、あなたは見つけることができ

によって休暇ワンアウトクロスバリデーションエラーを最小化します非常に安く、ニュートン法。

X^{T} X

$X^TX$

λ

$\lambda$

— Dikran有袋類

どうもありがとう！Rでの相互検証を含む、その方法に関するチュートリアル/メモはありますか？

— ルナ

この本の第3章をご覧ください：stanford.edu/~hastie/local.ftp/Springer/ESLII_print5.pdf リッジ回帰の実装は、Rで何人かの著者によって行われています（Googleはあなたの友達です！）。

— ネスター

lm.ridgeMASSパッケージのルーチンを使用できます。あなたはそれをの値の範囲渡すと

、例えば、のような呼び出しを、あなたは一般化クロスバリデーション統計情報が返されます、そしてに対してそれらをプロットすることができ

：最小を選択します。

λ

$\lambda$ foo <- lm.ridge(y~x1+x2,lambda=seq(0,10,by=0.1))foo

λ

$\lambda$ plot(foo$GCV~foo$lambda)

— jbowman

さて、私が前に使用したアドホックな方法が1つあります。このプロシージャに名前があるかどうかはわかりませんが、直感的に理解できます。

あなたの目標はモデルに適合することだと仮定します

Y_{i} = β_{0} + β_{1} X_{i} + β_{2} Z_{i} + ε_{i}

$Y_i = \beta_0 + \beta_1 X_i + \beta_2 Z_i + \varepsilon_i$

ここで、2つの予測子は高度に相関しています。先ほど指摘したように、同じモデルで両方を使用すると、係数推定値と値に奇妙なことが起こります。別の方法は、モデルを適合させることです $X_i, Z_i$ $p$

Z_{i} = α_{0} + α_{1} X_{i} + η_{i}

$Z_i = \alpha_0 + \alpha_1 X_i + \eta_i$

次いで残留無相関になりと缶は、ある意味で、の一部として考えることがとの線形関係によって包含されていない。次に、モデルの適合に進むことができます $\eta_i$ $X_i$ $Z_i$ $X_i$

Y_{i} = θ_{0} + θ_{1} X_{i} + θ_{2} η_{i} + ν_{i}

$Y_i = \theta_0 + \theta_1 X_i + \theta_2 \eta_i + \nu_i$

これは、最初のモデルのすべての効果をキャプチャします（実際、最初のモデルとまったく同じを持ちます）が、予測子はもはや共線ではありません。 $R^2$

編集： OPは、インターセプトが含まれる場合のようにインターセプトを省略すると、残差が定義上、予測子とゼロのサンプル相関を持たない理由の説明を求めました。コメントを投稿するには長すぎるため、ここで編集しました。この派生は特に啓発的なものではありません（残念ながら、合理的な直感的な議論を思い付くことができませんでした）が、OPが要求したものを示しています：

切片は、単純な線形回帰では省略されている場合、なので、 $\hat \beta = \frac{ \sum x_i y_i}{\sum x_i^2}$ 。間のサンプルの相関とに比例する場所バーの下量のサンプルの平均を表します。これは、必ずしもゼロに等しいとは限らないことを示します。 $e_i = y_i - x_i \frac{ \sum x_i y_i}{\sum x_i^2}$ $x_i$ $e_i$

\bar{x e} - \bar{x} \bar{e}

$\overline{xe} - \overline{x}\overline{e}$

\bar{\cdot}

$\overline{\cdot}$

最初に

\bar{x e} = \frac{1}{n} (\sum x_{i} y_{i} - x_{i}^{2} \cdot \frac{\sum x_{i} y_{i}}{\sum x_{i}^{2}}) = \bar{x y} (1 - \frac{\sum x_{i}^{2}}{\sum x_{i}^{2}}) = 0

$\overline{xe} = \frac{1}{n} \left( \sum x_i y_i - x_{i}^2 \cdot \frac{ \sum x_i y_i}{\sum x_i^2} \right) = \overline{xy} \left( 1 - \frac{ \sum x_{i}^2}{ \sum x_{i}^2 } \right) = 0$

だが

\bar{x} \bar{e} = \bar{x} (\bar{y} - \frac{\bar{x} \cdot \bar{x y}}{\bar{x^{2}}}) = \bar{x} \bar{y} - \frac{{\bar{x}}^{2} \cdot \bar{x y}}{\bar{x^{2}}}

$\overline{x} \overline{e} = \overline{x} \left( \overline{y} - \frac{\overline{x} \cdot \overline{xy}}{\overline{x^2}} \right) = \overline{x}\overline{y} - \frac{\overline{x}^2 \cdot \overline{xy}}{\overline{x^2}}$

そのためにはおよび正確に0のサンプルの相関関係を持つことは、我々は必要であることを。つまり、我々は必要 $e_i$ $x_i$ $\overline{x}\overline{e}$ $0$

\bar{y} = \frac{\bar{x} \cdot \bar{x y}}{\bar{x^{2}}}

$\overline{y} = \frac{ \overline{x} \cdot \overline{xy}}{\overline{x^2}}$

これは、一般的に2つの任意のデータセットは当てはまりません。 $x, y$

— 大きい
ソース

これは、部分回帰プロットを思い出させます。

— アンディW

(X, Z)

$(X, Z)$

One thing I had in mind is that PCA generalizes easily to more than two variables. Another is that it treats

X

$X$ and

Z

$Z$ symmetrically, whereas your proposal appears arbitrarily to single out one of these variables. Another thought was that PCA provides a disciplined way to reduce the number of variables (although one must be cautious about that, because a small principal component may be highly correlated with the dependent variable).

— whuber

Hi Macro, Thank you for the excellent proof. Yeah now I understand it. When we talk about sample correlation between x and residuals, it requires the intercept term to be included for the sample correlation to be 0. On the other hand, when we talk about orthogonality between x and residuals, it doesn't require the intercept term to be included, for the orthogonality to hold.

— Luna

@Luna, I don't particularly disagree with using ridge regression - this was just what first occurred to me (I answered before that was suggested). One thing I can say is that ridge regression estimate are biased, so, in some sense, you're actually estimating a slightly different (shrunken) quantity than you are with ordinary regression, making the interpretation of the coefficients perhaps more challenging (as gung alludes to). Also, what I've described here only requires understanding of basic linear regression and may be more intuitively appealing to some.

— Macro

I like both of the answers given thus far. Let me add a few things.

Another option is that you can also combine the variables. This is done by standardizing both (i.e., turning them into z-scores), averaging them, and then fitting your model with only the composite variable. This would be a good approach when you believe they are two different measures of the same underlying construct. In that case, you have two measurements that are contaminated with error. The most likely true value for the variable you really care about is in between them, thus averaging them gives a more accurate estimate. You standardize them first to put them on the same scale, so that nominal issues don't contaminate the result (e.g., you wouldn't want to average several temperature measurements if some are Fahrenheit and some are Celsius). Of course, if they are already on the same scale (e.g., several highly-correlated public opinion polls), you can skip that step. If you think one of your variables might be more accurate than the other, you could do a weighted average (perhaps using the reciprocals of the measurement errors).

If your variables are just different measures of the same construct, and are sufficiently highly correlated, you really could just throw one out without losing much information. As an example, I was actually in a situation once, where I wanted to use a covariate to absorb some of the error variance and boost power, but where I didn't care about that covariate--it wasn't germane substantively. I had several options available and they were all correlated with each other $r>.98$ . I basically picked one at random and moved on, and it worked fine. I suspect I would have lost power burning two extra degrees of freedom if I had included the others as well by using some other strategy. Of course, I could have combined them, but why bother? However, this depends critically on the fact that your variables are correlated because they are two different versions of the same thing; if there's a different reason they are correlated, this could be totally inappropriate.

As that implies, I suggest you think about what lies behind your correlated variables. That is, you need a theory of why they're so highly correlated to do the best job of picking which strategy to use. In addition to different measures of the same latent variable, some other possibilities are a causal chain (i.e., $X_1\rightarrow X_2\rightarrow Y$ ) and more complicated situations in which your variables are the result of multiple causal forces, some of which are the same for both. Perhaps the most extreme case is that of a suppressor variable, which @whuber describes in his comment below. @Macro's suggestion, for instance, assumes that you are primarily interested in $X$ and wonder about the additional contribution of $Z$ after having accounted for $X$ 's contribution. Thus, thinking about why your variables are correlated and what you want to know will help you decide which (i.e., $x_1$ or $x_2$ ) should be treated as $X$ and which $Z$ . The key is to use theoretical insight to inform your choice.

I agree that ridge regression is arguably better, because it allows you to use the variables you had originally intended and is likely to yield betas that are very close to their true values (although they will be biased--see here or here for more information). Nonetheless, I think is also has two potential downsides: It is more complicated (requiring more statistical sophistication), and the resulting model is more difficult to interpret, in my opinion.

I gather that perhaps the ultimate approach would be to fit a structural equation model. That's because it would allow you to formulate the exact set of relationships you believe to be operative, including latent variables. However, I don't know SEM well enough to say anything about it here, other than to mention the possibility. (I also suspect it would be overkill in the situation you describe with just two covariates.)

— gung - Reinstate Monica
ソース

Re the first point: Let vector

X_{1}

$X_1$ have a range of values and let vector

e

$e$ have small values completely uncorrelated with

X_{1}

$X_1$ so that

X_{2} = X_{1} + e

$X_2=X_1+e$ is highly correlated with

X_{1}

$X_1$ . Set

Y = e

$Y=e$ . In the regression of

Y

$Y$ against either

X_{1}

$X_1$ or

X_{2}

$X_2$ you will see no significant or important results. In the regression of

Y

$Y$ against

X_{1}

$X_1$ and

X_{2}

$X_2$ you will get an extremely good fit, because

Y = X_{2} - X_{1}

$Y=X_2-X_1$ . Thus, if you throw out either of

X_{1}

$X_1$ or

X_{2}

$X_2$ , you will have lost essentially all information about

Y

$Y$ . Whence, "highly correlated" does not mean "have equivalent information about

Y

$Y$ ".

— whuber

Thanks a lot Gung! Q1. Why does this approach work: "This is done by standardizing both (i.e., turning them into z-scores), averaging them, and then fitting your model with only the composite variable. "? Q2. Why would Ridge Regression be better? Q3. Why would SEM be better? Anybody please shed some lights on this? Thank you!

— Luna

Hi Luna, glad to help. I'm actually going to re-edit this; @whuber was more right than I had initially realized. I'll try to put in more to help w/ your additional questions, but it'll take a lot, so it might be a while. We'll see how it goes.

— gung - Reinstate Monica