回答:
相関行列が特異値に近い場合(つまり、変数の相関が高い場合)、リッジ回帰アプローチを試すことができます。これにより、堅牢な推定値が提供されます。
唯一の問題は、正則化パラメーター選択方法です。別の値を試すことをお勧めしますが、これは単純な問題ではありません。
お役に立てれば!
lm.ridge
MASSパッケージのルーチンを使用できます。あなたはそれをの値の範囲渡すと、例えば、のような呼び出しを、あなたは一般化クロスバリデーション統計情報が返されます、そしてに対してそれらをプロットすることができλ:最小を選択します。foo <- lm.ridge(y~x1+x2,lambda=seq(0,10,by=0.1))
foo
plot(foo$GCV~foo$lambda)
さて、私が前に使用したアドホックな方法が1つあります。このプロシージャに名前があるかどうかはわかりませんが、直感的に理解できます。
あなたの目標はモデルに適合することだと仮定します
ここで、2つの予測子は高度に相関しています。先ほど指摘したように、同じモデルで両方を使用すると、係数推定値とp値に奇妙なことが起こります。別の方法は、モデルを適合させることです
次いで残留無相関になりX Iと缶は、ある意味で、の一部として考えることがZ Iとの線形関係によって包含されていないX I。次に、モデルの適合に進むことができます
これは、最初のモデルのすべての効果をキャプチャします(実際、最初のモデルとまったく同じを持ちます)が、予測子はもはや共線ではありません。
編集: OPは、インターセプトが含まれる場合のようにインターセプトを省略すると、残差が定義上、予測子とゼロのサンプル相関を持たない理由の説明を求めました。コメントを投稿するには長すぎるため、ここで編集しました。この派生は特に啓発的なものではありません(残念ながら、合理的な直感的な議論を思い付くことができませんでした)が、OPが要求したものを示しています:
切片は、単純な線形回帰では省略されている場合、β = Σ X I 、Y Iなので、ei=yi−xi∑xiyi。間のサンプルの相関XIと電子iはに比例する ¯ X E - ¯ X ¯はE場所 ¯ ⋅は、バーの下量のサンプルの平均を表します。これは、必ずしもゼロに等しいとは限らないことを示します。
最初に
だが
そのためにはおよびxは、私は正確に0のサンプルの相関関係を持つことは、我々は必要¯ X ¯ 電子であることを0。つまり、我々は必要¯ Y = ¯ X ⋅ ¯ X 、Yを
これは、一般的に2つの任意のデータセットは当てはまりません。
I like both of the answers given thus far. Let me add a few things.
Another option is that you can also combine the variables. This is done by standardizing both (i.e., turning them into z-scores), averaging them, and then fitting your model with only the composite variable. This would be a good approach when you believe they are two different measures of the same underlying construct. In that case, you have two measurements that are contaminated with error. The most likely true value for the variable you really care about is in between them, thus averaging them gives a more accurate estimate. You standardize them first to put them on the same scale, so that nominal issues don't contaminate the result (e.g., you wouldn't want to average several temperature measurements if some are Fahrenheit and some are Celsius). Of course, if they are already on the same scale (e.g., several highly-correlated public opinion polls), you can skip that step. If you think one of your variables might be more accurate than the other, you could do a weighted average (perhaps using the reciprocals of the measurement errors).
If your variables are just different measures of the same construct, and are sufficiently highly correlated, you really could just throw one out without losing much information. As an example, I was actually in a situation once, where I wanted to use a covariate to absorb some of the error variance and boost power, but where I didn't care about that covariate--it wasn't germane substantively. I had several options available and they were all correlated with each other . I basically picked one at random and moved on, and it worked fine. I suspect I would have lost power burning two extra degrees of freedom if I had included the others as well by using some other strategy. Of course, I could have combined them, but why bother? However, this depends critically on the fact that your variables are correlated because they are two different versions of the same thing; if there's a different reason they are correlated, this could be totally inappropriate.
As that implies, I suggest you think about what lies behind your correlated variables. That is, you need a theory of why they're so highly correlated to do the best job of picking which strategy to use. In addition to different measures of the same latent variable, some other possibilities are a causal chain (i.e., ) and more complicated situations in which your variables are the result of multiple causal forces, some of which are the same for both. Perhaps the most extreme case is that of a suppressor variable, which @whuber describes in his comment below. @Macro's suggestion, for instance, assumes that you are primarily interested in and wonder about the additional contribution of after having accounted for 's contribution. Thus, thinking about why your variables are correlated and what you want to know will help you decide which (i.e., or ) should be treated as and which . The key is to use theoretical insight to inform your choice.
I agree that ridge regression is arguably better, because it allows you to use the variables you had originally intended and is likely to yield betas that are very close to their true values (although they will be biased--see here or here for more information). Nonetheless, I think is also has two potential downsides: It is more complicated (requiring more statistical sophistication), and the resulting model is more difficult to interpret, in my opinion.
I gather that perhaps the ultimate approach would be to fit a structural equation model. That's because it would allow you to formulate the exact set of relationships you believe to be operative, including latent variables. However, I don't know SEM well enough to say anything about it here, other than to mention the possibility. (I also suspect it would be overkill in the situation you describe with just two covariates.)