複数のカテゴリ変数がある場合のベータの解釈

私は、概念を理解カテゴリ変数は、回帰係数は、2つのカテゴリーの平均値の差があることをエンド解釈を与え、0に等しい（または基準基である）ときの平均です。でも、私はそれぞれ引き受ける> 2つのカテゴリとそのカテゴリの平均値と参照の違いを説明しています。 $\hat\beta_0$ $\hat\beta$

しかし、多変数モデルにさらに多くの変数が取り込まれたらどうなりますか？ここで、2つのカテゴリ変数の参照の平均であることが意味をなさない場合、インターセプトは何を意味しますか？たとえば、性別（M（ref）/ F）と人種（white（ref）/ black）が両方ともモデルに含まれている場合です。ある唯一の白人男性の平均は？他の可能性をどのように解釈しますか？ $\hat\beta_0$

別のメモとして：コントラストステートメントは、効果の変更を調査するための方法として機能しますか？または、さまざまなレベルで効果（）を見るだけですか？ $\hat\beta$

— レニー
ソース

用語として、「多変量」とは、複数の予測変数ではなく、複数の応答変数を意味します（こちらを参照）。また、最後の質問には従いません。

— GUNG -復活モニカ

この説明をありがとう。言語を正しくすることは私にとって重要です！対照変数を常に対照変数に設定することができるので、対照文が使用される理由をまったく理解できないと思いますか？

— レニー14年

異なる基準レベルでモデルを再フィットし続けることができると思います。それがもっと便利かどうかはわかりません。コントラストを使用して、直交コントラストのセットまたは理論的に暗示されたコントラスト（A対B＆Cの組み合わせ）を指定してテストすることもできます。

— GUNG -復活モニカ

回答:

$k$ レベルを持つ単一のカテゴリ変数がある場合、ベータの解釈について正しいです。複数のカテゴリ変数があった（との相互作用の項が存在しない）場合、（切片）の基準レベルを構成するグループの平均であり、両方の（すべての）カテゴリ変数を。サンプルシナリオを使用して、相互作用がない場合を検討し、ベータ版は次のとおりです。 $\hat\beta_0$

$\hat\beta_0$ ：白人男性の平均
$\hat\beta_{\rm Female}$ ：差女性の平均及び男性の平均値との間の
$\hat\beta_{\rm Black}$ ：違いブラックの平均および白人の平均値との間の

また、さまざまなグループ平均の計算方法の観点からこれを考えることができます。

\begin{aligned} {\bar{x}}_{W h i t e M a l e s} & = {\hat{β}}_{0} \\ {\bar{x}}_{W h i t e F e m a l e s} & = {\hat{β}}_{0} + {\hat{β}}_{F e m a l e} \\ {\bar{x}}_{B l a c k M a l e s} & = {\hat{β}}_{0} + {\hat{β}}_{B l a c k} \\ {\bar{x}}_{B l a c k F e m a l e s} & = {\hat{β}}_{0} + {\hat{β}}_{F e m a l e} + {\hat{β}}_{B l a c k} \end{aligned}

$\begin{align} &\bar x_{\rm White\ Males}& &= \hat\beta_0 \\ &\bar x_{\rm White\ Females}& &= \hat\beta_0 + \hat\beta_{\rm Female} \\ &\bar x_{\rm Black\ Males}& &= \hat\beta_0 + \hat\beta_{\rm Black} \\ &\bar x_{\rm Black\ Females}& &= \hat\beta_0 + \hat\beta_{\rm Female} + \hat\beta_{\rm Black} \end{align}$

相互作用項がある場合は、黒人女性の方程式の最後に追加されます。（このような相互作用用語の解釈は非常に複雑ですが、ここでは、相互作用用語の解釈について説明します。）

アップデート：さんはでコード化された缶詰の例を、考えてみましょう、私のポイントを明確にしますR。

d = data.frame(Sex  =factor(rep(c("Male","Female"),times=2), levels=c("Male","Female")),
               Race =factor(rep(c("White","Black"),each=2),  levels=c("White","Black")),
               y    =c(1, 3, 5, 7))
d
#      Sex  Race y
# 1   Male White 1
# 2 Female White 3
# 3   Male Black 5
# 4 Female Black 7

ここに画像の説明を入力してください

yこれらのカテゴリ変数の手段は次のとおりです。

aggregate(y~Sex,  d, mean)
#      Sex y
# 1   Male 3
# 2 Female 5
## i.e., the difference is 2
aggregate(y~Race, d, mean)
#    Race y
# 1 White 2
# 2 Black 6
## i.e., the difference is 4

これらの平均の差を、適合モデルの係数と比較できます。

summary(lm(y~Sex+Race, d))
# ...
# Coefficients:
#             Estimate Std. Error  t value Pr(>|t|)    
# (Intercept)        1   3.85e-16 2.60e+15  2.4e-16 ***
# SexFemale          2   4.44e-16 4.50e+15  < 2e-16 ***
# RaceBlack          4   4.44e-16 9.01e+15  < 2e-16 ***
# ...
# Warning message:
#   In summary.lm(lm(y ~ Sex + Race, d)) :
#   essentially perfect fit: summary may be unreliable

この状況について認識すべきことは、相互作用項がなければ、平行線を仮定しているということです。したがって、Estimateのためには、(Intercept)白人男性の平均です。Estimate以下のためSexFemale、女性の平均値と、男性の平均との差です。Estimate以下のためのRaceBlack黒人の平均値と白人の平均値との差です。繰り返しますが、相互作用項のないモデルは効果が厳密に加算的であると仮定しているため（線は厳密に平行です）、黒人女性の平均は白人男性の平均に女性の平均と男性の平均の差を加えたものになります黒の平均と白の平均の差。

— gung-モニカの復職
ソース

ありがとうございました！非常に明確で役立つ。最後に、インタラクション用語について言及します。相互作用項を実行すると、ベータはどのように変化しますか（相互作用項モデルからの新しいベータを意味します）？相互作用項のp値が重要であることは知っていますが、相互作用項ベータには意味のある解釈がありますか？ご協力ありがとうございます！

— レニー14年

{\hat{β}}_{F e m a l e}

$\hat\beta_{\rm Female}$

{\bar{x}}_{W h i t e M a l e}

$\bar x_{\rm White\ Male}$

{\bar{x}}_{W h i t e F e m a l e}

$\bar x_{\rm White\ Female}$

理にかなっています。ありがとうございました！＆相互作用項が主効果を改善するため、相互作用項のないモデルから変更されますか？相互作用がない場合の意味は、主な効果項は理論的には同じでしょうか？

— レニー14年

母集団だけでなくサンプルでも相互作用効果が正確に0（無限小数桁まで）だった場合、主効果ベータは相互作用項のないモデルまたはw / oのモデルで同じになります。

— GUNG -復活モニカ

@ hans0l0、これはコメントに埋もれた情報よりも、新しい質問としては良いでしょう。コンテキストのためにこれにリンクできます。簡単に言えば、すべての連続変数が= 0の場合の参照レベルの平均です。

— GUNG -復活モニカ

$\hat{\beta}_0$ $\hat\beta$ そのレベルの平均値との差でありますカテゴリと参照の平均。

あなたの例を少し拡張して、人種カテゴリ（アジアなど）に第3レベルを含め、参照として白を選択した場合、次のようになります。

$\hat{\beta}_0 = \bar{x}_{White}$
$\hat{\beta}_{Black} = \bar{x}_{Black} - \bar{x}_{White}$
$\hat{\beta}_{Asian} = \bar{x}_{Asian} - \bar{x}_{White}$

$\hat{\beta}$

$\bar{x}_{Asian} = \hat{\beta}_{Asian} + \hat{\beta}_0$

残念ながら、複数のカテゴリ変数の場合、切片の正しい解釈はもはや明確ではありません（最後の注を参照）。存在する場合、Nカテゴリ、複数のレベルと一つの基準レベル（例えば、各白と男性あなたの例では）、インターセプトするための一般的な形態です。

{\hat{β}}_{0} = \sum_{i = 1}^{n} {\bar{x}}_{r e f e r e n c e, i} - (n - 1) \bar{x},

$\hat{\beta}_0 =∑_{i=1}^{n}\bar{x}_{reference,i} -(n-1) \bar{x} ,$

{\bar{x}}_{r e f e r e n c e, i} is the mean of the reference level of the i-th categorical variable,

$\bar{x}_{reference,i}\small{\text{ is the mean of the reference level of the i-th categorical variable,}}$

\bar{x} is the mean of the whole data set

$\bar{x}\small{\text{ is the mean of the whole data set}}$

$\hat\beta$ 単一のカテゴリと同じである：彼らは、カテゴリのそのレベルの平均と同じカテゴリの基準レベルの平均値との差です。

あなたの例に戻ると、次のようになります。

$\hat{\beta}_0 = \bar{x}_{White} + \bar{x}_{Male} - \bar{x}$
$\hat{\beta}_{Black} = \bar{x}_{Black} - \bar{x}_{White}$
$\hat{\beta}_{Asian} = \bar{x}_{Asian} - \bar{x}_{White}$
$\hat{\beta}_{Female} = \bar{x}_{Female} - \bar{x}_{Male}$

You will notice that the mean of the cross categories (e.g. White males) are not present in any of the $\hat\beta$ . As a matter of fact, you cannot calculate these means precisely from the results of this type of regression.

The reason for this is that, the number of predictor variables (i.e. the $\hat\beta$ ) is smaller then the number of cross categories (as long as you have more than 1 category) so a perfect fit is not always possible. If we go back to your example, the number of predictors is 4 (i.e. $\hat{\beta}_0, ~\hat{\beta}_{Black}, ~\hat{\beta}_{Asian}$ and $\hat{\beta}_{Female}$ ) while the number of cross categories is 6.

Numerical Example

Let me borrow from @Gung for a canned numerical example:

d = data.frame(Sex=factor(rep(c("Male","Female"),times=3), levels=c("Male","Female")),
    Race =factor(rep(c("White","Black","Asian"),each=2),levels=c("White","Black","Asian")),
    y    =c(0, 3, 7, 8, 9, 10))
d

#      Sex  Race  y
# 1   Male White  0
# 2 Female White  3
# 3   Male Black  7
# 4 Female Black  8
# 5   Male Asian  9
# 6 Female Asian 10

In this case, the various averages that will go in the calculation of the $\hat\beta$ are:

aggregate(y~1,  d, mean)

#          y
# 1 6.166667

aggregate(y~Sex,  d, mean)

#      Sex        y
# 1   Male 5.333333
# 2 Female 7.000000

aggregate(y~Race, d, mean)

#    Race   y
# 1 White 1.5
# 2 Black 7.5
# 3 Asian 9.5

We can compare these numbers with the results of the regression:

summary(lm(y~Sex+Race, d))

# Coefficients:
#             Estimate Std. Error t value Pr(>|t|)
# (Intercept)   0.6667     0.6667   1.000   0.4226
# SexFemale     1.6667     0.6667   2.500   0.1296
# RaceBlack     6.0000     0.8165   7.348   0.0180
# RaceAsian     8.0000     0.8165   9.798   0.0103

As you can see, the various $\hat\beta$ estimated from the regression all line up with the formulas given above. For example, $\hat\beta_0$ is given by:

{\hat{β}}_{0} = {\bar{x}}_{W h i t e} + {\bar{x}}_{M a l e} - \bar{x}

$\hat{\beta}_0 = \bar{x}_{White} + \bar{x}_{Male} - \bar{x}$ Which gives:

1.5 + 5.333333 - 6.166667
# 0.66666

Note on the choice of contrast

A final note on this topic, all the results discussed above relate to categorical regressions using contrast treatment (the default type of contrast in R). There are different types of contrast which could be used (notably Helmert and sum) and and it would change the interpretation of the various $\hat\beta$ . However, It would not change the final predictions from the regressions (e.g. the prediction for White males is always the same no matter which type of contrast you use).

My personal favourite is contrast sum as I feel that the interpretation of the $\hat\beta^{contr.sum}$ generalises better when there are multiple categories. For this type of contrast, there is no reference level, or rather the reference is the mean of the whole sample, and you have the following $\hat\beta^{contr.sum}$ :

$\hat\beta_0^{contr.sum}=\bar{x}$
$\hat\beta_i^{contr.sum}=\bar{x}_i-\bar{x}$

If we go back to the previous example, you would have:

$\hat{\beta}_0^{contr.sum} = \bar{x}$
$\hat{\beta}_{White}^{contr.sum} = \bar{x}_{White} - \bar{x}$
$\hat{\beta}_{Black}^{contr.sum} = \bar{x}_{Black} - \bar{x}$
$\hat{\beta}_{Asian}^{contr.sum} = \bar{x}_{Asian} - \bar{x}$
$\hat{\beta}_{Male}^{contr.sum} = \bar{x}_{Male} - \bar{x}$
$\hat{\beta}_{Female}^{contr.sum} = \bar{x}_{Female} - \bar{x}$

You will notice that because White and Male are no longer reference levels, their $\hat\beta^{contr.sum}$ are no longer 0. The fact that these are 0 is specific to contrast treatment.

— G.L.
ソース