β^0β^そのレベルの平均値との差でありますカテゴリと参照の平均。
あなたの例を少し拡張して、人種カテゴリ(アジアなど)に第3レベルを含め、参照として白を選択した場合、次のようになります。
- β^0=x¯White
- β^Black=x¯Black−x¯White
- β^Asian=x¯Asian−x¯White
β^
- x¯Asian=β^Asian+β^0
残念ながら、複数のカテゴリ変数の場合、切片の正しい解釈はもはや明確ではありません(最後の注を参照)。存在する場合、Nカテゴリ、複数のレベルと一つの基準レベル(例えば、各白と男性あなたの例では)、インターセプトするための一般的な形態です。
β^0=∑ni=1x¯reference,i−(n−1)x¯,
x¯reference,i is the mean of the reference level of the i-th categorical variable,
x¯ is the mean of the whole data set
β^単一のカテゴリと同じである:彼らは、カテゴリのそのレベルの平均と同じカテゴリの基準レベルの平均値との差です。
あなたの例に戻ると、次のようになります。
- β^0=x¯White+x¯Male−x¯
- β^Black=x¯Black−x¯White
- β^Asian=x¯Asian−x¯White
- β^Female=x¯Female−x¯Male
You will notice that the mean of the cross categories (e.g. White males) are not present in any of the β^. As a matter of fact, you cannot calculate these means precisely from the results of this type of regression.
The reason for this is that, the number of predictor variables (i.e. the β^) is smaller then the number of cross categories (as long as you have more than 1 category) so a perfect fit is not always possible. If we go back to your example, the number of predictors is 4 (i.e. β^0, β^Black, β^Asian and β^Female) while the number of cross categories is 6.
Numerical Example
Let me borrow from @Gung for a canned numerical example:
d = data.frame(Sex=factor(rep(c("Male","Female"),times=3), levels=c("Male","Female")),
Race =factor(rep(c("White","Black","Asian"),each=2),levels=c("White","Black","Asian")),
y =c(0, 3, 7, 8, 9, 10))
d
# Sex Race y
# 1 Male White 0
# 2 Female White 3
# 3 Male Black 7
# 4 Female Black 8
# 5 Male Asian 9
# 6 Female Asian 10
In this case, the various averages that will go in the calculation of the β^ are:
aggregate(y~1, d, mean)
# y
# 1 6.166667
aggregate(y~Sex, d, mean)
# Sex y
# 1 Male 5.333333
# 2 Female 7.000000
aggregate(y~Race, d, mean)
# Race y
# 1 White 1.5
# 2 Black 7.5
# 3 Asian 9.5
We can compare these numbers with the results of the regression:
summary(lm(y~Sex+Race, d))
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.6667 0.6667 1.000 0.4226
# SexFemale 1.6667 0.6667 2.500 0.1296
# RaceBlack 6.0000 0.8165 7.348 0.0180
# RaceAsian 8.0000 0.8165 9.798 0.0103
As you can see, the various β^ estimated from the regression all line up with the formulas given above. For example, β^0 is given by:
β^0=x¯White+x¯Male−x¯
Which gives:
1.5 + 5.333333 - 6.166667
# 0.66666
Note on the choice of contrast
A final note on this topic, all the results discussed above relate to categorical regressions using contrast treatment (the default type of contrast in R). There are different types of contrast which could be used (notably Helmert and sum) and and it would change the interpretation of the various β^. However, It would not change the final predictions from the regressions (e.g. the prediction for White males is always the same no matter which type of contrast you use).
My personal favourite is contrast sum as I feel that the interpretation of the β^contr.sum generalises better when there are multiple categories. For this type of contrast, there is no reference level, or rather the reference is the mean of the whole sample, and you have the following β^contr.sum:
- β^contr.sum0=x¯
- β^contr.sumi=x¯i−x¯
If we go back to the previous example, you would have:
- β^contr.sum0=x¯
- β^contr.sumWhite=x¯White−x¯
- β^contr.sumBlack=x¯Black−x¯
- β^contr.sumAsian=x¯Asian−x¯
- β^contr.sumMale=x¯Male−x¯
- β^contr.sumFemale=x¯Female−x¯
You will notice that because White and Male are no longer reference levels, their β^contr.sum are no longer 0. The fact that these are 0 is specific to contrast treatment.