最小二乗推定量の分散における項の直感的な説明

18

がフルランクの場合、逆数が存在し、最小二乗推定値を取得します。および $X$ $X^TX$

\hat{β} = (X^{T} X)^{- 1} X Y

$\hat\beta = (X^TX)^{-1}XY$

Var (\hat{β}) = σ^{2} (X^{T} X)^{- 1}

$\operatorname{Var}(\hat\beta) = \sigma^2(X^TX)^{-1}$

分散式でをどのように直感的に説明できますか？派生のテクニックは私にとって明らかです。 $(X^TX)^{-1}$

regression variance least-squares

— ダニエル・イェフィモフ
ソース

3

分散共分散行列について述べた公式（がOLSによって推定されると仮定）が正しいのは、ガウス-マルコフの定理の条件が満たされ、特に、誤差項の分散共分散行列がで与えられる場合にのみ、ここでは単位行列で、は（および）の行。指定した式は、非球面エラーのより一般的な場合には正しくありません。

\hat{β}

$\hat{\beta}$

\hat{β}

$\hat{\beta}$

σ^{2} I_{n}

$\sigma^2 I_n$

I_{n}

$I_n$

n \times n

$n\times n$

n

$n$

X

$X$

Y

$Y$

— ミコ

13

定数項のない単純な回帰を考えてみましょう。単一のリグレッサーはそのサンプル平均を中心にしています。次いで、 $X'X$ は（ $n$ 回）そのサンプル分散、および $(X'X)^{-1}$ のrecirpocal。したがって、リグレッサーの分散=変動性が高いほど、係数推定器の分散は低くなります。説明変数の変動性が大きいほど、未知の係数をより正確に推定できます。

どうして？リグレッサーが変化するほど、含まれる情報が多くなります。回帰変数が多い場合、これはそれらの分散共分散行列の逆数に一般化され、回帰変数の共変動性も考慮されます。 $X'X$ が対角である極端な場合、各推定係数の精度は、関連するリグレッサーの分散/変動性のみに依存します（誤差項の分散が与えられた場合）。

— アレコスパパドプロス
ソース

この議論を、分散共分散行列の逆行列が偏相関をもたらすという事実に関連付けることができますか？

— ハイゼンベルク

5

表示の簡単な方法 $\sigma^2 \left(\mathbf{X}^{T} \mathbf{X} \right)^{-1}$ の行列のようになる（多変量）アナログ $\frac{\sigma^2}{\sum_{i=1}^n \left(X_i-\bar{X}\right)^2}$ 、単純なOLS回帰における傾き係数の分散です。一つでも得ることができ原点を回帰を実行することによって、すなわち、モデルに切片をommittingことによって、その分散のために。 $\frac{\sigma^2}{\sum_{i=1}^n X_i^2}$

これらの式のいずれかから、予測変数のより大きな変動性は一般にその係数のより正確な推定につながることがわかるかもしれません。これは、実験の設計でよく利用されるアイデアです。（非ランダム）予測子の値を選択することにより、の行列式を可能な限り大きくしようとします。行列式は変動性の尺度です。 $\left(\mathbf{X}^{T} \mathbf{X} \right)$

— JohnK
ソース

2

ガウス確率変数の線形変換は役立ちますか？ルールを使用して、その場合、は、。 $x \sim \mathcal{N}(\mu,\Sigma)$ $Ax + b ~ \sim \mathcal{N}(A\mu + b,A^T\Sigma A)$

ことが、仮定基礎となるモデルであり、。 $Y = X\beta + \epsilon$ $\epsilon \sim \mathcal{N}(0, \sigma^2)$

∴ Y \sim N (X β, σ^{2}) X^{T} Y \sim N (X^{T} X β, X σ^{2} X^{T}) (X^{T} X)^{- 1} X^{T} Y \sim N [β, (X^{T} X)^{- 1} σ^{2}]

$\therefore Y \sim \mathcal{N}(X\beta,\sigma^2)\\ X^TY \sim \mathcal{N}(X^TX\beta, X\sigma^2 X^T)\\ (X^TX)^{-1}X^TY \sim \mathcal{N}[\beta,(X^TX)^{-1} \sigma^2]$

だからであるだけで、複雑なスケーリング行列その変換の分布。 $(X^TX)^{-1}X^T$ $Y$

お役に立てば幸いです。

— ケダルプス
ソース

OLS推定量とその分散の導出には、誤差項の正規性は必要ありません。必要だということ全ては、

と

。（もちろん、OLSがCramer-Raoの下限を達成することを示すために正規性が必要ですが、それはOPの投稿の目的ではありませんか？）

E (ε) = 0

$E(\varepsilon)=0$

E (ε ε^{T}) = σ^{2} I_{n}

$E(\varepsilon\varepsilon^T)=\sigma^2 I_n$

— ミコ

2

式基礎となる直感を開発するために、別のアプローチを取ります。。重回帰モデルの直観を開発するとき、二変量線形回帰モデル、つまりを検討すると役立ちます。、 $\text{Var}\,\hat{\beta}=\sigma^2 (X'X)^{-1}$ 頻繁に決定的な貢献と呼ばれる、及び確率的貢献と呼ばれます。サンプル手段からの偏差で表す、このモデルは、のように書くこともできる

y_{i} = α + β x_{i} + ε_{i}, i = 1, \dots, n .

$y_i=\alpha+\beta x_i + \varepsilon_i, \quad i=1,\ldots,n.$

α + β x_{i}

$\alpha+\beta x_i$

y_{i}

$y_i$

ε_{i}

$\varepsilon_i$

(\bar{x}, \bar{y})

$(\bar{x},\bar{y})$

(y_{i} - \bar{y}) = β (x_{i} - \bar{x}) + (ε_{i} - \bar{ε}), i = 1, \dots, n .

$(y_i-\bar{y}) = \beta(x_i-\bar{x})+(\varepsilon_i-\bar{\varepsilon}), \quad i=1,\ldots,n.$

ヘルプへの直感を開発、我々は、最も単純なガウス・マルコフの仮定が満たされることを前提としています：非確率、すべてのため、およびすべてに対して。すでによく知っているように、これらの条件は、 $x_i$ $\sum_{i=1}^n(x_i-\bar{x})^2>0$ $n$ $\varepsilon_i \sim \text{iid}(0,\sigma^2)$ $i=1,\ldots,n$ ここで

Var \hat{β} = \frac{1}{n} σ^{2} (Var x)^{- 1},

$\text{Var}\,\hat{\beta}=\tfrac{1}{n}\sigma^2(\text{Var}\,x)^{-1}\text{,}$

サンプル分散であり

。すなわち、この式は、三の主張を行います「の分散

サンプルサイズに反比例し

、それの分散に正比例する

、それはの分散に反比例する

。」

Var x

$\text{Var}\,x$

x

$x$

\hat{β}

$\hat{\beta}$

n

$n$

ε

$\varepsilon$

x

$x$

なぜ、サンプルサイズを倍にする必要がありparibusをceterisの分散原因、半分にカットをしますか？この結果は、適用されるiidの仮定に密接にリンクされています。個々のエラーはiidであると想定されるため、各観測値は事前に同等に有益であると見なされる必要があります。そして、倍加観測の数は倍との間の（想定線形）関係を記述するパラメータに関する情報の量および。情報が2倍あれば、パラメーターに関する不確実性が半分になります。同様に、なぜ倍増するのかについて直感を開発することは簡単です $\hat{\beta}$ $\varepsilon$ $x$ $y$ またの分散倍。 $\sigma^2$ $\hat{\beta}$

レッツ・ターン、そして、の分散という主張のための直感を開発についてであるあなたの主な質問へのある反比例の分散に。概念を形式化するために、これからModel とModel と呼ばれる2つの別々の二変量線形回帰モデルを考えてみましょう。我々は、両方のモデルは、ガウス-マルコフ定理の最も単純な形式の仮定を満たすと仮定し、モデルはまったく同じ値を共有することを、、、及び。これらの仮定の下では、 $\hat{\beta}$ $x$ $(1)$ $(2)$ $\alpha$ $\beta$ $n$ $\sigma^2$ ; 言葉では、両方の推定量は公平です。決定的に、我々はまた、一方ことを仮定します、 $\text{E}\,\hat{\beta}{}^{(1)}=\text{E}\,\hat{\beta}{}^{(2)}=\beta$ $\bar{x}^{(1)}=\bar{x}^{(2)}=\bar{x}$ 。一般性を失うことなく、 $\text{Var}\,x^{(1)}\ne \text{Var}\,x^{(2)}$ $\text{Var}\,x^{(1)}>\text{Var}\,x^{(2)}$ . Which estimator of $\hat{\beta}$ will have the smaller variance? Put differently, will $\hat{\beta}{}^{(1)}$ or $\hat{\beta}{}^{(2)}$ be closer, on average, to $\beta$ ? From the earlier discussion, we have $\text{Var}\,\hat{\beta} {}^{(k)} =\tfrac{1}{n}\sigma^2/\text{Var}\,x{}^{(k)})$ for $k=1,2$ . Because $\text{Var}\,x^{(1)}>\text{Var}\,x^{(2)}$ by assumption, it follows that $\text{Var}\,\hat{\beta}{}^{(1)} <\text{Var}\,\hat{\beta}{}^{(2)}$ . What, then, is the intuition behind this result?

Because by assumption $\text{Var}\,x^{(1)}>\text{Var}\,x^{(2)}$ , on average each $x_i^{(1)}$ will be farther away from $\bar{x}$ than is the case, on average, for $x_i^{(2)}$ . Let us denote the expected average absolute difference between $x_i$ and $\bar{x}$ by $d_x$ . The assumption that $\text{Var}\,x^{(1)}>\text{Var}\,x^{(2)}$ implies that $d_x^{(1)} >d_x^{(2)}$ . The bivariate linear regression model, expressed in deviations from means, states that $d_y = \beta d_x^{(1)}$ for Model $(1)$ and $d_y = \beta d_x^{(2)}$ for Model $(2)$ . If $\beta\ne0$ , this means that the deterministic component of Model $(1)$ , $\beta d_x^{(1)}$ , has a greater influence on $d_y$ than does the deterministic component of Model $(2)$ , $\beta d_x^{(2)}$ . Recall that the both models are assumed to satisfy the Gauss-Markov assumptions, that the error variances are the same in both models, and that $\beta^{(1)}=\beta^{(2)}=\beta$ . Since Model $(1)$ imparts more information about the contribution of the deterministic component of $y$ than does Model $(2)$ , it follows that the precision with which the deterministic contribution can be estimated is greater for Model $(1)$ than is the case for Model $(2)$ . The converse of greater precision is a lower variance of the point estimate of $\beta$ .

It is reasonably straightforward to generalize the intuition obtained from studying the simple regression model to the general multiple linear regression model. The main complication is that instead of comparing scalar variances, it is necessary to compare the "size" of variance-covariance matrices. Having a good working knowledge of determinants, traces and eigenvalues of real symmetric matrices comes in very handy at this point :-)

— Mico
ソース

1

Say we have $n$ observations (or sample size) and $p$ parameters.

The covariance matrix $\operatorname{Var}(\hat{\beta})$ of the estimated parameters $\hat{\beta}_1,\hat{\beta}_2$ etc. is a representation of the accuracy of the estimated parameters.

If in an ideal world the data could be perfectly described by the model, then the noise will be $\sigma^2= 0$ . Now, the diagonal entries of $\operatorname{Var}(\hat{\beta})$ correspond to $\operatorname{Var}(\hat{\beta_1}),\operatorname{Var}(\hat{\beta_2})$ etc. The derived formula for the variance agrees with the intuition that if the noise is lower, the estimates will be more accurate.

In addition, as the number of measurements gets larger, the variance of the estimated parameters will decrease. So, overall the absolute value of the entries of $X^TX$ will be higher, as the number of columns of $X^T$ is $n$ and the number of rows of $X$ is $n$ , and each entry of $X^TX$ is a sum of $n$ product pairs. The absolute value of the entries of the inverse $(X^TX)^{-1}$ will be lower.

Hence, even if there is a lot of noise, we can still reach good estimates $\hat{\beta_i}$ of the parameters if we increase the sample size $n$ .

I hope this helps.

Reference: Section 7.3 on Least squares: Cosentino, Carlo, and Declan Bates. Feedback control in systems biology. Crc Press, 2011.

— Dilly Minch
ソース

1

This builds on @Alecos Papadopuolos' answer.

Recall that the result of a least-squares regression doesn't depend on the units of measurement of your variables. Suppose your X-variable is a length measurement, given in inches. Then rescaling X, say by multiplying by 2.54 to change the unit to centimeters, doesn't materially affect things. If you refit the model, the new regression estimate will be the old estimate divided by 2.54.

The $X'X$ matrix is the variance of X, and hence reflects the scale of measurement of X. If you change the scale, you have to reflect this in your estimate of $\beta$ , and this is done by multiplying by the inverse of $X'X$ .

— Hong Ooi
ソース