線形回帰予測間隔

データポイントの最適な線形近似（最小二乗を使用）がライン $y=mx+b$ 場合、近似誤差を計算するにはどうすればよいですか？観測値と予測値の差の標準偏差を計算する $e_i=real(x_i)-(mx_i+b)$ と、実際の（観測されていない）値 $y_r=real(x_0)$ は区間に属します $[y_p-\sigma, y_p+\sigma]$ （）確率が約68％で、正規分布を仮定していますか？ $y_p=mx_0+b$

明確にするために：

関数をいくつかの点評価することで観察しました。これらの観測値を線に適合させます。私が観察しなかったについては、大きさを知りたいと思います。上記の方法を使用して、prob でと言うのは正しいですか。〜68％？ $f(x)$ $x_i$ $l(x)=mx+b$ $x_0$ $f(x_0)-l(x_0)$ $f(x_0) \in [l(x_0)-\sigma, l(x_0)+\sigma]$

— bmx
ソース

予測間隔について質問していると思います。ただし、「

」の代わりに「

」を使用することに注意してください。これはタイプミスですか？私たちはしていない予測

秒。

x_{i}

$x_i$

y_{i}

$y_i$

x

$x$

— GUNG -復活モニカ

@gung：たとえば、時間を表すために

を使用し、

はその時点での変数の値なので、

は、時間

を観測したことを意味します。フィット関数の予測がyの実際の値からどれくらい離れているかを知りたい。それは理にかなっていますか？関数

の「正しい」値を返し

における

、そして私のデータ点は、から成る

x

$x$

y

$y$

y = f (x)

$y=f(x)$

y

$y$

x

$x$

r e a l (x_{i})

$real(x_i)$

y

$y$

x_{i}

$x_i$

。

(x_{i}, r e a l (x_{i}))

${(x_i, real(x_i))}$

— bmx

それは完全に合理的です。私が注目している部分は、例えば、「

」です。通常、regモデルのエラー/残差は「

」。残差のSDはありません予測区間を計算する上で役割を果たしています。それは「

e_{i} = r e a l (x_{i}) - (m x_{i} + b)

$e_i=real(x_i)-(mx_i+b)$

e_{i} = y_{i} - (m x_{i} + b)

$e_i=y_i-(mx_i+b)$

x_{i}

$x_i$ 「それは私に奇妙な、それはタイプミスだ場合、私は思ったんだけど、またはあなたは、私は認識していない何かについて求めている。

— GUNG -復活モニカ

見えると思う。あなたの編集を見逃しました。これは、システムが完全に決定論的であり、実際の基になる関数にアクセスできれば、常にエラーなしで完全に

予測できることを示唆しています。これは通常、regモデルについて考える方法ではありません。

y_{i}

$y_i$

— GUNG -復活モニカ

bmx、あなたはあなたの質問について明確な考えを持ち、いくつかの問題をよく認識しているように見えます。密接に関連する3つのスレッドを確認してください。stats.stackexchange.com/questions/17773では、非技術用語で予測間隔について説明しています。stats.stackexchange.com/questions/26702は、より数学的な説明を提供します。また、stats.stackexchange.com / questions / 9131では、Rob Hyndmanが求める式を提供しています。これらがあなたの質問に完全に答えない場合、少なくともそれらはあなたにそれを明確にするための標準的な表記法と語彙を与えるかもしれません。

— whuber

@whuberは3つの良い答えを示しましたが、おそらく私はまだ価値のある何かを書くことができます。私の理解では、あなたの明確な質問は次のとおりです。

私のフィットモデル所与の $\hat y_i=\hat mx_i + \hat b$ （通知私は'帽子'を追加）、および私の残差が正規分布していると仮定すると、、Iはまだとしてその予測することができるが観測されない応答、、既知の予測値と、、間隔内に入る $\mathcal N(0, \hat\sigma^2_e)$ $y_{new}$ $x_{new}$ 、確率68％？ $(\hat y -\sigma_e, \hat y +\sigma_e)$

直観的には、答えは「はい」であるように思われますが、本当の答えは多分です。これは、パラメーター（つまり、および）が既知でエラーがない場合に当てはまります。これらのパラメーターを推定したので、不確実性を考慮する必要があります。 $m, b,$ $\sigma$

まず、残差の標準偏差について考えてみましょう。これはデータから推定されるため、推定に多少の誤差が生じる可能性があります。その結果、予測区間を形成するために使用すべき分布は、正規分布ではなくになるはずです。ただし、急速に正常値に収束するため、実際には問題になる可能性は低くなります。 $t_\text{df error}$ $t$

$\hat y_\text{new}\pm t_{(1-\alpha/2,\ \text{df error})}s$ $\hat y_\text{new}\pm z_{(1-\alpha/2)}s$ $\hat m$ $\hat b$ $s_\text{error}$

s_{predictions(new)}^{2} = s_{error}^{2} + Var (\hat{m} x_{new} + \hat{b})

$s^2_\text{predictions(new)}=s^2_\text{error}+\text{Var}(\hat mx_\text{new}+\hat b)$

x

$x$

s^{2}

$s^2$

x

$x$ axis. The standard deviation of your predictions can be more conveniently estimated with the following formula:

s_{predictions(new)} = \sqrt{s_{error}^{2} (1 + \frac{1}{N} + \frac{(x_{new} - \bar{x})^{2}}{\sum (x_{i} - \bar{x})^{2}})}

$s_\text{predictions(new)}=\sqrt{s^2_\text{error}\left(1+\frac{1}{N}+\frac{(x_\text{new}-\bar x)^2}{\sum(x_i-\bar x)^2}\right)}$ As an interesting side note, we can infer a few facts about prediction intervals from this equation. First, prediction intervals will be narrower the more data we had when we built the prediction model (this is because there's less uncertainty in

\hat{m}

$\hat m$ &

\hat{b}

$\hat b$ ). Second, predictions will be most precise if they are made at the mean of the

x

$x$ values you used to develop your model, as the numerator for the third term will be

0

$0$ . The reason is that under normal circumstances, there is no uncertainty about the estimated slope at the mean of

x

$x$ , only some uncertainty about the true vertical position of the regression line. Thus, some lessons to be learned for building prediction models are: that more data is helpful, not with finding 'significance', but with improving the precision of future predictions; and that you should center your data collection efforts on the interval where you will need to be making predictions in the future (to minimize that numerator), but spread the observations as widely from that center as you can (to maximize that denominator).

Having calculated the correct value in this manner, we can then use it with the appropriate $t$ distribution as noted above.

— gung - Reinstate Monica
ソース