ベイズ確率の観点から、95％の信頼区間に95％の確率で真のパラメーターが含まれないのはなぜですか？

14

信頼区間に関するウィキペディアのページから：

...信頼区間が、繰り返される（および場合によっては異なる）実験の多くの個別のデータ分析にわたって構築される場合、パラメーターの真の値を含むそのような区間の割合は、信頼レベルと一致します...

そして同じページから：

信頼区間は、実際に取得されたデータが与えられた場合、パラメーターの真の値が信頼区間に入る特定の確率を持っているとは予測しません。

私がそれを正しく理解していれば、この最後の声明は、確率の頻繁な解釈を念頭に置いて作られています。しかし、ベイズ確率の観点から、95％の信頼区間に95％の確率で真のパラメーターが含まれないのはなぜですか？そうでない場合、次の推論の何が問題になっていますか？

95％の確率で正解を生成することがわかっているプロセスがある場合、次の解答が正しい確率は0.95です（プロセスに関する追加情報がない場合）。同様に、95％の時間で真のパラメーターを含むプロセスによって作成された信頼区間を誰かが表示した場合、知っていることを考えると、0.95の確率で真のパラメーターが含まれていると言ってはいけませんか？

この質問は似ていますが、同じではありません。95％CIが95％の平均を含む可能性を意味しないのはなぜですか？その質問に対する答えは、95％CIが95％の頻度で頻繁に見られる平均を含む可能性を暗示しない理由に焦点を合わせています。私の質問は同じですが、ベイジアン確率の観点からです。

bayesian confidence-interval

— ラスマス・バース
ソース

これを考える1つの方法は、95％CIが「長期平均」であるということです。現在、「短期」のケースを分割して、かなりarbitrary意的なカバレッジが得られるようにする多くの方法がありますが、平均すると全体で95％になります。別の、より抽象的な方法で生成され

x_{i} \sim B e r n o u l l i (p_{i})

$x_i\sim Bernoulli (p_i)$ のために

i = 1, 2, \dots

$i=1, 2,\dots$ よう

\sum_{i = 1}^{\infty} p_{i} = 0.95

$\sum_{i=1}^{\infty} p_i=0.95$ 。これを行う方法は無限にあります。ここで

x_{i}

$x_i$ i番目のデータ・セットを使用して作成CIパラメータが含まれているか否かを示し、

p_{i}

$p_i$ 、この場合のカバレッジ確率です。

— 確率論的

11

更新：数年後知恵の恩恵を受けて、私は同様の質問に答えて本質的に同じ素材のより簡潔な取り扱いを書きました。

信頼領域を構築する方法

信頼領域を構築する一般的な方法から始めましょう。単一のパラメーターに適用して、信頼区間または区間のセットを生成できます。また、2つ以上のパラメーターに適用して、より高い次元の信頼領域を生成できます。

観測された統計 $D$ は、パラメータ $\theta$ 持つ分布、つまり、可能な統計上 $s(d|\theta)$ サンプリング分布し、可能な値セットで信頼領域を探します。最高密度領域（HDR）の定義：PDFの -HDRは、確率をサポートするドメインの最小サブセットです。示すの-HDR として任意ため、 $d$ $\theta$ $\Theta$ $h$ $h$ $h$ $s(d|\psi)$ $H_\psi$ $\psi \in \Theta$ 。次いで、 $h$ の信頼領域 $\theta$ データが与えられると、 $D$ 、集合である $C_D = \{ \phi : D \in H_\phi \}$ 。典型的な値 $h$ 0.95です。

頻繁な解釈

信頼領域の前の定義から、次の

d \in H_{ψ} ⟷ ψ \in C_{d}

$d \in H_\psi \longleftrightarrow \psi \in C_d$ と

C_{d} = {ϕ : d \in H_{ϕ}}

$C_d = \{ \phi : d \in H_\phi \}$ 。今（の大規模なセットを想像虚数）観測

{D_{i}}

$\{D_i\}$ と同様の状況下で撮影し、

D

$D$ 。すなわち、

s (d | θ)

$s(d|\theta)$ から

サンプルです。以来

H_{θ}

$H_\theta$ サポートの確率質量

h

$h$ のPDFの

s (d | θ)

$s(d|\theta)$ 、

P (D_{i} \in H_{θ}) = h

$P(D_i \in H_\theta) = h$ 全てについて

i

$i$ 。したがって、画分

{D_{i}}

$\{D_i\}$ れる

D_{i} \in H_{θ}

$D_i \in H_\theta$ である

h

$h$ 。そのため、上記の等価を使用して、数分の

{D_{i}}

$\{D_i\}$ れる

θ \in C_{D_{i}}

$\theta \in C_{D_i}$ またある

h

$h$ 。

したがって、これは、 $h$ 信頼領域に対する頻度主義者の主張は次のとおりです。 $\theta$

観測された統計を生じさせたサンプリング分布から多数の虚数観測 $\{D_i\}$ を取得します。次に、は、類似しているが虚数の信頼領域分数内にあります。 $s(d|\theta)$ $D$ $\theta$ $h$ $\{C_{D_i}\}$

したがって、信頼領域 $C_D$ は、 $\theta$ がどこかにある確率については主張しません。その理由は、単純に、 $\theta$ 確率分布について話すことができる式には何もないということです。解釈は、複雑な上部構造であり、基盤を改善するものではありません。ベースは $s(d | \theta)$ および $D$ であり、 $\theta$ は分布量として現れず、それに対処するために使用できる情報はありません。 $\theta$ 上の分布を取得するには、基本的に2つの方法があります。

手元の情報から直接分布を割り当てます： $p(\theta | I)$ 。
関連 $\theta$ 別の分散量に： $p(\theta | I) = \int p(\theta x | I) dx = \int p(\theta | x I) p(x | I) dx$ 。

どちらの場合も、 $\theta$ はどこかに左側に表示されなければなりません。どちらの方法も異端の事前を必要とするため、頻繁に使用することはできません。

ベイジアンビュー

ベイズがで行うことができ、ほとんどの $h$ 、信頼領域 $C_D$ 資格なし所与は、単に直接解釈である：それがのセットであること $\phi$ いる $D$ に落ちるが $h$ -HDR $H_\phi$ サンプリング分布の $s(d|\phi)$ 。 $\theta$ について必ずしも多くを語るわけではありません。その理由は次のとおりです。

確率 $\theta \in C_D$ 、所与の $D$ および背景情報 $I$ 、ある：

\begin{aligned} P (θ \in C_{D} | D I) & = \int_{C_{D}} p (θ | D I) d θ \\ = \int_{C_{D}} \frac{p (D | θ I) p (θ | I)}{p (D | I)} d θ \end{aligned}

$\begin{align*} P(\theta \in C_D | DI) &= \int_{C_D} p(\theta | DI) d\theta \\ &= \int_{C_D} \frac{p(D | \theta I) p(\theta | I)}{p(D | I)} d\theta \end{align*}$ frequentist解釈とは異なり、我々はすぐにオーバー配信要求している、ということに注意してください

θ

$\theta$ 。背景情報

I

$I$ 標本分布であることを、以前のように、教えてくれる

s (d | θ)

$s(d | \theta)$ ：

\begin{aligned} P (θ \in C_{D} | D I) & = \int_{C_{D}} \frac{s (D | θ) p (θ | I)}{p (D | I)} d θ \\ = \frac{\int_{C_{D}} s (D | θ) p (θ | I) d θ}{p (D | I)} \\ i.e. P (θ \in C_{D} | D I) & = \frac{\int_{C_{D}} s (D | θ) p (θ | I) d θ}{\int s (D | θ) p (θ | I) d θ} \end{aligned}

$\begin{align*} P(\theta \in C_D | DI) &= \int_{C_D} \frac{s(D | \theta) p(\theta | I)}{p(D | I)} d \theta \\ &= \frac{\int_{C_D} s(D | \theta) p(\theta | I) d\theta}{p(D | I)} \\ \text{i.e.} \quad\quad P(\theta \in C_D | DI) &= \frac{\int_{C_D} s(D | \theta) p(\theta | I) d\theta}{\int s(D | \theta) p(\theta | I) d\theta} \end{align*}$ Now this expression does not in general evaluate to

h

$h$ , which is to say, the

h

$h$ confidence region

C_{D}

$C_D$ does not always contain

θ

$\theta$ with probability

h

$h$ . In fact it can be starkly different from

h

$h$ . There are, however, many common situations in which it does evaluate to

h

$h$ , which is why confidence regions are often consistent with our probabilistic intuitions.

たとえば、 $d$ と $\theta$ の以前のジョイントPDF が対称であり、 $p_{d,\theta}(d,\theta | I) = p_{d,\theta}(\theta,d | I)$ であると仮定します。（明らかに、これはPDFが $d$ と $\theta$ 同じ領域に及ぶという仮定を含みます。）次に、事前確率が $p(\theta | I) = f(\theta)$ 場合、 $s(D | \theta) p(\theta | I) = s(D | \theta) f(\theta) = s(\theta | D) f(D)$ . Hence

\begin{aligned} P (θ \in C_{D} | D I) & = \frac{\int_{C_{D}} s (θ | D) d θ}{\int s (θ | D) d θ} \\ i.e. P (θ \in C_{D} | D I) & = \int_{C_{D}} s (θ | D) d θ \end{aligned}

$\begin{align*} P(\theta \in C_D | DI) &= \frac{\int_{C_D} s(\theta | D) d\theta}{\int s(\theta | D) d\theta} \\ \text{i.e.} \quad\quad P(\theta \in C_D | DI) &= \int_{C_D} s(\theta | D) d\theta \end{align*}$ From the definition of an HDR we know that for any

ψ \in Θ

$\psi \in \Theta$

\begin{aligned} \int_{H_{ψ}} s (d | ψ) d d & = h \\ and therefore that \int_{H_{D}} s (d | D) d d & = h \\ or equivalently \int_{H_{D}} s (θ | D) d θ & = h \end{aligned}

$\begin{align*} \int_{H_\psi} s(d | \psi) dd &= h \\ \text{and therefore that} \quad\quad \int_{H_D} s(d | D) dd &= h \\ \text{or equivalently} \quad\quad \int_{H_D} s(\theta | D) d\theta &= h \end{align*}$ Therefore, given that

s (d | θ) f (θ) = s (θ | d) f (d)

$s(d | \theta) f(\theta) = s(\theta | d) f(d)$ ,

C_{D} = H_{D}

$C_D = H_D$ implies

P (θ \in C_{D} | D I) = h

$P(\theta \in C_D | DI) = h$ . The antecedent satisfies

C_{D} = H_{D} ⟷ \forall ψ [ψ \in C_{D} \leftrightarrow ψ \in H_{D}]

$C_D = H_D \longleftrightarrow \forall \psi \; [ \psi \in C_D \leftrightarrow \psi \in H_D ]$ Applying the equivalence near the top:

C_{D} = H_{D} ⟷ \forall ψ [D \in H_{ψ} \leftrightarrow ψ \in H_{D}]

$C_D = H_D \longleftrightarrow \forall \psi \; [ D \in H_\psi \leftrightarrow \psi \in H_D ]$ Thus, the confidence region

C_{D}

$C_D$ contains

θ

$\theta$ with probability

h

$h$ if for all possible values

ψ

$\psi$ of

θ

$\theta$ , the

h

$h$ -HDR of

s (d | ψ)

$s(d | \psi)$ contains

D

$D$ if and only if the

h

$h$ -HDR of

s (d | D)

$s(d | D)$ contains

ψ

$\psi$ .

Now the symmetric relation $D \in H_\psi \leftrightarrow \psi \in H_D$ is satisfied for all $\psi$ when $s(\psi + \delta | \psi) = s(D - \delta | D)$ for all $\delta$ that span the support of $s(d | D)$ and $s(d | \psi)$ . We can therefore form the following argument:

$s(d | \theta) f(\theta) = s(\theta | d) f(d)$ (premise)
$\forall \psi \; \forall \delta \; [ s(\psi + \delta | \psi) = s(D - \delta | D) ]$ (premise)
$\forall \psi \; \forall \delta \; [ s(\psi + \delta | \psi) = s(D - \delta | D) ] \longrightarrow \forall \psi \; [ D \in H_\psi \leftrightarrow \psi \in H_D ]$
$\therefore \quad \forall \psi \; [ D \in H_\psi \leftrightarrow \psi \in H_D ]$
$\forall \psi \; [ D \in H_\psi \leftrightarrow \psi \in H_D ] \longrightarrow C_D = H_D$
$\therefore \quad C_D = H_D$
$[s(d | \theta) f(\theta) = s(\theta | d) f(d) \wedge C_D = H_D] \longrightarrow P(\theta \in C_D | DI) = h$
$\therefore \quad P(\theta \in C_D | DI) = h$

Let's apply the argument to a confidence interval on the mean of a 1-D normal distribution $(\mu, \sigma)$ , given a sample mean $\bar{x}$ from $n$ measurements. We have $\theta = \mu$ and $d = \bar{x}$ , so that the sampling distribution is

s (d | θ) = \frac{\sqrt{n}}{σ \sqrt{2 π}} e^{- \frac{n}{2 σ^{2}} {(d - θ)}^{2}}

$s(d | \theta) = \frac{\sqrt{n}}{\sigma \sqrt{2 \pi}} e^{-\frac{n}{2 \sigma^2} { \left( d - \theta \right) }^2 }$ Suppose also that we know nothing about

θ

$\theta$ before taking the data (except that it's a location parameter) and therefore assign a uniform prior:

f (θ) = k

$f(\theta) = k$ . Clearly we now have

s (d | θ) f (θ) = s (θ | d) f (d)

$s(d | \theta) f(\theta) = s(\theta | d) f(d)$ , so the first premise is satisfied. Let

s (d | θ) = g ((d - θ)^{2})

$s(d | \theta) = g\left( (d - \theta)^2 \right)$ . (i.e. It can be written in that form.) Then

\begin{matrix} s (ψ + δ | ψ) = g ((ψ + δ - ψ)^{2}) = g (δ^{2}) \\ and s (D - δ | D) = g ((D - δ - D)^{2}) = g (δ^{2}) \\ so that \forall ψ \forall δ [s (ψ + δ | ψ) = s (D - δ | D)] \end{matrix}

$\begin{gather*} s(\psi + \delta | \psi) = g \left( (\psi + \delta - \psi)^2 \right) = g(\delta^2) \\ \text{and} \quad\quad s(D - \delta | D) = g \left( (D - \delta - D)^2 \right) = g(\delta^2) \\ \text{so that} \quad\quad \forall \psi \; \forall \delta \; [s(\psi + \delta | \psi) = s(D - \delta | D)] \end{gather*}$ whereupon the second premise is satisfied. Both premises being true, the eight-point argument leads us to conclude that the probability that

θ

$\theta$ lies in the confidence interval

C_{D}

$C_D$ is

h

$h$ !

We therefore have an amusing irony:

The frequentist who assigns the $h$ confidence interval cannot say that $P(\theta \in C_D) = h$ , no matter how innocently uniform $\theta$ looks before incorporating the data.
The Bayesian who would not assign an $h$ confidence interval in that way knows anyhow that $P(\theta \in C_D | DI) = h$ .

Final Remarks

We have identified conditions (i.e. the two premises) under which the $h$ confidence region does indeed yield probability $h$ that $\theta \in C_D$ . A frequentist will baulk at the first premise, because it involves a prior on $\theta$ , and this sort of deal-breaker is inescapable on the route to a probability. But for a Bayesian, it is acceptable---nay, essential. These conditions are sufficient but not necessary, so there are many other circumstances under which the Bayesian $P(\theta \in C_D | DI)$ equals $h$ . Equally though, there are many circumstances in which $P(\theta \in C_D | DI) \ne h$ , especially when the prior information is significant.

We have applied a Bayesian analysis just as a consistent Bayesian would, given the information at hand, including statistics $D$ . But a Bayesian, if he possibly can, will apply his methods to the raw measurements instead---to the $\{x_i\}$ , rather than $\bar{x}$ . Oftentimes, collapsing the raw data into summary statistics $D$ destroys information in the data; and then the summary statistics are incapable of speaking as eloquently as the original data about the parameters $\theta$ .

— CarbonFlambe--Reinstate Monica
ソース

Would it be correct to say that a Bayesian is committed to take all the available information into account, while interpretation given in the question ignored D in some sense?

— qbolec

Is it a good mental picture to illustrate the situation: imagine a grayscale image, where intensity of pixel x,y is the joint ppb of real param being y and observed stat being x. In each row y, we mark pixels which have 95% mass of the row. For each observed stat x, we define CI(x) to be the set of rows which have marked pixels in column x. Now, if we choose x,y randomly then CI(x) will contain y iff x,y was marked, and mass of marked pixels is 95% for each y. So, frequentists say that keeping y fixed, chance is 95%, OP says, that not fixing y also gives 95%, and bayesians fix y and don't know

— qbolec

@qbolec It is correct to say that in the Bayesian method one cannot arbitrarily ignore some information while taking account of the rest. Frequentists say that for all

y

$y$ the expectation of

y \in C I (x)

$y \in \mathrm{CI}(x)$ (as a Boolean integer) under the sampling distribution

p r o b (x | y, I)

$\mathrm{prob}(x\, |\, y, I)$ is 0.95. The frequentist 0.95 is not a probability but an expectation.

— CarbonFlambe--Reinstate Monica

6

from a Bayesian probability perspective, why doesn't a 95% confidence interval contain the true parameter with 95% probability?

Two answers to this, the first being less helpful than the second

There are no confidence intervals in Bayesian statistics, so the question doesn't pertain.
In Bayesian statistics, there are however credible intervals, which play a similar role to confidence intervals. If you view priors and posteriors in Bayesian statistics as quantifying the reasonable belief that a parameter takes on certain values, then the answer to your question is yes, a 95% credible interval represents an interval within which a parameter is believed to lie with 95% probability.

If I have a process that I know produces a correct answer 95% of the time then the probability of the next answer being correct is 0.95 (given that I don't have any extra information regarding the process).

yes, the process guesses a right answer with 95% probability

Similarly if someone shows me a confidence interval that is created by a process that will contain the true parameter 95% of the time, should I not be right in saying that it contains the true parameter with 0.95 probability, given what I know?

Just the same as your process, the confidence interval guesses the correct answer with 95% probability. We're back in the world of classical statistics here: before you gather the data you can say there's a 95% probability of randomly gathered data determining the bounds of the confidence interval such that the mean is within the bounds.

With your process, after you've gotten your answer, you can't say based on whatever your guess was, that the true answer is the same as your guess with 95% probability. The guess is either right or wrong.

And just the same as your process, in the confidence interval case, after you've gotten the data and have an actual lower and upper bound, the mean is either within those bounds or it isn't, i.e. the chance of the mean being within those particular bounds is either 1 or 0. (Having skimmed the question you refer to it seems this is covered in much more detail there.)

How to interpret a confidence interval given to you if you subscribe to a Bayesian view of probability.

There are a couple of ways of looking at this

Technically, the confidence interval hasn't been produced using a prior and Bayes theorem, so if you had a prior belief about the parameter concerned, there would be no way you could interpret the confidence interval in the Bayesian framework.
Another widely used and respected interpretation of confidence intervals is that they provide a "plausible range" of values for the parameter (see, e.g., here). This de-emphasises the "repeated experiments" interpretation.

Moreover, under certain circumstances, notably when the prior is uninformative (doesn't tell you anything, e.g. flat), confidence intervals can produce exactly the same interval as a credible interval. In these circumstances, as a Bayesianist you could argue that had you taken the Bayesian route you would have gotten exactly the same results and you could interpret the confidence interval in the same way as a credible interval.

— TooTone
ソース

but for sure confidence intervals exist even if I subscribe to a bayesian view of probability, they just wont dissapear, right? :)The situation I was asking about was how to interpret a confidence interval given to you if you subscribe to a Bayesian view of probability.

— Rasmus Bååth

The problem is that confidence intervals aren't produced using a Bayesian methodology. You don't start with a prior. I'll edit the post to add something which might help.

— TooTone

2

I'll give you an extreme example where they are different.

Suppose I create my 95% confidence interval for a parameter $\theta$ as follows. Start by sampling the data. Then generate a random number between $0$ and $1$ . Call this number $u$ . If $u$ is less than $0.95$ then return the interval $(-\infty,\infty)$ . Otherwise return the "null" interval.

Now over continued repititions, 95% of the CIs will be "all numbers" and hence contain the true value. The other 5% contain no values, hence have zero coverage. Overall, this is a useless, but technically correct 95% CI.

The Bayesian credible interval will be either 100% or 0%. Not 95%.

— probabilityislogic
ソース

So is it correct to say that before seeing a confidence interval there is a 95% probability that it will contain the true parameter, but for any given confidence interval the probability that it covers the true parameter depends on the data (and our prior)? To be honest, what I'm really struggling with is how useless confidence intervals sounds (credible intervals I like on the other hand) and the fact that I never the less will have to teach them to our students next week... :/

— Rasmus Bååth

This question has some more examples, plus a very good paper comparing the two approaches

— probabilityislogic

1

"from a Bayesian probability perspective, why doesn't a 95% confidence interval contain the true parameter with 95% probability? "

In Bayesian Statistics the parameter is not a unknown value, it is a Distribution. There is no interval containing the "true value", for a Bayesian point of view it does not even make sense. The parameter it's a random variable, you can perfectly know the probability of that value to be between x_inf an x_max if you know the distribuition. It's just a diferent mindset about the parameters, usually Bayesians used the median or average value of the distribuition of the parameter as a "estimate". There is not a confidence interval in Bayesian Statistics, something similar is called credibility interval.

Now from a frequencist point of view, the parameter is a "Fixed Value", not a random variable, can you really obtain probability interval (a 95% one) ? Remember that it's a fixed value not a random variable with a known distribution. Thats why you past the text :"A confidence interval does not predict that the true value of the parameter has a particular probability of being in the confidence interval given the data actually obtained."

The idea of repeating the experience over and over... is not Bayesian reasoning it's a Frequencist one. Imagine a real live experiment that you can only do once in your life time, can you/should you built that confidence interval (from the classical point of view )?.

But... in real life the results could get pretty close ( Bayesian vs Frequencist), maybe thats why It could be confusing.

— blew
ソース