最初の実験の95％信頼区間内で、どのサイズの反復実験が効果サイズを持ちますか？

ランダムサンプリング、ガウス母集団、等分散、Pハッキングなしなどの理想的な状況に固執しましょう。

ステップ1. 2つの標本平均を比較するという実験を実行し、2つの母平均間の差について95％の信頼区間を計算します。

ステップ2.さらに多くの実験（数千）を実行します。平均の違いは、ランダムサンプリングのため、実験ごとに異なります。

質問：ステップ1の信頼区間内にあるのは、ステップ2の実験のコレクションの平均の差のどの部分ですか？

それは答えられません。それはすべて、ステップ1で起こったことに依存します。ステップ1の実験が非常に非定型である場合、質問に対する答えは非常に低い可能性があります。

したがって、両方のステップが何度も繰り返されることを想像してください（ステップ2がさらに何度も繰り返される）。これで、平均して、繰り返し実験のどの部分が最初の実験の95％信頼区間内に効果サイズを持っているかについての期待を考え出すことができるはずです。

研究の再現性を評価するためには、これらの質問に対する答えを理解する必要があるようです。

confidence-interval replicability

— ハーベイ・モトゥルスキー
ソース

元の（ステップ1）実験

ごとに、元の結果の信頼区間内で結果を生成する後続の（ステップ2）結果の割合として

を

i

$i$ 定義します。

の経験的分布を計算したいですか？

x_{i}

$x_i$

x

$x$

— マシューガン

はい、あなたは私が求めていることを理解しています

— ハーベイ・モトゥルスキー

@MatthewGunnは、将来の観測のために「捕獲率」の経験的分布が必要かどうか尋ねました。あなたの投稿は、「...平均して、最初の実験の95％信頼区間内で効果サイズが繰り返される実験の割合に期待できると思います」と尋ねました。これは分布ではなく、期待値（平均）です。

Whuberの分析は優れていますが、引用が必要な場合は、この質問を詳細に正確に説明する論文があります。Cumming ＆Maillardet、2006、Confidence Intervals and Replication：Where Me Next Next Fall？。彼らは、信頼区間の割合をキャプチャと呼びます。

— アメーバは、モニカーを復活させる

回答:

分析

これは概念的な問題であるので、簡略化のためのはの状況を検討してみましょう $1-\alpha$ 信頼区間平均のために構成されているランダムサンプル使用してサイズをと第2のランダムサンプルサイズの取得されすべて同じ法線から、分布。（あなたのようなあなたは、交換することができる場合は学生からの値によって秒の分布自由度、以下の分析では変更されません。）

[{\bar{x}}^{(1)} + Z_{α / 2} s^{(1)} / \sqrt{n}, {\bar{x}}^{(1)} + Z_{1 - α / 2} s^{(1)} / \sqrt{n}]

$\left[\bar x^{(1)} + Z_{\alpha/2} s^{(1)}/\sqrt{n}, \bar x^{(1)} + Z_{1-\alpha/2} s^{(1)}/\sqrt{n}\right]$

μ

$\mu$

x^{(1)}

$x^{(1)}$

n

$n$

x^{(2)}

$x^{(2)}$

m

$m$

(μ, σ^{2})

$(\mu,\sigma^2)$

Z

$Z$

t

$t$

n - 1

$n-1$

2番目のサンプルの平均が最初のサンプルによって決定されたCI内にある可能性は

Pr ({\bar{x}}^{(1)} + \frac{Z_{α / 2}}{\sqrt{n}} s^{(1)} \leq {\bar{x}}^{(2)} \leq {\bar{x}}^{(1)} + \frac{Z_{1 - α / 2}}{\sqrt{n}} s^{(1)}) = Pr (\frac{Z_{α / 2}}{\sqrt{n}} s^{(1)} \leq {\bar{x}}^{(2)} - {\bar{x}}^{(1)} \leq \frac{Z_{1 - α / 2}}{\sqrt{n}} s^{(1)}) .

$\Pr\left(\bar x^{(1)} + \frac{Z_{\alpha/2}}{\sqrt{n}} s^{(1)} \le \bar x^{(2)} \le \bar x^{(1)} + \frac{Z_{1-\alpha/2}}{\sqrt{n}} s^{(1)}\right) =\Pr\left(\frac{Z_{\alpha/2}}{\sqrt{n}} s^{(1)} \le \bar x^{(2)}-\bar x^{(1)} \le \frac{Z_{1-\alpha/2}}{\sqrt{n}} s^{(1)}\right).$

最初のサンプルの平均ため最初のサンプルの標準偏差とは無関係である（これは正常を必要とする）と第二のサンプルは、最初の独立している、サンプル平均の差とは無関係である。さらに、この対称的な間隔の $\bar x^{(1)}$ $s^{(1)}$ $U = \bar x^{(2)} - \bar x^{(1)}$ $s^{(1)}$ $Z_{\alpha/2}=-Z_{1-\alpha/2}$ 。したがって、ランダム変数にを書き込む $S$ 両方の不等式を二乗すると、問題の確率は次のようになります。 $s^{(1)}$

Pr (U^{2} \leq {(\frac{Z_{1 - α / 2}}{\sqrt{n}})}^{2} S^{2}) = Pr (\frac{U^{2}}{S^{2}} \leq {(\frac{Z_{1 - α / 2}}{\sqrt{n}})}^{2}) .

$\Pr\left(U^2 \le \left(\frac{Z_{1-\alpha/2}}{\sqrt{n}}\right)^2 S^2\right)= \Pr\left(\frac{U^2}{S^2} \le \left(\frac{Z_{1-\alpha/2}}{\sqrt{n}}\right)^2\right).$

期待法則は、の平均があり、分散が $U$ $0$

Var (U) = Var ({\bar{x}}^{(2)} - {\bar{x}}^{(1)}) = σ^{2} (\frac{1}{m} + \frac{1}{n}) .

$\operatorname{Var}(U) = \operatorname{Var}\left(\bar x^{(2)} - \bar x^{(1)}\right) = \sigma^2\left(\frac{1}{m} + \frac{1}{n}\right).$

は正規変数の線形結合であるため、正規分布も持ちます。したがって、ある $U$ $U^2$ 回変数。我々はすでに知っていたある倍変数。その結果、は、分布の変数の倍になります。 $\sigma^2\left(\frac{1}{n} + \frac{1}{m}\right)$ $\chi^2(1)$ $S^2$ $\sigma^2/n$ $\chi^2(n-1)$ $U^2/S^2$ $1/n + 1/m$ $F(1,n-1)$ 必要な確率は、F分布によって与えられます。

\begin{matrix} (1) & F_{1, n - 1} (\frac{Z_{1 - α / 2}^{2}}{1 + n / m}) . \end{matrix}

$F_{1,n-1}\left(\frac{Z_{1-\alpha/2}^2}{1 + n/m}\right).\tag{1}$

Discussion

An interesting case is when the second sample is the same size as the first, so that $n/m=1$ and only $n$ and $\alpha$ determine the probability. Here are the values of $(1)$ plotted against $\alpha$ for $n=2,5,20,50$ .

グラフは各時制限値まで上昇としてが増加します。従来のテストサイズは、灰色の縦線でマークされています。大きい値の、制限チャンスは約です。 $\alpha$ $n$ $\alpha=0.05$ $n=m$ $\alpha=0.05$ $85\%$

By understanding this limit, we will peer past the details of small sample sizes and better understand the crux of the matter. As $n=m$ grows large, the $F$ distribution approaches a $\chi^2(1)$ distribution. In terms of the standard Normal distribution $\Phi$ , the probability $(1)$ then approximates

Φ (\frac{Z_{1 - α / 2}}{\sqrt{2}}) - Φ (\frac{Z_{α / 2}}{\sqrt{2}}) = 1 - 2 Φ (\frac{Z_{α / 2}}{\sqrt{2}}) .

$\Phi\left(\frac{Z_{1-\alpha/2}}{\sqrt{2}}\right) - \Phi\left(\frac{Z_{\alpha/2}}{\sqrt{2}}\right) = 1 - 2\Phi\left(\frac{Z_{\alpha/2}}{\sqrt{2}}\right) .$

For instance, with $\alpha=0.05$ , $Z_{\alpha/2}/\sqrt{2} \approx -1.96/1.41 \approx -1.386$ and $\Phi(-1.386) \approx 0.083$ . Consequently the limiting value attained by the curves at $\alpha=0.05$ as $n$ increases will be $1 - 2(0.083) = 1 - 0.166=0.834$ . You can see it has almost been reached for $n=50$ (where the chance is $0.8383\ldots$ .)

For small $\alpha$ , the relationship between $\alpha$ and the complementary probability--the risk that the CI does not cover the second mean--is almost perfectly a power law. Another way to express this is that the log complementary probability is almost a linear function of $\log\alpha$ . The limiting relationship is approximately

\log (2 Φ (\frac{Z_{α / 2}}{\sqrt{2}})) \approx - 1.79712 + 0.557203 \log (20 α) + 0.00657704 (\log (20 α))^{2} + \dots

$\log\left(2\Phi\left(\frac{Z_{\alpha/2}}{\sqrt{2}}\right)\right) \approx -1.79712 + 0.557203\log(20 \alpha) + 0.00657704 (\log(20 \alpha))^2 + \cdots$

In other words, for large $n=m$ and $\alpha$ anywhere near the traditional value of $0.05$ , $(1)$ will be close to

1 - 0.166 (20 α)^{0.557} .

$1 - 0.166 (20\alpha)^{0.557}.$

(This reminds me very much of the analysis of overlapping confidence intervals I posted at /stats//a/18259/919. Indeed, the magic power there, $1.91$ , is very nearly the reciprocal of the magic power here, $0.557$ . At this point you should be able to re-interpret that analysis in terms of reproducibility of experiments.)

Experimental results

These results are confirmed with a straightforwward simulation. The following R code returns the frequency of coverage, the chance as computed with $(1)$ , and a Z-score to assess how much they differ. The Z-scores are typically less than $2$ in size, regardless of $n, m, \mu, \sigma, \alpha$ (or even whether a $Z$ or $t$ CI is computed), indicating the correctness of formula $(1)$ .

n <- 3      # First sample size
m <- 2      # Second sample size
sigma <- 2 
mu <- -4
alpha <- 0.05
n.sim <- 1e4
#
# Compute the multiplier.
#
Z <- qnorm(alpha/2)
#Z <- qt(alpha/2, df=n-1) # Use this for a Student t C.I. instead.
#
# Draw the first sample and compute the CI as [l.1, u.1].
#
x.1 <- matrix(rnorm(n*n.sim, mu, sigma), nrow=n)
x.1.bar <- colMeans(x.1)
s.1 <- apply(x.1, 2, sd)
l.1 <- x.1.bar + Z * s.1 / sqrt(n)
u.1 <- x.1.bar - Z * s.1 / sqrt(n)
#
# Draw the second sample and compute the mean as x.2.
#
x.2 <- colMeans(matrix(rnorm(m*n.sim, mu, sigma), nrow=m))
#
# Compare the second sample means to the CIs.
#
covers <- l.1 <= x.2 & x.2 <= u.1
#
# Compute the theoretical chance and compare it to the simulated frequency.
#
f <- pf(Z^2 / ((n * (1/n + 1/m))), 1, n-1)
m.covers <- mean(covers)
(c(Simulated=m.covers, Theoretical=f, Z=(m.covers - f)/sd(covers) * sqrt(length(covers))))

— ウーバー
ソース

あなたは、zの代わりにtを使用しても大きな違いはないと言います。私はあなたを信じていますが、まだチェックしていません。サンプルサイズが小さい場合、2つの重要な値は大きく異なる可能性があり、t分布はCIを計算する正しい方法です。なぜzの使用を好むのですか？

— ハーベイモトゥルスキー

Z

$Z$

t

$t$ it is interesting that the curves in the figure start high and descend to their limit. In particular, the chance of reproducing a significant result is then much higher for small samples than for large! Note that there's nothing to check, because you are free to interpret

Z_{α}

$Z_{\alpha}$ as a percentage point of the appropriate Student t distribution (or of any other distribution you might care to name). Nothing changes in the analysis. If you do want to see the particular effects, uncomment the qt line in the code.

— whuber

+1. This is a great analysis (and your answer has way too few upvotes for what it is). I just came across a paper that discusses this very question in great detail and I thought you might be interested: Cumming & Maillardet, 2006, Confidence Intervals and Replication: Where Will the Next Mean Fall?. They call it capture percentage of a confidence interval.

— amoeba says Reinstate Monica

@Amoeba Thank you for the reference. I especially appreciate one general conclusion therein: "Replication is central to the scientific method, and researchers should not turn a blind eye to it just because it makes salient the inherent uncertainty of a single study."

— whuber

Update: Thanks to the ongoing discussion in the sister thread, I now believe my reasoning in the above comment was not correct. 95% CIs have 83% "replication-capture", but this is a statement about repeated sampling and cannot be interpreted as giving a probability conditioned on one particular confidence interval, at least not without further assumptions. (Perhaps both this and previous comments should better be deleted in order not to confuse further readers.)

— amoeba says Reinstate Monica

[Edited to fix the bug WHuber pointed out.]

I altered @Whuber's R code to use the t distribution, and plot coverage as a function of sample size. The results are below. At high sample size, the results match WHuber's of course.

And here is the adapted R code, run twice with alpha set to either 0.01 or 0.05.

sigma <- 2 
mu <- -4
alpha <- 0.01
n.sim <- 1e5
#
# Compute the multiplier.

for (n in c(3,5,7,10,15,20,30,50,100,250,500,1000))
{
   T <- qt(alpha/2, df=n-1)     
# Draw the first sample and compute the CI as [l.1, u.1].
#
x.1 <- matrix(rnorm(n*n.sim, mu, sigma), nrow=n)
x.1.bar <- colMeans(x.1)
s.1 <- apply(x.1, 2, sd)
l.1 <- x.1.bar + T * s.1 / sqrt(n)
u.1 <- x.1.bar - T * s.1 / sqrt(n)
#
# Draw the second sample and compute the mean as x.2.
#
x.2 <- colMeans(matrix(rnorm(n*n.sim, mu, sigma), nrow=n))
#
# Compare the second sample means to the CIs.
#
covers <- l.1 <= x.2 & x.2 <= u.1
#
Coverage=mean(covers)

print (Coverage)

}

And here is the GraphPad Prism file that made the graph.

— Harvey Motulsky
ソース

I believe your plots do not use the t distribution, due to a bug: you set the value of T outside the loop! If you would like to see the correct curves, just plot them directly using the theoretical result in my answer, as given at the end of my R code (rather than relying on the simulated results):

curve(pf(qt(.975, x-1)^2 / ((x * (1/x + 1/x))), 1, x-1), 2, 1000, log="x", ylim=c(.8,1), col="Blue"); curve(pf(qt(.995, x-1)^2 / ((x * (1/x + 1/x))), 1, x-1), add=TRUE, col="Red")

— whuber

@whuber. Yikes! Of course you are right. Embarrassing. I've fixed it. As you pointed out the coverage is higher with tiny sample sizes. (I fixed the simulations, and didn't try your theoretical function.)

— Harvey Motulsky

I am glad you fixed it, because it is very interesting how high the coverage is for small sample sizes. We could also invert your question and use the formula to determine what value of

Z_{α / 2}

$Z_{\alpha/2}$ to use if we wished to assure (before doing any experiments), with probability

p = 0.95

$p=0.95$ (say), that the mean of the second experiment would lie within the two-sided

1 - α

$1-\alpha$ confidence interval determined from the second. Doing so, as a routine practice, could be one intriguing way of addressing some criticism of NHST.

— whuber

@whuber I think the next step is to look at the distribution of coverage. So far, we have the average coverage (average of many first experiments, with average of many second experiments each). But depending on what the first experiment is, in some cases the average coverage will be poor. It would be interesting to see the distribution. I'm trying to learn R well enough to find out.

— Harvey Motulsky

Regarding the distributions, see the paper I linked to in the comments above.

— amoeba says Reinstate Monica