最初の実験の95%信頼区間内で、どのサイズの反復実験が効果サイズを持ちますか?


12

ランダムサンプリング、ガウス母集団、等分散、Pハッキングなしなどの理想的な状況に固執しましょう。

ステップ1. 2つの標本平均を比較するという実験を実行し、2つの母平均間の差について95%の信頼区間を計算します。

ステップ2.さらに多くの実験(数千)を実行します。平均の違いは、ランダムサンプリングのため、実験ごとに異なります。

質問:ステップ1の信頼区間内にあるのは、ステップ2の実験のコレクションの平均の差のどの部分ですか?

それは答えられません。それはすべて、ステップ1で起こったことに依存します。ステップ1の実験が非常に非定型である場合、質問に対する答えは非常に低い可能性があります。

したがって、両方のステップが何度も繰り返されることを想像してください(ステップ2がさらに何度も繰り返される)。これで、平均して、繰り返し実験のどの部分が最初の実験の95%信頼区間内に効果サイズを持っているかについての期待を考え出すことができるはずです。

研究の再現性を評価するためには、これらの質問に対する答えを理解する必要があるようです。


元の(ステップ1)実験ごとに、元の結果の信頼区間内で結果を生成する後続の(ステップ2)結果の割合としてx ii定義します。xの経験的分布を計算したいですか?xix
マシューガン

はい、あなたは私が求めていることを理解しています
ハーベイ・モトゥルスキー

@MatthewGunnは、将来の観測のために「捕獲率」の経験的分布が必要かどうか尋ねました。あなたの投稿は、「...平均して、最初の実験の95%信頼区間内で効果サイズが繰り返される実験の割合に期待できると思います」と尋ねました。これは分布ではなく、期待値(平均)です。

Whuberの分析は優れていますが、引用が必要な場合は、この質問を詳細に正確に説明する論文があります。Cumming &Maillardet、2006、Confidence Intervals and Replication:Where Me Next Next Fall?。彼らは、信頼区間の割合キャプチャと呼びます。
アメーバは、モニカーを復活させる

回答:


12

分析

これは概念的な問題であるので、簡略化のためのはの状況を検討してみましょう1α信頼区間平均のために構成されているμランダムサンプル使用して、X1サイズをNと第2のランダムサンプルX図2に示すようにサイズの取得され、Mすべて同じ法線から、μσ2分布。(あなたのようなあなたは、交換することができる場合はZ学生からの値によって秒のtの分布のn-1自由度、以下の分析では変更されません。)

[x¯(1)+Zα/2s(1)/n,x¯(1)+Z1α/2s(1)/n]
μx(1)nx(2)m(μ,σ2)Ztn1

2番目のサンプルの平均が最初のサンプルによって決定されたCI内にある可能性は

Pr(x¯(1)+Zα/2ns(1)x¯(2)x¯(1)+Z1α/2ns(1))=Pr(Zα/2ns(1)x¯(2)x¯(1)Z1α/2ns(1)).

最初のサンプルの平均ため最初のサンプルの標準偏差とは無関係であるS 1 (これは正常を必要とする)と第二のサンプルは、最初の独立している、サンプル平均の差U = ˉ X2 - ˉ X1 とは無関係であるS 1 。さらに、この対称的な間隔のZ α / 2 = - Z 1 - α / 2x¯(1)s(1)U=x¯(2)x¯(1)s(1)Zα/2=Z1α/2。したがって、ランダム変数にを書き込むS両方の不等式を二乗すると、問題の確率は次のようになります。s(1)

Pr(U2(Z1α/2n)2S2)=Pr(U2S2(Z1α/2n)2).

期待法則は、の平均が0であり、分散がU0

Var(U)=Var(x¯(2)x¯(1))=σ2(1m+1n).

は正規変数の線形結合であるため、正規分布も持ちます。したがって、U 2があるσ 2UU2χ21変数。我々はすでに知っていたS2があるσ2/Nχ2nは-1変数。その結果、U2/S2は、F1n1分布の変数の1/n+1/m倍になります。 σ2(1n+1m)χ2(1)S2σ2/nχ2(n1)U2/S21/n+1/mF(1,n1)必要な確率は、F分布によって与えられます。

(1)F1,n1(Z1α/221+n/m).

Discussion

An interesting case is when the second sample is the same size as the first, so that n/m=1 and only n and α determine the probability. Here are the values of (1) plotted against α for n=2,5,20,50.

Figure

グラフは各時制限値まで上昇としてのnが増加します。従来のテストサイズα = 0.05は、灰色の縦線でマークされています。n = mの大きい値の場合α = 0.05の制限チャンスは約85 です。αnα=0.05n=mα=0.0585%

By understanding this limit, we will peer past the details of small sample sizes and better understand the crux of the matter. As n=m grows large, the F distribution approaches a χ2(1) distribution. In terms of the standard Normal distribution Φ, the probability (1) then approximates

Φ(Z1α/22)Φ(Zα/22)=12Φ(Zα/22).

For instance, with α=0.05, Zα/2/21.96/1.411.386 and Φ(1.386)0.083. Consequently the limiting value attained by the curves at α=0.05 as n increases will be 12(0.083)=10.166=0.834. You can see it has almost been reached for n=50 (where the chance is 0.8383.)

For small α, the relationship between α and the complementary probability--the risk that the CI does not cover the second mean--is almost perfectly a power law. Another way to express this is that the log complementary probability is almost a linear function of logα. The limiting relationship is approximately

log(2Φ(Zα/22))1.79712+0.557203log(20α)+0.00657704(log(20α))2+

In other words, for large n=m and α anywhere near the traditional value of 0.05, (1) will be close to

10.166(20α)0.557.

(This reminds me very much of the analysis of overlapping confidence intervals I posted at /stats//a/18259/919. Indeed, the magic power there, 1.91, is very nearly the reciprocal of the magic power here, 0.557. At this point you should be able to re-interpret that analysis in terms of reproducibility of experiments.)


Experimental results

These results are confirmed with a straightforwward simulation. The following R code returns the frequency of coverage, the chance as computed with (1), and a Z-score to assess how much they differ. The Z-scores are typically less than 2 in size, regardless of n,m,μ,σ,α (or even whether a Z or t CI is computed), indicating the correctness of formula (1).

n <- 3      # First sample size
m <- 2      # Second sample size
sigma <- 2 
mu <- -4
alpha <- 0.05
n.sim <- 1e4
#
# Compute the multiplier.
#
Z <- qnorm(alpha/2)
#Z <- qt(alpha/2, df=n-1) # Use this for a Student t C.I. instead.
#
# Draw the first sample and compute the CI as [l.1, u.1].
#
x.1 <- matrix(rnorm(n*n.sim, mu, sigma), nrow=n)
x.1.bar <- colMeans(x.1)
s.1 <- apply(x.1, 2, sd)
l.1 <- x.1.bar + Z * s.1 / sqrt(n)
u.1 <- x.1.bar - Z * s.1 / sqrt(n)
#
# Draw the second sample and compute the mean as x.2.
#
x.2 <- colMeans(matrix(rnorm(m*n.sim, mu, sigma), nrow=m))
#
# Compare the second sample means to the CIs.
#
covers <- l.1 <= x.2 & x.2 <= u.1
#
# Compute the theoretical chance and compare it to the simulated frequency.
#
f <- pf(Z^2 / ((n * (1/n + 1/m))), 1, n-1)
m.covers <- mean(covers)
(c(Simulated=m.covers, Theoretical=f, Z=(m.covers - f)/sd(covers) * sqrt(length(covers))))

あなたは、zの代わりにtを使用しても大きな違いはないと言います。私はあなたを信じていますが、まだチェックしていません。サンプルサイズが小さい場合、2つの重要な値は大きく異なる可能性があり、t分布はCIを計算する正しい方法です。なぜzの使用を好むのですか?
ハーベイモトゥルスキー

Zt it is interesting that the curves in the figure start high and descend to their limit. In particular, the chance of reproducing a significant result is then much higher for small samples than for large! Note that there's nothing to check, because you are free to interpret Zα as a percentage point of the appropriate Student t distribution (or of any other distribution you might care to name). Nothing changes in the analysis. If you do want to see the particular effects, uncomment the qt line in the code.
whuber

1
+1. This is a great analysis (and your answer has way too few upvotes for what it is). I just came across a paper that discusses this very question in great detail and I thought you might be interested: Cumming & Maillardet, 2006, Confidence Intervals and Replication: Where Will the Next Mean Fall?. They call it capture percentage of a confidence interval.
amoeba says Reinstate Monica

@Amoeba Thank you for the reference. I especially appreciate one general conclusion therein: "Replication is central to the scientific method, and researchers should not turn a blind eye to it just because it makes salient the inherent uncertainty of a single study."
whuber

1
Update: Thanks to the ongoing discussion in the sister thread, I now believe my reasoning in the above comment was not correct. 95% CIs have 83% "replication-capture", but this is a statement about repeated sampling and cannot be interpreted as giving a probability conditioned on one particular confidence interval, at least not without further assumptions. (Perhaps both this and previous comments should better be deleted in order not to confuse further readers.)
amoeba says Reinstate Monica

4

[Edited to fix the bug WHuber pointed out.]

I altered @Whuber's R code to use the t distribution, and plot coverage as a function of sample size. The results are below. At high sample size, the results match WHuber's of course.

enter image description here

And here is the adapted R code, run twice with alpha set to either 0.01 or 0.05.

sigma <- 2 
mu <- -4
alpha <- 0.01
n.sim <- 1e5
#
# Compute the multiplier.

for (n in c(3,5,7,10,15,20,30,50,100,250,500,1000))
{
   T <- qt(alpha/2, df=n-1)     
# Draw the first sample and compute the CI as [l.1, u.1].
#
x.1 <- matrix(rnorm(n*n.sim, mu, sigma), nrow=n)
x.1.bar <- colMeans(x.1)
s.1 <- apply(x.1, 2, sd)
l.1 <- x.1.bar + T * s.1 / sqrt(n)
u.1 <- x.1.bar - T * s.1 / sqrt(n)
#
# Draw the second sample and compute the mean as x.2.
#
x.2 <- colMeans(matrix(rnorm(n*n.sim, mu, sigma), nrow=n))
#
# Compare the second sample means to the CIs.
#
covers <- l.1 <= x.2 & x.2 <= u.1
#
Coverage=mean(covers)

print (Coverage)

}

And here is the GraphPad Prism file that made the graph.


I believe your plots do not use the t distribution, due to a bug: you set the value of T outside the loop! If you would like to see the correct curves, just plot them directly using the theoretical result in my answer, as given at the end of my R code (rather than relying on the simulated results): curve(pf(qt(.975, x-1)^2 / ((x * (1/x + 1/x))), 1, x-1), 2, 1000, log="x", ylim=c(.8,1), col="Blue"); curve(pf(qt(.995, x-1)^2 / ((x * (1/x + 1/x))), 1, x-1), add=TRUE, col="Red")
whuber

1
@whuber. Yikes! Of course you are right. Embarrassing. I've fixed it. As you pointed out the coverage is higher with tiny sample sizes. (I fixed the simulations, and didn't try your theoretical function.)
Harvey Motulsky

I am glad you fixed it, because it is very interesting how high the coverage is for small sample sizes. We could also invert your question and use the formula to determine what value of Zα/2 to use if we wished to assure (before doing any experiments), with probability p=0.95 (say), that the mean of the second experiment would lie within the two-sided 1α confidence interval determined from the second. Doing so, as a routine practice, could be one intriguing way of addressing some criticism of NHST.
whuber

@whuber I think the next step is to look at the distribution of coverage. So far, we have the average coverage (average of many first experiments, with average of many second experiments each). But depending on what the first experiment is, in some cases the average coverage will be poor. It would be interesting to see the distribution. I'm trying to learn R well enough to find out.
Harvey Motulsky

Regarding the distributions, see the paper I linked to in the comments above.
amoeba says Reinstate Monica
弊社のサイトを使用することにより、あなたは弊社のクッキーポリシーおよびプライバシーポリシーを読み、理解したものとみなされます。
Licensed under cc by-sa 3.0 with attribution required.