既知のグループ分散、平均、およびサンプルサイズを指定して、2つ以上のグループのプールされた分散を計算する方法は？

言うがある $m+n$ 要素は、二つのグループに分け（ $m$ および $n$ ）。第1グループの分散であり、 $\sigma_m^2$ 及び第2グループの分散であり、 $\sigma^2_n$ 。要素自体は不明であると想定されているが、私は知っている手段 $\mu_m$ と $\mu_n$ 。

複合分散計算する方法がある $\sigma^2_{(m+n)}$ ？

分散は不偏である必要はないので、分母は $(m+n)$ あり、 $(m+n-1)$ ありません。

variance pooling

— user1809989
ソース

これらのグループの平均と分散を知っていると言うとき、それらはパラメーターまたはサンプル値ですか？彼らはサンプルの手段がある場合は/あなたは使用しないでください差異

と

...

μ

$\mu$

σ

$\sigma$

— ジョナサン・クリステンセン

シンボルを表現として使用しました。そうでなければ、私の問題を説明するのは難しいでしょう。

— user1809989

サンプル値については、通常ラテン文字を使用します（例：

および

）。通常、ギリシャ文字はパラメーター用に予約されています。「正しい」（予想される）記号を使用すると、コミュニケーションがより明確になります。

m

$m$

s

$s$

— ジョナサンクリステンセン

心配いりません、これからフォローします！乾杯

— -user1809989

これはサンプルや見積もりについての質問ではありませんので@ジョナサン、1は合法的との見方を取ることができ

と

ある真、それによってギリシャ文字の従来の使用を正当化するのではなく、平均値とデータのバッチの経験分布の分散それらを参照するラテン文字。

μ

$\mu$

σ^{2}

$\sigma^2$

— whuber

回答:

平均の定義を使用する

μ_{1 : n} = \frac{1}{n} \sum_{i = 1}^{n} x_{i}

$\mu_{1:n} = \frac{1}{n}\sum_{i=1}^n x_i$

および標本分散

σ_{1 : n}^{2} = \frac{1}{n} \sum_{i = 1}^{n} {(x_{i} - μ_{1 : n})}^{2} = \frac{n - 1}{n} (\frac{1}{n - 1} \sum_{i = 1}^{n} {(x_{i} - μ_{1 : n})}^{2})

$\sigma_{1:n}^2 = \frac{1}{n}\sum_{i=1}^n \left(x_i - \mu_{1:n}\right)^2 = \frac{n-1}{n}\left(\frac{1}{n-1}\sum_{i=1}^n \left(x_i - \mu_{1:n}\right)^2\right)$

（括弧内の最後の項は、統計ソフトウェアでデフォルトで計算されることが多い不偏分散推定量です）、すべてのデータ平方和を見つけます。が最初のグループの要素を示し、が2番目のグループの要素を示すように、インデックス並べてみましょう。その平方和をグループごとに分割し、データのサブセットの分散と平均に関して2つの部分を再表現します。 $x_i$ $i$ $i=1,\ldots,n$ $i=n+1,\ldots,n+m$

\begin{aligned} (m + n) (σ_{1 : m + n}^{2} + μ_{1 : m + n}^{2}) & = \sum_{i = 1}^{1 : n + m} x_{i}^{2} \\ = \sum_{i = 1}^{n} x_{i}^{2} + \sum_{i = n + 1}^{n + m} x_{i}^{2} \\ = n (σ_{1 : n}^{2} + μ_{1 : n}^{2}) + m (σ_{1 + n : m + n}^{2} + μ_{1 + n : m + n}^{2}) . \end{aligned}

$\eqalign{ (m+n)(\sigma^2_{1:m+n} + \mu_{1:m+n}^2) &= \sum_{i=1}^{1:n+m} x_i^2 \\ &= \sum_{i=1}^n x_i^2 + \sum_{i=n+1}^{n+m} x_i^2 \\ &= n(\sigma^2_{1:n} + \mu_{1:n}^2) + m(\sigma^2_{1+n:m+n} + \mu_{1+n:m+n}^2). }$

Algebraically solving this for $\sigma^2_{m+n}$ in terms of the other (known) quantities yields

σ_{1 : m + n}^{2} = \frac{n (σ_{1 : n}^{2} + μ_{1 : n}^{2}) + m (σ_{1 + n : m + n}^{2} + μ_{1 + n : m + n}^{2})}{m + n} - μ_{1 : m + n}^{2} .

$\sigma^2_{1:m+n} = \frac{n(\sigma^2_{1:n} + \mu_{1:n}^2) + m(\sigma^2_{1+n:m+n} + \mu_{1+n:m+n}^2)}{m+n} - \mu^2_{1:m+n}.$

Of course, using the same approach, $\mu_{1:m+n} = (n\mu_{1:n} + m\mu_{1+n:m+n})/(m+n)$ can be expressed in terms of the group means, too.

An anonymous contributor points out that when the sample means are equal (so that $\mu_{1:n}=\mu_{1+n:m+n}=\mu_{1:m+n}$ ), the solution for $\sigma^2_{m+n}$ is a weighted mean of the group sample variances.

— whuber
ソース

The "homework" tag doesn't mean the question is elementary or stupid: it's used for self-study questions that can even include research-level queries. It distinguishes routine, more or less context-free questions (of the sort that might ordinarily grace the math forum) from specific applied questions.

— whuber

I cannot understand your first passage:

n (σ^{2} + μ^{2}) = \sum (x - μ)^{2} + n μ^{2} \overset{?}{=} \sum x^{2}

$n(\sigma^2+\mu^2) = \sum (x - \mu)^2 + n\mu^2 \stackrel{?}{=} \sum x^2$ In particular I get

\sum [(x - μ)^{2} + μ^{2}] = \sum [x^{2} - 2 x μ]

$\sum [(x-\mu)^2+\mu^2] = \sum [x^2-2x\mu]$ which requires

μ = 0

$\mu = 0$ Am I missing something? Could you please explain this?

— DarioP

@Dario

\sum (x - μ)^{2} + n μ^{2} = (\sum x^{2} - 2 μ \sum x + n μ^{2}) + n μ^{2} = \sum x^{2} - 2 n μ^{2} + 2 n μ^{2} = \sum x^{2} .

$\sum(x-\mu)^2+n\mu^2=(\sum x^2 - 2\mu\sum x + n \mu^2)+n\mu^2 = \sum x^2 - 2n\mu^2 + 2n\mu^2 = \sum x^2.$

— whuber

Oh yes, I did a stupid sign mistake in my derivation, now is clear, thanks!!

— DarioP

I guess this can be extended to an arbitrary number of samples as long as you have the mean and variance for each. Calculating pooled (biased) standard deviation in R is simply sqrt(weighted.mean(u^2 + rho^2, n) - weighted.mean(u, n)^2) where n, u and rho are equal-length vectors. E.g. n=c(10, 14, 9) for three samples.

— Jonas Lindeløv

I'm going to use standard notation for sample means and sample variances in this answer, rather than the notation used in the question. Using standard notation, another formula for the pooled sample variance of two groups can be found in O'Neill (2014) (Result 1):

\begin{aligned} s_{pooled}^{2} & = \frac{1}{n_{1} + n_{2} - 1} [(n_{1} - 1) s_{1}^{2} + (n_{2} - 1) s_{2}^{2} + \frac{n_{1} n_{2}}{n_{1} + n_{2}} ({\bar{x}}_{1} - {\bar{x}}_{2})^{2}] . \end{aligned}

$\begin{equation} \begin{aligned} s_\text{pooled}^2 &= \frac{1}{n_1+n_2-1} \Bigg[ (n_1-1) s_1^2 + (n_2-1) s_2^2 + \frac{n_1 n_2}{n_1+n_2} (\bar{x}_1 - \bar{x}_2)^2 \Bigg]. \\[10pt] \end{aligned} \end{equation}$

This formula works directly with the underlying sample means and sample variances of the two subgroups, and does not require intermediate calculation of the pooled sample mean. (Proof of result in linked paper.)

— Reinstate Monica
ソース

-3

Yes, given the mean, sample count, and variance or standard deviation of each of two or more groups of samples, you can exactly calculate the variance or standard deviation of the combined group.

This web page describes how to do it, and why it works; it also includes source code in Perl: http://www.burtonsys.com/climate/composite_standard_deviations.html

BTW, contrary to the answer given above,

\begin{aligned} n (σ^{2} + μ^{2}) \neq \sum_{i = 1}^{n} x_{i}^{2} \end{aligned}

$\eqalign{ n(\sigma^2 + \mu^2) \space\space \ne \space\space \sum_{i=1}^n x_i^2 }$

See for yourself, e.g., in R:

> x = rnorm(10,5,2)
> x
 [1] 6.515139 8.273285 2.879483 3.624233 6.199610 3.683164 4.921028 8.084591
 [9] 2.974520 6.049962
> mean(x)
[1] 5.320502
> sd(x)
[1] 2.007519
> sum(x**2)
[1] 319.3486
> 10 * (mean(x)**2 + sd(x)**2)
[1] 323.3787

— Dave Burton
ソース

it's because you forgot the n-1 factor, e.g. try with n*(mean(x)**2+sd(x)**2/(n)*(n-1))

— user603

user603, what on earth are you talking about?

— Dave Burton

Dave, mathematics is a more reliable teacher than software. In this case R computes the unbiased estimate of the standard deviation rather than the standard deviation of the set of numbers. For instance, sd(c(-1,1)) returns 1.414214 rather than 1. Your example needs to use sqrt(9/10)*sd(x) in place of sd(x). Interpreting "

σ

$\sigma$ " as the SD of the data and "

μ

$\mu$ " as the mean of the data, your BTW remark is wrong. A program demonstrating this is n <- 10; x <- rnorm(n,5,2); m <- mean(x); s <- sd(x) * sqrt((n-1)/n); m2 <- sum(x^2); c(lhs=n * (m^2 + s^2), rhs=m2)

— whuber