「実際に」プールされた分散とはどういう意味ですか？

15

私は統計の初心者なので、ここで私を助けてください。

私の質問は次のとおりです。プールされた分散は実際に何を意味しますか？

インターネットでプールされた分散の式を探すと、次の式を使用して多くの文献が見つかります（たとえば、http：//math.tntech.edu/ISR/Mathematical_Statistics/Introduction_to_Statistical_Tests/thispage/newnode19.html）：

S_{p}^{2} = \frac{S_{1}^{2} (n_{1} - 1) + S_{2}^{2} (n_{2} - 1)}{n_{1} + n_{2} - 2}

$\begin{equation} \label{eq:stupidpooledvar} \displaystyle S^2_p = \frac{S_1^2 (n_1-1) + S_2^2 (n_2-1)}{n_1 + n_2 - 2} \end{equation}$

しかし、実際には何を計算しますか？プールされた分散を計算するためにこの式を使用すると、間違った答えが得られるためです。

たとえば、これらの「親サンプル」を考えてみましょう。

2, 2, 2, 2, 2, 8, 8, 8, 8, 8

$\begin{equation} \label{eq:parentsample} 2,2,2,2,2,8,8,8,8,8 \end{equation}$

この親サンプルの分散である、その平均である。 $S^2_p=10$ $\bar{x}_p=5$

ここで、この親サンプルを2つのサブサンプルに分割するとします。

最初のサブサンプルは、平均と2,2,2,2,2である、分散。 $\bar{x}_1=2$ $S^2_1=0$
第2のサブサンプルは、平均と8,8,8,8,8である及び分散。 $\bar{x}_2=8$ $S^2_2=0$

ここで、およびであるため、上記の式を使用してこれら2つのサブサンプルのプール/親分散を計算すると、ゼロが生成されます。それでは、この式は実際に何を計算しますか？ $S_1=0$ $S_2=0$

一方、長い時間をかけて導出した後、正しいプール/親の分散を生成する式は次のとおりです。

S_{p}^{2} = \frac{S_{1}^{2} (n_{1} - 1) + n_{1} d_{1}^{2} + S_{2}^{2} (n_{2} - 1) + n_{2} d_{2}^{2}}{n_{1} + n_{2} - 1}

$\begin{equation} \label{eq:smartpooledvar} \displaystyle S^2_p = \frac{S_1^2 (n_1-1) + n_1 d_1^2 + S_2^2 (n_2-1) + n_2 d_2^2} {n_1 + n_2 - 1} \end{equation}$

上記式中、及び。 $d_1=\bar{x_1}-\bar{x}_p$ $d_2=\bar{x_2}-\bar{x}_p$

私は同様の式を見つけました。例えば、http：//www.emathzone.com/tutorials/basic-statistics/combined-variance.html とWikipediaにあります。私は彼らが私のものとまったく同じに見えないことを認めなければなりませんが。

繰り返しますが、プールされた分散は実際に何を意味しますか？2つのサブサンプルからの親サンプルの分散を意味するべきではありませんか？または私はここで完全に間違っていますか？

前もって感謝します。

編集1：上記の2つのサブサンプルは、分散がゼロであるため病理学的であると誰かが言います。さて、私はあなたに別の例を挙げることができます。この親サンプルを検討してください。

1, 2, 3, 4, 5, 46, 47, 48, 49, 50

$\begin{equation} \label{eq:parentsample2} 1,2,3,4,5,46,47,48,49,50 \end{equation}$

この親サンプルの分散である、その平均である $S^2_p=564.7$ 。 $\bar{x}_p=25.5$

ここで、この親サンプルを2つのサブサンプルに分割するとします。

最初のサブサンプルは、平均と1,2,3,4,5である、分散。 $\bar{x}_1=3$ $S^2_1=2.5$
第2のサブサンプルは、平均と46,47,48,49,50ある及び分散。 $\bar{x}_2=48$ $S^2_2=2.5$

ここで、「文学の式」を使用してプールされた分散を計算すると、2.5になります。これは、親/プールされた分散が564.7であるため、完全に間違っています。代わりに、「私の式」を使用すると、正しい答えが得られます。

ここで極端な例を使用して、式が実際に間違っていることを人々に示してください。多くのバリエーション（極端な場合）を持たない「通常のデータ」を使用すると、これらの2つの式の結果は非常に似たものになり、式自体が原因ではなく、丸め誤差のために人々は差を無視できます違う。

variance mean pooling

— ハンシオン
ソース

ヘルプへのいくつかの関連リンク：stats.stackexchange.com/q/214834/3277、stats.stackexchange.com/q/12330/3277、stats.stackexchange.com/q/43159/3277。

— ttnphns

13

簡単に言えば、プールされた分散は、それらの分散が等しいという仮定/制約の下で、各サンプル内の分散の（偏りのない）推定です。

これについては、ウィキペディアのプールされた分散のエントリで詳細に説明、動機付け、分析されています。

想定したように、2つの個別のサンプルを連結して形成される新しい「メタサンプル」の分散は推定しません。既に発見したように、それを推定するには完全に異なる式が必要です。

— ジェイク・ウェストフォール
ソース

「平等」（つまり、同じ母集団がそれらのサンプルを実現した）という仮定は、それが何であるかを定義するために一般に必要ではありません-「プール」。プールとは、単に平均化されたオムニバスを意味します（ティムへの私のコメントを参照）。

— ttnphns

@ttnphns I think the equality assumption is necessary for giving the pooled variance a conceptual meaning (which the OP asked for) that goes beyond just verbally describing the mathematical operation it performs on the sample variances. If the population variances are not assumed equal, then it's unclear what we could consider the pooled variance to be an estimate of. Of course, we could just think about it as being an amalgamation of the two variances and leave it at that, but that's hardly enlightening in the absence of any motivation for wanting to combine the variances in the first place.

— Jake Westfall

Jake, I'm not in disagreement with that, given the specific question of the OP, but I wanted to speak about definition of the word "pooled", that's why I said, "in general".

— ttnphns

@JakeWestfall Your answer is the best answer so far. Thank you. Although I am still not clear about one thing. According to Wikipedia, pooled variance is a method for estimating variance of several different populations when the mean of each population may be different, but one may assume that the variance of each population is the same.

— Hanciong

@JakeWestfall：では、異なる平均値を持つ2つの異なる母集団からプールされた分散を計算する場合、実際には何を計算しますか？最初の分散は最初の平均に関する変動を測定しており、2番目の分散は2番目の平均に関するものであるためです。計算からどのような追加情報が得られるかわかりません。

— ハンシオン

10

Pooled variance is used to combine together variances from different samples by taking their weighted average, to get the "overall" variance. The problem with your example is that it is a pathological case, since each of the sub-samples has variance equal to zero. Such pathological case has very little in common with the data we usually encounter, since there is always some variability and if there is no variability, we don't care about such variables since they carry no information. You need to notice that this is a very simple method and there are more complicated ways of estimating variance in hierarchical data structures that are not prone to such problems.

As about your example in the edit, it shows that it is important to clearly state your assumptions before starting the analysis. Let's say that you have $n$ data points in $k$ groups, we would denote it as $x_{1,1},x_{2,1},\dots,x_{n-1,k},x_{n,k}$ , where the $i$ -th index in $x_{i,j}$ stands for cases and $j$ -th index stands for group indexes. There are several scenarios possible, you can assume that all the points come from the same distribution (for simplicity, let's assume normal distribution),

\begin{matrix} (1) & x_{i, j} \sim N (μ, σ^{2}) \end{matrix}

$x_{i,j} \sim \mathcal{N}(\mu, \sigma^2) \tag{1}$

you can assume that each of the sub-samples has its own mean

\begin{matrix} (2) & x_{i, j} \sim N (μ_{j}, σ^{2}) \end{matrix}

$x_{i,j} \sim \mathcal{N}(\mu_j, \sigma^2) \tag{2}$

or, its own variance

\begin{matrix} (3) & x_{i, j} \sim N (μ, σ_{j}^{2}) \end{matrix}

$x_{i,j} \sim \mathcal{N}(\mu, \sigma^2_j) \tag{3}$

or, each of them have their own, distinct parameters

\begin{matrix} (4) & x_{i, j} \sim N (μ_{j}, σ_{j}^{2}) \end{matrix}

$x_{i,j} \sim \mathcal{N}(\mu_j, \sigma^2_j) \tag{4}$

Depending on your assumptions, particular method may, or may not be adequate for analyzing the data.

In the first case, you wouldn't be interested in estimating the within-group variances, since you would assume that they all are the same. Nonetheless, if you aggregated the global variance from the group variances, you would get the same result as by using pooled variance since the definition of variance is

V a r (X) = \frac{1}{n - 1} \sum_{i} (x_{i} - μ)^{2}

$\mathrm{Var}(X) = \frac{1}{n-1} \sum_i (x_i - \mu)^2$

and in pooled estimator you first multiply it by $n-1$ , then add together, and finally divide by $n_1 + n_2 - 1$ .

In the second case, means differ, but you have a common variance. This example is closest to your example in the edit. In this scenario, the pooled variance would correctly estimate the global variance, while if estimated variance on the whole dataset, you would obtain incorrect results, since you were not accounting for the fact that the groups have different means.

In the third case it doesn't make sense to estimate the "global" variance since you assume that each of the groups have its own variance. You may be still interested in obtaining the estimate for the whole population, but in such case both (a) calculating the individual variances per group, and (b) calculating the global variance from the whole dataset, can give you misleading results. If you are dealing with this kind of data, you should think of using more complicated model that accounts for the hierarchical nature of the data.

The fourth case is the most extreme and quite similar to the previous one. In this scenario, if you wanted to estimate the global mean and variance, you would need a different model and different set of assumptions. In such case, you would assume that your data is of hierarchical structure, and besides the within-group means and variances, there is a higher-level common variance, for example assuming the following model

\begin{matrix} (5) & \begin{aligned} x_{i, j} & \sim N (μ_{j}, σ_{j}^{2}) \\ μ_{j} & \sim N (μ_{0}, σ_{0}^{2}) \\ σ_{j}^{2} & \sim I G (α, β) \end{aligned} \end{matrix}

$\begin{align} x_{i,j} &\sim \mathcal{N}(\mu_j, \sigma^2_j) \\ \mu_j &\sim \mathcal{N}(\mu_0, \sigma^2_0) \\ \sigma^2_j &\sim \mathcal{IG}(\alpha, \beta) \end{align} \tag{5}$

where each sample has its own means and variances $\mu_j,\sigma^2_j$ that are themselves draws from common distributions. In such case, you would use a hierarchical model that takes into consideration both the lower-level and upper-level variability. To read more about this kind of models, you can check the Bayesian Data Analysis book by Gelman et al. and their eight schools example. This is however much more complicated model then the simple pooled variance estimator.

— Tim
ソース

I have updated my question with different example. In this case, the answer from "literature's formula" is still wrong. I understand that we are usually dealing with "normal data" where there is no extreme case like my example above. However, as mathematicians, shouldn't you care about which formula is indeed correct, instead of which formula applies in "everyday/common problem"? If some formula is fundamentally wrong, it should be discarded, especially if there is another formula which holds in all cases, pathological or not.

— Hanciong

Btw you said there are more complicated ways of estimating variance. Could you show me these ways? Thank you

— Hanciong

2

Tim, pooled variance is not the total variance of the "combined sample". In statistics, "pooled" means weighted averaged (when we speak of averaged quantities such as variances, weights being the n's) or just summed (when we speak of sums such as scatters, sums-of-squares). Please, reconsider your terminology (choice of words) in the answer.

— ttnphns

1

Albeit off the current topic, here is an interesting question about "common" variance concept. stats.stackexchange.com/q/208175/3277

— ttnphns

1

Hanciong. I insist that "pooled" in general and even specifically "pooled variance" concept does not need, in general, any assumption such as: groups came from populations with equal variances. Pooling is simply blending (weighted averaging or summing). It is in ANOVA and similar circumstances that we do add that statistical assumption.

— ttnphns

1

The problem is if you just concatenate the samples and estimate its variance you're assuming they're from the same distribution therefore have the same mean. But we are in general interested in several samples with different mean. Does this make sense?

— ZHU
ソース

0

The use-case of pooled variance is when you have two samples from distributions that:

may have different means, but
which you expect to have an equal true variance.

An example of this is a situation where you measure the length of Alice's nose $n$ times for one sample, and measure the length of Bob's nose $m$ times for the second. These are likely to produce a bunch of different measurements on the scale of millimeters, because of measurement error. But you expect the variance in measurement error to be the same no matter which nose you measure.

In this case, taking the pooled variance would give you a better estimate of the variance in measurement error than taking the variance of one sample alone.

— Misha
ソース

Thank you for your answer, but I still don't understand about one thing. The first data gives you the variance with respect to Alice's nose length, and the second data gives you the variance with respect to Bob's nose length. If you are calculating a pooled variance from those data, what does it mean actually? Because the first variance is measuring the variation with respect to Alice's, and the second with respect to Bob's, so what additional information can we gained by calculating their pooled variance? They are completely different numbers.

— Hanciong

0

Through pooled variance we are not trying to estimate the variance of a bigger sample, using smaller samples. Hence, the two examples you gave don't exactly refer to the question.

Pooled variance is required to get a better estimate of population variance, from two samples that have been randomly taken from that population and come up with different variance estimates.

Example, you are trying to gauge variance in the smoking habits of males in London. You sample two times, 300 males from London. You end up getting two variances (probably a bit different!). Now since, you did a fair random sampling (best to your capability! as true random sampling is almost impossible), you have all the rights to say that both the variances are true point estimates of population variance (London males in this case).

But how is that possible? i.e. two different point estimates!! Thus, we go ahead and find a common point estimate which is pooled variance. It is nothing but weighted average of two point estimates, where the weights are the degree of freedom associated with each sample.

Hope this clarifies.

— Sameer Saurabh
ソース