おそらく標準偏差がゼロのデータセットのピアソン相関？

12

標準偏差がゼロのデータセットのピアソン相関係数の計算に問題があります（つまり、すべてのデータの値が同じです）。

次の2つのデータセットがあるとします。

float x[] = {2, 2, 2, 3, 2};
float y[] = {2, 2, 2, 2, 2};

相関係数「r」は、次の式を使用して計算されます。

float r = covariance(x, y) / (std_dev(x) * std_dev(y));

ただし、データセット「y」のすべてのデータは同じ値を持つため、標準偏差std_dev（y）はゼロになり、「r」は未定義になります。

この問題の解決策はありますか？または、この場合、他の方法を使用してデータ関係を測定する必要がありますか？

correlation

— アンドリー
ソース

yには変化がないため、この例には「データ関係」はありません。割り当て任意の数値Rは間違いであろう。

— whuber

1

@whuber-

が未定義であることは事実ですが、必ずしも「真の」未知の相関

を推定できないとは限りません。それを推定するために別のものを使用する必要があります。

r

$r$

ρ

$\rho$

— 確率論的

@probabilityこれは推定の問題であり、単なる特性化の問題ではないと仮定します。しかし、それを受け入れて、この例ではどの推定量を提案しますか？推定器の使用方法（損失関数、実際）に依存するため、一般的に正しい答えはありません。PCAなどの多くのアプリケーションでは、使用している可能性が高いと思われる任意の価値を転嫁手順

認識し、他の方法よりも悪いかもしれ

識別することができません。

ρ

$\rho$

ρ

$\rho$

— whuber

1

@whuber-見積もりは私にとって言葉の悪い選択です（あなたは私が最高のワードスミスではないことに気づいたかもしれません）、私が意味したのは、

が一意に識別されないかもしれないが、これはデータが伝えるのに役に立たないことを意味しないということです

について。私の答えは、代数的観点からこれの（ugい）デモンストレーションを与えます。

ρ

$\rho$

ρ

$\rho$

— 確率論的

@Probability分析は矛盾しているようです。実際にyが正規分布でモデル化されている場合、5つの2のサンプルはこのモデルが不適切であることを示します。最終的に、あなたは何のためにも何も得られません：あなたの結果は、事前条件についてなされた仮定に強く依存します。

を特定する際の元の問題はまだありますが、これらすべての追加の仮定によって隠れています。私見は、問題を明確にするのではなく、単に問題をあいまいにしているようです。

ρ

$\rho$

— whuber

9

「サンプリング理論」の人々は、そのような推定値は存在しないと言うでしょう。しかし、あなたはそれを得ることができます、あなたはあなたの前の情報について合理的である必要があり、はるかに難しい数学的な仕事をします。

ベイズ推定法を指定し、事後が前と同じである場合、データはパラメーターについて何も言わないと言うことができます。物事が「特異」になる場合があるため、無限のパラメータ空間を使用することはできません。ピアソン相関を使用しているため、2変量の正規尤度があると仮定しています。

ここで、

p (D | μ_{x}, μ_{y}, σ_{x}, σ_{y}, ρ) = {(σ_{x} σ_{y} \sqrt{2 π (1 - ρ^{2})})}^{- N} e x p (- \frac{\sum_{i} Q_{i}}{2 (1 - ρ^{2})})

$p(D|\mu_x,\mu_y,\sigma_x,\sigma_y,\rho)=\left(\sigma_x\sigma_y\sqrt{2\pi(1-\rho^2)}\right)^{-N}exp\left(-\frac{\sum_{i}Q_i}{2(1-\rho^2)}\right)$

Q_{i} = \frac{(x_{i} - μ_{x})^{2}}{σ_{x}^{2}} + \frac{(y_{i} - μ_{y})^{2}}{σ_{y}^{2}} - 2 ρ \frac{(x_{i} - μ_{x}) (y_{i} - μ_{y})}{σ_{x} σ_{y}}

$Q_i=\frac{(x_i-\mu_x)^2}{\sigma_x^2}+\frac{(y_i-\mu_y)^2}{\sigma_y^2}-2\rho\frac{(x_i-\mu_x)(y_i-\mu_y)}{\sigma_x\sigma_y}$

ここで、1つのデータセットが同じ値である可能性があることを示すために、と記述します。 $y_i=y$

ここで

\sum_{i} Q_{i} = N [\frac{(y - μ_{y})^{2}}{σ_{y}^{2}} + \frac{s_{x}^{2} + (\bar{x} - μ_{x})^{2}}{σ_{x}^{2}} - 2 ρ \frac{(\bar{x} - μ_{x}) (y - μ_{y})}{σ_{x} σ_{y}}]

$\sum_{i}Q_i=N\left[\frac{(y-\mu_y)^2}{\sigma_y^2}+\frac{s_x^2 + (\overline{x}-\mu_x)^2}{\sigma_x^2}-2\rho\frac{(\overline{x}-\mu_x)(y-\mu_y)}{\sigma_x\sigma_y}\right]$

s_{x}^{2} = \frac{1}{N} \sum_{i} (x_{i} - \bar{x})^{2}

$s_x^2=\frac{1}{N}\sum_{i}(x_i-\overline{x})^2$

あなたの可能性は、4つの数に依存ように、。あなたはの見積もりたいので、、あなたの前で乗算する必要がある、と迷惑なパラメータアウト統合するので、。統合の準備をするために、「正方形を完成させる」 $s_x^2,y,\overline{x},N$ $\rho$ $\mu_x,\mu_y,\sigma_x,\sigma_y$

\frac{\sum_{i} Q_{i}}{1 - ρ^{2}} = N [\frac{{(μ_{y} - [y - (\bar{x} - μ_{x}) \frac{ρ σ_{y}}{σ_{x}}])}^{2}}{σ_{y}^{2} (1 - ρ^{2})} + \frac{s_{x}^{2}}{σ_{x}^{2} (1 - ρ^{2})} + \frac{(\bar{x} - μ_{x})^{2}}{σ_{x}^{2}}]

$\frac{\sum_{i}Q_i}{1-\rho^2}=N\left[\frac{\left(\mu_y-\left[y-(\overline{x}-\mu_x)\frac{\rho\sigma_y}{\sigma_x}\right]\right)^2}{\sigma_y^2(1-\rho^{2})}+\frac{s_x^2}{\sigma_{x}^{2}(1-\rho^{2})} + \frac{(\overline{x}-\mu_x)^2}{\sigma_x^2}\right]$

ここで、注意を怠って適切に正規化された確率を確保する必要があります。そうすれば、トラブルに巻き込まれることはありません。そのようなオプションの1つは、それぞれの範囲に制限を設けるだけの、情報量の少ない事前分布を使用することです。我々が持っているので、フラット前やと手段のために前ジェフリーズと標準偏差のために。これらの制限は、問題について考える少しの「常識」で簡単に設定できます。に対して不特定の事前をとります $L_{\mu}<\mu_x,\mu_y<U_{\mu}$ $L_{\sigma}<\sigma_x,\sigma_y<U_{\sigma}$ $\rho$ 、そして、我々は得る（ユニフォームは問題なく動作するはずであり、特異点を切り捨てない場合）： $\pm 1$

p (ρ, μ_{x}, μ_{y}, σ_{x}, σ_{y}) = \frac{p (ρ)}{A σ_{x} σ_{y}}

$p(\rho,\mu_x,\mu_y,\sigma_x,\sigma_y)=\frac{p(\rho)}{A\sigma_x\sigma_y}$

ここで、。これにより、次のものが得られます。 $A=2(U_{\mu}-L_{\mu})^{2}[log(U_{\sigma})-log(L_{\sigma})]^{2}$

p (ρ | D) = \int p (ρ, μ_{x}, μ_{y}, σ_{x}, σ_{y}) p (D | μ_{x}, μ_{y}, σ_{x}, σ_{y}, ρ) d μ_{y} d μ_{x} d σ_{x} d σ_{y}

$p(\rho|D)=\int p(\rho,\mu_x,\mu_y,\sigma_x,\sigma_y)p(D|\mu_x,\mu_y,\sigma_x,\sigma_y,\rho)d\mu_y d\mu_x d\sigma_x d\sigma_y$

= \frac{p (ρ)}{A [2 π (1 - ρ^{2})]^{\frac{N}{2}}} \int_{L_{σ}}^{U_{σ}} \int_{L_{σ}}^{U_{σ}} {(σ_{x} σ_{y})}^{- N - 1} e x p (- \frac{N s_{x}^{2}}{2 σ_{x}^{2} (1 - ρ^{2})}) \times

$=\frac{p(\rho)}{A[2\pi(1-\rho^2)]^{\frac{N}{2}}}\int_{L_{\sigma}}^{U_{\sigma}}\int_{L_{\sigma}}^{U_{\sigma}}\left(\sigma_x\sigma_y\right)^{-N-1}exp\left(-\frac{N s_x^2}{2\sigma_{x}^{2}(1-\rho^{2})}\right) \times$

\int_{L_{μ}}^{U_{μ}} e x p (- \frac{N (\bar{x} - μ_{x})^{2}}{2 σ_{x}^{2}}) \int_{L_{μ}}^{U_{μ}} e x p (- \frac{N {(μ_{y} - [y - (\bar{x} - μ_{x}) \frac{ρ σ_{y}}{σ_{x}}])}^{2}}{2 σ_{y}^{2} (1 - ρ^{2})}) d μ_{y} d μ_{x} d σ_{x} d σ_{y}

$\int_{L_{\mu}}^{U_{\mu}}exp\left(-\frac{N(\overline{x}-\mu_x)^2}{2\sigma_x^2}\right)\int_{L_{\mu}}^{U_{\mu}}exp\left(-\frac{N\left(\mu_y-\left[y-(\overline{x}-\mu_x)\frac{\rho\sigma_y}{\sigma_x}\right]\right)^2}{2\sigma_y^2(1-\rho^{2})}\right)d\mu_y d\mu_x d\sigma_x d\sigma_y$

Now the first integration over $\mu_y$ can be done by making a change of variables $z=\sqrt{N}\frac{\mu_y-\left[y-(\overline{x}-\mu_x)\frac{\rho\sigma_y}{\sigma_x}\right]}{\sigma_y\sqrt{1-\rho^{2}}}\implies dz=\frac{\sqrt{N}}{\sigma_y\sqrt{1-\rho^{2}}}d\mu_y$ and the first integral over $\mu_y$ becomes:

\frac{σ_{y} \sqrt{2 π (1 - ρ^{2})}}{\sqrt{N}} [Φ (\frac{U_{μ} - [y - (\bar{x} - μ_{x}) \frac{ρ σ_{y}}{σ_{x}}]}{\frac{σ_{y}}{\sqrt{N}} \sqrt{1 - ρ^{2}}}) - Φ (\frac{L_{μ} - [y - (\bar{x} - μ_{x}) \frac{ρ σ_{y}}{σ_{x}}]}{\frac{σ_{y}}{\sqrt{N}} \sqrt{1 - ρ^{2}}})]

$\frac{\sigma_y\sqrt{2\pi(1-\rho^{2})}}{\sqrt{N}}\left[\Phi\left( \frac{U_{\mu}-\left[y-(\overline{x}-\mu_x)\frac{\rho\sigma_y}{\sigma_x}\right]}{\frac{\sigma_y}{\sqrt{N}}\sqrt{1-\rho^{2}}} \right)-\Phi\left( \frac{L_{\mu}-\left[y-(\overline{x}-\mu_x)\frac{\rho\sigma_y}{\sigma_x}\right]}{\frac{\sigma_y}{\sqrt{N}}\sqrt{1-\rho^{2}}} \right)\right]$

And you can see from here, no analytic solutions are possible. However, it is also worthwhile to note that the value $\rho$ has not dropped out of the equations. This means that the data and prior information still have something to say about the true correlation. If the data said nothing about the correlation, then we would be simply left with $p(\rho)$ as the only function of $\rho$ in these equations.

It also shows how that passing to the limit of infinite bounds for $\mu_y$ "throws away" some of the information about $\rho$ , which is contained in the complicated looking normal CDF function $\Phi(.)$ . Now if you have a lot of data, then passing to the limit is fine, you don't loose much, but if you have very scarce information, such as in your case - it is important keep every scrap you have. It means ugly maths, but this example is not too hard to do numerically. So we can evaluate the integrated likelihood for $\rho$ at values of say $-0.99,-0.98,\dots,0.98,0.99$ fairly easily. Just replace the integrals by summations over a small enough intervals - so you have a triple summation

— probabilityislogic
ソース

@probabilityislogic: Wow. Simply wow. After seen some of your answers I really wonder: what should a doofus like me do to reach such a flexible bayesian state of mind ?

— steffen

1

@steffen - lol. Its not that difficult, you just need to practice. And always always always remember that the product and sum rules of probability are the only rules you will ever need. They will extract whatever information is there - whether you see it or not. So you apply product and sum rules, then just do the maths. That is all I have done here.

— probabilityislogic

@steffen - and the other rule - more a mathematical one than stats one - don't pass to an infinite limit too early in your calculations, your results may become arbitrary, or little details may get thrown out. Measurement error models are a perfect example of this (as is this question).

— probabilityislogic

@probabilityislogic: Thank you, I'll keep this in mind... as soon as I am done working through my "Bayesian Analysis"-copy ;).

— steffen

@probabilityislogic: If you could humor a nonmathematical statistician/researcher...would it be possible to summarize or translate your answer to a group of dentists or high school principals or introductory statistics students?

— rolando2

6

I agree with sesqu that the correlation is undefined in this case. Depending on your type of application you could e.g. calculate the Gower Similarity between both vectors, which is: $gower(v1,v2)=\frac{\sum_{i=1}^{n}\delta(v1_i,v2_i)}{n}$ where $\delta$ represents the kronecker-delta, applied as function on $v1,v2$ .

So for instance if all values are equal, gower(.,.)=1. If on the other hand they differ only in one dimension, gower(.,.)=0.9. If they differ in every dimension, gower(.,.)=0 and so on.

Of course this is no measure for correlation, but it allows you to calculate how close the vector with s>0 is to the one with s=0. Of course you can apply other metrics,too, if they serve your purpose better.

— steffen
ソース

+1 That's a creative idea. It sounds like the "Gower Similarity" is a scaled Hamming distance.

— whuber

@whuber: Indeed it is !

— steffen

0

The correlation is undefined in that case. If you must define it, I would define it as 0, but consider a simple mean absolute difference instead.

— sesqu
ソース

0

This question is coming from programmers, so I'd suggest plugging in zero. There's no evidence of a correlation, and the null hypothesis would be zero (no correlation). There might be other context knowledge that would provide a "typical" correlation in one context, but the code might be re-used in another context.

— zbicyclist
ソース

2

There's no evidence of lack of correlation either, so why not plug in 1? Or -1? Or anything in between? They all lead to re-usable code!

— whuber

@whuber - you plug in zero because the data is "less constrained" when it is independent - this is why maxent distributions are independent unless you explicitly specify correlations in the constraints. Independence can be viewed as a conservative assumption when you know of no such correlations - effectively you are averaging over all possible correlations.

— probabilityislogic

1

@prob I question why it makes sense as a generic procedure to average over all correlations. In effect this procedure substitutes the definite and possibly quite wrong answer "zero!" for the correct answer "the data don't tell us." That difference can be important for decision making.

— whuber

Just because the question might be from a programmer, does not mean you should convert an undefined value to zero. Zero means something specific in a correlation calculation. Throw an exception. Let the caller decide what should happen. Your function should calculate a correlation, not decide what to do if one cannot be computed.

— Jared Becksfort