これらの相関ベースの距離に対して、三角形の不等式は満たされていますか?


12

階層的クラスタリングの場合、2つのランダム変数X間の距離を測定するために、次の2つの「メトリック」(正確には言えません)をよく目にしますY

d1(X,Y)=1|Cor(X,Y)|,d2(X,Y)=1(Cor(X,Y))2
どちらかが三角形の不等式を満たしますか?もしそうなら、単に総当たり計算を行う以外にどのように証明する必要がありますか?それらがメトリックではない場合、簡単なカウンターの例は何ですか?

:あなたは、この論文の見直しに興味があるかもしれないarxiv.org/pdf/1208.3145.pdfを
クリス

回答:


5

d 1三角形の不等式は次のようになります。 d1

d1(X,Z)d1(X,Y)+d1(Y,Z)1|Cor(X,Z)|1|Cor(X,Y)|+1|Cor(Y,Z)||Cor(X,Y)|+|Cor(Y,Z)|1+|Cor(X,Z)|

これは敗北するのは非常に簡単な不平等のようです。XZ独立させることで、右側をできるだけ小さくすることができます(正確に1つ)。次に、左側が1を超えるを見つけることができYますか?

場合XZが同じ分散を有する、次いでC O RX Y = Y=X+ZXZおよび同様にCORYZので、左側に一つ以上であると不等式が違反されています。Rでのこの違反の例。XZは多変量正規の成分です。Cor(X,Y)=220.707Cor(Y,Z)XZ

library(MASS)
set.seed(123)
d1 <- function(a,b) {1 - abs(cor(a,b))}

Sigma    <- matrix(c(1,0,0,1), nrow=2) # covariance matrix of X and Z
matrixXZ <- mvrnorm(n=1e3, mu=c(0,0), Sigma=Sigma, empirical=TRUE)
X <- matrixXZ[,1] # mean 0, variance 1
Z <- matrixXZ[,2] # mean 0, variance 1
cor(X,Z) # nearly zero
Y <- X + Z

d1(X,Y) 
# 0.2928932
d1(Y,Z)
# 0.2928932
d1(X,Z)
# 1
d1(X,Z) <= d1(X,Y) + d1(Y,Z)
# FALSE

ただし、この構造はでは機能しません。d2

d2 <- function(a,b) {1 - cor(a,b)^2}
d2(X,Y) 
# 0.5
d2(Y,Z)
# 0.5
d2(X,Z)
# 1
d2(X,Z) <= d2(X,Y) + d2(Y,Z)
# TRUE

理論的な攻撃を仕掛けるのではなく、この段階で、R の共分散行列を使って、いい反例が出てくるまで簡単にいじることができることがわかりました。V a rX = 2V a rZ = 1およびC o vX Z = 1を許可すると、次のようになります。d2SigmaVar(X)=2Var(Z)=1Cov(X,Z)=1

Var(Y)=Var(X+Y)=Var(X)+Var(Z)+2Cov(X,Z)=2+1+2=5

共分散を調べることもできます。

C o vY Z = C o v = C o vX

Cov(X,Y)=Cov(X,X+Z)=Cov(X,X)+Cov(X,Z)=2+1=3
Cov(Y,Z)=Cov(X+Z,Z)=Cov(X,Z)+Cov(Z,Z)=1+1=2

平方相関は次のとおりです CorXY2=CovX

Cor(X,Z)2=Cov(X,Z)2Var(X)Var(Z)=122×1=0.5
CorYZ2=CovYZ2
Cor(X,Y)2=Cov(X,Y)2Var(X)Var(Y)=322×5=0.9
Cor(Y,Z)2=Cov(Y,Z)2Var(Y)Var(Z)=225×1=0.8

次いで、一方、D 2X Y = 0.1D 2Y Z = 0.2を三角不等式が実質的なマージンによって破られるように。d2(X,Z)=0.5d2(X,Y)=0.1d2(Y,Z)=0.2

Sigma    <- matrix(c(2,1,1,1), nrow=2) # covariance matrix of X and Z
matrixXZ <- mvrnorm(n=1e3, mu=c(0,0), Sigma=Sigma, empirical=TRUE)
X <- matrixXZ[,1] # mean 0, variance 2
Z <- matrixXZ[,2] # mean 0, variance 1
cor(X,Z) # 0.707
Y  <- X + Z
d2 <- function(a,b) {1 - cor(a,b)^2}
d2(X,Y) 
# 0.1
d2(Y,Z)
# 0.2
d2(X,Z)
# 0.5
d2(X,Z) <= d2(X,Y) + d2(Y,Z)
# FALSE

5

3つのベクトル(変数または個体)Y、およびZがあります。そして、それぞれをzスコアに標準化しました(平均= 0、分散= 1)。XYZ

dXY2=2(n1)(1cosXY)cosXYrXY2(n1)定数乗数は考慮からます。

d1(X,Y)=1|Cor(X,Y)|

|r||r|

d)、私たちの3つのベクトルを描いてみましょう。

enter image description here

ベクトルは単位長です(標準化されているため)。角度の余弦(αβα+β)は rバツYrバツZrYZ、それぞれ。これらの角度は、ベクトル間の対応するユークリッド距離を広げます。dバツYdバツZdYZ. For simplicity, the three vectors are all on the same plane (and so the angle between X and Z is the sum of the two other, α+β). It is the position in which the violation of triangle inequality by the distances squared is most prominent.

For, as you can see with eyes, the green square area excels the sum of the two red squares: dYZ2>dXY2+dXZ2.

Therefore regarding

d1(X,Y)=1|Cor(X,Y)|

distance we can say it is not metric. Because even when all rs were originally positive the distance is the euclidean d2 which itself isn't metric.

What is about the second distance?

d2(X,Y)=1(Cor(X,Y))2

Since correlation r in the case of standardized vectors is cos, 1r2 is sin2. (Indeed, 1r2 is SSerror/SStotal of a linear regression, a quantity which is the squared correlation of the dependent variable with something orthogonal to the predictor.) In that case draw the sines of the vectors, and make them squared (because we are talking about the distance which is sin2):

enter image description here

Although it is not quite obvious visually, the green sinYZ2 square is again larger than the sum of red areas sinXY2+sinXZ2.

It could be proved. On a plane, sin(α+β)=sinαcosβ+cosαsinβ. Square both sides since we are interested in sin2.

sin2(α+β)=sin2α(1sin2β)+(1sin2α)sin2β+2sinαcosβcosαsinβ=sin2α+sin2β2[sin2αsin2β]+2[sinαcosαsinβcosβ]

In the last expression, two important terms are shown bracketed. If the second of the two is (or can be) larger than the first one then sin2(α+β)>sin2α+sin2β, and the "d2" distance violates triangular inequality. And it is so on our picture where α is about 40 degrees and β is about 30 degrees (term 1 is .1033 and term 2 is .2132). "D2" isn't metric.

The square root of "d2" distance - the sine dissimilarity measure - is metric though (I believe). You can play with various α and β angles on my circle to make sure. Whether "d2" will show to be metric in a non-collinear setting (i.e. three vectors not on a plane) too - I can't say at this time, albeit I tentatively suppose it will.


3

See also this preprint that I wrote: http://arxiv.org/abs/1208.3145 . I still need to take time and properly submit it. The abstract:

We investigate two classes of transformations of cosine similarity and Pearson and Spearman correlations into metric distances, utilising the simple tool of metric-preserving functions. The first class puts anti-correlated objects maximally far apart. Previously known transforms fall within this class. The second class collates correlated and anti-correlated objects. An example of such a transformation that yields a metric distance is the sine function when applied to centered data.

The upshot for your question is that d1, d2 are indeed not metrics and that the square root of d2 is in fact a proper metric.


2

No.

Simplest counter-example:

for X=(0,0) the distance is not defined at all, whatever your Y is.

Any constant series has standard deviation σ=0, and thus causes a division by zero in the definition of Cor...

At most it is a metric on a subset of the data space, not including any constant series.


Good point! I must mention this in the pre-print mentioned elsewhere.
micans
弊社のサイトを使用することにより、あなたは弊社のクッキーポリシーおよびプライバシーポリシーを読み、理解したものとみなされます。
Licensed under cc by-sa 3.0 with attribution required.