回答:
d 1の三角形の不等式は次のようになります。
これは敗北するのは非常に簡単な不平等のようです。と独立させることで、右側をできるだけ小さくすることができます(正確に1つ)。次に、左側が1を超えるを見つけることができますか?
場合とXとZが同じ分散を有する、次いでC O R(X 、Y )= √および同様にCOR(Y、Z)ので、左側に一つ以上であると不等式が違反されています。Rでのこの違反の例。XとZは多変量正規の成分です。
library(MASS)
set.seed(123)
d1 <- function(a,b) {1 - abs(cor(a,b))}
Sigma <- matrix(c(1,0,0,1), nrow=2) # covariance matrix of X and Z
matrixXZ <- mvrnorm(n=1e3, mu=c(0,0), Sigma=Sigma, empirical=TRUE)
X <- matrixXZ[,1] # mean 0, variance 1
Z <- matrixXZ[,2] # mean 0, variance 1
cor(X,Z) # nearly zero
Y <- X + Z
d1(X,Y)
# 0.2928932
d1(Y,Z)
# 0.2928932
d1(X,Z)
# 1
d1(X,Z) <= d1(X,Y) + d1(Y,Z)
# FALSE
ただし、この構造はでは機能しません。
d2 <- function(a,b) {1 - cor(a,b)^2}
d2(X,Y)
# 0.5
d2(Y,Z)
# 0.5
d2(X,Z)
# 1
d2(X,Z) <= d2(X,Y) + d2(Y,Z)
# TRUE
理論的な攻撃を仕掛けるのではなく、この段階で、R の共分散行列を使って、いい反例が出てくるまで簡単にいじることができることがわかりました。V a r(X )= 2、V a r(Z )= 1およびC o v(X 、Z )= 1を許可すると、次のようになります。Sigma
共分散を調べることもできます。
C o v(Y 、Z )= C o v = C o v(X
平方相関は次のとおりです Cor(X、Y)2=Cov(X、
次いで、一方、D 2(X 、Y )= 0.1とD 2(Y 、Z )= 0.2を三角不等式が実質的なマージンによって破られるように。
Sigma <- matrix(c(2,1,1,1), nrow=2) # covariance matrix of X and Z
matrixXZ <- mvrnorm(n=1e3, mu=c(0,0), Sigma=Sigma, empirical=TRUE)
X <- matrixXZ[,1] # mean 0, variance 2
Z <- matrixXZ[,2] # mean 0, variance 1
cor(X,Z) # 0.707
Y <- X + Z
d2 <- function(a,b) {1 - cor(a,b)^2}
d2(X,Y)
# 0.1
d2(Y,Z)
# 0.2
d2(X,Z)
# 0.5
d2(X,Z) <= d2(X,Y) + d2(Y,Z)
# FALSE
3つのベクトル(変数または個体)、Y、およびZがあります。そして、それぞれをzスコアに標準化しました(平均= 0、分散= 1)。
)、私たちの3つのベクトルを描いてみましょう。
ベクトルは単位長です(標準化されているため)。角度の余弦(、 、 )は 、 、 、それぞれ。これらの角度は、ベクトル間の対応するユークリッド距離を広げます。、 、 . For simplicity, the three vectors are all on the same plane (and so the angle between and is the sum of the two other, ). It is the position in which the violation of triangle inequality by the distances squared is most prominent.
For, as you can see with eyes, the green square area excels the sum of the two red squares: .
Therefore regarding
distance we can say it is not metric. Because even when all s were originally positive the distance is the euclidean which itself isn't metric.
What is about the second distance?
Since correlation in the case of standardized vectors is , is . (Indeed, is SSerror/SStotal
of a linear regression, a quantity which is the squared correlation of the dependent variable with something orthogonal to the predictor.) In that case draw the sines of the vectors, and make them squared (because we are talking about the distance which is ):
Although it is not quite obvious visually, the green square is again larger than the sum of red areas .
It could be proved. On a plane, . Square both sides since we are interested in .
In the last expression, two important terms are shown bracketed. If the second of the two is (or can be) larger than the first one then , and the "d2" distance violates triangular inequality. And it is so on our picture where is about 40 degrees and is about 30 degrees (term 1 is .1033
and term 2 is .2132
). "D2" isn't metric.
The square root of "d2" distance - the sine dissimilarity measure - is metric though (I believe). You can play with various and angles on my circle to make sure. Whether "d2" will show to be metric in a non-collinear setting (i.e. three vectors not on a plane) too - I can't say at this time, albeit I tentatively suppose it will.
See also this preprint that I wrote: http://arxiv.org/abs/1208.3145 . I still need to take time and properly submit it. The abstract:
We investigate two classes of transformations of cosine similarity and Pearson and Spearman correlations into metric distances, utilising the simple tool of metric-preserving functions. The first class puts anti-correlated objects maximally far apart. Previously known transforms fall within this class. The second class collates correlated and anti-correlated objects. An example of such a transformation that yields a metric distance is the sine function when applied to centered data.
The upshot for your question is that d1, d2 are indeed not metrics and that the square root of d2 is in fact a proper metric.
No.
Simplest counter-example:
for the distance is not defined at all, whatever your is.
Any constant series has standard deviation , and thus causes a division by zero in the definition of ...
At most it is a metric on a subset of the data space, not including any constant series.