平面上のサンプルの中央値、またはより高い順序のスペースについて受け入れられている定義はありますか？

33

もしそうなら、何？そうでない場合は、なぜですか？

ライン上のサンプルの場合、中央値は合計絶対偏差を最小化します。定義をR2などに拡張するのは自然に思えますが、私はそれを見たことがありません。しかし、その後、私は長い間左のフィールドに出てきました。

multivariate-analysis spatial median

— phv3773
ソース

stats.stackexchange.com/questions/89676/...

— HalvorsenのはKjetil B

19

多変量中央値について受け入れられている定義があるかどうかはわかりません。私がよく知っているのは、Ojaの中央値ポイントです。これは、ポイントのサブセット上に形成されるシンプレックスのボリュームの合計を最小化します。（技術的な定義については、リンクを参照してください。）

更新：上記のOjaの定義について参照されているサイトには、多変量中央値の多くの定義を網羅した素晴らしい論文もあります。

データの深さの幾何学的測定

— アルス
ソース

1

素敵なリファレンス：ありがとう。ここに記載されているすべてを包括的にカバーしています。

— whuber

同じWebサイトには、htmlの単純な概要も含まれています。cgm.cs.mcgill.ca

— Aditya

15

@Arsはそこには受け入れられている定義はありません（これは良い点である）と述べました。分位数を一般化する方法の一般的な代替ファミリがあり $\mathbb{R}^d$ ます。最も重要なことは次のとおりです。

分位プロセス一般レッツの経験的尺度（=の観測の割合である）。次に、とでのボレル集合のも選ばれたサブセットと実数値尺度は、あなたが経験的分位関数を定義することができます。 $P_n(A)$ $A$ $\mathbb{A}$ $\mathbb{R}^d$ $\lambda$

$U_n(t)=\inf (\lambda(A) : P_n(A)\geq t A\in\mathbb{A})$

Suppose you can find one $A_{t}$ that gives you the minimum. Then the set (or an element of the set) $A_{1/2-\epsilon}\cap A_{1/2+\epsilon}$ gives you the median when $\epsilon$ is made small enough. The definition of the median is recovered when using $\mathbb{A}=(]-\infty,x] x\in\mathbb{R})$ and $\lambda(]-\infty,x])=x$ . Ars answer falls into that framework I guess... tukey's half space location may be obtained using $\mathbb{A}(a)=( H_{x}=(t\in \mathbb{R}^d :\; \langle a, t \rangle \leq x )$ and $\lambda(H_{x})=x$ (with $x\in \mathbb{R}$ , $a\in\mathbb{R}^d$ ).
variational definition and M-estimation The idea here is that the $\alpha$ -quantile $Q_{\alpha}$ of a random variable $Y$ in $\mathbb{R}$ can be defined through a variational equality.
- The most common definition is using the quantile regression function $\rho_{\alpha}$ (also known as pinball loss, guess why ? ) $Q_{\alpha}=arg\inf_{x\in \mathbb{R}}\mathbb{E}[\rho_{\alpha}(Y-x)]$ . The case $\alpha=1/2$ gives $\rho_{1/2}(y)=|y|$ and you can generalize that to higher dimension using $l^1$ @Srikant Answerで行われた距離。これは理論上の中央値ですが、期待値を経験的期待値（平均）に置き換えると経験的中央値が得られます。
- $Q_{\alpha}=Arg\sup_s (s\alpha-f(s))$ where $f(s)=\frac{1}{2}\mathbb{E} [|s-Y|-|Y|+s]$ for $s\in \mathbb{R}$ . He gives a lot of deep reasons for that (see the paper ;)). Generalizing this to higher dimensions require working with a vectorial $\alpha$ and replacing $s\alpha$ by $\langle s,\alpha\rangle$ but you can take $\alpha=(1/2,\dots,1/2)$ .
Partial ordering You can generalize the definition of quantiles in $\mathbb{R}^d$ as soon as you can create a partial order (with equivalence classes).

Obviously there are bridges between the different formulations. They are not all obvious...

— robin girard
ソース

Nice answer, Robin!

— ars

12

There are distinct ways to generalize the concept of median to higher dimensions. One not yet mentioned, but which was proposed long ago, is to construct a convex hull, peel it away, and iterate for as long as you can: what's left in the last hull is a set of points that are all candidates to be "medians."

"Head-banging" is another more recent attempt (c. 1980) to construct a robust center to a 2D point cloud. (The link is to documentation and software available at the US National Cancer Institute.)

The principal reason why there are multiple distinct generalizations and no one obvious solution is that R1 can be ordered but R2, R3, ... cannot be.

— whuber
ソース

Any measure that coincides with the usual median when restricted to R1 is a candidate generalization. There must be a lot of them.

— phv3773

phv:> one can ask for 'the' generalization to preserve (in higher dimensions) some of the interesting properties of the median. This severly limits the number of candidates (see the commenting after Srikant's answer below)

— user603

@Whuber:> then notion of ordering can be generalized to R^n for unimodal distributions (see my answer below).

— user603

@kwak: could you elaborate a little? The usual mathematical definition of an ordering of a space is independent of any kind of probability distribution, so you must implicitly have some additional assumptions in mind.

— whuber

1

@Whuber:> You state: "R1 can be ordered but R2, R3, ... cannot be". R2,..,R3 can be ordered in many ways by mapping from Rn to R . One such way is the tukey depth. It has many important properties (robustness to some extend, non parametric, invariance,...) but these only hold for the case of unimodal distributions. Let me know if you want more details.

— user603

7

Geometric median is the point with the smallest average euclidian distance from the samples

— Yaroslav Bulatov
ソース

Also stats.stackexchange.com/questions/113239/…, stats.stackexchange.com/questions/89676/…

— kjetil b halvorsen

6

The Tukey halfspace median can be extended to >2 dimensions using DEEPLOC, an algorithm due to Struyf and Rousseeuw; see here for details.

The algorithm is used to approximate the point of greatest depth efficiently; naive methods which attempt to determine this exactly usually run afoul of (the computational version of) "the curse of dimensionality", where the runtime required to calculate a statistic grows exponentially with the number of dimensions of the space.

— Gary Campbell
ソース

3

A definition that comes close to it, for unimodal distributions, is the tukey halfspace median

— user603
ソース

0

I do not know if any such definition exists but I will try and extend the standard definition of the median to $R^2$ . I will use the following notation:

$X$ , $Y$ : the random variables associated with the two dimensions.

$m_x$ , $m_y$ : the corresponding medians.

$f(x,y)$ : the joint pdf for our random variables

To extend the definition of the median to $R^2$ , we choose $m_x$ and $m_y$ to minimize the following:

$E(|(x,y) - (m_x,m_y)|$

The problem now is that we need a definition for what we mean by:

$|(x,y) - (m_x,m_y)|$

The above is in a sense a distance metric and several possible candidate definitions are possible.

Eucliedan Metric

$|(x,y) - (m_x,m_y)| = \sqrt{(x-m_x)^2 + (y-m_y)^2}$

Computing the median under the euclidean metric will require computing the expectation of the above with respect to the joint density $f(x,y)$ .

Taxicab Metric

$|(x,y) - (m_x,m_y)| = |x-m_x| + |y-m_y|$

Computing the median in the case of the taxicab metric involves computing the median of $X$ and $Y$ separately as the metric is separable in $x$ and $y$ .

Srikant:> No. The definition has to have two important feature of the univariate median. a) Invariant to monotone transformation of the data, b) robust to contamination by outliers. None of the extentions you propose have these. The Tukey depth has these qualities.

— user603

@kwak What you say makes sense.

@Srikant:> Check the R&S paper cited by Gary Campbell above ;). Best,

— user603

@kwak On thinking some more, the taxicab metric does have the features you mentioned as it basically reduces to univariate medians. no?

2

@Srikant:> there are no incorrect answer to phv's questions because there are no 'good answers' either; this area of research is still under development. I simply wanted to point out why it is still an open problem.

— user603