片側チェビシェフ不等式のサンプルバージョンは存在しますか？

チェビシェフ不等式の次の片側Cantelli版に興味があります。

P (X - E (X) \geq t) \leq \frac{V a r (X)}{V a r (X) + t^{2}} .

$\mathbb P(X - \mathbb E (X) \geq t) \leq \frac{\mathrm{Var}(X)}{\mathrm{Var}(X) + t^2} \,.$

基本的に、母平均と分散がわかっている場合、特定の値を観測する確率の上限を計算できます。（少なくとも私の理解はそうでした。）

ただし、実際の母集団の平均と分散の代わりに、標本の平均と標本の分散を使用したいと思います。

これにより不確実性が高まるため、上限が増加すると推測しています。

上記に類似した不等式はありますが、サンプルの平均と分散を使用していますか？

編集：チェビシェフ不等式の「サンプル」アナログ（片面ではない）が作成されました。Wikipediaのページには、いくつかの詳細を持っています。ただし、上記の片側のケースにどのように変換されるかはわかりません。

— カサンドラ
ソース

ありがとう、Glen_b。これは非常に興味深い問題です。チェビシェフの不等式は強力だといつも思っていました（確率分布を必要とせずに統計的推論を行うことができるため）。したがって、サンプルの平均と分散で使用できるのは非常に素晴らしいことです。

— カサンドラ14年

回答:

はい、サンプルの平均と分散を使用して類似の結果を得ることができます。おそらく、プロセスでいくつかのわずかな驚きが生じます。

最初に、質問文を少し改良して、いくつかの仮定を設定する必要があります。重要なことは、母集団の分散を右側のサンプル分散で置き換えることを期待できないことは明らかです。後者はランダムであるためです。そこで、我々は同等の不平等に私達の注意を再び焦点を合わせる

P (X - E X \geq t σ) \leq \frac{1}{1 + t^{2}} .

$\mathbb P\left( X - \mathbb E X \geq t \sigma \right) \leq \frac{1}{1+t^2} \>.$ ケースでは、我々は単純に置き換えたこと、これらは同等であることに注意してください明確ではありません

t

$t$ と

t σ

$t \sigma$ 一般性を損なうことなく、元の不平等では。

第二に、我々は、ランダムなサンプルがあると $X_1,\ldots,X_n$ 、我々は、上位類似量行きに興味を持っている $\mathbb P(X_1 - \bar X \geq t S)$ 、 $\bar X$ サンプル平均であり、 $S$ サンプルの標準偏差です。

半歩前進

そのすでに元の片側チェビシェフの不等式を適用することにより、注、我々はそれを得る $X_1 - \bar X$ であり、より小さなオリジナルバージョンの右側より。意味あり！サンプルからのランダム変数の特定の実現は、母平均よりも貢献するサンプル平均に（わずかに）近い傾向があります。以下で説明するように、さらに一般的な仮定の下でをに置き換えます。

P (X_{1} - \bar{X} \geq t σ) \leq \frac{1}{1 + \frac{n}{n - 1} t^{2}}

$\mathbb P( X_1 - \bar X \geq t\sigma ) \leq \frac{1}{1 + \frac{n}{n-1}t^2}$

σ^{2} = V a r (X_{1})

$\sigma^2 = \mathrm{Var}(X_1)$

σ

$\sigma$

S

$S$

片側チェビシェフのサンプルバージョン

主張：をとなるようなランダムサンプルとし。その後、 $X_1,\ldots,X_n$ $\mathbb P(S = 0) = 0$ 特に、バウンドのサンプルバージョンは、元のポピュレーションバージョンよりも厳密です。
$P (X_{1} - \bar{X} \geq t S) \leq \frac{1}{1 + \frac{n}{n - 1} t^{2}} .$ $\mathbb P(X_1 - \bar X \geq t S) \leq \frac{1}{1 + \frac{n}{n-1} t^2}\>.$

注：は有限平均または分散があるとは想定していません！ $X_i$

Proof. The idea is to adapt the proof of the original one-sided Chebyshev inequality and employ symmetry in the process. First, set $Y_i = X_i - \bar X$ for notational convenience. Then, observe that

P (Y_{1} \geq t S) = \frac{1}{n} \sum_{i = 1}^{n} P (Y_{i} \geq t S) = E \frac{1}{n} \sum_{i = 1}^{n} 1_{(Y_{i} \geq t S)} .

$\mathbb P( Y_1 \geq t S ) = \frac{1}{n} \sum_{i=1}^n \mathbb P( Y_i \geq t S ) = \mathbb E \frac{1}{n} \sum_{i=1}^n \mathbf 1_{(Y_i \geq t S)} \>.$

Now, for any $c > 0$ , on $\{S > 0\}$ ,

1_{(Y_{i} \geq t S)} = 1_{(Y_{i} + t c S \geq t S (1 + c))} \leq 1_{((Y_{i} + t c S)^{2} \geq t^{2} (1 + c)^{2} S^{2})} \leq \frac{(Y_{i} + t c S)^{2}}{t^{2} (1 + c)^{2} S^{2}} .

$\newcommand{I}[1]{\mathbf{1}_{(#1)}} \I{Y_i \geq t S} = \I{Y_i + t c S \geq t S (1+c)} \leq \I{(Y_i + t c S)^2 \geq t^2 (1+c)^2 S^2} \leq \frac{(Y_i + t c S)^2}{t^2(1+c)^2 S^2}\>.$

Then,

\frac{1}{n} \sum_{i} 1_{(Y_{i} \geq t S)} \leq \frac{1}{n} \sum_{i} \frac{(Y_{i} + t c S)^{2}}{t^{2} (1 + c)^{2} S^{2}} = \frac{(n - 1) S^{2} + n t^{2} c^{2} S^{2}}{n t^{2} (1 + c)^{2} S^{2}} = \frac{(n - 1) + n t^{2} c^{2}}{n t^{2} (1 + c)^{2}},

$\frac{1}{n} \sum_i \I{Y_i \geq t S} \leq \frac{1}{n} \sum_i \frac{(Y_i + t c S)^2}{t^2(1+c)^2 S^2} = \frac{(n-1)S^2 + n t^2 c^2 S^2}{n t^2 (1+c)^2 S^2} = \frac{(n-1) + n t^2 c^2}{n t^2 (1+c)^2} \>,$ since

\bar{Y} = 0

$\bar Y = 0$ and

\sum_{i} Y_{i}^{2} = (n - 1) S^{2}

$\sum_i Y_i^2 = (n-1)S^2$ .

The right-hand side is a constant (!), so taking expectations on both sides yields,

P (X_{1} - \bar{X} \geq t S) \leq \frac{(n - 1) + n t^{2} c^{2}}{n t^{2} (1 + c)^{2}} .

$\mathbb P(X_1 - \bar X \geq t S) \leq \frac{(n-1) + n t^2 c^2}{n t^2 (1+c)^2} \>.$ Finally, minimizing over

c

$c$ , yields

c = \frac{n - 1}{n t^{2}}

$c = \frac{n-1}{n t^2}$ , which after a little algebra establishes the result.

That pesky technical condition

Note that we had to assume $\mathbb P(S = 0) = 0$ in order to be able to divide by $S^2$ in the analysis. This is no problem for absolutely continuous distributions, but poses an inconvenience for discrete ones. For a discrete distribution, there is some probability that all observations are equal, in which case $0 = Y_i = t S = 0$ for all $i$ and $t > 0$ .

We can wiggle our way out by setting $q = \mathbb P(S = 0)$ . Then, a careful accounting of the argument shows that everything goes through virtually unchanged and we get

Corollary 1. For the case $q = \mathbb P(S = 0) > 0$ , we have
$P (X_{1} - \bar{X} \geq t S) \leq (1 - q) \frac{1}{1 + \frac{n}{n - 1} t^{2}} + q .$ $\mathbb P(X_1 - \bar X \geq t S) \leq (1-q) \frac{1}{1 + \frac{n}{n-1} t^2} + q \>.$

Proof. Split on the events $\{S > 0\}$ and $\{S = 0\}$ . The previous proof goes through for $\{S > 0\}$ and the case $\{S = 0\}$ is trivial.

A slightly cleaner inequality results if we replace the nonstrict inequality in the probability statement with a strict version.

Corollary 2. Let $q = \mathbb P(S = 0)$ (possibly zero). Then,
$P (X_{1} - \bar{X} > t S) \leq (1 - q) \frac{1}{1 + \frac{n}{n - 1} t^{2}} .$ $\mathbb P(X_1 - \bar X > t S) \leq (1-q) \frac{1}{1 + \frac{n}{n-1} t^2} \>.$

Final remark: The sample version of the inequality required no assumptions on $X$ (other than that it not be almost-surely constant in the nonstrict inequality case, which the original version also tacitly assumes), in essence, because the sample mean and sample variance always exist whether or not their population analogs do.

— cardinal
ソース

This is just a complement to @cardinal 's ingenious answer. Samuelson Inequality, states that, for a sample of size $n$ , when we have at least three distinct values of the realized $x_i$ 's, it holds that

x_{i} - \bar{x} < s \sqrt{n - 1}, i = 1, . . . n

$x_i-\bar x < s\sqrt{n-1},\;\; i=1,...n$ where

s

$s$ is calculated without the bias correction,

s = {(\frac{1}{n} \sum_{i = 1}^{n} (x_{i} - \bar{x})^{2})}^{1 / 2}

$s= \left (\frac 1n\sum_{i=1}^n(x_i-\bar x)^2\right)^{1/2}$ .

Then, using the notation of Cardinal's answer we can state that

P (X_{1} - \bar{X} \geq S \sqrt{n - 1}) = 0 a . s . [1]

$\mathbb P\left(X_1-\bar X \ge S\sqrt{n-1}\right) =0 \;\;a.s. \qquad [1]$

Since we require, three distinct values, we will have $S\neq 0$ by assumption. So setting $t=\sqrt{n-1}$ in Cardinal's Inequality (the initial version) we obtain

P (X_{1} - \bar{X} \geq S \sqrt{n - 1}) \leq \frac{1}{1 + n}, [2]

$\mathbb P\left (X_1 - \bar X \geq S\sqrt{n-1}\right) \leq \frac{1}{1 + n}, \;\; \qquad [2]$

Eq. $[2]$ is of course compatible with eq. $[1]$ . The combination of the two tells us that Cardinal's Inequality is useful as a probabilistic statement for $0< t < \sqrt{n-1}$ .

If Cardinal's Inequality requires $S$ to be calculated bias-corrected (call this $\tilde S$ ) then the equations become

P (X_{1} - \bar{X} \geq \tilde{S} \frac{n - 1}{\sqrt{n}}) = 0 a . s . [1 a]

$\mathbb P\left(X_1-\bar X \ge \tilde S\frac{n-1}{\sqrt{n}}\right) =0 \;\;a.s. \qquad [1a]$

and we choose $t= \frac{n-1}{\sqrt{n}}$ to obtain through Cardinal's Inequality

P (X_{1} - \bar{X} \geq \tilde{S} \frac{n - 1}{\sqrt{n}}) \leq \frac{1}{n}, [2 a]

$\mathbb P\left (X_1 - \bar X \geq \tilde S\frac{n-1}{\sqrt{n}}\right) \leq \frac{1}{ n}, \;\; \qquad [2a]$ and the probabilistically meaningful interval for

t

$t$ is

0 < t < \frac{n - 1}{\sqrt{n}} .

$0< t < \frac{n-1}{\sqrt{n}}.$

— Alecos Papadopoulos
ソース

(+1) Incidentally, as I was first considering this problem, the fact that

max_{i} | X_{i} - \bar{X} | \leq S \sqrt{n - 1}

$\max_i |X_i - \bar X| \leq S\sqrt{n-1}$ was actually the initial clue that the sample inequality should be tighter than the original. I wanted to squeeze that into my post, but couldn't find a (comfortable) place for it. I'm glad to see you mention it (actually a very slight improvement on it) here along with your very nice additional elaboration. Cheers.

— cardinal

Cheers @Cardinal, great answer -just clarify for me -does it matter for your Inequality how one defines the sample variance (bias-corrected or not)?

— Alecos Papadopoulos

Only ever so slightly. I used the bias-corrected sample variance. If you use

n

$n$ instead of

n - 1

$n-1$ to normalize, then you'll end up with

\frac{1 + t^{2} c^{2}}{t^{2} (1 + c)^{2}}

$\frac{1+t^2c^2}{t^2(1+c)^2}$ instead of

\frac{(n - 1) + n t^{2} c^{2}}{n t^{2} (1 + c)^{2}},

$\frac{(n-1) + n t^2c^2}{nt^2(1+c)^2} \,,$ which means the

n / (n - 1)

$n/(n-1)$ term in the final inequality will disappear. Thus, you'll get the same bound as in the original one-sided Chebyshev inequality in that case. (Assuming I've done the algebra correctly.) :-)

— cardinal

@Cardinal ...which means that the relevant equations in my answer are

1 a

$1a$ and

2 a

$2a$ , which means that your inequality tells us that for

t

$t$ chosen to activate Samuelson Inequality, the probability of the event we are examining, cannot be greater than

1 / n

$1/n$ , i.e. not greater than randomly choosing any one realized value from the sample... which somehow makes some hazy intuitive sense: what is proven certainly impossible in deterministic terms, when approached probabilistically its probability bound does not exceed equiprobability... not clear in my mind yet.

— Alecos Papadopoulos