分類子の精度の統計的有意性を評価する方法は？

8

パーセンテージと入力サンプル数で分類子の精度を出力します。この情報に基づく結果が統計的に有意であるかどうかを判断できるテストはありますか？

ありがとう

statistical-significance classification

— シャン
ソース

例を挙げていただけますか？

— Max Gordon

3

あなたが何を持っているのか、何を求めているのかははっきりしません。比率が0であるかどうかのテストがありますが、それは正確さの意味のあるテストではありません-正確さ0はある意味で完璧です-常に間違っています！

— Peter Flom

10

推測の精度の分布を定義したいとします。おそらくこれは $X/n$ どこ $X \sim$ 二項式（ $n$ 、 $p$ ）既知の $p$ （たとえば50％）。

次に、このnullモデルが真である場合に、実行した結果を観察する可能性を計算します。Rでは、binom.testで直接使用または計算できますpbinom。

通常、精度を「推測」ではなく、いくつかの代替方法と比較する必要があります。その場合、マクネマーの検定を使用できます。Rでは、mcnemar.test。

— カール
ソース

6

完全な無作為性に対するテストがどこで役立つかわかりません。純粋なランダムな推測のみを打つことができる分類子はあまり役に立ちません。より大きな問題は、正確性スコアとして正しく分類された比率の使用です。これは不連続で不適切なスコアリングルールであり、恣意的で鈍感なので簡単に操作できます。（多くの）その欠陥を確認する方法の1つは、切片のみのモデルがある場合に正しく分類された比率を計算することです。結果が有病率で0.5に近くない場合は高くなります。

より適切なルールを選択したら、インデックスの信頼区間を計算することは価値があります。統計的有意性はほとんど価値がありません。

— フランク・ハレル
ソース

正しく分類された割合について、標準の分類精度を意味しますか？感謝

— Simone、

1

はい; 非常に問題の多い対策。

— フランクハレル、2011年

はい、それは非常に問題の多い対策です。仰るとおりです。

— Simone、

2

ランダムな推測をやっと打つだけの分類子は、状況によっては非常に役立つ場合があります。したがって、分類器が偶然よりも優れているという確信を定量化するいくつかのテストを行うことも役立ちます。

— 2013

3

確かに、信頼区間をコンピュータ化できます。もし $\mbox{acc}$ のテストセットで推定された精度は $N$ 要素、それはそれを保持します

\frac{a c c - p}{\sqrt{p (1 - p) / N}} \sim N (0, 1)

$\frac{acc-p}{\sqrt{p(1-p)/N}} \sim \mathcal{N}(0,1)$ したがって

P (\frac{a c c - p}{\sqrt{p (1 - p) / N}} \in [- z_{α / 2}, + z_{α / 2}]) \approx 1 - α

$P\bigg( \frac{acc-p}{\sqrt{p(1-p)/N}} \in [-z_{\alpha/2},+z_{\alpha/2}]\bigg) \approx 1 - \alpha$ だからあなたはそれを言うことができます：

P (p \in [l, u]) \approx 1 - α

$P(p \in [l,u]) \approx 1 - \alpha$ たとえば、ウィルソン間隔を計算できます。

l = \frac{2 N acc + z_{α / 2}^{2} - z_{α / 2} \sqrt{z_{α / 2}^{2} + 4 N acc - 4 N {acc}^{2}}}{2 (N + z_{α / 2}^{2})}

$l = \frac{2 \ N \ \mbox{acc} + z_{\alpha/2}^2 - z_{\alpha/2} \sqrt{z_{\alpha/2}^2+4 \ N \ \mbox{acc}-4 \ N \ \mbox{acc}^2}}{2(N+z_{\alpha/2}^2)}$

u = \frac{2 N acc + z_{α / 2}^{2} + z_{α / 2} \sqrt{z_{α / 2}^{2} + 4 N acc - 4 N {acc}^{2}}}{2 (N + z_{α / 2}^{2})}

$u = \frac{2 \ N \ \mbox{acc} + z_{\alpha/2}^2 + z_{\alpha/2} \sqrt{z_{\alpha/2}^2+4 \ N \ \mbox{acc}-4 \ N \ \mbox{acc}^2}}{2(N+z_{\alpha/2}^2)}$

パフォーマンスをランダムに計算した場合との差を計算して、ゲインを計算できると思います。ランダム分類子の精度は次のとおりです。

{acc}_{r} = \sum_{i = 1}^{c} p_{i}^{2}

$\mbox{acc}_r = \sum_{i=1}^{c} p_i^2$ どこ

p_{i}

$p_i$ クラスの経験的頻度

i

$i$ テストセットで推定され、

c

$c$ 異なるクラスの数です。平均して、クラスを推測してランダムに分類するランダム分類子

i

$i$ テストセットの事前確率に依存して、分類

p_{i} \cdot n_{i} = \frac{n_{i}}{N} \cdot n_{i}

$p_i\cdot n_i = \frac{n_i}{N} \cdot n_i$ クラスの例

i

$i$ correctly. Where

n_{i}

$n_i$ is the number of records of class

i

$i$ in the test set. Thus

{acc}_{r} = \frac{p_{1} \cdot n_{1} + \dots + p_{c} \cdot n_{c}}{n_{1} + \dots + n_{c}} = \frac{p_{1} \cdot n_{1}}{N} + \dots + \frac{p_{c} \cdot n_{c}}{N} = \sum_{i}^{c} p_{i}^{2}

$\mbox{acc}_r = \frac{p_1 \cdot n_1 + \dots + p_c \cdot n_c}{n_1 + \dots + n_c} = \frac{p_1\cdot n_1}{N} + \dots + \frac{p_c\cdot n_c}{N} = \sum_{i}^{c} p_i^2$ You might have a look to a question of mine.

The gain is:

gain = \frac{acc}{{acc}_{r}}

$\mbox{gain} = \frac{\mbox{acc}}{\mbox{acc}_r}$

I actually think a statistical test can be sketched. The numerator could be seen as a Normal random variable, $\mathcal{N}(\mbox{acc},p(1-p)/N)$ , but you should figure out what kind of random variable the denominator $\mbox{acc}_r$ could be.

— Simone
ソース

3

Again I'm not convinced that a statistical test against absolutely no predictive value is of value.

— Frank Harrell

2

Classifiers that just barely beat random guessing can be extremely useful in some situations. Thus, having some test that quantifies confidence in a classifier being better than chance is also useful.

— ely

1

In the vast majority of situations we want to know how well a prediction discriminates, not just whether it discriminates better than random chance.

— Frank Harrell

Not if you are boosting a bunch of weak classifiers, which is a very common activity. You may care about discrimination once you reach the fully boosted final classifier, but there's a lot of work between the start and the finish, and demonstrating that a complicated classifier empirically performs better than chance is important.

— ely

1

And some application domains, say financial markets, where you get to use the classifier in many many roughly independent cases, just being a bit better than chance (R-squared's of like 11% or 12% are considered great) can mean a lot. In those cases, if even the boosted classifier has R-squared of 15% that might be considered very good -- in which case it really matters if you can statistically resolve whether the weak classifiers are definitely better than guessing.

— ely

1

You may be interested in the following papers:

Eric W. Noreen, Computer-intensive Methods for Testing Hypotheses: An Introduction, John Wiley & Sons, New York, NY, USA, 1989.
Alexander Yeh, More accurate tests for the statistical significance of result differences, in: Proceedings of the 18th International Conference on Computational Linguistics, Volume 2, pages 947-953, 2000.

I think they cover what Dimitrios Athanasakis talks about.

I implemented one option of Yeh in the manner that I understand it:

http://www.clips.uantwerpen.be/~vincent/software#art

— vvasch
ソース

0

I think that one thing you could try out would be a permutation test. Simply put just randomly permute the input-desired output pairs you feed to your classifier over a number of times. If it fails to reproduce anything at the same level over 100 different permutations than it's significant at the 99% interval and so on. This is basically the same process used to obtain p-values (which correspond to the probability of obtaining a linear correlation of the same mangnitude after randomly permuting the data) and so on.

— Dimitrios Athanasakis
ソース

Could you elaborate further what you meant for input/desired output pairs?

— Simone