統計学習理論では、テストセットに過剰適合の問題はありませんか？

MNISTデータセットの分類に関する問題を考えてみましょう。

Yann LeCunのMNIST Webページによると、「Ciresan et al。」畳み込みニューラルネットワークを使用したMNISTテストセットで0.23％のエラー率を得ました。

レッツとして示すMNISTトレーニングセット、としてMNISTテストセット、最終的な仮説は、彼らが使用して得られたとして、およびMNIST試験に彼らの誤り率が使用して設定のように。 $D_{train}$ $D_{test}$ $D_{train}$ $h_{1}$ $h_{1}$ $E_{test}(h_{1}) = 0.0023$

彼らの観点では、 $D_{test}$ は $h_{1}$ に関係なく入力空間からランダムにサンプリングされたテストセットであるため、最終仮説サンプル外エラーパフォーマンスは $E_{out}(h_{1})$ 次のように制限されると主張できますHoeffdingの不等式。

P [| E_{o u t} (h_{1}) - E_{t e s t} (h_{1}) | < ϵ |] \geq 1 - 2 e^{2 ϵ^{2} N_{t e s t}}

$P[|E_{out}(h_{1}) - E_{test}(h_{1})| < \epsilon|] \geq 1 - 2e^{2\epsilon^{2}N_{test}}$

N_{t e s t} = | D_{t e s t} |

$N_{test}=|D_{test}|$

換言すれば、少なくとも確率が、 $1-\delta$

E_{o u t} (h_{1}) \leq E_{t e s t} (h_{1}) + \sqrt{\frac{1}{2 N_{t e s t}} l n \frac{2}{δ}}

$E_{out}(h_1) \leq E_{test}(h_1) + \sqrt{{1 \over 2N_{test}}ln{2\over\delta}}$

別の視点を考えてみましょう。MNISTテストセットを適切に分類したい人がいるとします。そこで、彼は最初にYann LeCunのMNIST Webpageを見て、8つの異なるモデルを使用している他の人々によって得られた以下の結果を見つけました。

MNIST分類結果

そして、8つのモデルの中でMNISTテストセットで最高のパフォーマンスを発揮するモデルを選びました。 $g$

彼にとって、学習プロセスは、仮説セットからテストセット最適に実行される仮説を選択していました。 $g$ $D_{test}$ $H_{trained}=\{h_1, h_2, .. ,h_8\}$

したがって、テストセットのエラーは、この学習プロセスの「サンプル内」エラーであるため、次の不等式として有限仮説セットのVC境界を適用できます。 $E_{test}(g)$

P [| E_{o u t} (g) - E_{i n} (g) | < ϵ] \geq 1 - 2 | H_{t r a i n e d} | e^{2 ϵ^{2} N_{t e s t}}

$P[|E_{out}(g)-E_{in}(g)|<\epsilon] \geq 1 - 2|H_{trained}|e^{2\epsilon^{2}N_{test}}$

換言すれば、少なくとも確率が、 $1-\delta$

E_{o u t} (g) \leq E_{t e s t} (g) + \sqrt{\frac{1}{2 N_{t e s t}} l n \frac{2 | H_{t r a i n e d} |}{δ}}

$E_{out}(g) \leq E_{test}(g) + \sqrt{{1 \over 2N_{test}}ln{2|H_{trained}|\over\delta}}$

この結果は、複数のモデルの中でモデルのパフォーマンスが最高になるように選択した場合、テストセットが過剰適合になる可能性があることを意味します。

この場合、人は選ぶかもしれません。これは最も低いエラー率です。以来、、この特定のテストセットに8つのモデルの間で最良の仮説である、といういくつかの可能性が存在し得る MNISTテストセットにoverfitted仮説です。 $h_{1}$ $E_{test}(h_{1}) = 0.0023$ $h_{1}$ $D_{test}$ $h_{1}$

したがって、この人は次の不平等を主張できます。

E_{o u t} (h_{1}) \leq E_{t e s t} (h_{1}) + \sqrt{\frac{1}{2 N_{t e s t}} l n \frac{2 | H_{t r a i n e d} |}{δ}}

$E_{out}(h_1) \leq E_{test}(h_1) + \sqrt{{1 \over 2N_{test}}ln{2|H_{trained}|\over\delta}}$

その結果、2つの不等式

P [E_{o u t} (h_{1}) \leq E_{t e s t} (h_{1}) + \sqrt{\frac{1}{2 N_{t e s t}} l n \frac{2}{δ}}] \geq 1 - δ

$P[\;E_{out}(h_1) \leq E_{test}(h_1) + \sqrt{{1 \over 2N_{test}}ln{2\over\delta}}\;] \geq 1-\delta$

P [E_{o u t} (h_{1}) \leq E_{t e s t} (h_{1}) + \sqrt{\frac{1}{2 N_{t e s t}} l n \frac{2 | H_{t r a i n e d} |}{δ}}] \geq 1 - δ

$P[\;E_{out}(h_1) \leq E_{test}(h_1) + \sqrt{{1 \over 2N_{test}}ln{2|H_{trained}|\over\delta}}\;] \geq 1-\delta$

しかし、これら2つの不等式には互換性がないことは明らかです。

どこで間違っていますか？どちらが正しいか、どちらが間違っているか？

後者が間違っている場合、この場合の有限仮説セットにVC限界を適用する正しい方法は何ですか？

— asqdf
ソース

Among those two inequalities, I think the later is wrong. In brief, what's wrong here is the identity $g=h_1$ given that $g$ is a function of the test data while $h_1$ is a model that is independent of test data.

In fact, $g$ is one of the 8 models in $H_{trained} = \{ h_1, h_2,..., h_8 \}$ that best predicts test set $D_{test}$ .

Therefore, $g$ is a function of $D_{test}$ . For a specific test set, $D^*_{test}$ (like the one you mentioned), it could happens that $g(D^*_{test}) = h_1$ , but in general, depending on the test set, $g(D_{test})$ could take any value in $H_{trained}$ . On the other hand $h_1$ is just one value in $H_{trained}$ .

For the other question:

If the latter is wrong, what is the right way to apply the VC bound for finite hypothesis sets in this case?

Just don't replace $g$ by $h_1$ , you will get the correct bound (for $g$ , of course) and it will have no conflict with the other bound (which is for $h_1$ ).

— Tĩnh Trần
ソース