パーセプトロンルールと勾配降下法と確率勾配降下法の実装に関する説明

さまざまなパーセプトロンの実装を少し試し、「反復」を正しく理解しているかどうかを確認したいと思います。

ローゼンブラットの元のパーセプトロン規則

私の知る限り、Rosenblattの古典的なパーセプトロンアルゴリズムでは、すべてのトレーニング例の後に重みが同時に更新されます。

$\Delta{w}^{(t+1)} = \Delta{w}^{(t)} + \eta(target - actual)x_i$

ここで、 $eta$ は学習ルールです。また、ターゲットと実際の両方にしきい値が設定されます（-1または1）。1反復= 1トレーニングサンプルのパスとして実装しましたが、各トレーニングサンプルの後に重みベクトルが更新されます。

そして、「実際の」値を次のように計算します

$sign ({\pmb{w}^T\pmb{x}}) = sign( w_0 + w_1 x_1 + ... + w_d x_d)$

確率的勾配降下

$\Delta{w}^{(t+1)} = \Delta{w}^{(t)} + \eta(target - actual)x_i$

しかし、パーセプトロンルールと同じ、targetおよびactual閾値が、実際の値がされていません。また、「反復」をトレーニングサンプルのパスとしてカウントします。

SGDと従来のパーセプトロンルールの両方が、この線形に分離可能な場合に収束しますが、勾配降下の実装に問題があります。

勾配降下

ここでは、トレーニングサンプルを調べ、トレーニングサンプルの1パスの重みの変化を合計し、その後、重みを更新しました。たとえば、

各トレーニングサンプル：

$\Delta{w_{new}} \mathrel{{+}{=}} \Delta{w}^{(t)} + \eta(target - actual)x_i$

...

トレーニングセットを1回通過した後：

$\Delta{w} \mathrel{{+}{=}} \Delta{w_{new}}$

この仮定が正しいのか、何か不足しているのか、疑問に思っています。さまざまな（無限に小さい）学習率を試しましたが、収束の兆候を示すことができませんでした。だから、私はsthを誤解しているかどうか疑問に思っています。ここに。

ありがとう、セバスチャン

optimization gradient-descent perceptron

$\Delta$

パーセプトロン：

$\pmb{w}^{(t+1)} = \pmb{w}^{(t)} + \eta_t (y^{(i)} - \hat{y}^{(i)}) \pmb{x}^{(i)}$

$\hat{y}^{(i)} = \text{sign} ({\pmb{w}^\top\pmb{x}^{(i)}})$ $i^{th}$

これは、次の「パーセプトロン損失」関数*の確率的亜勾配降下法として見ることができます：

パーセプトロン損失：

$L_{\pmb{w}}(y^{(i)}) = \max(0, -y^{(i)} \pmb{w}^\top\pmb{x}^{(i)})$

$\partial L_{\pmb{w}}(y^{(i)}) = \begin{array}{rl} \{ 0 \}, & \text{ if } y^{(i)} \pmb{w}^\top\pmb{x}^{(i)} > 0 \\ \{ -y^{(i)} \pmb{x}^{(i)} \}, & \text{ if } y^{(i)} \pmb{w}^\top\pmb{x}^{(i)} < 0 \\ [-1, 0] \times y^{(i)} \pmb{x}^{(i)}, & \text{ if } \pmb{w}^\top\pmb{x}^{(i)} = 0 \\ \end{array}$ .

Since perceptron already is a form of SGD, I'm not sure why the SGD update should be different than the perceptron update. The way you've written the SGD step, with non-thresholded values, you suffer a loss if you predict an answer too correctly. That's bad.

Your batch gradient step is wrong because you're using "+=" when you should be using "=". The current weights are added for each training instance. In other words, the way you've written it,

$\pmb{w}^{(t+1)} = \pmb{w}^{(t)} + \sum_{i=1}^n \{\pmb{w}^{(t)} - \eta_t \partial L_{\pmb{w}^{(t)}}(y^{(i)}) \}$ .

What it should be is:

$\pmb{w}^{(t+1)} = \pmb{w}^{(t)} - \eta_t \sum_{i=1}^n {\partial L_{\pmb{w}^{(t)}}(y^{(i)}) }$ .

Also, in order for the algorithm to converge on every and any data set, you should decrease your learning rate on a schedule, like $\eta_t = \frac{\eta_0}{\sqrt{t}}$ .

* The perceptron algorithm is not exactly the same as SSGD on the perceptron loss. Usually in SSGD, in the case of a tie ( $\pmb{w}^\top\pmb{x}^{(i)} = 0$ ), $\partial L= [-1, 0] \times y^{(i)} \pmb{x}^{(i)}$ , so $\pmb{0} \in \partial L$ , so you would be allowed to not take a step. Accordingly, perceptron loss can be minimized at $\pmb{w} = \pmb{0}$ , which is useless. But in the perceptron algorithm, you are required to break ties, and use the subgradient direction $-y^{(i)} \pmb{x}^{(i)} \in \partial L$ if you choose the wrong answer.

So they're not exactly the same, but if you work from the assumption that the perceptron algorithm is SGD for some loss function, and reverse engineer the loss function, perceptron loss is what you end up with.

— Sam Thomson
ソース

Thank you Sam, and I do apologize for my messy question. I don't know where the deltas come from, but the "+=" was the the thing that went wrong. I completely overlooked that part. Thanks for the thorough answer!