Δ
パーセプトロン:
ww(t+1)=ww(t)+ηt(y(i)−y^(i))xx(i)
y^(i)=sign(ww⊤xx(i))ith
これは、次の「パーセプトロン損失」関数*の確率的亜勾配降下法として見ることができます:
パーセプトロン損失:
Lww(y(i))=max(0,−y(i)ww⊤xx(i))
∂Lww(y(i ))={0},{−y(i)xx(i)},[−1,0]×y(i)xx(i), if y(i)ww⊤xx(i)>0 if y(i)ww⊤xx(i)<0 if ww⊤xx(i)=0.
Since perceptron already is a form of SGD, I'm not sure why the SGD update should be different than the perceptron update. The way you've written the SGD step, with non-thresholded values, you suffer a loss if you predict an answer too correctly. That's bad.
Your batch gradient step is wrong because you're using "+=" when you should be using "=". The current weights are added for each training instance. In other words, the way you've written it,
ww(t+1)=ww(t)+∑ni=1{ww(t)−ηt∂Lww(t)(y(i))}.
What it should be is:
ww(t+1)=ww(t)−ηt∑ni=1∂Lww(t)(y(i)).
Also, in order for the algorithm to converge on every and any data set, you should decrease your learning rate on a schedule, like ηt=η0t√.
* The perceptron algorithm is not exactly the same as SSGD on the perceptron loss. Usually in SSGD, in the case of a tie (ww⊤xx(i)=0), ∂L=[−1,0]×y(i)xx(i), so 00∈∂L, so you would be allowed to not take a step. Accordingly, perceptron loss can be minimized at ww=00, which is useless. But in the perceptron algorithm, you are required to break ties, and use the subgradient direction −y(i)xx(i)∈∂L if you choose the wrong answer.
So they're not exactly the same, but if you work from the assumption that the perceptron algorithm is SGD for some loss function, and reverse engineer the loss function, perceptron loss is what you end up with.