SGD学習率を下げると精度が大幅に向上するのはなぜですか?


8

などの論文では、この Iしばしば形状のこの種の訓練曲線を参照してください。

enter image description here

この場合、SGDは0.9の係数で使用され、学習率は30エポックごとに10の係数で減少しました。

  • 学習率を変更すると、なぜエラーが大幅に減少するのですか?
  • 最初の低下の後に検証エラーが増加し始めるのに、トレーニングエラーは減少し続けるのはなぜですか?
  • 2回目以降の学習率の変化を近づけても同じ結果が得られますか?つまり、それ以上の遅延が減少するのはなぜですか?

回答:


5

With a higher learning rate, you take bigger steps towards the solution. However, when you are close, you might jump over the solution and then the next step, you jump over it again causing an oscillation around the solution. Now, if you lower the learning rate correctly, you will stop the oscillation and continue towards the solution once again. That is, until you start oscillating again. To keep in mind is that a larger learning rate can jump over smaller local minima and help you find a better minima, which it can't jump over. Also, it is generally the training error that becomes better, and the validation error becomes worse as you start to overfit on the training data.


2

Because the smaller learning rate allows the optimizer to escape saddle points, which is what happens at each cliff, instead of overshooting. The validation error oscillated approaching the second saddle point. The noise makes it difficult to state that it increased with statistical significance, but if it did it could be due to overfitting. I do not know of any result that relates the separation between saddle points, so the delay could be arbitrary. At some point you reach the bottom, of course.


Sorry, do you mean larger learning allow escaping saddle points? This is also what @Carl in the other answer talks about?
HelloWorld

No, smaller. Same subject. Imagine that the manifold that connects one local minima to another is through a narrow hole. You are unlikely to go through it if you take big steps.
Emre
弊社のサイトを使用することにより、あなたは弊社のクッキーポリシーおよびプライバシーポリシーを読み、理解したものとみなされます。
Licensed under cc by-sa 3.0 with attribution required.