ディープニューラルネットワーク-ReLUによる逆伝播

ReLUで逆伝播を導き出すのに多少の困難があり、いくつかの作業を行いましたが、正しい軌道に乗っているかどうかはわかりません。

コスト関数：ここで、は実数値で、は予測値です。また、 > 0は常に仮定します。 $\frac{1}{2}(y-\hat y)^2$ $y$ $\hat y$ $x$

1層ReLU、1番目の層の重みは $w_1$

$\frac{dC}{dw_1}=\frac{dC}{dR}\frac{dR}{dw_1}$

$\frac{dC}{w_1}=(y-ReLU(w_1x))(x)$

2層の番目の層の重みはで、2番目の層は、1番目の層を更新したかった $w_2$ $w_1$ $w_2$

$\frac{dC}{dw_2}=\frac{dC}{dR}\frac{dR}{dw_2}$

$\frac{dC}{w_2}=(y-ReLU(w_1*ReLU(w_2x))(w_1x)$

以降 $ReLU(w_1*ReLU(w_2x))=w_1w_2x$

3層ReLU、1番目の層の重みは番目の層および3番目の層 $w_3$ $w_2$ $w_1$

$\frac{dC}{dw_3}=\frac{dC}{dR}\frac{dR}{dw_3}$

$\frac{dC}{w_3}=(y-ReLU(w_1*ReLU(w_2(*ReLU(w_3)))(w_1w_2x)$

以降 $ReLU(w_1*ReLU(w_2(*ReLU(w_3))=w_1w_2w_3x$

シグモイドと比較して、チェーンルールは2つの導関数でのみ持続するため、層の長さになる可能性があります。 $n$

3つのレイヤーの重みをすべて更新したいとしますは3番目のレイヤー、は2番目のレイヤー、は3番目のレイヤーです $w_1$ $w_2$ $w_1$

$\frac{dC}{w_1}=(y-ReLU(w_1x))(x)$

$\frac{dC}{w_2}=(y-ReLU(w_1*ReLU(w_2x))(w_1x)$

$\frac{dC}{w_3}=(y-ReLU(w_1*ReLU(w_2(*ReLU(w_3)))(w_1w_2x)$

この導出が正しい場合、これはどのように消失を防ぎますか？シグモイドと比較すると、方程式に0.25を乗じているのに対し、ReLUには定数値の乗算はありません。レイヤーが数千ある場合、重みのために多くの乗算がありますが、これは勾配の消失または爆発を引き起こしませんか？

neural-network backpropagation

— user1157751
ソース

@NeilSlaterお返事ありがとうございます！詳しく説明してもらえますか？

— user1157751

ああ、私はあなたの意味を知っていると思う。さて、この質問を提起した理由は、派生が正しいと確信しているからですか？私はあちこち検索しましたが、完全にゼロから派生したReLUの例を見つけられませんでしたか？

— user1157751

ReLU関数とその派生物の作業定義：

$ReLU(x) = \begin{cases} 0, & \text{if } x < 0, \\ x, & \text{otherwise}. \end{cases}$

$\frac{d}{dx} ReLU(x) = \begin{cases} 0, & \text{if } x < 0, \\ 1, & \text{otherwise}. \end{cases}$

導関数は単位ステップ関数です。これは、勾配が厳密に定義されていない $x=0$ 問題を無視しますが、それはニューラルネットワークにとって実際的な問題ではありません。上記の式では、0での微分は1ですが、ニューラルネットワークのパフォーマンスに実際の影響を与えることなく、0または0.5として同等に扱うことができます。

簡素化されたネットワーク

これらの定義を使用して、サンプルネットワークを見てみましょう。

コスト関数回帰を実行しています $C = \frac{1}{2}(y-\hat{y})^2$ 。あなたは、定義した $R$ 人工ニューロンの出力として、がありますが、入力値を定義していません。Iは、完全性のためにことを追加します-それを呼び出す $z$ 、層によっていくつかの索引付けを追加し、私は、ベクトルと行列の大文字ため小文字を好むので、 $r^{(1)}$ 第一の層の出力、 $z^{(1)}$ のためにニューロンをその入力接続する重みの入力および $W^{(0)}$ （より大きなネットワークでは、より深い接続する場合があります） $x$ $r$ 代わりに値）。また、重み行列のインデックス番号も調整しました。これは、大規模なネットワークで明らかになる理由です。NB今のところ、各層にニューロン以上のものがあることを無視しています。

単純な1層、1ニューロンネットワークを見ると、フィードフォワード方程式は次のとおりです。

$z^{(1)} = W^{(0)}x$

$\hat{y} = r^{(1)} = ReLU(z^{(1)})$

推定の例に対するコスト関数の導関数は次のとおりです。

$\frac{\partial C}{\partial \hat{y}} = \frac{\partial C}{\partial r^{(1)}} = \frac{\partial}{\partial r^{(1)}}\frac{1}{2}(y-r^{(1)})^2 = \frac{1}{2}\frac{\partial}{\partial r^{(1)}}(y^2 - 2yr^{(1)} + (r^{(1)})^2) = r^{(1)} - y$

$z$

$\frac{\partial C}{\partial z^{(1)}} = \frac{\partial C}{\partial r^{(1)}} \frac{\partial r^{(1)}}{\partial z^{(1)}} = (r^{(1)} - y)Step(z^{(1)}) = (ReLU(z^{(1)}) - y)Step(z^{(1)})$

$\frac{\partial C}{\partial z^{(1)}}$

$W^{(0)}$

$\frac{\partial C}{\partial W^{(0)}} = \frac{\partial C}{\partial z^{(1)}} \frac{\partial z^{(1)}}{\partial W^{(0)}} = (ReLU(z^{(1)}) - y)Step(z^{(1)})x = (ReLU(W^{(0)}x) - y)Step(W^{(0)}x)x$

$z^{(1)} = W^{(0)}x$ $\frac{\partial z^{(1)}}{\partial W^{(0)}} = x$

これが最も単純なネットワークの完全なソリューションです。

ただし、階層化されたネットワークでは、同じロジックを次の層に持ち込む必要もあります。また、通常、レイヤーには複数のニューロンがあります。

より一般的なReLUネットワーク

より一般的な用語を追加すると、2つの任意のレイヤーを操作できます。それらをレイヤー呼ぶ $(k)$ indexed by $i$ , and Layer $(k+1)$ indexed by $j$ . The weights are now a matrix. So our feed-forward equations look like this:

$z^{(k+1)}_j = \sum_{\forall i} W^{(k)}_{ij}r^{(k)}_i$

$r^{(k+1)}_j = ReLU(z^{(k+1)}_j)$

In the output layer, then the initial gradient w.r.t. $r^{output}_j$ is still $r^{output}_j - y_j$ . However, ignore that for now, and look at the generic way to back propagate, assuming we have already found $\frac{\partial C}{\partial r^{(k+1)}_j}$ - just note that this is ultimately where we get the output cost function gradients from. Then there are 3 equations we can write out following the chain rule:

First we need to get to the neuron input before applying ReLU:

$\frac{\partial C}{\partial z^{(k+1)}_j} = \frac{\partial C}{\partial r^{(k+1)}_j} \frac{\partial r^{(k+1)}_j}{\partial z^{(k+1)}_j} = \frac{\partial C}{\partial r^{(k+1)}_j}Step(z^{(k+1)}_j)$

We also need to propagate the gradient to previous layers, which involves summing up all connected influences to each neuron:

$\frac{\partial C}{\partial r^{(k)}_i} = \sum_{\forall j} \frac{\partial C}{\partial z^{(k+1)}_j} \frac{\partial z^{(k+1)}_j}{\partial r^{(k)}_i} = \sum_{\forall j} \frac{\partial C}{\partial z^{(k+1)}_j} W^{(k)}_{ij}$

And we need to connect this to the weights matrix in order to make adjustments later:

$\frac{\partial C}{\partial W^{(k)}_{ij}} = \frac{\partial C}{\partial z^{(k+1)}_j} \frac{\partial z^{(k+1)}_j}{\partial W^{(k)}_{ij}} = \frac{\partial C}{\partial z^{(k+1)}_j} r^{(k)}_{i}$

You can resolve these further (by substituting in previous values), or combine them (often steps 1 and 2 are combined to relate pre-transform gradients layer by layer). However the above is the most general form. You can also substitute the $Step(z^{(k+1)}_j)$ in equation 1 for whatever the derivative function is of your current activation function - this is the only place where it affects the calculations.

Back to your questions:

If this derivation is correct, how does this prevent vanishing?

Your derivation was not correct. However, that does not completely address your concerns.

The difference between using sigmoid versus ReLU is just in the step function compared to e.g. sigmoid's $y(1-y)$ , applied once per layer. As you can see from the generic layer-by-layer equations above, the gradient of the transfer function appears in one place only. The sigmoid's best case derivative adds a factor of 0.25 (when $x = 0, y = 0.5$ ), and it gets worse than that and saturates quickly to near zero derivative away from $x=0$ . The ReLU's gradient is either 0 or 1, and in a healthy network will be 1 often enough to have less gradient loss during backpropagation. This is not guaranteed, but experiments show that ReLU has good performance in deep networks.

If there's thousands of layers, there would be a lot of multiplication due to weights, then wouldn't this cause vanishing or exploding gradient?

Yes this can have an impact too. This can be a problem regardless of transfer function choice. In some combinations, ReLU may help keep exploding gradients under control too, because it does not saturate (so large weight norms will tend to be poor direct solutions and an optimiser is unlikely to move towards them). However, this is not guaranteed.

— Neil Slater
ソース

Was a chain rule performed on

\frac{d C}{d \hat{y}}

$\frac{dC}{d \hat y}$ ?

— user1157751

@ user1157751：いいえ、

\frac{\partial C}{\partial \hat{y}} = \frac{\partial C}{\partial r^{(1)}}

$\frac{\partial C}{\partial \hat{y}} = \frac{\partial C}{\partial r^{(1)}}$ because

\hat{y} = r^{(1)}

$\hat{y} = r^{(1)}$ . The cost function C is simple enough that you can take its derivative immediately. The only thing I haven't shown there is the expansion of the square - would you like me to add it?

— Neil Slater

だが

C

$C$ は

\frac{1}{2} (y - \hat{y})^{2}

$\frac{1}{2}(y- \hat y)^2$ , don't we need to perform chain rule so that we can perform the derivative on

\hat{y}

$\hat y$ ?

\frac{d C}{d \hat{y}} = \frac{d C}{d U} \frac{d U}{d \hat{y}}

$\frac{dC}{d \hat y}=\frac{dC}{dU}\frac{dU}{d \hat y}$ , where

U = y - \hat{y}

$U = y - \hat y$ . Apologize for asking really simple questions, my maths ability is probably causing trouble for you : (

— user1157751

If you can make things simpler by expanding. Then please do expand the square.

— user1157751

@user1157751: Yes you could use the chain rule in that way, and it would give the same answer as I show. I just expanded the square - I'll show it.

— Neil Slater