バックプロパゲーションを使用してニューラルネットワークをトレーニングするための時間の複雑さは何ですか？

16

NNには、 $n$ 隠れ層、訓練例、特徴、および各層のノードが含まれているとします。バックプロパゲーションを使用してこのNNをトレーニングする時間の複雑さは何ですか？ $m$ $x$ $n_i$

私はアルゴリズムの時間の複雑さをどのように見つけるかについての基本的なアイデアを持っていますが、ここでは4つの異なる要因、すなわち反復、レイヤー、各レイヤーのノード、トレーニング例、そしてさらに多くの要因があります。ここで答えを見つけましたが、十分に明確ではありませんでした。

上記で述べたものとは別に、NNのトレーニングアルゴリズムの時間の複雑さに影響する他の要因はありますか？

— DuttaA
ソース

https://qr.ae/TWttzqも参照してください。

— nbro

9

信頼できるソースからの回答を見たことはありませんが、簡単な例を使って（現在の知識で）これに自分で答えようとします。

一般に、逆伝播を使用したMLPのトレーニングは通常、行列で実装されることに注意してください。

行列乗算の時間の複雑さ

$M_{ij} * M_{jk}$ の行列乗算の時間計算量は単純に $\mathcal{O}(i*j*k)$ です。

ここでは、最も単純な乗算アルゴリズムを想定していることに注意してください。時間の複雑さがいくらか改善された他のアルゴリズムがいくつか存在します。

フィードフォワードパスアルゴリズム

フィードフォワード伝播アルゴリズムは次のとおりです。

まず、レイヤー $i$ から $j$ に移動するには、次のようにします。

S_{j} = W_{j i} * Z_{i}

$S_j = W_{ji}*Z_i$

次に、アクティベーション機能を適用します

Z_{j} = f (S_{j})

$Z_j = f(S_j)$

我々が持っている場合は $N$ （入力と出力層を含む）層を、これが実行されます $N-1$ 回。

例

例として、 $4$ 層のMLPのフォワードパスアルゴリズムの時間計算量を計算してみましょう。ここで、 $i$ は入力層のノード数、 $j$ は2番目の層のノード数、 $k$ はノードのノード数3番目の層と $l$ は出力層のノードの数です。

$4$ レイヤーがあるため、これらのレイヤー間の重みを表すには $3$ マトリックスが必要です。 $W_{ji}$ 、 $W_{kj}$ 、 $W_{lk}$ でそれらを示しましょう。ここで、 $W_{ji}$ は $j$ 行 $i$ 列の行列です（したがって、 $W_{ji}$ にはレイヤー $i$ からレイヤー $j$ への重みが含まれます）。

あなたが持っていると仮定 $t$ 訓練例を。レイヤー $i$ から $j$ 伝播するには、最初に

S_{j t} = W_{j i} * Z_{i t}

$S_{jt} = W_{ji} * Z_{it}$

そして、この操作（すなわち、行列の乗算）は $\mathcal{O}(j*i*t)$ 時間の複雑さを持ちます。次に、アクティベーション関数を適用します

Z_{j t} = f (S_{j t})

$Z_{jt} = f(S_{jt})$

また、これは要素単位の操作であるため、 $\mathcal{O}(j*t)$ 時間の複雑さを持ちます。

したがって、合計で、

O (j * i * t + j * t) = O (j * t * (t + 1)) = O (j * i * t)

$\mathcal{O}(j*i*t + j*t) = \mathcal{O}(j*t*(t + 1)) = \mathcal{O}(j*i*t)$

同じロジックを使用して、 $j \to k$ に進むには $\mathcal{O}(k*j*t)$ があり、 $k \to l$ には $\mathcal{O}(l*k*t)$ ます。

In total, the time complexity for feedforward propagation will be

O (j * i * t + k * j * t + l * k * t) = O (t * (i j + j k + k l))

$\mathcal{O}(j*i*t + k*j*t + l*k*t) = \mathcal{O}(t*(ij + jk + kl))$

I'm not sure if this can be simplified further or not. Maybe it's just $\mathcal{O}(t*i*j*k*l)$ , but I'm not sure.

Back-propagation algorithm

逆伝播アルゴリズムは次のように進みます。出力層から開始 $l \to k$ 、我々は、エラー信号、計算 $E_{lt}$ 、層のノードのための誤差信号を含むマトリックス $l$

E_{l t} = f^{'} (S_{l t}) ⊙ (Z_{l t} - O_{l t})

$E_{lt} = f'(S_{lt}) \odot {(Z_{lt} - O_{lt})}$

どこ $\odot$ 要素毎の乗算を意味します。 $E_{lt}$ は $l$ 行と $t$ 列があることに注意してください。各列がトレーニング例 $t$ エラー信号であることを意味します。

We then compute the "delta weights", $D_{lk} \in \mathbb{R}^{l \times k}$ (between layer $l$ and layer $k$ )

D_{l k} = E_{l t} * Z_{t k}

$D_{lk} = E_{lt} * Z_{tk}$

where $Z_{tk}$ is the transpose of $Z_{kt}$ .

We then adjust the weights

W_{l k} = W_{l k} - D_{l k}

$W_{lk} = W_{lk} - D_{lk}$

For $l \to k$ , we thus have the time complexity $\mathcal{O}(lt + lt + ltk + lk) = \mathcal{O}(l*t*k)$ .

Now, going back from $k \to j$ . We first have

E_{k t} = f^{'} (S_{k t}) ⊙ (W_{k l} * E_{l t})

$E_{kt} = f'(S_{kt}) \odot (W_{kl} * E_{lt})$

Then

D_{k j} = E_{k t} * Z_{t j}

$D_{kj} = E_{kt} * Z_{tj}$

And then

W_{k j} = W_{k j} - D_{k j}

$W_{kj} = W_{kj} - D_{kj}$

where $W_{kl}$ is the transpose of $W_{lk}$ . For $k \to j$ , we have the time complexity $\mathcal{O}(kt + klt + ktj + kj) = \mathcal{O}(k*t(l+j))$ .

And finally, for $j \to i$ , we have $\mathcal{O}(j*t(k+i))$ . In total, we have

O (l t k + t k (l + j) + t j (k + i)) = O (t * (l k + k j + j i))

$\mathcal{O}(ltk + tk(l + j) + tj (k + i)) = \mathcal{O}(t*(lk + kj + ji))$

which is same as feedforward pass algorithm. Since they are same, the total time complexity for one epoch will be

O (t * (i j + j k + k l)) .

$O(t*(ij + jk + kl)).$

This time complexity is then multiplied by number of iterations (epochs). So, we have

O (n * t * (i j + j k + k l)),

$O(n*t*(ij + jk + kl)),$ where

n

$n$ is number of iterations.

Notes

Note that these matrix operations can greatly be paralelized by GPUs.

Conclusion

We tried to find the time complexity for training a neural network that has 4 layers with respectively $i$ , $j$ , $k$ and $l$ nodes, with $t$ training examples and $n$ epochs. The result was $\mathcal{O}(nt*(ij + jk + kl))$ .

We assumed the simplest form of matrix multiplication that has cubic time complexity. We used batch gradient descent algorithm. The results for stochastic and mini-batch gradient descent should be same. (Let me know if you think the otherwise: note that batch gradient descent is the general form, with little modification, it becomes stochastic or mini-batch)

Also, if you use momentum optimization, you will have same time complexity, because the extra matrix operations required are all element-wise operations, hence they will not affect the time complexity of the algorithm.

I'm not sure what the results would be using other optimizers such as RMSprop.

Sources

The following article http://briandolhansky.com/blog/2014/10/30/artificial-neural-networks-matrix-form-part-5 describes an implementation using matrices. Although this implementation is using "row major", the time complexity is not affected by this.

If you're not familiar with back-propagation, check this article:

http://briandolhansky.com/blog/2013/9/27/artificial-neural-networks-backpropagation-part-4

— M.kazem Akhgary
ソース

Your answer is great..I could not find any ambiguity till now, but you forgot the no. of iterations part, just add it...and if no one answers in 5 days i'll surely accept your answer

— DuttaA

@DuttaA I tried to put every thing I knew. it may not be 100% correct so feel free to leave this unaccepted :) I'm also waiting for other answers to see what other points I missed.

— M.kazem Akhgary

3

For the evaluation of a single pattern, you need to process all weights and all neurons. Given that every neuron has at least one weight, we can ignore them, and have $\mathcal{O}(w)$ where $w$ is the number of weights, i.e., $n * n_i$ , assuming full connectivity between your layers.

The back-propagation has the same complexity as the forward evaluation (just look at the formula).

So, the complexity for learning $m$ examples, where each gets repeated $e$ times, is $\mathcal{O}(w*m*e)$ .

The bad news is that there's no formula telling you what number of epochs $e$ you need.

— maaartinus
ソース

From the above answer don't you think itdepends on more factors?

— DuttaA

1

@DuttaA No. There's a constant amount of work per weight, which gets repeated e times for each of m examples. I didn't bother to compute the number of weights, I guess, that's the difference.

— maaartinus

答えは同じだと思います。私の答えでは、重みの数を想定できますw = ij + jk + kl。基本的にn * n_i、ご指摘のようにレイヤー間の合計。

— M.kazem Akhgary