# バックプロパゲーションを使用してニューラルネットワークをトレーニングするための時間の複雑さは何ですか？

16

NNには、$$nnn$$隠れ層、訓練例、特徴、および各層のノードが含まれているとします。バックプロパゲーションを使用してこのNNをトレーニングする時間の複雑さは何ですか？$$mmm$$$$xxx$$$$ninin_i$$

9

### 行列乗算の時間の複雑さ

$$Mij∗MjkMij∗MjkM_{ij} * M_{jk}$$の行列乗算の時間計算量は単純に$$O(i∗j∗k)O(i∗j∗k)\mathcal{O}(i*j*k)$$です。

ここでは、最も単純な乗算アルゴリズムを想定していることに注意してください。時間の複雑さがいくらか改善された他のアルゴリズムがいくつか存在します。

### フィードフォワードパスアルゴリズム

フィードフォワード伝播アルゴリズムは次のとおりです。

まず、レイヤー$$iii$$から$$jjj$$に移動するには、次のようにします。



${S}_{j}={W}_{ji}\ast {Z}_{i}$



${Z}_{j}=f\left({S}_{j}\right)$

### 例

$$444$$レイヤーがあるため、これらのレイヤー間の重みを表すには$$333$$マトリックスが必要です。$$WjiWjiW_{ji}$$$$WkjWkjW_{kj}$$$$WlkWlkW_{lk}$$でそれらを示しましょう。ここで、$$WjiWjiW_{ji}$$$$jjj$$$$iii$$列の行列です（したがって、$$WjiWjiW_{ji}$$にはレイヤー$$iii$$からレイヤー$$jjj$$への重みが含まれます）。

あなたが持っていると仮定$$ttt$$訓練例を。レイヤー$$iii$$から$$jjj$$伝播するには、最初に



${S}_{jt}={W}_{ji}\ast {Z}_{it}$

そして、この操作（すなわち、行列の乗算）は$$O(j∗i∗t)O(j∗i∗t)\mathcal{O}(j*i*t)$$時間の複雑さを持ちます。次に、アクティベーション関数を適用します



${Z}_{jt}=f\left({S}_{jt}\right)$

また、これは要素単位の操作であるため、$$O(j∗t)O(j∗t)\mathcal{O}(j*t)$$時間の複雑さを持ちます。

したがって、合計で、



$\mathcal{O}\left(j\ast i\ast t+j\ast t\right)=\mathcal{O}\left(j\ast t\ast \left(t+1\right)\right)=\mathcal{O}\left(j\ast i\ast t\right)$

In total, the time complexity for feedforward propagation will be



$\mathcal{O}\left(j\ast i\ast t+k\ast j\ast t+l\ast k\ast t\right)=\mathcal{O}\left(t\ast \left(ij+jk+kl\right)\right)$

I'm not sure if this can be simplified further or not. Maybe it's just $$O(t∗i∗j∗k∗l)O(t∗i∗j∗k∗l)\mathcal{O}(t*i*j*k*l)$$, but I'm not sure.

### Back-propagation algorithm



${E}_{lt}={f}^{\prime }\left({S}_{lt}\right)\odot \left({Z}_{lt}-{O}_{lt}\right)$

どこ$$⊙⊙\odot$$要素毎の乗算を意味します。$$EltEltE_{lt}$$$$lll$$行と$$ttt$$列があることに注意してください。各列がトレーニング例$$ttt$$エラー信号であることを意味します。

We then compute the "delta weights", $$Dlk∈Rl×kDlk∈Rl×kD_{lk} \in \mathbb{R}^{l \times k}$$ (between layer $$lll$$ and layer $$kkk$$)



${D}_{lk}={E}_{lt}\ast {Z}_{tk}$

where $$ZtkZtkZ_{tk}$$ is the transpose of $$ZktZktZ_{kt}$$.



${W}_{lk}={W}_{lk}-{D}_{lk}$

For $$l→kl→kl \to k$$, we thus have the time complexity $$O(lt+lt+ltk+lk)=O(l∗t∗k)O(lt+lt+ltk+lk)=O(l∗t∗k)\mathcal{O}(lt + lt + ltk + lk) = \mathcal{O}(l*t*k)$$.

Now, going back from $$k→jk→jk \to j$$. We first have



${E}_{kt}={f}^{\prime }\left({S}_{kt}\right)\odot \left({W}_{kl}\ast {E}_{lt}\right)$

Then



${D}_{kj}={E}_{kt}\ast {Z}_{tj}$

And then



${W}_{kj}={W}_{kj}-{D}_{kj}$

where $$WklWklW_{kl}$$ is the transpose of $$WlkWlkW_{lk}$$. For $$k→jk→jk \to j$$, we have the time complexity $$O(kt+klt+ktj+kj)=O(k∗t(l+j))O(kt+klt+ktj+kj)=O(k∗t(l+j))\mathcal{O}(kt + klt + ktj + kj) = \mathcal{O}(k*t(l+j))$$.

And finally, for $$j→ij→ij \to i$$, we have $$O(j∗t(k+i))O(j∗t(k+i))\mathcal{O}(j*t(k+i))$$. In total, we have



$\mathcal{O}\left(ltk+tk\left(l+j\right)+tj\left(k+i\right)\right)=\mathcal{O}\left(t\ast \left(lk+kj+ji\right)\right)$

which is same as feedforward pass algorithm. Since they are same, the total time complexity for one epoch will be 

$O\left(t\ast \left(ij+jk+kl\right)\right).$

This time complexity is then multiplied by number of iterations (epochs). So, we have 

$O\left(n\ast t\ast \left(ij+jk+kl\right)\right),$
where $$nnn$$ is number of iterations.

### Notes

Note that these matrix operations can greatly be paralelized by GPUs.

### Conclusion

We tried to find the time complexity for training a neural network that has 4 layers with respectively $$iii$$, $$jjj$$, $$kkk$$ and $$lll$$ nodes, with $$ttt$$ training examples and $$nnn$$ epochs. The result was $$O(nt∗(ij+jk+kl))O(nt∗(ij+jk+kl))\mathcal{O}(nt*(ij + jk + kl))$$.

We assumed the simplest form of matrix multiplication that has cubic time complexity. We used batch gradient descent algorithm. The results for stochastic and mini-batch gradient descent should be same. (Let me know if you think the otherwise: note that batch gradient descent is the general form, with little modification, it becomes stochastic or mini-batch)

Also, if you use momentum optimization, you will have same time complexity, because the extra matrix operations required are all element-wise operations, hence they will not affect the time complexity of the algorithm.

I'm not sure what the results would be using other optimizers such as RMSprop.

### Sources

The following article http://briandolhansky.com/blog/2014/10/30/artificial-neural-networks-matrix-form-part-5 describes an implementation using matrices. Although this implementation is using "row major", the time complexity is not affected by this.

http://briandolhansky.com/blog/2013/9/27/artificial-neural-networks-backpropagation-part-4

Your answer is great..I could not find any ambiguity till now, but you forgot the no. of iterations part, just add it...and if no one answers in 5 days i'll surely accept your answer
DuttaA

@DuttaA I tried to put every thing I knew. it may not be 100% correct so feel free to leave this unaccepted :) I'm also waiting for other answers to see what other points I missed.
M.kazem Akhgary

3

For the evaluation of a single pattern, you need to process all weights and all neurons. Given that every neuron has at least one weight, we can ignore them, and have $$O(w)O(w)\mathcal{O}(w)$$ where $$www$$ is the number of weights, i.e., $$n∗nin∗nin * n_i$$, assuming full connectivity between your layers.

The back-propagation has the same complexity as the forward evaluation (just look at the formula).

So, the complexity for learning $$mmm$$ examples, where each gets repeated $$eee$$ times, is $$O(w∗m∗e)O(w∗m∗e)\mathcal{O}(w*m*e)$$.

The bad news is that there's no formula telling you what number of epochs $$eee$$ you need.

From the above answer don't you think itdepends on more factors?
DuttaA

1
@DuttaA No. There's a constant amount of work per weight, which gets repeated e times for each of m examples. I didn't bother to compute the number of weights, I guess, that's the difference.
maaartinus

M.kazem Akhgary