バッチ正規化による逆伝播の行列形式

バッチの正規化は、ディープニューラルネットのパフォーマンスが大幅に向上したとされています。インターネット上の多くの資料は、アクティベーションごとにそれを実装する方法を示しています。私はすでに行列代数を使用してバックプロップを実装しましたが、高レベル言語で作業していることを考えてRcpp（そして、最終的にはGPUの高密度行列乗算に依存しています）、すべてをリッピングして- forループに頼るとおそらくコードが遅くなります実質的に、大きな痛みに加えて。

バッチ正規化関数である

b (x_{p}) = γ (x_{p} - μ_{x_{p}}) σ_{x_{p}}^{- 1} + β

$b(x_p) = \gamma \left(x_p - \mu_{x_p}\right) \sigma^{-1}_{x_p} + \beta$

$x_p$ は、アクティブ化される前の $p$ 番目のノードです。
$\gamma$ と $\beta$ はスカラーパラメーターです
$\mu_{x_p}$ と $\sigma_{x_p}$ 平均値とのSDいる $x_p$ 。（分散の平方根とファッジファクターが通常使用されることに注意してください-コンパクト化のために非ゼロ要素を仮定しましょう）

行列形式では、層全体のバッチの正規化は次のようになり

b (X) = (γ \otimes 1_{p}) ⊙ (X - μ_{X}) ⊙ σ_{X}^{- 1} + (β \otimes 1_{p})

$b(\mathbf{X}) = \left(\gamma\otimes\mathbf{1}_p\right)\odot \left(\mathbf{X} - \mu_{\mathbf{X}}\right) \odot\sigma^{-1}_{\mathbf{X}} + \left(\beta\otimes\mathbf{1}_p\right)$ ここで

$\mathbf{X}$ は $N\times p$
$\mathbf{1}_N$ は1の列ベクトルです
$\gamma$ と $\beta$ は、レイヤーごとの正規化パラメーターの行 $p$ ベクトルです。
$\mu_{\mathbf{X}}$ 及び $\sigma_{\mathbf{X}}$ であり $N \times p$ 各列は行列、 $N$ -ベクトル列方向手段と標準偏差の
$\otimes$ はクロネッカー積であり、 $\odot$ は要素単位（アダマール）積です。

無バッチの正規化及び連続結果に非常に単純な1層ニューラルネットであり

y = a ({X Γ}_{1}) Γ_{2} + ϵ

$y = a\left(\mathbf{X\Gamma}_1\right)\Gamma_2 + \epsilon$

どこ

$\Gamma_1$ ある $p_1 \times p_2$
$\Gamma_2$ あり $p_2 \times 1$
はアクティベーション関数です $a(.)$

損失がある場合は、次いで勾配は、 $R = N^{-1}\displaystyle\sum\left(y - \hat{y}\right)^2$

\begin{array}{lr} \frac{\partial R}{\partial Γ_{1}} = - 2 V^{T} \hat{ϵ} \\ \frac{\partial R}{\partial Γ_{2}} = X^{T} (a^{'} (X Γ_{1}) ⊙ - 2 \hat{ϵ} Γ_{2}^{T}) \end{array}

$\begin{array}{lr} \frac{\partial R}{\partial \Gamma_1} = -2\mathbf{V}^T \hat\epsilon\\ \frac{\partial R}{\partial \Gamma_2} = \mathbf{X}^T \left(a'(\mathbf{X}\mathbf{\Gamma}_1) \odot -2\hat\epsilon \mathbf{\Gamma}_2^T\right) \\ \end{array}$

どこ

$\mathbf{V} = a\left(\mathbf{X}\Gamma_1\right)$
$\hat{\epsilon} = y-\hat{y}$

バッチの正規化の下で、ネットになり又は

y = a (b (X Γ_{1})) Γ_{2}

$y = a\left(b\left(\mathbf{X}\Gamma_1\right)\right)\Gamma_2$

y = a ((γ \otimes 1_{N}) ⊙ (X Γ_{1} - μ_{X Γ_{1}}) ⊙ σ_{X Γ_{1}}^{- 1} + (β \otimes 1_{N})) Γ_{2}

$y = a\Big(\left(\gamma\otimes\mathbf{1}_N\right)\odot \left(\mathbf{X\Gamma_1} - \mu_{\mathbf{X\Gamma_1}}\right) \odot\sigma^{-1}_{\mathbf{X\Gamma_1}} + \left(\beta\otimes\mathbf{1}_N\right)\Big)\mathbf{\Gamma_2}$ アダマール製品とクロネッカー製品の導関数を計算する方法がわかりません。クロネッカー製品に関しては、文献はかなり難解です。

実用的な計算方法がある、、およびマトリクスフレームワーク内？ノードごとの計算に頼らない単純な式？ $\partial R/\partial \gamma$ $\partial R/\partial \beta$ $\partial R/\partial \mathbf{\Gamma_1}$

更新1：

私は考え出した -ソートの。それは：いくつかのRコードは、これはそれを行うためのループ方法と等価であることを実証しています。最初に偽のデータを設定します： $\partial R/\partial \beta$

1_{N}^{T} (a^{'} (X Γ_{1}) ⊙ - 2 \hat{ϵ} Γ_{2}^{T})

$\mathbf{1}_{N}^T \left(a'(\mathbf{X}\mathbf{\Gamma}_1) \odot -2\hat\epsilon \mathbf{\Gamma}_2^T\right)$

set.seed(1)
library(dplyr)
library(foreach)

#numbers of obs, variables, and hidden layers
N <- 10
p1 <- 7
p2 <- 4
a <- function (v) {
  v[v < 0] <- 0
  v
}
ap <- function (v) {
  v[v < 0] <- 0
  v[v >= 0] <- 1
  v
}

# parameters
G1 <- matrix(rnorm(p1*p2), nrow = p1)
G2 <- rnorm(p2)
gamma <- 1:p2+1
beta <- (1:p2+1)*-1
# error
u <- rnorm(10)

# matrix batch norm function
b <- function(x, bet = beta, gam = gamma){
  xs <- scale(x)
  gk <- t(matrix(gam)) %x% matrix(rep(1, N))
  bk <- t(matrix(bet)) %x% matrix(rep(1, N))
  gk*xs+bk
}
# activation-wise batch norm function
bi <- function(x, i){
  xs <- scale(x)
  gk <- t(matrix(gamma[i]))
  bk <- t(matrix(beta[i]))
  suppressWarnings(gk*xs[,i]+bk)
}

X <- round(runif(N*p1, -5, 5)) %>% matrix(nrow = N)
# the neural net
y <- a(b(X %*% G1)) %*% G2 + u

次に、導関数を計算します。

# drdbeta -- the matrix way
drdb <- matrix(rep(1, N*1), nrow = 1) %*% (-2*u %*% t(G2) * ap(b(X%*%G1)))
drdb
           [,1]      [,2]    [,3]        [,4]
[1,] -0.4460901 0.3899186 1.26758 -0.09589582
# the looping way
foreach(i = 1:4, .combine = c) %do%{
  sum(-2*u*matrix(ap(bi(X[,i, drop = FALSE]%*%G1[i,], i)))*G2[i])
}
[1] -0.44609015  0.38991862  1.26758024 -0.09589582

彼らは一致します。しかし、私はまだ混乱しています。なぜなら、これがなぜ機能するのか本当にわからないからです。@マークL.ストーンによって参照MatCalcノートはの派生と言うあるべき $\beta \otimes \mathbf{1}_N$

の添え字は、、及び、の寸法でありおよび。は転流行列で、両方の入力がベクトルであるため、ここでは1だけです。私はこれを試して、役に立たないと思われる結果を得ます：

\frac{\partial A \otimes B}{\partial A} = (I_{n q} \otimes T_{m p}) (I_{n} \otimes v e c (B) \otimes I_{m})

$\frac{\partial A \otimes B}{\partial A} = \left(I_{nq} \otimes T_{mp}\right)\left(I_n\otimes vec(B) \otimes I_m\right)$

m

$m$

n

$n$

p

$p$

q

$q$

A

$A$

B

$B$

T

$T$

# playing with the kroneker derivative rule
A <- t(matrix(beta)) 
B <- matrix(rep(1, N))
diag(rep(1, ncol(A) *ncol(B))) %*% diag(rep(1, ncol(A))) %x% (B) %x% diag(nrow(A))
     [,1] [,2] [,3] [,4]
 [1,]    1    0    0    0
 [2,]    1    0    0    0
 snip
[13,]    0    1    0    0
[14,]    0    1    0    0
snip
[28,]    0    0    1    0
[29,]    0    0    1    0
[snip
[39,]    0    0    0    1
[40,]    0    0    0    1

$\gamma$ $\mathbf{\Gamma_1}$ $\beta \otimes \mathbf{1}$

更新2

$\partial R/\partial \Gamma_1$ $\partial R/\partial \gamma$ vec() $\partial R/\partial \Gamma_1$ $w\odot\mathbf{X\Gamma_1}$ $\mathbf{\Gamma_1}$ $w \equiv (\gamma \otimes \mathbf{1}) \odot \sigma_{\mathbf{X\Gamma_1}}^{-1}$

$w\odot\mathbf{X}$ $w$ $\mathbf{X}$

\partial (A ⊙ B) = \partial A ⊙ B + A ⊙ \partial B

$\partial(A \odot B) = \partial A \odot B + A \odot \partial B$

そして、これから、

\frac{\partial v e c (w ⊙ X Γ_{1})}{\partial v e c (Γ_{1})^{T}} = v e c (X Γ_{1}) I \frac{\partial v e c (w)}{\partial v e c (Γ_{1})^{T}} + v e c (w) I \frac{\partial v e c (X Γ_{1})}{\partial v e c (Γ_{1})^{T}}

$\frac{\partial vec(w \odot \mathbf{X\Gamma_1})}{\partial vec(\mathbf{\Gamma_1})^T} = vec(\mathbf{X\Gamma_1})I\frac{\partial vec(w)}{\partial vec(\mathbf{\Gamma_1})^T} + vec(w)I\frac{\partial vec(\mathbf{X\Gamma_1})}{\partial vec(\mathbf{\Gamma_1})^T}$

アップデート3

ここで進歩しています。昨夜の午前2時にこのアイデアで目が覚めました。数学は睡眠に適していない。

$\partial R/\partial \mathbf{\Gamma_1}$

$w \equiv (\gamma \otimes \mathbf{1}) \odot \sigma_{\mathbf{X\Gamma_1}}^{-1}$
$\text{"stub"} \equiv a'(b(\mathbf{X\Gamma}_1)) \odot -2\hat\epsilon \mathbf{\Gamma}_2^T$

\frac{\partial R}{\partial Γ_{1}} = \frac{\partial w ⊙ {X Γ}_{1}}{\partial Γ_{1}} ("stub")

$\frac{\partial R}{\partial \Gamma_1} = \frac{\partial w \odot \mathbf{X\Gamma}_1}{\partial \Gamma_1}\left(\text{"stub"}\right)$

i

$i$

j

$j$

I

$\mathbf{I}$

\frac{\partial R}{\partial Γ_{i j}} = {(w_{i} ⊙ X_{i})}^{T} ({"stub"}_{j})

$\frac{\partial R}{\partial \Gamma_{ij}} = \left(w_i \odot \mathbf{X_i}\right)^T\left(\text{"stub"}_j\right)$

\frac{\partial R}{\partial Γ_{i j}} = {(I w_{i} X_{i})}^{T} ({"stub"}_{j})

$\frac{\partial R}{\partial \Gamma_{ij}} = \left(\mathbf{I} w_i \mathbf{X_i}\right)^T\left(\text{"stub"}_j\right)$

\frac{\partial R}{\partial Γ_{i j}} = {X_{i}}^{T} I w_{i} ({"stub"}_{j})

$\frac{\partial R}{\partial \Gamma_{ij}} = \mathbf{X_i}^T\mathbf{I} w_i\left(\text{"stub"}_j\right)$

\frac{\partial R}{\partial Γ} = X^{T} ("stub" ⊙ w)

$\frac{\partial R}{\partial \Gamma} = \mathbf{X}^T\left(\text{"stub"}\odot w\right)$

そして、実際には：

stub <- (-2*u %*% t(G2) * ap(b(X%*%G1)))
w <- t(matrix(gamma)) %x% matrix(rep(1, N)) * (apply(X%*%G1, 2, sd) %>% t %x% matrix(rep(1, N)))
drdG1 <- t(X) %*% (stub*w)

loop_drdG1 <- drdG1*NA
for (i in 1:7){
  for (j in 1:4){
    loop_drdG1[i,j] <- t(X[,i]) %*% diag(w[,j]) %*% (stub[,j])
  }
}

> loop_drdG1
           [,1]       [,2]       [,3]       [,4]
[1,] -61.531877  122.66157  360.08132 -51.666215
[2,]   7.047767  -14.04947  -41.24316   5.917769
[3,] 124.157678 -247.50384 -726.56422 104.250961
[4,]  44.151682  -88.01478 -258.37333  37.072659
[5,]  22.478082  -44.80924 -131.54056  18.874078
[6,]  22.098857  -44.05327 -129.32135  18.555655
[7,]  79.617345 -158.71430 -465.91653  66.851965
> drdG1
           [,1]       [,2]       [,3]       [,4]
[1,] -61.531877  122.66157  360.08132 -51.666215
[2,]   7.047767  -14.04947  -41.24316   5.917769
[3,] 124.157678 -247.50384 -726.56422 104.250961
[4,]  44.151682  -88.01478 -258.37333  37.072659
[5,]  22.478082  -44.80924 -131.54056  18.874078
[6,]  22.098857  -44.05327 -129.32135  18.555655
[7,]  79.617345 -158.71430 -465.91653  66.851965

更新4

$\partial R / \partial \gamma$

$\widetilde{\mathbf{X\Gamma}} \equiv \left(\mathbf{X\Gamma} - \mu_{\mathbf{X\Gamma}}\right)\odot \sigma^{-1}_\mathbf{X\Gamma}$
$\tilde\gamma \equiv \gamma \otimes\mathbf{1}_N$

\frac{\partial R}{\partial \tilde{γ}} = \frac{\partial \tilde{γ} ⊙ \tilde{X Γ}}{\partial \tilde{γ}} ("stub")

$\frac{\partial R}{\partial \tilde\gamma} = \frac{\partial \tilde\gamma \odot \widetilde{\mathbf{X\Gamma}}}{\partial \tilde\gamma}\left(\text{"stub"}\right)$

\frac{\partial R}{\partial {\tilde{γ}}_{i}} = (\tilde{X Γ})_{i}^{T} I {\tilde{γ}}_{i} ({"stub"}_{i})

$\frac{\partial R}{\partial \tilde\gamma_i} = (\widetilde{\mathbf{X\Gamma}})_i^T \mathbf{I}\tilde\gamma_i \left(\text{"stub"}_i\right)$ Which, like before, is basically pre-multiplying the stub. It should therefore be equivalent to:

\frac{\partial R}{\partial \tilde{γ}} = (\tilde{X Γ})^{T} ("stub" ⊙ \tilde{γ})

$\frac{\partial R}{\partial \tilde\gamma} = (\widetilde{\mathbf{X\Gamma}})^T \left(\text{"stub"} \odot \tilde\gamma \right)$

It sort of matches:

drdg <- t(scale(X %*% G1)) %*% (stub * t(matrix(gamma)) %x% matrix(rep(1, N)))

loop_drdg <- foreach(i = 1:4, .combine = c) %do% {
  t(scale(X %*% G1)[,i]) %*% (stub[,i, drop = F] * gamma[i])  
}

> drdg
           [,1]      [,2]       [,3]       [,4]
[1,]  0.8580574 -1.125017  -4.876398  0.4611406
[2,] -4.5463304  5.960787  25.837103 -2.4433071
[3,]  2.0706860 -2.714919 -11.767849  1.1128364
[4,] -8.5641868 11.228681  48.670853 -4.6025996
> loop_drdg
[1]   0.8580574   5.9607870 -11.7678486  -4.6025996

The diagonal on the first is the same as the vector on the second. But really since the derivative is with respect to a matrix -- albeit one with a certain structure, the output should be a similar matrix with the same structure. Should I take the diagonal of the matrix approach and simply take it to be $\gamma$ ? I'm not sure.

It seems that I have answered my own question but I am unsure whether I am correct. At this point I will accept an answer that rigorously proves (or disproves) what I've sort of hacked together.

while(not_answered){
  print("Bueller?")
  Sys.sleep(1)
}

— generic_user
ソース

Chapter 9 section 14 of "Matrix Differential Calculus with Applications in Statistics and Econometrics" by Magnus and Neudecker, 3rd edition janmagnus.nl/misc/mdc2007-3rdedition covers differentials of Kronecker products and concludes with an exercise on differential of Hadamard product. "Notes on Matrix Calculus" by Paul L. Fackler www4.ncsu.edu/~pfackler/MatCalc.pdf has a lot of material on differentiating Kronceker products

— Mark L. Stone

Thanks for the references. I've found those MatCalc notes before, but it doesn't cover Hadamard, and anyway I'm never certain whether a rule from non-matrix calculus applies or doesn't apply to to matrix case. Product rules, chain rules, etc. I'll look into the book. I'd accept an answer that points me to all of the ingredients I need to pencil it out myself...

— generic_user

why are you doing this? why not use framewroks such as Keras/TensorFlow? It's a waste of productive time to implement these low level algorithms, that you could use on solving actual problems

— Aksakal almost surely binary

More precisely, I'm fitting networks that exploit known parametric structure -- both in terms of linear-in-parameters representations of input data, as well as longitudinal/panel structure. Established frameworks are so heavily optimized as to be beyond my ability to hack/modify. Plus math is helpful generally. Plenty of codemonkeys have no idea what they're doing. Likewise learning enough Rcpp to implement it efficiently is useful.

— generic_user

@MarkL.Stone not only is it theoretically sound, it's practically easy! A more or less mechanical process! &%#$!

— generic_user

Not a complete answer, but to demonstrate what I suggested in my comment if

b (X) = (X - e_{N} μ_{X}^{T}) Γ Σ_{X}^{- 1 / 2} + e_{N} β^{T}

$b(X)=(X−e_N\mu_X^T)ΓΣ_X^{-1/2}+e_N\beta^T$ where

Γ = d i a g (γ)

$\Gamma=\mathop{\mathrm{diag}}(\gamma)$ ,

Σ_{X}^{- 1 / 2} = d i a g (σ_{X_{1}}^{- 1}, σ_{X_{2}}^{- 1}, \dots)

$\Sigma_X^{-1/2}=\mathop{\mathrm{diag}}(\sigma_{X_1}^{-1},\sigma_{X_2}^{-1},\dots)$ and

e_{N}

$e_N$ is a vector of ones, then by the chain rule

\nabla_{β} R = [- 2 \hat{ϵ} (Γ_{2}^{T} \otimes I) J_{X} (a) (I \otimes e_{N})]^{T}

$\nabla_\beta R=[-2\hat{\epsilon}(\Gamma_2^T\otimes I)J_X(a)(I\otimes e_N)]^T$ Noting that

- 2 \hat{ϵ} (Γ_{2}^{T} \otimes I) = v e c (- 2 \hat{ϵ} Γ_{2}^{T})^{T}

$-2\hat{\epsilon}(\Gamma_2^T\otimes I)=\mathop{\mathrm{vec}}(-2\hat{\epsilon}\Gamma_2^T)^T$ and

J_{X} (a) = d i a g (v e c (a^{'} (b (X Γ_{1}))))

$J_X(a)=\mathop{\mathrm{diag}}(\mathop{\mathrm{vec}}(a^\prime(b(X\Gamma_1))))$ , we see that

\nabla_{β} R = (I \otimes e_{N}^{T}) v e c (a^{'} (b (X Γ_{1})) ⊙ - 2 \hat{ϵ} Γ_{2}^{T}) = e_{N}^{T} (a^{'} (b (X Γ_{1})) ⊙ - 2 \hat{ϵ} Γ_{2}^{T})

$\nabla_\beta R=(I\otimes e_N^T)\mathop{\mathrm{vec}}(a^\prime(b(X\Gamma_1))\odot-2\hat{\epsilon}\Gamma_2^T)=e_N^T(a^\prime(b(X\Gamma_1))\odot-2\hat{\epsilon}\Gamma_2^T)$ via the identity

v e c (A X B) = (B^{T} \otimes A) v e c (X)

$\mathop{\mathrm{vec}}(AXB)=(B^T\otimes A)\mathop{\mathrm{vec}}(X)$ . Similarly,

\begin{aligned} \nabla_{γ} R & = [- 2 \hat{ϵ} (Γ_{2}^{T} \otimes I) J_{X} (a) (Σ_{X Γ_{1}}^{- 1 / 2} \otimes (X Γ_{1} - e_{N} μ_{X Γ_{1}}^{T})) K]^{T} \\ = K^{T} v e c ((X Γ_{1} - e_{N} μ_{X Γ_{1}}^{T})^{T} W Σ_{X Γ_{1}}^{- 1 / 2}) \\ = d i a g ((X Γ_{1} - e_{N} μ_{X Γ_{1}}^{T})^{T} W Σ_{X Γ_{1}}^{- 1 / 2}) \end{aligned}

$\begin{align}\nabla_\gamma R&=[-2\hat{\epsilon}(\Gamma_2^T\otimes I)J_X(a)(\Sigma_{X\Gamma_1}^{-1/2}\otimes (X\Gamma_1-e_N\mu_{X\Gamma_1}^T))K]^T\\&=K^T\mathop{\mathrm{vec}}((X\Gamma_1-e_N\mu_{X\Gamma_1}^T)^TW\Sigma^{-1/2}_{X\Gamma_1})\\&=\mathop{\mathrm{diag}}((X\Gamma_1-e_N\mu_{X\Gamma_1}^T)^TW\Sigma^{-1/2}_{X\Gamma_1})\end{align}$ where

W = a^{'} (b (X Γ_{1})) ⊙ - 2 \hat{ϵ} Γ_{2}^{T}

$W=a^\prime(b(X\Gamma_1))\odot-2\hat{\epsilon}\Gamma_2^T$ (the "stub") and

K

$K$ is an

N p \times p

$Np\times p$ binary matrix that selects the columns of the Kronecker product corresponding to the diagonal elements of a square matrix. This follows from the fact that

d Γ_{i \neq j} = 0

$d\Gamma_{i\neq j}=0$ . Unlike the first gradient, this expression is not equivalent to the expression you derived. Considering that

b

$b$ is a linear function w.r.t

γ_{i}

$\gamma_i$ , there should not be a factor of

γ_{i}

$\gamma_i$ in the gradient. I leave the gradient of

Γ_{1}

$\Gamma_1$ to the OP, but I will say for derivation with fixed

w

$w$ creates the "explosion" the writers of the article seek to avoid. In practice, you will also need to find the Jacobians of

Σ_{X}

$\Sigma_X$ and

μ_{X}

$\mu_X$ w.r.t

X

$X$ and use product rule.

— deasmhumnha
ソース