強化学習におけるベルマン演算子は何ですか？

数学では、演算子という単語は、いくつかの異なるが関連する概念を参照できます。演算子は、2つのベクトル空間の間の関数として定義でき、ドメインとコドメインが同じである関数として定義できます。または、関数（ベクトル）から他の関数（の場合）への関数として定義できます。たとえば、微分演算子）、つまり高次関数（関数プログラミングに精通している場合）。

強化学習（RL）におけるベルマン演算子とは何ですか？なぜそれが必要なのですか？ベルマン演算子は、RLのベルマン方程式とどのように関連していますか？

reinforcement-learning terminology math

— nbro
ソース

このトピックに関連するいくつかの論文は、大規模動的プログラミングの機能ベースの方法（John N. TsitsiklisおよびBenjamin Van Roy、1996）、関数近似による時間差学習の分析（John N. TsitsiklisおよびBenjamin Vanによる）です。 Roy、1997）および最小二乗ポリシー反復（Michail G. LagoudakisおよびRonald Parr、2003年）。

— nbro

私が見つけたさらにいくつかの関連する論文は、一般化されたマルコフ決定プロセス：動的プログラミングと強化学習アルゴリズム（CsabaSzepesváriとMichael L. Littman、1997）と

ϵ

$\epsilon$ -MDP：変化する環境での学習（IstvánSzita、BálintTakács、AndrásLörincz、2002年）。

— nbro

私が使用する表記は、David Silverによる2つの異なる講義からのものであり、これらのスライドによっても説明されています。

予想されるベルマン方程式は

\begin{matrix} (1) & v_{π} (s) = \sum_{a \in A} π (a | s) (R_{s}^{a} + γ \sum_{s^{'} \in S} P_{s s^{'}}^{a} v_{π} (s^{'})) \end{matrix}

$v_\pi(s) = \sum_{a\in \cal{A}} \pi(a|s) \left(\cal{R}_s^a + \gamma\sum_{s' \in \cal{S}} \cal{P}_{ss'}^a v_\pi(s')\right) \tag 1$

させたら

\begin{matrix} (2) & P_{s s^{'}}^{π} = \sum_{a \in A} π (a | s) P_{s s^{'}}^{a} \end{matrix}

$\cal{P}_{ss'}^\pi = \sum\limits_{a \in \cal{A}} \pi(a|s)\cal{P}_{ss'}^a \tag 2$ そして

\begin{matrix} (3) & R_{s}^{π} = \sum_{a \in A} π (a | s) R_{s}^{a} \end{matrix}

$\cal{R}_{s}^\pi = \sum\limits_{a \in \cal{A}} \pi(a|s)\cal{R}_{s}^a \tag 3$ その後、書き換えることができます

(1)

$(1)$ なので

\begin{matrix} (4) & v_{π} (s) = R_{s}^{π} + γ \sum_{s^{'} \in S} P_{s s^{'}}^{π} v_{π} (s^{'}) \end{matrix}

$v_\pi(s) = \cal{R}_s^\pi + \gamma\sum_{s' \in \cal{S}} \cal{P}_{ss'}^\pi v_\pi(s') \tag 4$

これは行列形式で書くことができます

\begin{matrix} (5) & [\begin{matrix} v_{π} (1) \\ ⋮ \\ v_{π} (n) \end{matrix}] = [\begin{matrix} R_{1}^{π} \\ ⋮ \\ R_{n}^{π} \end{matrix}] + γ [\begin{matrix} P_{11}^{π} & \dots & P_{1 n}^{π} \\ ⋮ & ⋱ & ⋮ \\ P_{n 1}^{π} & \dots & P_{n n}^{π} \end{matrix}] [\begin{matrix} v_{π} (1) \\ ⋮ \\ v_{π} (n) \end{matrix}] \end{matrix}

$\left. \begin{bmatrix} v_\pi(1) \\ \vdots \\ v_\pi(n) \end{bmatrix}= \begin{bmatrix} \cal{R}_1^\pi \\ \vdots \\ \cal{R}_n^\pi \end{bmatrix} +\gamma \begin{bmatrix} \cal{P}_{11}^\pi & \dots & \cal{P}_{1n}^\pi\\ \vdots & \ddots & \vdots\\ \cal{P}_{n1}^\pi & \dots & \cal{P}_{nn}^\pi \end{bmatrix} \begin{bmatrix} v_\pi(1) \\ \vdots \\ v_\pi(n) \end{bmatrix} \right. \tag 5$

Or, more compactly,

\begin{matrix} (6) & v_{π} = R^{π} + γ P^{π} v_{π} \end{matrix}

$v_\pi = \cal{R}^\pi + \gamma \cal{P}^\pi v_\pi \tag 6$

Notice that both sides of $(6)$ are $n$ -dimensional vectors. Here $n=|\cal{S}|$ is the size of the state space. We can then define an operator $\cal{T}^\pi:\mathbb{R}^n\to\mathbb{R}^n$ as

\begin{matrix} (7) & T^{π} (v) = R^{π} + γ P^{π} v \end{matrix}

$\cal{T^\pi}(v) = \cal{R}^\pi + \gamma \cal{P}^\pi v \tag 7$

for any $v\in \mathbb{R}^n$ . This is the expected Bellman operator.

Similarly, you can rewrite the Bellman optimality equation

\begin{matrix} (8) & v_{*} (s) = max_{a \in A} (R_{s}^{a} + γ \sum_{s^{'} \in S} P_{s s^{'}}^{a} v_{*} (s^{'})) \end{matrix}

$v_*(s) = \max_{a\in\cal{A}} \left(\cal{R}_s^a + \gamma\sum_{s' \in \cal{S}} \cal{P}_{ss'}^a v_*(s')\right) \tag 8$

as the Bellman optimality operator

\begin{matrix} (9) & T^{*} (v) = max_{a \in A} (R^{a} + γ P^{a} v) \end{matrix}

$\cal{T^*}(v) = \max_{a\in\cal{A}} \left(\cal{R}^a + \gamma \cal{P}^a v\right) \tag 9$

The Bellman operators are "operators" in that they are mappings from one point to another within the vector space of state values, $\mathbb{R}^n$ .

Rewriting the Bellman equations as operators is useful for proving that certain dynamic programming algorithms (e.g. policy iteration, value iteration) converge to a unique fixed point. This usefulness comes in the form of a body of existing work in operator theory, which allows us to make use of special properties of the Bellman operators.

Specifically, the fact that the Bellman operators are contractions gives the useful results that, for any policy $\pi$ and any initial vector $v$ ,

\begin{matrix} (10) & lim_{k \to \infty} (T^{π})^{k} v = v_{π} \end{matrix}

$\lim_{k\to\infty}(\cal{T}^\pi)^k v = v_\pi \tag{10}$

\begin{matrix} (11) & lim_{k \to \infty} (T^{*})^{k} v = v_{*} \end{matrix}

$\lim_{k\to\infty}(\cal{T}^*)^k v = v_* \tag{11}$

where $v_\pi$ is the value of policy $\pi$ and $v_*$ is the value of an optimal policy $\pi^*$ . The proof is due to the contraction mapping theorem.

— Philip Raeisghasem
ソース