Qラーニングが分岐するのはなぜですか？

私のQラーニングアルゴリズムの状態値は無限に発散し続けます。つまり、私の重みも発散しています。値のマッピングにはニューラルネットワークを使用しています。

私はもう試した：

「報酬+割引*アクションの最大値」のクリッピング（最大/最小は50 / -50に設定）
低い学習率の設定（0.00001と私は、重みを更新するために従来のバックプロパゲーションを使用しています）
報酬の価値を下げる
探査率を上げる
入力を1〜100に正規化します（以前は0〜1でした）
割引率を変更する
ニューラルネットワークのレイヤーを減らす（検証のためだけ）

Qラーニングは非線形入力で発散することが知られていると聞きましたが、重みの発散を止めようと試みることができる他に何かありますか？

2017年8月14日の更新＃1：

リクエストがあったため、現在行っていることについて具体的な詳細を追加することにしました。

私は現在、エージェントにシューティングゲームのトップダウンビューで戦う方法を学ばせようとしています。対戦相手は確率的に動くシンプルなボットです。

各キャラクターには、各ターンで選択できる9つのアクションがあります。

上に移動
下に移動
左に移動
右に動く
弾丸を上向きに発射する
弾丸を撃ち落とす
左に弾丸を撃ちます
右に弾丸を撃ちます
何もしない

報酬は次のとおりです。

エージェントがボットを弾丸で打った場合、+ 100（さまざまな値を試してみました）
ボットが発射した弾丸にエージェントが当たった場合、-50（ここでも、さまざまな値を試しました）
弾丸を発射できないときにエージェントが弾丸を発射しようとした場合（例：エージェントが弾丸を発射した直後など）-25（必須ではありませんが、エージェントをより効率的にしたいと思いました）
ボットがアリーナから出ようとした場合は-20（あまり必要ではありませんが、エージェントをより効率的にしたいと思いました）

ニューラルネットワークの入力は次のとおりです。

0から100に正規化されたX軸上のエージェントとボット間の距離
0から100に正規化されたY軸上のエージェントとボット間の距離
エージェントのxとyの位置
ボットのxとyの位置
ボットの弾丸の位置。ボットが弾丸を発射しなかった場合、パラメーターはボットのx位置とy位置に設定されます。

私も入力をいじっています。エージェントの位置（距離ではなく実際の位置）のx値やボットの弾丸の位置などの新しい機能を追加してみました。それらのどれもうまくいきませんでした。

これがコードです：

from pygame import *
from pygame.locals import *
import sys
from time import sleep
import numpy as np
import random
import tensorflow as tf
from pylab import savefig
from tqdm import tqdm


#Screen Setup
disp_x, disp_y = 1000, 800
arena_x, arena_y = 1000, 800
border = 4; border_2 = 1

#Color Setup
white = (255, 255, 255); aqua= (0, 200, 200)
red = (255, 0, 0); green = (0, 255, 0)
blue = (0, 0, 255); black = (0, 0, 0)
green_yellow = (173, 255, 47); energy_blue = (125, 249, 255)

#Initialize character positions
init_character_a_state = [disp_x/2 - arena_x/2 + 50, disp_y/2 - arena_y/2 + 50]
init_character_b_state = [disp_x/2 + arena_x/2 - 50, disp_y/2 + arena_y/2 - 50]

#Setup character dimentions
character_size = 50
character_move_speed = 25

#Initialize character stats
character_init_health = 100

#initialize bullet stats
beam_damage = 10
beam_width = 10
beam_ob = -100

#The Neural Network
input_layer = tf.placeholder(shape=[1,7],dtype=tf.float32)
weight_1 = tf.Variable(tf.random_uniform([7,9],0,0.1))
#weight_2 = tf.Variable(tf.random_uniform([6,9],0,0.1))

#The calculations, loss function and the update model
Q = tf.matmul(input_layer, weight_1)
predict = tf.argmax(Q, 1)
next_Q = tf.placeholder(shape=[1,9],dtype=tf.float32)
loss = tf.reduce_sum(tf.square(next_Q - Q))
trainer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
updateModel = trainer.minimize(loss)

initialize = tf.global_variables_initializer()

jList = []
rList = []

init()
font.init()
myfont = font.SysFont('Comic Sans MS', 15)
myfont2 = font.SysFont('Comic Sans MS', 150)
myfont3 = font.SysFont('Gothic', 30)
disp = display.set_mode((disp_x, disp_y), 0, 32)

#CHARACTER/BULLET PARAMETERS
agent_x = agent_y = int()
bot_x = bot_y = int()
agent_hp = bot_hp = int()
bot_beam_dir = int()
agent_beam_fire = bot_beam_fire = bool()
agent_beam_x = bot_beam_x = agent_beam_y = bot_beam_y = int()
agent_beam_size_x = agent_beam_size_y = bot_beam_size_x = bot_beam_size_y = int()
bot_current_action = agent_current_action = int()

def param_init():
    """Initializes parameters"""
    global agent_x, agent_y, bot_x, bot_y, agent_hp, bot_hp, agent_beam_fire, bot_beam_fire, agent_beam_x, bot_beam_x, agent_beam_y, bot_beam_y

    agent_x = list(init_character_a_state)[0]; agent_y = list(init_character_a_state)[1]
    bot_x = list(init_character_b_state)[0]; bot_y = list(init_character_b_state)[1]
    agent_hp = bot_hp = character_init_health
    agent_beam_fire = bot_beam_fire = False
    agent_beam_x = bot_beam_x = agent_beam_y = bot_beam_y = beam_ob
    agent_beam_size_x = agent_beam_size_y = bot_beam_size_x = bot_beam_size_y = 0


def screen_blit():
    global disp, disp_x, disp_y, arena_x, arena_y, border, border_2, character_size, agent_x, \
    agent_y, bot_x, bot_y, character_init_health, agent_hp, bot_hp, red, blue, aqua, green, black, green_yellow, energy_blue, \
    agent_beam_fire, bot_beam_fire, agent_beam_x, agent_beam_y, bot_beam_x, bot_beam_y, agent_beam_size_x, agent_beam_size_y, bot_beam_size_x, bot_beam_size_y, beam_width

    disp.fill(aqua)
    draw.rect(disp, black, (disp_x / 2 - arena_x / 2 - border, disp_y /
                            2 - arena_y / 2 - border, arena_x + border * 2, arena_y + border * 2))
    draw.rect(disp, green, (disp_x / 2 - arena_x / 2,
                            disp_y / 2 - arena_y / 2, arena_x, arena_y))
    if bot_beam_fire == True:
        draw.rect(disp, green_yellow, (agent_beam_x, agent_beam_y, agent_beam_size_x, agent_beam_size_y))
        bot_beam_fire = False
    if agent_beam_fire == True:
        draw.rect(disp, energy_blue, (bot_beam_x, bot_beam_y, bot_beam_size_x, bot_beam_size_y))
        agent_beam_fire = False

    draw.rect(disp, red, (agent_x, agent_y, character_size, character_size))
    draw.rect(disp, blue, (bot_x, bot_y, character_size, character_size))

    draw.rect(disp, red, (disp_x / 2 - 200, disp_y / 2 + arena_y / 2 +
                            border + 1, float(agent_hp) / float(character_init_health) * 100, 14))
    draw.rect(disp, blue, (disp_x / 2 + 200, disp_y / 2 + arena_y / 2 +
                            border + 1, float(bot_hp) / float(character_init_health) * 100, 14))


def bot_take_action():
    return random.randint(1, 9)

def beam_hit_detector(player):
    global agent_x, agent_y, bot_x, bot_y, agent_beam_fire, bot_beam_fire, agent_beam_x, \
    bot_beam_x, agent_beam_y, bot_beam_y, agent_beam_size_x, agent_beam_size_y, \
    bot_beam_size_x, bot_beam_size_y, bot_current_action, agent_current_action, beam_width, character_size

    if player == "bot":
        if bot_current_action == 1:
            if disp_y/2 - arena_y/2 <= agent_y <= bot_y and (agent_x < bot_beam_x + beam_width < agent_x + character_size or agent_x < bot_beam_x < agent_x + character_size):
                return True
            else:
                return False
        elif bot_current_action == 2:
            if bot_x <= agent_x <= disp_x/2 + arena_x/2 and (agent_y < bot_beam_y + beam_width < agent_y + character_size or agent_y < bot_beam_y < agent_y + character_size):
                return True
            else:
                return False
        elif bot_current_action == 3:
            if bot_y <= agent_y <= disp_y/2 + arena_y/2 and (agent_x < bot_beam_x + beam_width < agent_x + character_size or agent_x < bot_beam_x < agent_x + character_size):
                return True
            else:
                return False
        elif bot_current_action == 4:
            if disp_x/2 - arena_x/2 <= agent_x <= bot_x and (agent_y < bot_beam_y + beam_width < agent_y + character_size or agent_y < bot_beam_y < agent_y + character_size):
                return True
            else:
                return False
    else:
        if agent_current_action == 1:
            if disp_y/2 - arena_y/2 <= bot_y <= agent_y and (bot_x < agent_beam_x + beam_width < bot_x + character_size or bot_x < agent_beam_x < bot_x + character_size):
                return True
            else:
                return False
        elif agent_current_action == 2:
            if agent_x <= bot_x <= disp_x/2 + arena_x/2 and (bot_y < agent_beam_y + beam_width < bot_y + character_size or bot_y < agent_beam_y < bot_y + character_size):
                return True
            else:
                return False
        elif agent_current_action == 3:
            if agent_y <= bot_y <= disp_y/2 + arena_y/2 and (bot_x < agent_beam_x + beam_width < bot_x + character_size or bot_x < agent_beam_x < bot_x + character_size):
                return True
            else:
                return False
        elif bot_current_action == 4:
            if disp_x/2 - arena_x/2 <= bot_x <= agent_x and (bot_y < agent_beam_y + beam_width < bot_y + character_size or bot_y < agent_beam_y < bot_y + character_size):
                return True
            else:
                return False


def mapping(maximum, number):
    return number#int(number * maximum)

def action(agent_action, bot_action):
    global agent_x, agent_y, bot_x, bot_y, agent_hp, bot_hp, agent_beam_fire, \
    bot_beam_fire, agent_beam_x, bot_beam_x, agent_beam_y, bot_beam_y, agent_beam_size_x, \
    agent_beam_size_y, bot_beam_size_x, bot_beam_size_y, beam_width, agent_current_action, bot_current_action, character_size

    agent_current_action = agent_action; bot_current_action = bot_action
    reward = 0; cont = True; successful = False; winner = ""
    if 1 <= bot_action <= 4:
        bot_beam_fire = True
        if bot_action == 1:
            bot_beam_x = bot_x + character_size/2 - beam_width/2; bot_beam_y = disp_y/2 - arena_y/2
            bot_beam_size_x = beam_width; bot_beam_size_y = bot_y - disp_y/2 + arena_y/2
        elif bot_action == 2:
            bot_beam_x = bot_x + character_size; bot_beam_y = bot_y + character_size/2 - beam_width/2
            bot_beam_size_x = disp_x/2 + arena_x/2 - bot_x - character_size; bot_beam_size_y = beam_width
        elif bot_action == 3:
            bot_beam_x = bot_x + character_size/2 - beam_width/2; bot_beam_y = bot_y + character_size
            bot_beam_size_x = beam_width; bot_beam_size_y = disp_y/2 + arena_y/2 - bot_y - character_size
        elif bot_action == 4:
            bot_beam_x = disp_x/2 - arena_x/2; bot_beam_y = bot_y + character_size/2 - beam_width/2
            bot_beam_size_x = bot_x - disp_x/2 + arena_x/2; bot_beam_size_y = beam_width

    elif 5 <= bot_action <= 8:
        if bot_action == 5:
            bot_y -= character_move_speed
            if bot_y <= disp_y/2 - arena_y/2:
                bot_y = disp_y/2 - arena_y/2
            elif agent_y <= bot_y <= agent_y + character_size:
                bot_y = agent_y + character_size
        elif bot_action == 6:
            bot_x += character_move_speed
            if bot_x >= disp_x/2 + arena_x/2 - character_size:
                bot_x = disp_x/2 + arena_x/2 - character_size
            elif agent_x <= bot_x + character_size <= agent_x + character_size:
                bot_x = agent_x - character_size
        elif bot_action == 7:
            bot_y += character_move_speed
            if bot_y + character_size >= disp_y/2 + arena_y/2:
                bot_y = disp_y/2 + arena_y/2 - character_size
            elif agent_y <= bot_y + character_size <= agent_y + character_size:
                bot_y = agent_y - character_size
        elif bot_action == 8:
            bot_x -= character_move_speed
            if bot_x <= disp_x/2 - arena_x/2:
                bot_x = disp_x/2 - arena_x/2
            elif agent_x <= bot_x <= agent_x + character_size:
                bot_x = agent_x + character_size

    if bot_beam_fire == True:
        if beam_hit_detector("bot"):
            #print "Agent Got Hit!"
            agent_hp -= beam_damage
            reward += -50
            bot_beam_size_x = bot_beam_size_y = 0
            bot_beam_x = bot_beam_y = beam_ob
            if agent_hp <= 0:
                cont = False
                winner = "Bot"

    if 1 <= agent_action <= 4:
        agent_beam_fire = True
        if agent_action == 1:
            if agent_y > disp_y/2 - arena_y/2:
                agent_beam_x = agent_x - beam_width/2; agent_beam_y = disp_y/2 - arena_y/2
                agent_beam_size_x = beam_width; agent_beam_size_y = agent_y - disp_y/2 + arena_y/2
            else:
                reward += -25
        elif agent_action == 2:
            if agent_x + character_size < disp_x/2 + arena_x/2:
                agent_beam_x = agent_x + character_size; agent_beam_y = agent_y + character_size/2 - beam_width/2
                agent_beam_size_x = disp_x/2 + arena_x/2 - agent_x - character_size; agent_beam_size_y = beam_width
            else:
                reward += -25
        elif agent_action == 3:
            if agent_y + character_size < disp_y/2 + arena_y/2:
                agent_beam_x = agent_x + character_size/2 - beam_width/2; agent_beam_y = agent_y + character_size
                agent_beam_size_x = beam_width; agent_beam_size_y = disp_y/2 + arena_y/2 - agent_y - character_size
            else:
                reward += -25
        elif agent_action == 4:
            if agent_x > disp_x/2 - arena_x/2:
                agent_beam_x = disp_x/2 - arena_x/2; agent_beam_y = agent_y + character_size/2 - beam_width/2
                agent_beam_size_x = agent_x - disp_x/2 + arena_x/2; agent_beam_size_y = beam_width
            else:
                reward += -25

    elif 5 <= agent_action <= 8:
        if agent_action == 5:
            agent_y -= character_move_speed
            if agent_y <= disp_y/2 - arena_y/2:
                agent_y = disp_y/2 - arena_y/2
                reward += -5
            elif bot_y <= agent_y <= bot_y + character_size and bot_x <= agent_x <= bot_x + character_size:
                agent_y = bot_y + character_size
                reward += -2
        elif agent_action == 6:
            agent_x += character_move_speed
            if agent_x + character_size >= disp_x/2 + arena_x/2:
                agent_x = disp_x/2 + arena_x/2 - character_size
                reward += -5
            elif bot_x <= agent_x + character_size <= bot_x + character_size and bot_y <= agent_y <= bot_y + character_size:
                agent_x = bot_x - character_size
                reward += -2
        elif agent_action == 7:
            agent_y += character_move_speed
            if agent_y + character_size >= disp_y/2 + arena_y/2:
                agent_y = disp_y/2 + arena_y/2 - character_size
                reward += -5
            elif bot_y <= agent_y + character_size <= bot_y + character_size and bot_x <= agent_x <= bot_x + character_size:
                agent_y = bot_y - character_size
                reward += -2
        elif agent_action == 8:
            agent_x -= character_move_speed
            if agent_x <= disp_x/2 - arena_x/2:
                agent_x = disp_x/2 - arena_x/2
                reward += -5
            elif bot_x <= agent_x <= bot_x + character_size and bot_y <= agent_y <= bot_y + character_size:
                agent_x = bot_x + character_size
                reward += -2
    if agent_beam_fire == True:
        if beam_hit_detector("agent"):
            #print "Bot Got Hit!"
            bot_hp -= beam_damage
            reward += 50
            agent_beam_size_x = agent_beam_size_y = 0
            agent_beam_x = agent_beam_y = beam_ob
            if bot_hp <= 0:
                successful = True
                cont = False
                winner = "Agent"
    return reward, cont, successful, winner

def bot_beam_dir_detector():
    global bot_current_action
    if bot_current_action == 1:
        bot_beam_dir = 2
    elif bot_current_action == 2:
        bot_beam_dir = 4
    elif bot_current_action == 3:
        bot_beam_dir = 3
    elif bot_current_action == 4:
        bot_beam_dir = 1
    else:
        bot_beam_dir = 0
    return bot_beam_dir

#Parameters
y = 0.75
e = 0.3
num_episodes = 10000
batch_size = 10
complexity = 100
with tf.Session() as sess:
    sess.run(initialize)
    success = 0
    for i in tqdm(range(1, num_episodes)):
        #print "Episode #", i
        rAll = 0; d = False; c = True; j = 0
        param_init()
        samples = []
        while c == True:
            j += 1
            current_state = np.array([[mapping(complexity, float(agent_x) / float(arena_x)),
                                        mapping(complexity, float(agent_y) / float(arena_y)),
                                        mapping(complexity, float(bot_x) / float(arena_x)),
                                        mapping(complexity, float(bot_y) / float(arena_y)),
                                        #mapping(complexity, float(agent_hp) / float(character_init_health)),
                                        #mapping(complexity, float(bot_hp) / float(character_init_health)),
                                        mapping(complexity, float(agent_x - bot_x) / float(arena_x)),
                                        mapping(complexity, float(agent_y - bot_y) / float(arena_y)),
                                        bot_beam_dir
                                        ]])
            b = bot_take_action()
            if np.random.rand(1) < e or i <= 5:
                a = random.randint(0, 8)
            else:
                a, _ = sess.run([predict, Q],feed_dict={input_layer : current_state})
            r, c, d, winner = action(a + 1, b)
            bot_beam_dir = bot_beam_dir_detector()
            next_state = np.array([[mapping(complexity, float(agent_x) / float(arena_x)),
                                        mapping(complexity, float(agent_y) / float(arena_y)),
                                        mapping(complexity, float(bot_x) / float(arena_x)),
                                        mapping(complexity, float(bot_y) / float(arena_y)),
                                        #mapping(complexity, float(agent_hp) / float(character_init_health)),
                                        #mapping(complexity, float(bot_hp) / float(character_init_health)),
                                        mapping(complexity, float(agent_x - bot_x) / float(arena_x)),
                                        mapping(complexity, float(agent_y - bot_y) / float(arena_y)),
                                        bot_beam_dir
                                        ]])
            samples.append([current_state, a, r, next_state])
            if len(samples) > 10:
                for count in xrange(batch_size):
                    [batch_current_state, action_taken, reward, batch_next_state] = samples[random.randint(0, len(samples) - 1)]
                    batch_allQ = sess.run(Q, feed_dict={input_layer : batch_current_state})
                    batch_Q1 = sess.run(Q, feed_dict = {input_layer : batch_next_state})
                    batch_maxQ1 = np.max(batch_Q1)
                    batch_targetQ = batch_allQ
                    batch_targetQ[0][a] = reward + y * batch_maxQ1
                    sess.run([updateModel], feed_dict={input_layer : batch_current_state, next_Q : batch_targetQ})
            rAll += r
            screen_blit()
            if d == True:
                e = 1. / ((i / 50) + 10)
                success += 1
                break
            #print agent_hp, bot_hp
            display.update()

        jList.append(j)
        rList.append(rAll)
        print winner

Python環境にpygameとTensorflowとmatplotlibがインストールされている場合、ボットとエージェントの「戦闘」のアニメーションを見ることができるはずです。

更新については触れませんでしたが、誰かが元の一般的な問題と一緒に私の特定の問題にも対処できれば素晴らしいと思います。

ありがとう！

2017年8月18日の更新＃2：

@NeilSlaterのアドバイスに基づいて、モデルにエクスペリエンスリプレイを実装しました。アルゴリズムは改善されましたが、収束を提供するより優れた改善オプションを探します。

2017年8月22日の更新＃3：

エージェントがターンに弾丸でボットを攻撃し、そのターンでボットが行ったアクションが「弾丸を発射」しない場合、間違ったアクションがクレジットされることに気づきました。したがって、私は弾丸をビームに変えて、ビームが発射されたターンにボット/エージェントがダメージを受けるようにしました。

— IronEdward
ソース

最近のネットワークの「凍結された」コピーからの経験リプレイとブートストラップ値を使用していますか？これらはDQNで使用されるアプローチです。安定性のために必要になる場合がありますが、保証はされません。Q（

λ

$\lambda$ ）アルゴリズム、または単一ステップのQ学習？あなたの環境と報酬制度がどのようなものであるかを示すことができますか？報酬が少ない場合、シングルステップQラーニングはうまく機能しません。たとえば、長いエピソードの最後の最終+1または-1報酬です。

— Neil Slater

更新から、私はすぐに経験を再現する必要があることをお勧めします。また、これらは非線形近似による強化学習への影響を安定させるため、ブートストラップ用におそらくネットワークを交互にする必要もあります。私はそれについて詳しく説明し、例を示すためにプロジェクトコードを見て喜んでいますが、その詳細レベルで返答するのに1〜2日かかる場合があります。

— Neil Slater

コードを実行しましたが、理解が正しければ、エージェントは1から4のアクションごとにアクションを選択することで弾丸を「操縦」できます。つまり、エージェントが止まっている間、弾丸を任意の方向に動かすことができます。それは意図的ですか？ボットは、グリッド上でエージェントに位置合わせされたときにのみ起動し、そうする場合は常に同じ方向を選択するため、これを行いません。

— Neil Slater

ほとんど問題ありませんが、ブートストラップされた値は保存せず、後でステップがサンプリングされたときに再計算します。実行したアクションごとに、状態、アクション、次の状態、報酬の4つを保存します。次に、このリストから小さなミニバッチ（ステップごとに1つで十分ですが、たとえば10が一般的です）を取得し、Qラーニングの場合は、新しい最大アクションとその値を計算して、監視付き学習ミニバッチ（別名TDターゲット）。

— Neil Slater

それは「近似器の凍結されたコピー（つまり、ニューラルネットワーク）」である必要があります（引用が私のコメントまたは回答の1つからのものである場合は、私に指摘してください。修正します。非常に簡単です。重量のコピーを2つ保持するだけです。パラメータ

w

$\mathbf{w}$ 、更新する「ライブ」のものと、数百回の更新ごとに「ライブ」のものからコピーする「最近の」もの。TDターゲットを計算するとき

R + γ {max}_{a^{'}} \hat{q} (S^{'}, a^{'}, w)

$R + \gamma \text{max}_{a'} \hat{q}(S',a',\mathbf{w})$ 次に、「古い」コピーを使用して計算します

\hat{q}

$\hat{q}$ 、しかしそれらの値で「ライブ」のものをトレーニングします。

— Neil Slater

重みが発散している場合は、オプティマイザーまたは勾配が適切に動作していません。重みを発散させる一般的な理由は、勾配の爆発です。

レイヤーが多すぎる、または
RNNを使用している場合、繰り返しサイクルが多すぎます。

次のようにして、グラデーションが爆発しているかどうかを確認できます。

grad_magnitude = tf.reduce_sum([tf.reduce_sum(g**2)
                                for g in tf.gradients(loss, weights_list)])**0.5

グラデーションの爆発の問題を解決するいくつかのアプローチは次のとおりです。

RELUまたはELUアクティベーションを使用する
Xavierの初期化を使用する
Deep Residualアーキテクチャを使用します。これにより、グラデーションが後続のレイヤーによって押しつぶされるのを防ぎます。

— デフォルトの画像
ソース