私のQラーニングアルゴリズムの状態値は無限に発散し続けます。つまり、私の重みも発散しています。値のマッピングにはニューラルネットワークを使用しています。
私はもう試した:
- 「報酬+割引*アクションの最大値」のクリッピング(最大/最小は50 / -50に設定)
- 低い学習率の設定(0.00001と私は、重みを更新するために従来のバックプロパゲーションを使用しています)
- 報酬の価値を下げる
- 探査率を上げる
- 入力を1〜100に正規化します(以前は0〜1でした)
- 割引率を変更する
- ニューラルネットワークのレイヤーを減らす(検証のためだけ)
Qラーニングは非線形入力で発散することが知られていると聞きましたが、重みの発散を止めようと試みることができる他に何かありますか?
2017年8月14日の更新#1:
リクエストがあったため、現在行っていることについて具体的な詳細を追加することにしました。
私は現在、エージェントにシューティングゲームのトップダウンビューで戦う方法を学ばせようとしています。対戦相手は確率的に動くシンプルなボットです。
各キャラクターには、各ターンで選択できる9つのアクションがあります。
- 上に移動
- 下に移動
- 左に移動
- 右に動く
- 弾丸を上向きに発射する
- 弾丸を撃ち落とす
- 左に弾丸を撃ちます
- 右に弾丸を撃ちます
- 何もしない
報酬は次のとおりです。
- エージェントがボットを弾丸で打った場合、+ 100(さまざまな値を試してみました)
- ボットが発射した弾丸にエージェントが当たった場合、-50(ここでも、さまざまな値を試しました)
弾丸を発射できないときにエージェントが弾丸を発射しようとした場合(例:エージェントが弾丸を発射した直後など)-25(必須ではありませんが、エージェントをより効率的にしたいと思いました)
ボットがアリーナから出ようとした場合は-20(あまり必要ではありませんが、エージェントをより効率的にしたいと思いました)
ニューラルネットワークの入力は次のとおりです。
0から100に正規化されたX軸上のエージェントとボット間の距離
0から100に正規化されたY軸上のエージェントとボット間の距離
エージェントのxとyの位置
ボットのxとyの位置
ボットの弾丸の位置。ボットが弾丸を発射しなかった場合、パラメーターはボットのx位置とy位置に設定されます。
私も入力をいじっています。エージェントの位置(距離ではなく実際の位置)のx値やボットの弾丸の位置などの新しい機能を追加してみました。それらのどれもうまくいきませんでした。
これがコードです:
from pygame import *
from pygame.locals import *
import sys
from time import sleep
import numpy as np
import random
import tensorflow as tf
from pylab import savefig
from tqdm import tqdm
#Screen Setup
disp_x, disp_y = 1000, 800
arena_x, arena_y = 1000, 800
border = 4; border_2 = 1
#Color Setup
white = (255, 255, 255); aqua= (0, 200, 200)
red = (255, 0, 0); green = (0, 255, 0)
blue = (0, 0, 255); black = (0, 0, 0)
green_yellow = (173, 255, 47); energy_blue = (125, 249, 255)
#Initialize character positions
init_character_a_state = [disp_x/2 - arena_x/2 + 50, disp_y/2 - arena_y/2 + 50]
init_character_b_state = [disp_x/2 + arena_x/2 - 50, disp_y/2 + arena_y/2 - 50]
#Setup character dimentions
character_size = 50
character_move_speed = 25
#Initialize character stats
character_init_health = 100
#initialize bullet stats
beam_damage = 10
beam_width = 10
beam_ob = -100
#The Neural Network
input_layer = tf.placeholder(shape=[1,7],dtype=tf.float32)
weight_1 = tf.Variable(tf.random_uniform([7,9],0,0.1))
#weight_2 = tf.Variable(tf.random_uniform([6,9],0,0.1))
#The calculations, loss function and the update model
Q = tf.matmul(input_layer, weight_1)
predict = tf.argmax(Q, 1)
next_Q = tf.placeholder(shape=[1,9],dtype=tf.float32)
loss = tf.reduce_sum(tf.square(next_Q - Q))
trainer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
updateModel = trainer.minimize(loss)
initialize = tf.global_variables_initializer()
jList = []
rList = []
init()
font.init()
myfont = font.SysFont('Comic Sans MS', 15)
myfont2 = font.SysFont('Comic Sans MS', 150)
myfont3 = font.SysFont('Gothic', 30)
disp = display.set_mode((disp_x, disp_y), 0, 32)
#CHARACTER/BULLET PARAMETERS
agent_x = agent_y = int()
bot_x = bot_y = int()
agent_hp = bot_hp = int()
bot_beam_dir = int()
agent_beam_fire = bot_beam_fire = bool()
agent_beam_x = bot_beam_x = agent_beam_y = bot_beam_y = int()
agent_beam_size_x = agent_beam_size_y = bot_beam_size_x = bot_beam_size_y = int()
bot_current_action = agent_current_action = int()
def param_init():
"""Initializes parameters"""
global agent_x, agent_y, bot_x, bot_y, agent_hp, bot_hp, agent_beam_fire, bot_beam_fire, agent_beam_x, bot_beam_x, agent_beam_y, bot_beam_y
agent_x = list(init_character_a_state)[0]; agent_y = list(init_character_a_state)[1]
bot_x = list(init_character_b_state)[0]; bot_y = list(init_character_b_state)[1]
agent_hp = bot_hp = character_init_health
agent_beam_fire = bot_beam_fire = False
agent_beam_x = bot_beam_x = agent_beam_y = bot_beam_y = beam_ob
agent_beam_size_x = agent_beam_size_y = bot_beam_size_x = bot_beam_size_y = 0
def screen_blit():
global disp, disp_x, disp_y, arena_x, arena_y, border, border_2, character_size, agent_x, \
agent_y, bot_x, bot_y, character_init_health, agent_hp, bot_hp, red, blue, aqua, green, black, green_yellow, energy_blue, \
agent_beam_fire, bot_beam_fire, agent_beam_x, agent_beam_y, bot_beam_x, bot_beam_y, agent_beam_size_x, agent_beam_size_y, bot_beam_size_x, bot_beam_size_y, beam_width
disp.fill(aqua)
draw.rect(disp, black, (disp_x / 2 - arena_x / 2 - border, disp_y /
2 - arena_y / 2 - border, arena_x + border * 2, arena_y + border * 2))
draw.rect(disp, green, (disp_x / 2 - arena_x / 2,
disp_y / 2 - arena_y / 2, arena_x, arena_y))
if bot_beam_fire == True:
draw.rect(disp, green_yellow, (agent_beam_x, agent_beam_y, agent_beam_size_x, agent_beam_size_y))
bot_beam_fire = False
if agent_beam_fire == True:
draw.rect(disp, energy_blue, (bot_beam_x, bot_beam_y, bot_beam_size_x, bot_beam_size_y))
agent_beam_fire = False
draw.rect(disp, red, (agent_x, agent_y, character_size, character_size))
draw.rect(disp, blue, (bot_x, bot_y, character_size, character_size))
draw.rect(disp, red, (disp_x / 2 - 200, disp_y / 2 + arena_y / 2 +
border + 1, float(agent_hp) / float(character_init_health) * 100, 14))
draw.rect(disp, blue, (disp_x / 2 + 200, disp_y / 2 + arena_y / 2 +
border + 1, float(bot_hp) / float(character_init_health) * 100, 14))
def bot_take_action():
return random.randint(1, 9)
def beam_hit_detector(player):
global agent_x, agent_y, bot_x, bot_y, agent_beam_fire, bot_beam_fire, agent_beam_x, \
bot_beam_x, agent_beam_y, bot_beam_y, agent_beam_size_x, agent_beam_size_y, \
bot_beam_size_x, bot_beam_size_y, bot_current_action, agent_current_action, beam_width, character_size
if player == "bot":
if bot_current_action == 1:
if disp_y/2 - arena_y/2 <= agent_y <= bot_y and (agent_x < bot_beam_x + beam_width < agent_x + character_size or agent_x < bot_beam_x < agent_x + character_size):
return True
else:
return False
elif bot_current_action == 2:
if bot_x <= agent_x <= disp_x/2 + arena_x/2 and (agent_y < bot_beam_y + beam_width < agent_y + character_size or agent_y < bot_beam_y < agent_y + character_size):
return True
else:
return False
elif bot_current_action == 3:
if bot_y <= agent_y <= disp_y/2 + arena_y/2 and (agent_x < bot_beam_x + beam_width < agent_x + character_size or agent_x < bot_beam_x < agent_x + character_size):
return True
else:
return False
elif bot_current_action == 4:
if disp_x/2 - arena_x/2 <= agent_x <= bot_x and (agent_y < bot_beam_y + beam_width < agent_y + character_size or agent_y < bot_beam_y < agent_y + character_size):
return True
else:
return False
else:
if agent_current_action == 1:
if disp_y/2 - arena_y/2 <= bot_y <= agent_y and (bot_x < agent_beam_x + beam_width < bot_x + character_size or bot_x < agent_beam_x < bot_x + character_size):
return True
else:
return False
elif agent_current_action == 2:
if agent_x <= bot_x <= disp_x/2 + arena_x/2 and (bot_y < agent_beam_y + beam_width < bot_y + character_size or bot_y < agent_beam_y < bot_y + character_size):
return True
else:
return False
elif agent_current_action == 3:
if agent_y <= bot_y <= disp_y/2 + arena_y/2 and (bot_x < agent_beam_x + beam_width < bot_x + character_size or bot_x < agent_beam_x < bot_x + character_size):
return True
else:
return False
elif bot_current_action == 4:
if disp_x/2 - arena_x/2 <= bot_x <= agent_x and (bot_y < agent_beam_y + beam_width < bot_y + character_size or bot_y < agent_beam_y < bot_y + character_size):
return True
else:
return False
def mapping(maximum, number):
return number#int(number * maximum)
def action(agent_action, bot_action):
global agent_x, agent_y, bot_x, bot_y, agent_hp, bot_hp, agent_beam_fire, \
bot_beam_fire, agent_beam_x, bot_beam_x, agent_beam_y, bot_beam_y, agent_beam_size_x, \
agent_beam_size_y, bot_beam_size_x, bot_beam_size_y, beam_width, agent_current_action, bot_current_action, character_size
agent_current_action = agent_action; bot_current_action = bot_action
reward = 0; cont = True; successful = False; winner = ""
if 1 <= bot_action <= 4:
bot_beam_fire = True
if bot_action == 1:
bot_beam_x = bot_x + character_size/2 - beam_width/2; bot_beam_y = disp_y/2 - arena_y/2
bot_beam_size_x = beam_width; bot_beam_size_y = bot_y - disp_y/2 + arena_y/2
elif bot_action == 2:
bot_beam_x = bot_x + character_size; bot_beam_y = bot_y + character_size/2 - beam_width/2
bot_beam_size_x = disp_x/2 + arena_x/2 - bot_x - character_size; bot_beam_size_y = beam_width
elif bot_action == 3:
bot_beam_x = bot_x + character_size/2 - beam_width/2; bot_beam_y = bot_y + character_size
bot_beam_size_x = beam_width; bot_beam_size_y = disp_y/2 + arena_y/2 - bot_y - character_size
elif bot_action == 4:
bot_beam_x = disp_x/2 - arena_x/2; bot_beam_y = bot_y + character_size/2 - beam_width/2
bot_beam_size_x = bot_x - disp_x/2 + arena_x/2; bot_beam_size_y = beam_width
elif 5 <= bot_action <= 8:
if bot_action == 5:
bot_y -= character_move_speed
if bot_y <= disp_y/2 - arena_y/2:
bot_y = disp_y/2 - arena_y/2
elif agent_y <= bot_y <= agent_y + character_size:
bot_y = agent_y + character_size
elif bot_action == 6:
bot_x += character_move_speed
if bot_x >= disp_x/2 + arena_x/2 - character_size:
bot_x = disp_x/2 + arena_x/2 - character_size
elif agent_x <= bot_x + character_size <= agent_x + character_size:
bot_x = agent_x - character_size
elif bot_action == 7:
bot_y += character_move_speed
if bot_y + character_size >= disp_y/2 + arena_y/2:
bot_y = disp_y/2 + arena_y/2 - character_size
elif agent_y <= bot_y + character_size <= agent_y + character_size:
bot_y = agent_y - character_size
elif bot_action == 8:
bot_x -= character_move_speed
if bot_x <= disp_x/2 - arena_x/2:
bot_x = disp_x/2 - arena_x/2
elif agent_x <= bot_x <= agent_x + character_size:
bot_x = agent_x + character_size
if bot_beam_fire == True:
if beam_hit_detector("bot"):
#print "Agent Got Hit!"
agent_hp -= beam_damage
reward += -50
bot_beam_size_x = bot_beam_size_y = 0
bot_beam_x = bot_beam_y = beam_ob
if agent_hp <= 0:
cont = False
winner = "Bot"
if 1 <= agent_action <= 4:
agent_beam_fire = True
if agent_action == 1:
if agent_y > disp_y/2 - arena_y/2:
agent_beam_x = agent_x - beam_width/2; agent_beam_y = disp_y/2 - arena_y/2
agent_beam_size_x = beam_width; agent_beam_size_y = agent_y - disp_y/2 + arena_y/2
else:
reward += -25
elif agent_action == 2:
if agent_x + character_size < disp_x/2 + arena_x/2:
agent_beam_x = agent_x + character_size; agent_beam_y = agent_y + character_size/2 - beam_width/2
agent_beam_size_x = disp_x/2 + arena_x/2 - agent_x - character_size; agent_beam_size_y = beam_width
else:
reward += -25
elif agent_action == 3:
if agent_y + character_size < disp_y/2 + arena_y/2:
agent_beam_x = agent_x + character_size/2 - beam_width/2; agent_beam_y = agent_y + character_size
agent_beam_size_x = beam_width; agent_beam_size_y = disp_y/2 + arena_y/2 - agent_y - character_size
else:
reward += -25
elif agent_action == 4:
if agent_x > disp_x/2 - arena_x/2:
agent_beam_x = disp_x/2 - arena_x/2; agent_beam_y = agent_y + character_size/2 - beam_width/2
agent_beam_size_x = agent_x - disp_x/2 + arena_x/2; agent_beam_size_y = beam_width
else:
reward += -25
elif 5 <= agent_action <= 8:
if agent_action == 5:
agent_y -= character_move_speed
if agent_y <= disp_y/2 - arena_y/2:
agent_y = disp_y/2 - arena_y/2
reward += -5
elif bot_y <= agent_y <= bot_y + character_size and bot_x <= agent_x <= bot_x + character_size:
agent_y = bot_y + character_size
reward += -2
elif agent_action == 6:
agent_x += character_move_speed
if agent_x + character_size >= disp_x/2 + arena_x/2:
agent_x = disp_x/2 + arena_x/2 - character_size
reward += -5
elif bot_x <= agent_x + character_size <= bot_x + character_size and bot_y <= agent_y <= bot_y + character_size:
agent_x = bot_x - character_size
reward += -2
elif agent_action == 7:
agent_y += character_move_speed
if agent_y + character_size >= disp_y/2 + arena_y/2:
agent_y = disp_y/2 + arena_y/2 - character_size
reward += -5
elif bot_y <= agent_y + character_size <= bot_y + character_size and bot_x <= agent_x <= bot_x + character_size:
agent_y = bot_y - character_size
reward += -2
elif agent_action == 8:
agent_x -= character_move_speed
if agent_x <= disp_x/2 - arena_x/2:
agent_x = disp_x/2 - arena_x/2
reward += -5
elif bot_x <= agent_x <= bot_x + character_size and bot_y <= agent_y <= bot_y + character_size:
agent_x = bot_x + character_size
reward += -2
if agent_beam_fire == True:
if beam_hit_detector("agent"):
#print "Bot Got Hit!"
bot_hp -= beam_damage
reward += 50
agent_beam_size_x = agent_beam_size_y = 0
agent_beam_x = agent_beam_y = beam_ob
if bot_hp <= 0:
successful = True
cont = False
winner = "Agent"
return reward, cont, successful, winner
def bot_beam_dir_detector():
global bot_current_action
if bot_current_action == 1:
bot_beam_dir = 2
elif bot_current_action == 2:
bot_beam_dir = 4
elif bot_current_action == 3:
bot_beam_dir = 3
elif bot_current_action == 4:
bot_beam_dir = 1
else:
bot_beam_dir = 0
return bot_beam_dir
#Parameters
y = 0.75
e = 0.3
num_episodes = 10000
batch_size = 10
complexity = 100
with tf.Session() as sess:
sess.run(initialize)
success = 0
for i in tqdm(range(1, num_episodes)):
#print "Episode #", i
rAll = 0; d = False; c = True; j = 0
param_init()
samples = []
while c == True:
j += 1
current_state = np.array([[mapping(complexity, float(agent_x) / float(arena_x)),
mapping(complexity, float(agent_y) / float(arena_y)),
mapping(complexity, float(bot_x) / float(arena_x)),
mapping(complexity, float(bot_y) / float(arena_y)),
#mapping(complexity, float(agent_hp) / float(character_init_health)),
#mapping(complexity, float(bot_hp) / float(character_init_health)),
mapping(complexity, float(agent_x - bot_x) / float(arena_x)),
mapping(complexity, float(agent_y - bot_y) / float(arena_y)),
bot_beam_dir
]])
b = bot_take_action()
if np.random.rand(1) < e or i <= 5:
a = random.randint(0, 8)
else:
a, _ = sess.run([predict, Q],feed_dict={input_layer : current_state})
r, c, d, winner = action(a + 1, b)
bot_beam_dir = bot_beam_dir_detector()
next_state = np.array([[mapping(complexity, float(agent_x) / float(arena_x)),
mapping(complexity, float(agent_y) / float(arena_y)),
mapping(complexity, float(bot_x) / float(arena_x)),
mapping(complexity, float(bot_y) / float(arena_y)),
#mapping(complexity, float(agent_hp) / float(character_init_health)),
#mapping(complexity, float(bot_hp) / float(character_init_health)),
mapping(complexity, float(agent_x - bot_x) / float(arena_x)),
mapping(complexity, float(agent_y - bot_y) / float(arena_y)),
bot_beam_dir
]])
samples.append([current_state, a, r, next_state])
if len(samples) > 10:
for count in xrange(batch_size):
[batch_current_state, action_taken, reward, batch_next_state] = samples[random.randint(0, len(samples) - 1)]
batch_allQ = sess.run(Q, feed_dict={input_layer : batch_current_state})
batch_Q1 = sess.run(Q, feed_dict = {input_layer : batch_next_state})
batch_maxQ1 = np.max(batch_Q1)
batch_targetQ = batch_allQ
batch_targetQ[0][a] = reward + y * batch_maxQ1
sess.run([updateModel], feed_dict={input_layer : batch_current_state, next_Q : batch_targetQ})
rAll += r
screen_blit()
if d == True:
e = 1. / ((i / 50) + 10)
success += 1
break
#print agent_hp, bot_hp
display.update()
jList.append(j)
rList.append(rAll)
print winner
Python環境にpygameとTensorflowとmatplotlibがインストールされている場合、ボットとエージェントの「戦闘」のアニメーションを見ることができるはずです。
更新については触れませんでしたが、誰かが元の一般的な問題と一緒に私の特定の問題にも対処できれば素晴らしいと思います。
ありがとう!
2017年8月18日の更新#2:
@NeilSlaterのアドバイスに基づいて、モデルにエクスペリエンスリプレイを実装しました。アルゴリズムは改善されましたが、収束を提供するより優れた改善オプションを探します。
2017年8月22日の更新#3:
エージェントがターンに弾丸でボットを攻撃し、そのターンでボットが行ったアクションが「弾丸を発射」しない場合、間違ったアクションがクレジットされることに気づきました。したがって、私は弾丸をビームに変えて、ビームが発射されたターンにボット/エージェントがダメージを受けるようにしました。