上文中提到，在知道强化学习转移概率和reward函数的情况下，我们可以使用dp的方法来求解强化学习问题。那么，在我们不知道转移概率和reward函数的时候，我们需要怎么来求解呢。

0x1 蒙特卡洛学习(Monte Carlo Learning, MC Learning)

蒙特卡洛一词主要指代使用随机采样的方法，用采样得到的平均水平来代替期望。这里主要使用蒙特卡洛来估计value function。
对于某种策略 $\pi$ ，我们从起始状态 $s_0$ 出发，根据该策略获取状态的轨迹： $s_0,a_0,r_1,s_1,a_1,r_2,...,s_{T-1},a_{T-1},r_T,s_T$ , 这里 $a_t$ 表示 $t$ 时刻的动作， $r_t$ 表示的 $t$ 时刻的reward。我们把每一个这种采样的序列称为轨迹（trajectory），这样的一个过程称为episode，或者rollout。表示一个事件从开始到结束，通过采样多个episode，我们可以得到多个状态序列。在MC prediction中，计算Value函数有两种方法，第一种是first-visit,第二种是evert-visit，其中first-visit就是说，对于在每个episode中，选取该episode第一次出现状态 $s$ 以后的回报来计算 $V(s)$ ,然后对所有的episode中的 $V(s)$ 按照出现次数取平均值，就可以得到 $V(s)$ 。另一种是说取每一个episode中 $s$ 以后的所有回报(接下来的每一个s的回报的和)，再对所有episode得到的求和取平均，作为 $V(s)$ 。
first visit:

$V(s)=\frac{G_1(s)+G_2(s)+...}{N_(s)}$

every visit:

$V(s)=\frac{G_{11}(s)+G_{12}(s)+...+G_{21}(s)+G_{22}(s)+...}{N_(s)}$

在强化学习中，有一点十分重要，就是兼顾探索（exploration）和利用（exploitation），这就引出了我们常用的ε-greedy策略，ε是一个介于0和1之间的值，我们有ε的概率随机探索（动作随机执行），1-ε的概率使用贪心策略（执行当前状态下Q最大的动作），随着策略的学习，ε的值逐渐减小，即策略逐渐由贪心策略主导。

我们使用蒙特卡洛方法和ε-greedy的伪代码如下：
pseudocode of monte carlo learning

import numpy as np
from env import Grid
import matplotlib.pyplot as plt
def get_action(q_table,eps,x,y):
    if np.random.rand() < eps:
        a = np.random.randint(4)
    else:
        a = np.argmax(q_table[x,y])
    return a
grid = Grid()
width = grid.width
height = grid.height
# Q= np.random.rand(width,height,4)
Q= np.zeros((width,height,4))
N = np.zeros((width,height,4))

GAMMA = 0.95
EPOCH = 1000
MIN_EPS = 0.05
for epoch in range(EPOCH):
    eps = 1 - (1 - MIN_EPS)/EPOCH * epoch
    pos = grid.reset()
    ret = []
    a_s = []
    traject = []
    dic={}
    for i in range(width):
        for j in range(height):
            dic[(i,j)] = 0
    while True:
        traject.append(pos)
        a = get_action(Q,eps,*pos)
        a_s.append(a)
        r,pos_,done = grid.step(a)
        ret.append(r)
        pos = pos_
        if done:
            break
    for i in range(len(ret) - 2, -1,-1):
        ret[i] +=  GAMMA * ret[i + 1]

    # print(ret)
    #     self.move = {0:(0,1),1:(0,-1),2:(1,0),3:(-1,0)}
    #     0:down 1:up 2:right 3:left
    for idx,i in enumerate(traject):
        if dic[i] == 0:
            Q[i[0],i[1],a_s[idx]] +=  (ret[idx] - Q[i[0],i[1],a_s[idx]]) / (N[i[0],i[1],a_s[idx]] + 1)
            N[i[0],i[1],a_s[idx]] += 1
            dic[i] = 1

def generate_policy(width, height,Q,goal):
    dic={0:'D',1:'U',2:'R',3:'L'}
    for i in range(height):
        s = ""
        for j in range(width):
            if (j,i) == goal:
                s += "S"
            else:
                s += dic[np.argmax(Q[j,i])]
        print(s)
    print()

goal=grid.goal
generate_policy(width,height,Q,goal)

0x2 时序差分学习(Temporal Difference Learning，TD Learning)

蒙特卡罗强化学习算法通过考虑采样轨迹，克服了模型未知给策略估计造成的困难，不过蒙特卡罗方法有一个缺点，就是每次需要采样完一个轨迹之后才能更新策略。蒙特卡洛方法没有充分利用学习任务的MDP结构，而时序差分学习方法就充分利用了MDP结构，效率比MC要高。

SARSA

sarsa是一种on-policy的TD Learning，其进行决策的策略和更新的策略是相同的。与MC不同的是它不使用采样整条轨迹来估计价值函数，而是使用bellman期望方程中 $V_{\pi}=\sum_a \pi(a|s) (R_{s}^a+\gamma \sum_{s^{’} \in S} P_{ss^{’}}^a V_{\pi}(s^{’}))$
，也就是用即时奖励R与接下来的状态的价值作为对当前状态的价值的估计。这个估计称为TD target。但是，只估计V值还不足以直接得到策略，因此实际上我们一般使用Q值来进行估计，也就是 $Q(s,a) = R + Q(s^{’},a^{’})$

pseudocode of sarsa

import numpy as np
from env import Grid
import matplotlib.pyplot as plt
grid = Grid()
width = grid.width
height = grid.height
Q= np.zeros((width,height,4))
def get_action(q_table,eps,x,y):
    if np.random.rand() < eps:
        a = np.random.randint(4)
    else:
        a = np.argmax(q_table[x,y])
    return a


EPOCH = 10000
MIN_EPS = 0.05
GAMMA = 0.95
LR = 0.1
rets = []
for i in range(EPOCH):
    eps = 1 - (1 - MIN_EPS)/EPOCH * i
    pos = grid.reset()
    a = get_action(Q,eps,*pos)
    ret=0
    # print("pos",pos)
    while True:
        # print("a",a)
        r,pos_,flag = grid.step(a)
        # print("pos_",pos_)
        ret += r
        a_ = get_action(Q,eps,*pos_)
        Q[pos[0],pos[1],a] += LR*(r + GAMMA * Q[pos_[0],pos_[1],a_] - Q[pos[0],pos[1],a])
        pos = pos_
        a = a_

        if flag:
            break
    if i % (EPOCH/200) == 0:
        rets.append(ret)


def generate_policy(width, height,Q,goal):
    dic={0:'D',1:'U',2:'R',3:'L'}
    for i in range(height):
        s = ""
        for j in range(width):
            if (j,i) == goal:
                s += "S"
            else:
                s += dic[np.argmax(Q[j,i])]
        print(s)
    print()

goal=grid.goal
generate_policy(width,height,Q,goal)

fig  = plt.figure()
x = np.arange(0,EPOCH,EPOCH/200)
plt.plot(x,rets)
plt.show()

Q-Learning

与Sarsa不同，Q-Learning是一种off-policy的方法，也就是说更新的策略和实际上采用的策略不是一个策略。Q-Learning的更新方式为 $Q(s,a) = R + \max_{a} Q(s^{’},a)$ , 注意这里与SARSA不同的是，这里只是取了下一个状态中Q值最大的项，但是并没有执行。而SARSA是根据当前策略在 $s^{’}$ 继续执行了一步 $a^{’}$ ，拿这个执行的 $Q(s^{’},a^{’}$ 更新的。

pseudocode of Q-learning

import numpy as np
from env import Grid
import matplotlib.pyplot as plt
grid = Grid()
width = grid.width
height = grid.height
Q= np.zeros((width,height,4))
def get_action(q_table,eps,x,y):
    if np.random.rand() < eps:
        a = np.random.randint(4)
    else:
        a = np.argmax(q_table[x,y])
    return a

EPOCH = 10000
MIN_EPS = 0.05
GAMMA = 0.95
LR = 0.1
rets = []
for i in range(EPOCH):
    eps = 1 - (1 - MIN_EPS)/EPOCH * i
    pos = grid.reset()
    ret = 0
    while True:
        a = get_action(Q,eps,*pos)
        # print("a",a)
        r,pos_,flag = grid.step(a)
        ret += r
        # print("pos_",pos_)

        a_ = get_action(Q,0,*pos_)
        Q[pos[0],pos[1],a] += LR*(r + GAMMA * Q[pos_[0],pos_[1],a_] - Q[pos[0],pos[1],a])
        pos = pos_

        if flag:
            break
    if i % (EPOCH/200) == 0:
        rets.append(ret)


def generate_policy(width, height,Q,goal):
    dic={0:'D',1:'U',2:'R',3:'L'}
    for i in range(height):
        s = ""
        for j in range(width):
            if (j,i) == goal:
                s += "S"
            else:
                s += dic[np.argmax(Q[j,i])]
        print(s)
    print()

goal=grid.goal
generate_policy(width,height,Q,goal)
fig  = plt.figure()
x = np.arange(0,EPOCH,EPOCH/200)
plt.plot(x,rets)
plt.show()