book: An Introduction to Reinforcement Learning . Sutton and Barto, 1998
book: Algorithms for Reinforcement Learning. Szepesvari
Abort Reinforcement Learning
强化学习是多种学科交叉的领域,也许本质上是一个决策学科,目的是以最佳的方式来指定决策。
在工程领域就是花费大量时间来寻求最佳控制。
强化学习在不同的领域,有不同的叫法,其实就是进行一系列的活动,才能最终得到好的结果。
在神经科学领域,近期最主要的发现实际上是:理解人类大脑是如何做出决策的,(人脑做出决策)很大程度上是依赖于多巴胺系统,该系统通过传递神经递质多巴胺,实际上反映了我们在这门课程学习的主要算法,
心理学、数学(等价公式、运筹学)、经济学领域(博弈论)
强化学习是不同于监督学习的,但是也不属于无监督学习
不同之处:
- there is no supervisor, only a reward signal, no one tell us the right action to take, instead, 更加类似于一个小孩不断试错的示例。不会告诉你什么行为是最好的,可能会告诉你什么行为是错误的、正确的或者给一个分数。
- Feedback is delayed, not instantaneous. 在强化学习中, 当你做出了一个决定后,可能会在很多很多步之后,才知道这是一个正确的决策,还是一个错误的决策。通过时间的流逝,对过去决策的回顾,你才会意识到有可能做出了一个错误的选择。因为可能在经历很多步之后,你那个时候的决定却可能给你带来灾难性的损失。
- Time Really matters(sequential, non i.i.d data)在我们谈论顺序决策时,也就是一步接着一步,agent采取决策,并计算采取措施之后,会得到多少奖励。然后,agent会尝试修改策略,以期望最终尽可能获得最多的奖励。
我们不是在讨论传统的监督学习或者非监督学习,在这些学习过过程中,只需要将独立同分布的数据丢给机器,让机器自己去学习就可以了。然而在这里,我们需要应对的是一个动态的系统,agent要和外部环境进行交互。对于强化学习来说,独立同分布的条件已经被破环了,agent是根据环境的影响来采措施,agent每一步做出的决策,都会影响它所接收的数据。这是一种主动的学习过程。是由不同的数据集组合而成的。
Rewards
- A Rewards
is a scalar feedback signal(一个标量,一个反馈信号)
- Indicates how well agent is doing at step t
- The agent's job is to maximise cumulative reward.
Definition (Reward Hypothesis)
All goals can be described by the maximisation of expected cumulative reward.
Sequential Decision Making
Goal: select actions to maximise total future reward.
Actions may have long term consequences
Reward may be delayed
It may be better to sacrifice immediate reward to gain more long-term reward.
-
Examples:
- A financial investment.
- Refuelling a helicopter
History and State
The history is the sequence of observations, actions, rewards
history 就是目前为止,agent所知的所有信息。每一步都会采取行动、进行观察,采取奖励。i.e. all observable variables up to time t.
我们模拟的是人类的大脑,agent通过自己的感知器官获知它们所“看到”的东西,输入是agent看到的东西,输出是做出的决定,agent和环境之间需要一个良好的接口,我们需要做的就是控制好这些接口。
-
What happens next depends on the history:
- The agent selects actions(我们创建的算法,其实是history到action的映射,我们的目标是创建一个映射, 然后算法是从一个history中挑选下一个action的映射,agent的下一个action是什么,完全依赖与history)
- The environment selects observations/rewards(环境根据history会发生变化)
State is the information used to determine what happens next
Formally, state is a function of the history:
Information State
An information state (aka. Markov state) contains all useful information from the hsitory.
definition
A state is Markov if and only if
也就是说,我们可以把之前的状态全部丢弃掉,只需要保留当前的状态(In other way, you can throw all of previous states, and just retain your current state.), 未来的状态也是具有马尔可夫性的。
- The future is independent of the past given the present
也就是说可以用state代表整个history,因此可以放弃history
Example
- 我们认为接下来会发生什么,取决于我们的状态表示方法。
Fully Observable Environments
**Full observation: ** agent directly observes environments state
- Agent state = Environment state = Information state
- Formally. this is a Markov decision process(MDP)
Partially Observable Environments
-
Partial observability: agent indirectly observes environment.
- A robot with camera vision isn't told its absolute location
- A trading agent only observes current prices
- A poker playing agent only observes public carts.
Now
Formally, this is a partially observable Markov decision process(POMDP)
-
Agent must construct its own state representation
- Complete history:
- Beliefs of environment state:
- Recurrent neural network:
- Complete history:
Inside An RL Agent
- An RL agent may include one or more of these components:
- Policy: agent's behaviour function.
- Value function: how good is each state and/or action
- Model: agent's representation of the environment.(用来表示agent眼中的环境是怎么样的)
Policy
- A policy is the agent's behaviour
- It is a map from state to action, e.g.
- Deterministic policy:
当前的状态s经过函数的映射,可以得到一个actoin,即将要采取的动作。
- Stochastic policy:
Value Function
Value function is a prediction of future reward.
Used to evaluate the goodness/badness of states
And therefore to select between actions,
Model
- A model predicts what the environment will do next. (model并不是环境本身,但是他对于预测环境变化非常重要,我们model会学习环境的变化,然后可以用来确定计划,model对下一步的行动很有用处)
-
Transitions:
predict the next state.(i.e. dynamics)
-
Rewards:
predicts the next (immediate) reaward,
Maze Example
- Reward: -1 per time step
- Actions: N, E, S, W
- States: Agent's location
Policy
mapping from state to action, 每一个状态(就是agent所在的每一个格子)都有一个箭头,代表着如果在这个状态之下,agent下一步将会去往哪一个方向。
Value Function
...
Model
每一步都会立即获得一个奖励-1
Categorizing RL agent(1)
-
Value Based
- No Policy(Implicit)
- Value Function
-
Policy Based
- Policy
- No Value Function
-
Actor Critic
- Policy
- Value Function
Categorizing RL agents(2)
没有model意味着我们不会去尝试理解环境,我们并不会创建一个动态特性模型
我们会直接使用一个Policy或者一个Value Function,我们就能知道如何采取行动,才能获取最高的奖励。我们并不需要知道环境的状态是如何改变的。
- Model Free
- Policy and/or Value Function
- No Model
与之对应。我们可以创建基于model的强化学习模型。第一步就是建立一个模型去表示环境的工作原理,就像是建立一个关于直升飞机的动态model,通过这个model,我们就可以知道接下来会发生什么,并找到最优的行动方式。
RL Agent Taxonomy
....
Problems within Reinforcement Learning
Two fundamental problems in sequential decision making
- Reinforcement learning
- The environment is initially unknown
- The agent interacts with environment
- The agent improves its policy
- Planning(规划问题)
- A model of the environment is known
- The agent performs computations with its model(without any external interaction)
- The agent improves its policy
Exploration and Exploitation(1)
探索 与 开拓 强化学习是一种不断试错的学习方式。
Exploration and Exploitation(2)
- Exploration finds more information about the environment
- Exploitation exploits knowns information to maximise reward.
Prediction and Control
- Prediction: evaluate the future
- Given a policy
- Control: optimise the future
- Find the best policy