RL【1】：Basic Concepts

技术文档

系列文章目录

文章目录

系列文章目录
前言
Fundamental concepts in Reinforcement Learning
Markov decision process (MDP)
总结

前言

本系列文章主要用于记录 B站赵世钰老师的【强化学习的数学原理】的学习笔记，关于赵老师课程的具体内容，可以移步：
B站视频：【【强化学习的数学原理】课程：从零开始到透彻理解（完结）】
GitHub 课程资料：Book-Mathematical-Foundation-of-Reinforcement-Learning

Fundamental concepts in Reinforcement Learning

State: The status of the agent with respect to the environment.
State space: the set of all states. $\\{ s_i \\}^N_{i=1}$
Action: For each state, an action refers to a set of possible moves or operations that the agent can take.
Action space of a state: the set of all possible actions of a state. $A(s_i) = \\{a_i\\}^N_{i=1}$
State transition: When taking an action, the agent may move from one state to another. Such a process is called state transition. State transition defines the interaction with the environment.
Policy $\\pi$ : Policy tells the agent what actions to take at a state.
Reward: a real number we get after taking an action.
- A positive reward represents encouragement to take such actions.
- A negative reward represents punishment to take such actions.
- Reward can be interpreted as a human-machine interface, with which we can guide the agent to behave as what we expect.
- The reward depends on the state and action, but not the next state.
Trajectory: A trajectory is a state-action-reward chain.
Return: The return of this trajectory is the sum of all the rewards collected along the trajectory.
Discount rate: The discount rate is a scalar factor $\\gamma \\in [0,1)$ that determines the present value of future rewards. It specifies how much importance the agent assigns to rewards received in the future compared to immediate rewards.
- A smaller value of $\\gamma$ makes the agent more short-sighted, emphasizing immediate returns.
- A value closer to 1 encourages long-term planning by valuing distant rewards nearly as much as immediate ones.
Discounted return $G_t$ : The discounted return is the cumulative reward the agent aims to maximize, defined as the weighted sum of future rewards starting from time step $t$
- $G_t = R_{t+1} + \\gamma R_{t+2} + \\gamma^2 R_{t+3} + \\cdots = \\sum_{k=0}^{\\infty} \\gamma^k R_{t+k+1}.$
- It captures both immediate and future rewards while incorporating the discount rate $\\gamma$ , thereby balancing short-term and long-term gains.
Episode: When interacting with the environment following a policy, the agent may stop
at some terminal states. The resulting trajectory is called an episode (or a
trial).
- An episode is usually assumed to be a finite trajectory. Tasks with episodes are called episodic tasks.
- Some tasks may have no terminal states, meaning the interaction with the environment will never end. Such tasks are called continuing tasks.
- In fact, we can treat episodic and continuing tasks in a unified mathematical way by converting episodic tasks to continuing tasks.
  - Option 1: Treat the target state as a special absorbing state. Once the agent reaches an absorbing state, it will never leave. The consequent rewards $r = 0$ .
  - Option 2: Treat the target state as a normal state with a policy. The agent can still leave the target state and gain $r = r + 1$ when entering the target state.

Markov decision process (MDP)

Sets:
- State: the set of states $\\mathcal{S}$
- Action: the set of actions $\\mathcal{A}(s)$ associated with state $\\in \\mathcal{S}$
- Reward: the set of rewards $\\mathcal{R}(s, a)$
Probability distribution:
- State transition probability: at state $s$ , taking action $a$ , the probability of transitioning to state $s\'$ is $p(s\' | s, a)$
- Reward probability: at state $s$ , taking action $a$ , the probability of receiving reward $r$ is $p (r ∣ s, a)$
Policy: at state $s$ , the probability of choosing action $a$ is $\\pi(a | s)$
Markov property: memoryless property
- $p(s_{t+1} | a_t, s_t, ..., a_0, s_0) = p(s_{t+1} | a_t, s_t)$
- $p(r_{t+1} | a_t, s_t, ..., a_0, s_0) = p(r_{t+1} | a_t, s_t)$

总结

第一节系统性地介绍了 RL 中常见的一些概念，并做了通俗易懂的解释。同时，结合了马尔可夫决策过程，用数学公式化的语言进一步讲解了各概念的含义，为后续内容的讲解做好了铺垫。

RL【1】：Basic Concepts

系列文章目录

文章目录

前言

Fundamental concepts in Reinforcement Learning

Markov decision process (MDP)

总结

公告

DeepSeek全套部署资料免费下载

免费可商用字体批量下载

标签

RL【1】：Basic Concepts

系列文章目录

文章目录

前言

Fundamental concepts in Reinforcement Learning

Markov decision process (MDP)

总结

相关问题

公告

DeepSeek全套部署资料免费下载

免费可商用字体批量下载

标签