RL【1】:Basic Concepts
系列文章目录
文章目录
- 系列文章目录
- 前言
- Fundamental concepts in Reinforcement Learning
- Markov decision process (MDP)
- 总结
前言
本系列文章主要用于记录 B站 赵世钰老师的【强化学习的数学原理】的学习笔记,关于赵老师课程的具体内容,可以移步:
B站视频:【【强化学习的数学原理】课程:从零开始到透彻理解(完结)】
GitHub 课程资料:Book-Mathematical-Foundation-of-Reinforcement-Learning
Fundamental concepts in Reinforcement Learning
- State: The status of the agent with respect to the environment.
- State space: the set of all states. S = { s i} i = 1 N S = \\{ s_i \\}^N_{i=1} S={si}i=1N
- Action: For each state, an action refers to a set of possible moves or operations that the agent can take.
- Action space of a state: the set of all possible actions of a state. A ( s i ) = { a i} i = 1 N A(s_i) = \\{a_i\\}^N_{i=1} A(si)={ai}i=1N
- State transition: When taking an action, the agent may move from one state to another. Such a process is called state transition. State transition defines the interaction with the environment.
- Policy π \\pi π: Policy tells the agent what actions to take at a state.
- Reward: a real number we get after taking an action.
- A positive reward represents encouragement to take such actions.
- A negative reward represents punishment to take such actions.
- Reward can be interpreted as a human-machine interface, with which we can guide the agent to behave as what we expect.
- The reward depends on the state and action, but not the next state.
- Trajectory: A trajectory is a state-action-reward chain.
- Return: The return of this trajectory is the sum of all the rewards collected along the trajectory.
- Discount rate: The discount rate is a scalar factor γ ∈ [ 0 , 1 ) \\gamma \\in [0,1) γ∈[0,1) that determines the present value of future rewards. It specifies how much importance the agent assigns to rewards received in the future compared to immediate rewards.
- A smaller value of γ \\gamma γ makes the agent more short-sighted, emphasizing immediate returns.
- A value closer to 1 encourages long-term planning by valuing distant rewards nearly as much as immediate ones.
- Discounted return G tG_t Gt: The discounted return is the cumulative reward the agent aims to maximize, defined as the weighted sum of future rewards starting from time step t t t
- G t = R t + 1 + γ R t + 2 + γ 2 R t + 3 + ⋯ = ∑ k = 0 ∞ γ k R t + k + 1 . G_t = R_{t+1} + \\gamma R_{t+2} + \\gamma^2 R_{t+3} + \\cdots = \\sum_{k=0}^{\\infty} \\gamma^k R_{t+k+1}. Gt=Rt+1+γRt+2+γ2Rt+3+⋯=∑k=0∞γkRt+k+1.
- It captures both immediate and future rewards while incorporating the discount rate γ \\gamma γ, thereby balancing short-term and long-term gains.
- Episode: When interacting with the environment following a policy, the agent may stop
at some terminal states. The resulting trajectory is called an episode (or a
trial).- An episode is usually assumed to be a finite trajectory. Tasks with episodes are called episodic tasks.
- Some tasks may have no terminal states, meaning the interaction with the environment will never end. Such tasks are called continuing tasks.
- In fact, we can treat episodic and continuing tasks in a unified mathematical way by converting episodic tasks to continuing tasks.
- Option 1: Treat the target state as a special absorbing state. Once the agent reaches an absorbing state, it will never leave. The consequent rewards r = 0 r = 0 r=0.
- Option 2: Treat the target state as a normal state with a policy. The agent can still leave the target state and gain r = r + 1 r= r+1 r=r+1 when entering the target state.
Markov decision process (MDP)
- Sets:
- State: the set of states S \\mathcal{S} S
- Action: the set of actions A ( s ) \\mathcal{A}(s) A(s) associated with state s ∈ S s \\in \\mathcal{S} s∈S
- Reward: the set of rewards R ( s , a ) \\mathcal{R}(s, a) R(s,a)
- Probability distribution:
- State transition probability: at state s s s, taking action a a a, the probability of transitioning to state s ′ s\' s′ is p ( s ′ ∣ s , a ) p(s\' | s, a) p(s′∣s,a)
- Reward probability: at state s s s, taking action a a a, the probability of receiving reward r r r is p ( r ∣ s , a ) p(r | s, a) p(r∣s,a)
- Policy: at state s s s, the probability of choosing action a a a is π ( a ∣ s ) \\pi(a | s) π(a∣s)
- Markov property: memoryless property
- p ( s t + 1 ∣ a t , s t , . . . , a 0 , s 0 ) = p ( s t + 1 ∣ a t , s t ) p(s_{t+1} | a_t, s_t, ..., a_0, s_0) = p(s_{t+1} | a_t, s_t) p(st+1∣at,st,...,a0,s0)=p(st+1∣at,st)
- p ( r t + 1 ∣ a t , s t , . . . , a 0 , s 0 ) = p ( r t + 1 ∣ a t , s t ) p(r_{t+1} | a_t, s_t, ..., a_0, s_0) = p(r_{t+1} | a_t, s_t) p(rt+1∣at,st,...,a0,s0)=p(rt+1∣at,st)
总结
第一节系统性地介绍了 RL 中常见的一些概念,并做了通俗易懂的解释。同时,结合了马尔可夫决策过程,用数学公式化的语言进一步讲解了各概念的含义,为后续内容的讲解做好了铺垫。