강화 학습에서 Bellman의 방정식 도출

32

" 강의 학습에서 소개 "에 다음 방정식 이 표시되지만 아래에서 파란색으로 강조 표시된 단계를 따르지 않습니다. 이 단계는 정확히 어떻게 도출됩니까?

expected-value reinforcement-learning

7

이것은 그 뒤에 깨끗하고 구조화 된 수학에 대해 궁금해하는 모든 사람들을위한 답입니다 (즉, 임의의 변수가 무엇인지 아는 사람들의 그룹에 속하고 임의의 변수에 밀도가 있다는 것을 보여 주거나 가정해야한다면 이것은 다음과 같습니다) 당신을위한 답 ;-)) :

우선 우리는 Markov 의사 결정 프로세스에 유한 한 수의 보상이 있어야합니다. 즉 , 각각 변수에 속하는 유한 밀도의 세트가 있어야 합니다. 예 : 모든 및 맵 대해 (즉, MDP 뒤에있는 오토마타에는 무한히 많은 상태가있을 수 있지만 상태 사이에 무한히 천이 될 수있는 보상 분포는 무한히 많음) $L^1$ $E$ $L^1$ $\int_{\mathbb{R}}x \cdot e(x) dx < \infty$ $e \in E$ $F : A \times S \to E$

p (r t | a t, s t) = F (a t, s t) (r t)

$p(r_t|a_t, s_t) = F(a_t, s_t)(r_t)$

L1 $L^1$

정리 1 : (즉, 적분 실수 랜덤 변수)를 로하고 가 공통 밀도를 갖도록 를 또 다른 랜덤 변수로 하자. $X \in L^1(\Omega)$ $Y$ $X,Y$

E [X | Y = y] = \int R x p (x | y) d x

$E[X|Y=y] = \int_\mathbb{R} x p(x|y) dx$

증명 : 본질적으로 입증 여기 스테판 한센.

정리 2 : 하고 를 가 공통 밀도를 갖도록 여기서 는 범위입니다 . $X \in L^1(\Omega)$ $Y,Z$ $X,Y,Z$

E [X | Y = y] = \int Z p (z | y) E [X | Y = y, Z = z] d z

$E[X|Y=y] = \int_{\mathcal{Z}} p(z|y) E[X|Y=y,Z=z] dz$

Z $\mathcal{Z}$

Z $Z$

증명 :

E [X | Y = y] = \int R x p (x | y) d x (by Thm. 1) = \int R x p ( x , y ) p ( y ) d x = \int R x \int Z p ( x , y , z ) d z p ( y ) d x = \int Z \int R x p ( x , y , z ) p ( y ) d x d z = \int Z \int R x p (x | y, z) p (z | y) d x d z = \int Z p (z | y) \int R x p (x | y, z) d x d z = \int Z p (z | y) E [X | Y = y, Z = z] d z (by Thm. 1)

$\begin{align*} E[X|Y=y] &= \int_{\mathbb{R}} x p(x|y) dx \\ &~~~~\text{(by Thm. 1)}\\ &= \int_{\mathbb{R}} x \frac{p(x,y)}{p(y)} dx \\ &= \int_{\mathbb{R}} x \frac{\int_{\mathcal{Z}} p(x,y,z) dz}{p(y)} dx \\ &= \int_{\mathcal{Z}} \int_{\mathbb{R}} x \frac{ p(x,y,z) }{p(y)} dx dz \\ &= \int_{\mathcal{Z}} \int_{\mathbb{R}} x p(x|y,z)p(z|y) dx dz \\ &= \int_{\mathcal{Z}} p(z|y) \int_{\mathbb{R}} x p(x|y,z) dx dz \\ &= \int_{\mathcal{Z}} p(z|y) E[X|Y=y,Z=z] dz \\ &~~~~\text{(by Thm. 1)} \end{align*}$

넣어 넣고 그러면 수렴하고 함수 이후에 MDP에 유한 한 보상 만 있다는 사실을 사용하여아직 (즉, 적분)을 가진자는 또한 그 (조건부 기대 [의 인수 분해]에 대한 정의 방정식에서 단조 수렴 정리의 일반적인 조합하고 지배 융합을 사용하여) 표시 할 수 이제 우리는 $G_t = \sum_{k=0}^\infty \gamma^k R_{t+k}$ $G_t^{(K)} = \sum_{k=0}^K \gamma^k R_{t+k}$ $L^1$ $G_t^{(K)}$ $\sum_{k=0}^\infty \gamma^k |R_{t+k}|$ $L^1(\Omega)$

lim K \to \infty E [G (K) t | S t = s t] = E [G t | S t = s t]

$\lim_{K \to \infty} E[G_t^{(K)} | S_t=s_t] = E[G_t | S_t=s_t]$

이자형 [G (K) 티 | 에스 티 = s 티] = E [R 티 | 에스 티 = s 티] + γ \int 에스 p (초 t + 1 | 에스 티) 전자 [G (K - 1) t + 1 | 에스 t + 1 = s t + 1] d 에스 t + 1

$E[G_t^{(K)} | S_t=s_t] = E[R_{t} | S_t=s_t] + \gamma \int_S p(s_{t+1}|s_t) E[G_{t+1}^{(K-1)} | S_{t+1}=s_{t+1}] ds_{t+1}$ 사용한 , Thm. 2 위의 Thm. 1 다음 간단한 소외 대전 한 프로그램을 사용하는 모든 대해 . 이제 방정식의 양변에 한계 를 적용해야합니다 . 상태 공간 의 적분으로 한계를 풀 려면 몇 가지 추가 가정을해야합니다.

지( K)티= R티+ γ지( K− 1 )t + 1 $G_t^{(K)} = R_t + \gamma G_{t+1}^{(K-1)}$

E[G(K−1)t+1|St+1=s′,St=st] $E[G_{t+1}^{(K-1)}|S_{t+1}=s', S_t=s_t]$

p(rq|st+1,st)=p(rq|st+1) $p(r_q|s_{t+1}, s_t) = p(r_q|s_{t+1})$

q≥t+1 $q \geq t+1$

K→∞ $K \to \infty$

S $S$

상태 공간이 유한 ( 이고 합이 유한함) 모든 보상이 모두 양수이거나 (그런 다음 모노톤 수렴을 사용함) 모든 보상이 음수입니다 (그런 다음 방정식과 모노톤 수렴을 다시 사용) 또는 모든 보상이 제한됩니다 (그런 다음 지배적 수렴을 사용합니다). 그런 다음 ( 위의 부분 / 유한 벨만 방정식의 양쪽에 를 적용하여 ) $\int_S = \sum_S$ $\lim_{K \to \infty}$

E [G t | S t = s t] = E [G (K) t | S t = s t] = E [R t | S t = s t] + γ \int S p (s t + 1 | s t) E [G t + 1 | S t + 1 = s t + 1] d s t + 1

$E[G_t | S_t=s_t] = E[G_t^{(K)} | S_t=s_t] = E[R_{t} | S_t=s_t] + \gamma \int_S p(s_{t+1}|s_t) E[G_{t+1} | S_{t+1}=s_{t+1}] ds_{t+1}$

나머지는 일반적인 밀도 조작입니다.

고지 : 매우 간단한 작업에서도 상태 공간은 무한 할 수 있습니다! 한 가지 예는 '극점 균형'작업입니다. 상태는 본질적으로 극의 각도 ( 의 값 , 셀 수없이 무한대입니다!) $[0, 2\pi)$

비고 : 사람들은 ' 의 밀도를 직접 사용하고 '...하지만 ... 내 질문은 : $G_t$ $p(g_{t+1}|s_{t+1}, s_t) = p(g_{t+1}|s_{t+1})$

에 밀도가 있다는 것을 어떻게 알 수 있습니까? $G_{t+1}$
어떻게 당신도 알고 것이라 과 공통의 밀도를 함께 가지고 ? $G_{t+1}$ $S_{t+1}, S_t$
어떻게 추론 합니까? 이것은 Markov 속성 일뿐입니다. Markov 속성은 한계 분포에 대한 정보 만 제공하지만 반드시 전체 분포를 결정하지는 않습니다 (예 : 다변량 가우시안 참조)! $p(g_{t+1}|s_{t+1}, s_t) = p(g_{t+1}|s_{t+1})$

— 파비안 베르너
소스

10

시간 이후 할인 된 총 보상 금액을 다음과 같이합시다 . $t$
$G_t = R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+...$

상태부터의 이용 가치는, 시간에서 기대 합에 상당 할인 보상 정책의 실행 상태부터 전방으로한다. 정의에 선형성의 법칙에 의해 법률에 따라 $s$ $t$
$R$ $\pi$ $s$
$U_\pi(S_t=s) = E_\pi[G_t|S_t = s]$
$\\ = E_\pi[(R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+...)|S_t = s]$ $G_t$
$= E_\pi[(R_{t+1}+\gamma (R_{t+2}+\gamma R_{t+3}+...))|S_t = s]$
$= E_\pi[(R_{t+1}+\gamma (G_{t+1}))|S_t = s]$
$= E_\pi[R_{t+1}|S_t = s]+\gamma E_\pi[ G_{t+1}|S_t = s]$
$= E_\pi[R_{t+1}|S_t = s]+\gamma E_\pi[E_\pi(G_{t+1}|S_{t+1} = s')|S_t = s]$ 총 기대치 정의 선형성의 법칙
$= E_\pi[R_{t+1}|S_t = s]+\gamma E_\pi[U_\pi(S_{t+1}= s')|S_t = s]$ $U_\pi$
$= E_\pi[R_{t+1} + \gamma U_\pi(S_{t+1}= s')|S_t = s]$

프로세스 만족 마르코프 재산권 있다고 가정 :
확률 상태에서 끝나는의 상태에서 시작하는 데 취해진 조치 , 및 보상 상태에서 끝나는의 상태에서 기동하고있는 액션 촬영 , $Pr$ $s'$ $s$ $a$
$Pr(s'|s,a) = Pr(S_{t+1} = s', S_t=s,A_t = a)$
$R$ $s'$ $s$ $a$
$R(s,a,s') = [R_{t+1}|S_t = s, A_t = a, S_{t+1}= s']$

따라서 위의 유틸리티 방정식을 다음과 같이 다시 쓸 수 있습니다.
$= \sum_a \pi(a|s) \sum_{s'} Pr(s'|s,a)[R(s,a,s')+ \gamma U_\pi(S_{t+1}=s')]$

어디에; : 액션 복용의 가능성 때 상태에서 확률 적 정책. 결정적 정책의 경우 $\pi(a|s)$ $a$ $s$ $\sum_a \pi(a|s)= 1$

— 은 타브 고바
소스

단지 몇 가지 참고 사항 : 확률 적 정책에서도 의 합 은 1과 같지만 결정 론적 정책에서는 전체 가중치 (예 : 와 나머지 를받는 액션이 하나뿐입니다. 가중치가 0이므로 항이 방정식에서 제거됩니다. 또한 총 기대 법칙을 사용한 행에서 조건의 순서가 반대로됩니다.

π $\pi$

π(a|s)=1 $\pi(a|s) = 1$

— Gilad Peleg

1

나는이 대답이 틀렸다는 것을 확신한다. 총 기대 법칙을 포함하는 선까지만 방정식을 따르자. 이어서 왼쪽에는 의존하지 않기 오른쪽이 동시에 수행 ... 즉 방정식은 다음 올바른지하는 경우 그들이 정정 있습니까? 당신은 통합을 통해 어떤 종류가 있어야 이미 그 단계에서. 그 이유는 아마도 (임의 변수)와 인수 분해 (결정적 함수) 의 차이점에 대한 오해 일 것입니다 .

s′ $s'$

E[X|Y] $E[X|Y]$

E[X|Y=y] $E[X|Y=y]$

— Fabian Werner

@FabianWerner 나는 이것이 올바르지 않다는 것에 동의합니다. Jie Shi의 답변이 정답입니다.

— teucer

@teucer이 답변은 일부 "기호화"가 없어서 수정할 수 있습니다. 예 :

그러나 여전히 질문은 Jie Shis 답변과 동일합니다. 왜

E[A|C=c]=∫range(B)p(b|c)E[A|B=b,C=c]dPB(b) $E[A|C=c] = \int_{\text{range}(B)} p(b|c) E[A|B=b, C=c] dP_B(b)$

? 때문에뿐만 아니라 마르코프 속성입니다

조차 수렴 않습니다 정말 복잡 RV는? 그렇다면 어디서? 공통 밀도

는 무엇입니까

E[Gt+1|St+1=st+1,St=st]=E[Gt+1|St+1=st+1] $E[G_{t+1}|S_{t+1}=s_{t+1}, S_t=s_t] = E[G_{t+1}|S_{t+1}=s_{t+1}]$

Gt+1 $G_{t+1}$

? 우리는 유한 합 (복잡한 회선)에 대해서만이 표현을 알고 있지만 무한 경우에 대해서도 알고 있습니까? p(gt+1,st+1,st) $p(g_{t+1}, s_{t+1}, s_t)$

— Fabian Werner

@ FabianWerner 모든 질문에 대답 할 수 있는지 잘 모르겠습니다. 일부 포인터 아래.

의 수렴의 경우 , 할인 된 보상의 합계 인 경우 계열이 수렴한다고 가정하는 것이 합리적입니다 (할인 계수가

이고 수렴이 실제로 중요하지 않은 곳). 밀도에 대해서는 신경 쓰지 않습니다 (임의의 변수가있는 한 항상 관절 밀도를 정의 할 수 있음). 잘 정의되어 있고 그 경우에만 중요합니다. Gt+1 $G_{t+1}$

<1 $<1$

— teucer

8

여기 내 증거가 있습니다. 조건부 분포의 조작을 기반으로하므로 쉽게 따라갈 수 있습니다. 이것이 당신을 돕기를 바랍니다.

v π (s) = E [G t | S t = s] = E [R t + 1 + γ G t + 1 | S t = s] = \sum s' \sum r \sum g t + 1 \sum a p (s', r, g t + 1, a | s) (r + γ g t + 1) = \sum a p (a | s) \sum s' \sum r \sum g t + 1 p (s', r, g t + 1 | a, s) (r + γ g t + 1) = \sum a p (a | s) \sum s' \sum r \sum g t + 1 p (s', r | a, s) p (g t + 1 | s', r, a, s) (r + γ g t + 1) Note that p (g t + 1 | s', r, a, s) = p (g t + 1 | s') by assumption of MDP = \sum a p (a | s) \sum s' \sum r p (s', r | a, s) \sum g t + 1 p (g t + 1 | s') (r + γ g t + 1) = \sum a p (a | s) \sum s' \sum r p (s', r | a, s) (r + γ \sum g t + 1 p (g t + 1 | s') g t + 1) = \sum a p (a | s) \sum s' \sum r p (s', r | a, s) (r + γ v π (s'))

$\begin{align} v_{\pi}(s)&=E{\left[G_t|S_t=s\right]} \nonumber \\ &=E{\left[R_{t+1}+\gamma G_{t+1}|S_t=s\right]} \nonumber \\ &= \sum_{s'}\sum_{r}\sum_{g_{t+1}}\sum_{a}p(s',r,g_{t+1}, a|s)(r+\gamma g_{t+1}) \nonumber \\ &= \sum_{a}p(a|s)\sum_{s'}\sum_{r}\sum_{g_{t+1}}p(s',r,g_{t+1} |a, s)(r+\gamma g_{t+1}) \nonumber \\ &= \sum_{a}p(a|s)\sum_{s'}\sum_{r}\sum_{g_{t+1}}p(s',r|a, s)p(g_{t+1}|s', r, a, s)(r+\gamma g_{t+1}) \nonumber \\ &\text{Note that $p(g_{t+1}|s', r, a, s)=p(g_{t+1}|s')$ by assumption of MDP} \nonumber \\ &= \sum_{a}p(a|s)\sum_{s'}\sum_{r}p(s',r|a, s)\sum_{g_{t+1}}p(g_{t+1}|s')(r+\gamma g_{t+1}) \nonumber \\ &= \sum_{a}p(a|s)\sum_{s'}\sum_{r}p(s',r|a, s)(r+\gamma\sum_{g_{t+1}}p(g_{t+1}|s')g_{t+1}) \nonumber \\ &=\sum_{a}p(a|s)\sum_{s'}\sum_{r}p(s',r|a, s)\left(r+\gamma v_{\pi}(s')\right) \label{eq2} \end{align}$ 이 유명한 벨만 방정식이다.

— 시 지
소스

Do you mind explaining this comment 'Note that ...' a little more? Why do these random variables

Gt+1 $G_{t+1}$ and the state and action variables even have a common density? If so, why do you know this property that you are using? I can see that it is true for a finite sum but if the random variable is a limit... ???

— Fabian Werner

To Fabian: First let's recall what is

Gt+1 $G_{t+1}$ .

Gt+1=Rt+2+Rt+3+⋯ $G_{t+1}=R_{t+2}+R_{t+3}+\cdots$ . Note that

Rt+2 $R_{t+2}$ only directly depends on

St+1 $S_{t+1}$ and

At+1 $A_{t+1}$ since

p(s′,r|s,a) $p(s', r|s, a)$ captures all the transition information of a MDP (More precisely,

Rt+2 $R_{t+2}$ is independent of all states, actions, and rewards before time

t+1 $t+1$ given

St+1 $S_{t+1}$ and

At+1 $A_{t+1}$ ). Similarly,

Rt+3 $R_{t+3}$ only depends on

St+2 $S_{t+2}$ and

At+2 $A_{t+2}$ . As a result,

Gt+1 $G_{t+1}$ is independent of

St $S_t$ ,

At $A_t$ , and

Rt $R_t$ given

St+1 $S_{t+1}$ , which explains that line.

— Jie Shi

Sorry, that only 'motivates' it, it doesn't actually explain anything. For example: What is the density of

Gt+1 $G_{t+1}$ ? Why are you sure that

p(gt+1|st+1,st)=p(gt+1|st+1) $p(g_{t+1}|s_{t+1}, s_t) = p(g_{t+1}|s_{t+1})$ ? Why do these random variables even have a common density? You know that a sum transforms into a convolution in densities so what...

Gt+1 $G_{t+1}$ should have an infinite amount of integrals in the density??? There is absolutely no candidate for the density!

— Fabian Werner

To Fabian: I do not get your question. 1. You want the exact form of the marginal distribution

p(gt+1) $p(g_{t+1})$ ? I do not know it and we do not need it in this proof. 2. why

p(gt+1|st+1,st)=p(gt+1|st+1) $p(g_{t+1}|s_{t+1}, s_t)=p(g_{t+1}|s_{t+1})$ ? Because as I mentioned earlier

gt+1 $g_{t+1}$ and

st $s_t$ are independent given

st+1 $s_{t+1}$ . 3. What do you mean by "common density"? You mean joint distribution? You want to know why these random variables have a joint distribution? All random variables in this universe can have a joint distribution. If this is your question, I would suggest you find a probability theory book and read it.

— Jie Shi

Let us move this discussion to chat: chat.stackexchange.com/rooms/88952/bellman-equation

— Fabian Werner

2

What's with the following approach?

v π (s) = E π [G t ∣ S t = s] = E π [R t + 1 + γ G t + 1 ∣ S t = s] = \sum a π (a ∣ s) \sum s' \sum r p (s', r ∣ s, a) \cdot E π [R t + 1 + γ G t + 1 ∣ S t = s, A t + 1 = a, S t + 1 = s', R t + 1 = r] = \sum a π (a ∣ s) \sum s', r p (s', r ∣ s, a) [r + γ v π (s')] .

$\begin{align} v_\pi(s) & = \mathbb{E}_\pi\left[G_t \mid S_t = s\right] \\ & = \mathbb{E}_\pi\left[R_{t+1} + \gamma G_{t+1} \mid S_t = s\right] \\ & = \sum_a \pi(a \mid s) \sum_{s'} \sum_r p(s', r \mid s, a) \cdot \,\\ & \qquad \mathbb{E}_\pi\left[R_{t+1} + \gamma G_{t+1} \mid S_{t} = s, A_{t+1} = a, S_{t+1} = s', R_{t+1} = r\right] \\ & = \sum_a \pi(a \mid s) \sum_{s', r} p(s', r \mid s, a) \left[r + \gamma v_\pi(s')\right]. \end{align}$

The sums are introduced in order to retrieve $a$ , $s'$ and $r$ from $s$ . After all, the possible actions and possible next states can be . With these extra conditions, the linearity of the expectation leads to the result almost directly.

I am not sure how rigorous my argument is mathematically, though. I am open for improvements.

— Mr Tsjolder
소스

The last line only works because of the MDP property.

— teucer

2

This is just a comment/addition to the accepted answer.

I was confused at the line where law of total expectation is being applied. I don't think the main form of law of total expectation can help here. A variant of that is in fact needed here.

If $X,Y,Z$ are random variables and assuming all the expectation exists, then the following identity holds:

$E[X|Y] = E[E[X|Y,Z]|Y]$

In this case, $X= G_{t+1}$ , $Y = S_t$ and $Z = S_{t+1}$ . Then

$E[G_{t+1}|S_t=s] = E[E[G_{t+1}|S_t=s, S_{t+1}=s'|S_t=s]$ , which by Markov property eqauls to $E[E[G_{t+1}|S_{t+1}=s']|S_t=s]$

From there, one could follow the rest of the proof from the answer.

— Mehdi Golari
소스

1

Welcome to CV! Please use the answers only for answering the question. Once you have enough reputation (50), you can add comments.

— Frans Rodenburg

Thank you. Yes, since I could not comment due to not having enough reputation, I thought it might be useful to add the explanation to the answers. But I will keep that in mind.

— Mehdi Golari

I upvoted but still, this answer is missing details: Even if

E[X|Y] $E[X|Y]$ satisfies this crazy relationship then nobody guarantees that this is true for the factorizations of the conditional expectations as well! I.e. as in the case with the answer of Ntabgoba: The left hand side does not depend on

s′ $s'$ while the right hand side does. This equation cannot be correct!

— Fabian Werner

1

$\mathbb{E}_\pi(\cdot)$ usually denotes the expectation assuming the agent follows policy $\pi$ . In this case $\pi(a|s)$ seems non-deterministic, i.e. returns the probability that the agent takes action $a$ when in state $s$ .

It looks like $r$ , lower-case, is replacing $R_{t+1}$ , a random variable. The second expectation replaces the infinite sum, to reflect the assumption that we continue to follow $\pi$ for all future $t$ . $\sum_{s',r} r \cdot p(s′,r|s,a)$ is then the expected immediate reward on the next time step; The second expectation—which becomes $v_\pi$ —is the expected value of the next state, weighted by the probability of winding up in state $s'$ having taken $a$ from $s$ .

Thus, the expectation accounts for the policy probability as well as the transition and reward functions, here expressed together as $p(s', r|s,a)$ .

— Sean Easter
소스

Thanks. Yes, what you mentioned about

π(a|s) $\pi(a|s)$ is correct (it's the probability of the agent taking action

a $a$ when in state

s $s$ ).

— Amelio Vazquez-Reina

What I don't follow is what terms exactly get expanded into what terms in the second step (I'm familiar with probability factorization and marginalization, but not so much with RL). Is

Rt $R_t$ the term being expanded? I.e. what exactly in the previous step equals what exactly in the next step?

— Amelio Vazquez-Reina

1

It looks like

r $r$ , lower-case, is replacing

Rt+1 $R_{t+1}$ , a random variable, and the second expectation replaces the infinite sum (probably to reflect the assumption that we continue to follow

π $\pi$ for all future

t $t$ ).

Σp(s′,r|s,a)r $\Sigma p(s',r|s,a)r$ is then the expected immediate reward on the next time step, and the second expectation—which becomes

vπ $v_\pi$ —is the expected value of the next state, weighted by the probability of winding up in state

s′ $s'$ having taken

a $a$ from

s $s$ .

— Sean Easter

1

even though the correct answer has already been given and some time has passed, I thought the following step by step guide might be useful:
By linearity of the Expected Value we can split $E[R_{t+1} + \gamma E[G_{t+1}|S_{t}=s]]$ into $E[R_{t+1}|S_t=s]$ and $\gamma E[G_{t+1}|S_{t}=s]$ .
I will outline the steps only for the first part, as the second part follows by the same steps combined with the Law of Total Expectation.

E [R t + 1 | S t = s] = \sum r r P [R t + 1 = r | S t = s] = \sum a \sum r r P [R t + 1 = r, A t = a | S t = s] (III) = \sum a \sum r r P [R t + 1 = r | A t = a, S t = s] P [A t = a | S t = s] = \sum s' \sum a \sum r r P [S t + 1 = s', R t + 1 = r | A t = a, S t = s] P [A t = a | S t = s] = \sum a π (a | s) \sum s', r p (s', r | s, a) r

$\begin{align} E[R_{t+1}|S_t=s]&=\sum_r{ r P[R_{t+1}=r|S_t =s]} \\ &= \sum_a{ \sum_r{ r P[R_{t+1}=r, A_t=a|S_t=s]}} \qquad \text{(III)} \\ &=\sum_a{ \sum_r{ r P[R_{t+1}=r| A_t=a, S_t=s] P[A_t=a|S_t=s]}} \\ &= \sum_{s^{'}}{ \sum_a{ \sum_r{ r P[S_{t+1}=s^{'}, R_{t+1}=r| A_t=a, S_t=s] P[A_t=a|S_t=s] }}} \\ &=\sum_a{ \pi(a|s) \sum_{s^{'},r}{p(s^{'},r|s,a)} } r \end{align}$

Whereas (III) follows form:

P [A, B | C] = P [ A , B , C ] P [ C ] = P [ A , B , C ] P [ C ] P [ B , C ] P [ B , C ] = P [ A , B , C ] P [ B , C ] P [ B , C ] P [ C ] = P [A | B, C] P [B | C]

$\begin{align} P[A,B|C]&=\frac{P[A,B,C]}{P[C]} \\ &= \frac{P[A,B,C]}{P[C]} \frac{P[B,C]}{P[B,C]}\\ &= \frac{P[A,B,C]}{P[B,C]} \frac{P[B,C]}{P[C]}\\ &= P[A|B,C] P[B|C] \end{align}$

— Adsertor Justitia
소스

1

I know there is already an accepted answer, but I wish to provide a probably more concrete derivation. I would also like to mention that although @Jie Shi trick somewhat makes sense, but it makes me feel very uncomfortable:(. We need to consider the time dimension to make this work. And it is important to note that, the expectation is actually taken over the entire infinite horizon, rather than just over $s$ and $s'$ . Let assume we start from $t=0$ (in fact, the derivation is the same regardless of the starting time; I do not want to contaminate the equations with another subscript $k$ )

v π (s 0) G 0 E π [G 0 | s 0] = E π [G 0 | s 0] = \sum t = 0 T - 1 γ t R t + 1 = \sum a 0 π (a 0 | s 0) \sum a 1, . . . a T \sum s 1, . . . s T \sum r 1, . . . r T (\prod t = 0 T - 1 π (a t + 1 | s t + 1) p (s t + 1, r t + 1 | s t, a t) \times (\sum t = 0 T - 1 γ t r t + 1)) = \sum a 0 π (a 0 | s 0) \sum a 1, . . . a T \sum s 1, . . . s T \sum r 1, . . . r T (\prod t = 0 T - 1 π (a t + 1 | s t + 1) p (s t + 1, r t + 1 | s t, a t) \times (r 1 + γ \sum t = 0 T - 2 γ t r t + 2))

$\begin{align} v_{\pi}(s_0)&=\mathbb{E}_{\pi}[G_{0}|s_0]\\ G_0&=\sum_{t=0}^{T-1}\gamma^tR_{t+1}\\ \mathbb{E}_{\pi}[G_{0}|s_0]&=\sum_{a_0}\pi(a_0|s_0)\sum_{a_{1},...a_{T}}\sum_{s_{1},...s_{T}}\sum_{r_{1},...r_{T}}\bigg(\prod_{t=0}^{T-1}\pi(a_{t+1}|s_{t+1})p(s_{t+1},r_{t+1}|s_t,a_t)\\ &\times\Big(\sum_{t=0}^{T-1}\gamma^tr_{t+1}\Big)\bigg)\\ &=\sum_{a_0}\pi(a_0|s_0)\sum_{a_{1},...a_{T}}\sum_{s_{1},...s_{T}}\sum_{r_{1},...r_{T}}\bigg(\prod_{t=0}^{T-1}\pi(a_{t+1}|s_{t+1})p(s_{t+1},r_{t+1}|s_t,a_t)\\ &\times\Big(r_1+\gamma\sum_{t=0}^{T-2}\gamma^tr_{t+2}\Big)\bigg) \end{align}$ NOTED THAT THE ABOVE EQUATION HOLDS EVEN IF $T\rightarrow\infty$ , IN FACT IT WILL BE TRUE UNTIL THE END OF UNIVERSE (maybe be a bit exaggerated :) )
At this stage, I believe most of us should already have in mind how the above leads to the final expression--we just need to apply sum-product rule(

∑a∑b∑cabc≡∑aa∑bb∑cc $\sum_a\sum_b\sum_cabc\equiv\sum_aa\sum_bb\sum_cc$ ) painstakingly. Let us apply the law of linearity of Expectation to each term inside the

(r1+γ∑T−2t=0γtrt+2) $\Big(r_{1}+\gamma\sum_{t=0}^{T-2}\gamma^tr_{t+2}\Big)$

Part 1

\sum a 0 π (a 0 | s 0) \sum a 1, . . . a T \sum s 1, . . . s T \sum r 1, . . . r T (\prod t = 0 T - 1 π (a t + 1 | s t + 1) p (s t + 1, r t + 1 | s t, a t) \times r 1)

$\sum_{a_0}\pi(a_0|s_0)\sum_{a_{1},...a_{T}}\sum_{s_{1},...s_{T}}\sum_{r_{1},...r_{T}}\bigg(\prod_{t=0}^{T-1}\pi(a_{t+1}|s_{t+1})p(s_{t+1},r_{t+1}|s_t,a_t)\times r_1\bigg)$

Well this is rather trivial, all probabilities disappear (actually sum to 1) except those related to $r_1$ . Therefore, we have

\sum a 0 π (a 0 | s 0) \sum s 1, r 1 p (s 1, r 1 | s 0, a 0) \times r 1

$\sum_{a_0}\pi(a_0|s_0)\sum_{s_1,r_1}p(s_1,r_1|s_0,a_0)\times r_1$

Part 2
Guess what, this part is even more trivial--it only involves rearranging the sequence of summations.

\sum a 0 π (a 0 | s 0) \sum a 1, . . . a T \sum s 1, . . . s T \sum r 1, . . . r T (\prod t = 0 T - 1 π (a t + 1 | s t + 1) p (s t + 1, r t + 1 | s t, a t)) = \sum a 0 π (a 0 | s 0) \sum s 1, r 1 p (s 1, r 1 | s 0, a 0) (\sum a 1 π (a 1 | s 1) \sum a 2, . . . a T \sum s 2, . . . s T \sum r 2, . . . r T (\prod t = 0 T - 2 π (a t + 2 | s t + 2) p (s t + 2, r t + 2 | s t + 1, a t + 1)))

$\sum_{a_0}\pi(a_0|s_0)\sum_{a_{1},...a_{T}}\sum_{s_{1},...s_{T}}\sum_{r_{1},...r_{T}}\bigg(\prod_{t=0}^{T-1}\pi(a_{t+1}|s_{t+1})p(s_{t+1},r_{t+1}|s_t,a_t)\bigg)\\=\sum_{a_0}\pi(a_0|s_0)\sum_{s_1,r_1}p(s_1,r_1|s_0,a_0)\bigg(\sum_{a_1}\pi(a_1|s_1)\sum_{a_{2},...a_{T}}\sum_{s_{2},...s_{T}}\sum_{r_{2},...r_{T}}\bigg(\prod_{t=0}^{T-2}\pi(a_{t+2}|s_{t+2})p(s_{t+2},r_{t+2}|s_{t+1},a_{t+1})\bigg)\bigg)$

And Eureka!! we recover a recursive pattern in side the big parentheses. Let us combine it with $\gamma\sum_{t=0}^{T-2}\gamma^tr_{t+2}$ , and we obtain $v_{\pi}(s_1)=\mathbb{E}_{\pi}[G_1|s_1]$

γ E π [G 1 | s 1] = \sum a 1 π (a 1 | s 1) \sum a 2, . . . a T \sum s 2, . . . s T \sum r 2, . . . r T (\prod t = 0 T - 2 π (a t + 2 | s t + 2) p (s t + 2, r t + 2 | s t + 1, a t + 1)) (γ \sum t = 0 T - 2 γ t r t + 2)

$\gamma\mathbb{E}_{\pi}[G_1|s_1]=\sum_{a_1}\pi(a_1|s_1)\sum_{a_{2},...a_{T}}\sum_{s_{2},...s_{T}}\sum_{r_{2},...r_{T}}\bigg(\prod_{t=0}^{T-2}\pi(a_{t+2}|s_{t+2})p(s_{t+2},r_{t+2}|s_{t+1},a_{t+1})\bigg)\bigg(\gamma\sum_{t=0}^{T-2}\gamma^tr_{t+2}\bigg)$
and part 2 becomes

\sum a 0 π (a 0 | s 0) \sum s 1, r 1 p (s 1, r 1 | s 0, a 0) \times γ v π (s 1)

$\sum_{a_0}\pi(a_0|s_0)\sum_{s_1,r_1}p(s_1,r_1|s_0,a_0)\times \gamma v_{\pi}(s_1)$

Part 1 + Part 2

v π (s 0) = \sum a 0 π (a 0 | s 0) \sum s 1, r 1 p (s 1, r 1 | s 0, a 0) \times (r 1 + γ v π (s 1))

$v_{\pi}(s_0) =\sum_{a_0}\pi(a_0|s_0)\sum_{s_1,r_1}p(s_1,r_1|s_0,a_0)\times \Big(r_1+\gamma v_{\pi}(s_1)\Big)$

And now if we can tuck in the time dimension and recover the general recursive formulae

v π (s) = \sum a π (a | s) \sum s', r p (s', r | s, a) \times (r + γ v π (s'))

$v_{\pi}(s) =\sum_a \pi(a|s)\sum_{s',r} p(s',r|s,a)\times \Big(r+\gamma v_{\pi}(s')\Big)$

Final confession, I laughed when I saw people above mention the use of law of total expectation. So here I am

— Karlsson Yu
소스

Erm... what is the symbol '

∑a0,...,a∞ $\sum_{a_0, ..., a_{\infty}}$ ' supposed to mean? There is no

a∞ $a_\infty$ ...

— Fabian Werner

Another question: Why is the very first equation true? I know

E[f(X)|Y=y]=∫Xf(x)p(x|y)dx $E[f(X)|Y=y] = \int_{\mathcal{X}} f(x) p(x|y) dx$ but in our case,

X $X$ would be an infinite sequence of random variables

(R0,R1,R2,........) $(R_0, R_1, R_2, ........)$ so we would need to compute the density of this variable (consisting of an infinite amount of variables of which we know the density) together with something else (namely the state)... how exactly do you du that? I.e. what is

p(r0,r1,....) $p(r_0, r_1, ....)$ ?

— Fabian Werner

@FabianWerner. Take a deep breath to calm your brain first:). Let me answer your first question.

∑a0,...,a∞≡∑a0∑a1,...,∑a∞ $\sum_{a_0,...,a_{\infty}} \equiv \sum_{a_0}\sum_{a_1},...,\sum_{a_{\infty}}$ . If you recall the definition of the value function, it is actually a summation of discounted future rewards. If we consider an infinite horizon for our future rewards, we then need to sum infinite number of times. A reward is result of taking an action from a state, since there is an infinite number of rewards, there should be an infinite number of actions, hence

a∞ $a_{\infty}$ .

— Karlsson Yu

1

let us assume that I agree that there is some weird

a∞ $a_\infty$ (which I still doubt, usually, students in the very first semester in math tend to confuse the limit with some construction that actually involves an infinite element)... I still have one simple question: how is “

∑a1...∑a∞ $\sum_{a_1} ... \sum_{a_\infty}$ defined? I know what this expression is supposed to mean with a finite amount of sums... but infinitely many of them? What do you understand that this expression does?

— Fabian Werner

1

internet. Could you refer me to a page or any place that defines your expression? If not then you actually defined something new and there is no point in discussing that because it is just a symbol that you made up (but there is no meaning behind it)... you agree that we are only able to discuss about the symbol if we both know what it means, right? So, I do not know what it means, please explain...

— Fabian Werner

1

There are already a great many answers to this question, but most involve few words describing what is going on in the manipulations. I'm going to answer it using way more words, I think. To start,

G t ≐ \sum k = t + 1 T γ k - t - 1 R k

$G_{t} \doteq \sum_{k=t+1}^{T} \gamma^{k-t-1} R_{k}$

is defined in equation 3.11 of Sutton and Barto, with a constant discount factor $0 \leq \gamma \leq 1$ and we can have $T = \infty$ or $\gamma = 1$ , but not both. Since the rewards, $R_{k}$ , are random variables, so is $G_{t}$ as it is merely a linear combination of random variables.

v π (s) ≐ E π [G t ∣ S t = s] = E π [R t + 1 + γ G t + 1 ∣ S t = s] = E π [R t + 1 | S t = s] + γ E π [G t + 1 | S t = s]

$\begin{align} v_\pi(s) & \doteq \mathbb{E}_\pi\left[G_t \mid S_t = s\right] \\ & = \mathbb{E}_\pi\left[R_{t+1} + \gamma G_{t+1} \mid S_t = s\right] \\ & = \mathbb{E}_{\pi}\left[ R_{t+1} | S_t = s \right] + \gamma \mathbb{E}_{\pi}\left[ G_{t+1} | S_t = s \right] \end{align}$

That last line follows from the linearity of expectation values. $R_{t+1}$ is the reward the agent gains after taking action at time step $t$ . For simplicity, I assume that it can take on a finite number of values $r \in \mathcal{R}$ .

Work on the first term. In words, I need to compute the expectation values of $R_{t+1}$ given that we know that the current state is $s$ . The formula for this is

E π [R t + 1 | S t = s] = \sum r \in R r p (r | s) .

$\begin{align} \mathbb{E}_{\pi}\left[ R_{t+1} | S_t = s \right] = \sum_{r \in \mathcal{R}} r p(r|s). \end{align}$

In other words the probability of the appearance of reward $r$ is conditioned on the state $s$ ; different states may have different rewards. This $p(r|s)$ distribution is a marginal distribution of a distribution that also contained the variables $a$ and $s'$ , the action taken at time $t$ and the state at time $t+1$ after the action, respectively:

p (r | s) = \sum s' \in S \sum a \in A p (s', a, r | s) = \sum s' \in S \sum a \in A π (a | s) p (s', r | a, s) .

$\begin{align} p(r|s) = \sum_{s' \in \mathcal{S}} \sum_{a \in \mathcal{A}} p(s',a,r|s) = \sum_{s' \in \mathcal{S}} \sum_{a \in \mathcal{A}} \pi(a|s) p(s',r | a,s). \end{align}$

Where I have used $\pi(a|s) \doteq p(a|s)$ , following the book's convention. If that last equality is confusing, forget the sums, suppress the $s$ (the probability now looks like a joint probability), use the law of multiplication and finally reintroduce the condition on $s$ in all the new terms. It in now easy to see that the first term is

E π [R t + 1 | S t = s] = \sum r \in R \sum s' \in S \sum a \in A r π (a | s) p (s', r | a, s),

$\begin{align} \mathbb{E}_{\pi}\left[ R_{t+1} | S_t = s \right] = \sum_{r \in \mathcal{R}} \sum_{s' \in \mathcal{S}} \sum_{a \in \mathcal{A}} r \pi(a|s) p(s',r | a,s), \end{align}$

as required. On to the second term, where I assume that $G_{t+1}$ is a random variable that takes on a finite number of values $g \in \Gamma$ . Just like the first term:

E π [G t + 1 | S t = s] = \sum g \in Γ g p (g | s) . (*)

$\begin{align} \mathbb{E}_{\pi}\left[ G_{t+1} | S_t = s \right] = \sum_{g \in \Gamma} g p(g|s). \qquad\qquad\qquad\qquad (*) \end{align}$

Once again, I "un-marginalize" the probability distribution by writing (law of multiplication again)

p (g | s) = \sum r \in R \sum s' \in S \sum a \in A p (s', r, a, g | s) = \sum r \in R \sum s' \in S \sum a \in A p (g | s', r, a, s) p (s', r, a | s) = \sum r \in R \sum s' \in S \sum a \in A p (g | s', r, a, s) p (s', r | a, s) π (a | s) = \sum r \in R \sum s' \in S \sum a \in A p (g | s', r, a, s) p (s', r | a, s) π (a | s) = \sum r \in R \sum s' \in S \sum a \in A p (g | s') p (s', r | a, s) π (a | s) (* *)

$\begin{align} p(g|s) & = \sum_{r \in \mathcal{R}} \sum_{s' \in \mathcal{S}} \sum_{a \in \mathcal{A}} p(s',r,a,g|s) = \sum_{r \in \mathcal{R}} \sum_{s' \in \mathcal{S}} \sum_{a \in \mathcal{A}} p(g | s', r, a, s) p(s', r, a | s) \\ & = \sum_{r \in \mathcal{R}} \sum_{s' \in \mathcal{S}} \sum_{a \in \mathcal{A}} p(g | s', r, a, s) p(s', r | a, s) \pi(a | s) \\ & = \sum_{r \in \mathcal{R}} \sum_{s' \in \mathcal{S}} \sum_{a \in \mathcal{A}} p(g | s', r, a, s) p(s', r | a, s) \pi(a | s) \\ & = \sum_{r \in \mathcal{R}} \sum_{s' \in \mathcal{S}} \sum_{a \in \mathcal{A}} p(g | s') p(s', r | a, s) \pi(a | s) \qquad\qquad\qquad\qquad (**) \end{align}$

The last line in there follows from the Markovian property. Remember that $G_{t+1}$ is the sum of all the future (discounted) rewards that the agent receives after state $s'$ . The Markovian property is that the process is memory-less with regards to previous states, actions and rewards. Future actions (and the rewards they reap) depend only on the state in which the action is taken, so $p(g | s', r, a, s) = p(g | s')$ , by assumption. Ok, so the second term in the proof is now

$\begin{align} \gamma \mathbb{E}_{\pi}\left[ G_{t+1} | S_t = s \right] & = \gamma \sum_{g \in \Gamma} \sum_{r \in \mathcal{R}} \sum_{s' \in \mathcal{S}} \sum_{a \in \mathcal{A}} g p(g | s') p(s', r | a, s) \pi(a | s) \\ & = \gamma \sum_{r \in \mathcal{R}} \sum_{s' \in \mathcal{S}} \sum_{a \in \mathcal{A}} \mathbb{E}_{\pi}\left[ G_{t+1} | S_{t+1} = s' \right] p(s', r | a, s) \pi(a | s) \\ & = \gamma \sum_{r \in \mathcal{R}} \sum_{s' \in \mathcal{S}} \sum_{a \in \mathcal{A}} v_{\pi}(s') p(s', r | a, s) \pi(a | s) \end{align}$

as required, once again. Combining the two terms completes the proof

$\begin{align} v_\pi(s) & \doteq \mathbb{E}_\pi\left[G_t \mid S_t = s\right] \\ & = \sum_{a \in \mathcal{A}} \pi(a | s) \sum_{r \in \mathcal{R}} \sum_{s' \in \mathcal{S}} p(s', r | a, s) \left[ r + \gamma v_{\pi}(s') \right]. \end{align}$

UPDATE

I want to address what might look like a sleight of hand in the derivation of the second term. In the equation marked with $(*)$ , I use a term $p(g|s)$ and then later in the equation marked $(**)$ I claim that $g$ doesn't depend on $s$ , by arguing the Markovian property. So, you might say that if this is the case, then $p(g|s) = p(g)$ . But this is not true. I can take $p(g | s', r, a, s) \rightarrow p(g | s')$ because the probability on the left side of that statement says that this is the probability of $g$ conditioned on $s'$ , $a$ , $r$ , and $s$ . Because we either know or assume the state $s'$ , none of the other conditionals matter, because of the Markovian property. If you do not know or assume the state $s'$ , then the future rewards (the meaning of $g$ ) will depend on which state you begin at, because that will determine (based on the policy) which state $s'$ you start at when computing $g$ .

If that argument doesn't convince you, try to compute what $p(g)$ is:

$\begin{align} p(g) & = \sum_{s' \in \mathcal{S}} p(g, s') = \sum_{s' \in \mathcal{S}} p(g | s') p(s') \\ & = \sum_{s' \in \mathcal{S}} p(g | s') \sum_{s,a,r} p(s', a, r, s) \\ & = \sum_{s' \in \mathcal{S}} p(g | s') \sum_{s,a,r} p(s', r | a, s) p(a, s) \\ & = \sum_{s \in \mathcal{S}} p(s) \sum_{s' \in \mathcal{S}} p(g | s') \sum_{a,r} p(s', r | a, s) \pi(a | s) \\ & \doteq \sum_{s \in \mathcal{S}} p(s) p(g|s) = \sum_{s \in \mathcal{S}} p(g,s) = p(g). \end{align}$

As can be seen in the last line, it is not true that $p(g|s) = p(g)$ . The expected value of $g$ depends on which state you start in (i.e. the identity of $s$ ), if you do not know or assume the state $s'$ .

— Finncent Price
소스