어떤 시점에서 기준선이 상태에 대해 조건부 인 이유는 무엇입니까?

로봇 공학에서 보강 학습 기술은 로봇의 제어 패턴을 찾는 데 사용됩니다. 불행히도 대부분의 정책 기울기 방법은 통계적으로 편향되어있어 로봇을 안전하지 않은 상황에 놓을 수 있습니다. Jan Peters and Stefan Schaal의 2 페이지를 참조하십시오 : 정책 기울기를 이용한 운동 기술 강화 학습, 2008

모터 기본 학습을 사용하면 정책 기울기 매개 변수 최적화가 학습 단계를 목표로 지정하므로 문제를 극복 할 수 있습니다.

인용 :“그라디언트 추정치가 편향적이지 않고 학습률이 sum (a) = 0을 충족하는 경우 학습 프로세스는 최소한 로컬 최소값으로 수렴하도록 보장됩니다. [...] 따라서 생성 된 데이터에서만 정책 경사도를 추정해야합니다. 작업 실행 중 ”(같은 용지의 4 페이지)

Berkeley RL 클래스 문제 1 의 과제에서, 차감 된 기준선이 타임 스텝 t의 상태 함수 인 경우 정책 그라디언트가 여전히 편향되어 있지 않음을 표시하도록 요청합니다.

▽_{θ} \sum_{t = 1}^{T} E_{(s_{t}, a_{t}) \sim p (s_{t}, a_{t})} [b (s_{t})] = 0

$\triangledown _\theta \sum_{t=1}^T \mathbb{E}_{(s_t,a_t) \sim p(s_t,a_t)} [b(s_t)] = 0$

나는 그러한 증거의 첫 단계가 무엇인지 고민하고 있습니다. 누군가 올바른 방향으로 나를 가리킬 수 있습니까? 나의 초기 생각은 어떻게 든 t (st)에 대한 기대를 T에 대한 조건부로 만들기 위해 총 기대 법칙을 사용하는 것이었지만 확실하지 않습니다. 미리 감사드립니다 :)

_{방정식의 원래 png로 연결}

reinforcement-learning

— 로라 C
소스

SE : AI에 오신 것을 환영합니다! (저는 방정식을 MathJax로 변환하는 자유를 가졌습니다. Original .png는 맨 아래에 연결되어 있습니다.)

— DukeZhou

LaTeX로 정확한 방정식을 작성하고 형식을 지정할 시간이 많지 않지만 (아직 답하지 않으면 나중에) 힌트가 있습니다. 미분이 0이되도록 합계가 정책에 의존하지 않게하려고합니다. 따라서 p (s, a) 정책을 사용하여 표현하려고합니다. btw에 대한 답변은 Sutton의 RL Intro 책의 정책 그라디언트 장에서도 찾을 수 있습니다.

— Hai Nguyen

대단히 감사합니다! 이 힌트를 사용하여 시작하고 Sutton RL에 대해 알려 주셔서 감사합니다. 나는 그 책을 읽고 있는데 아주 훌륭합니다!

— Laura C

당신이 다른 사람 전에 답을 찾을 경우 @LauraC는 돌아 오지 마십시오 확실히 :)이 질문처럼 여기 공식적인 답변으로 포스트 (사람

— DukeZhou

질문에 대한 컨텍스트 정보를 추가했습니다.

— Manuel Rodriguez

답변:

기대했던 반복 법칙을 사용하면

$\triangledown _\theta \sum_{t=1}^T \mathbb{E}_{(s_t,a_t) \sim p(s_t,a_t)} [b(s_t)] = \nabla_\theta \sum_{t=1}^T \mathbb{E}_{s_t \sim p(s_t)} \left[ \mathbb{E}_{a_t \sim \pi_\theta(a_t | s_t)} \left[ b(s_t) \right]\right] =$

적분으로 작성하고 그라디언트를 내부 (선형성)로 이동

$= \sum_{t=1}^T \int_{s_t} p(s_t) \left(\int_{a_t} \nabla_\theta b(s_t) \pi_\theta(a_t | s_t) da_t \right)ds_t =$

이제 움직일 수 있습니다 $\nabla_\theta$ (선형성으로 인해) $b(s_t)$ (에 의존하지 않습니다 $a_t$ ) 외부의 내부 통합을 형성하십시오.

$= \sum_{t=1}^T \int_{s_t} p(s_t) b(s_t) \nabla_\theta \left(\int_{a_t} \pi_\theta(a_t | s_t) da_t \right)ds_t=$

$\pi_\theta(a_t | s_t)$ (조건부) 확률 밀도 함수이므로 전체에 통합 $a_t$ 주어진 고정 상태 $s_t$ 같다 $1$ :

$= \sum_{t=1}^T \int_{s_t} p(s_t) b(s_t) \nabla_\theta 1 ds_t =$

Now $\nabla_\theta1 = 0$ , which concludes the proof.

— Andrei Poehlmann
소스

It appears that the homework was due two days prior to this answer's writing, but in case it is still relevant in some way, the relevant class notes (which would have been useful if provided in the question along with the homework) are here.

The first instance of expectation placed on the student is, "Please show equation 12 by using the law of iterated expectations, breaking $\mathbb{E}_{\tau \sim p \theta(\tau)}$ by decoupling the state-action marginal from the rest of the trajectory." Equation 12 is this.

$\sum_{t = 1}^{T} E_{\tau \sim p \theta(\tau)} [\nabla_\theta \log \pi_\theta(a_t|s_t)(b(s_t))] = 0$

The class notes identifies $\pi_\theta(a_t|s_t)$ as the state-action marginal. It is not a proof sought, but a sequence of algebraic steps to perform the decoupling and show the degree to which independence of the state-action marginal can be achieved.

This exercise is a preparation for the next step in the homework and draws only on the review of CS189, Burkeley's Introduction to Machine Learning course, which does not contain the Law of Total Expectation in its syllabus or class notes.

All the relevant information is in the above link for class notes and requires only intermediate algebra.

— Douglas Daseeco
소스