Date:

Value Learning from Different Perspectives

Introduction

In the last few years, reinforcement learning (RL) has made remarkable progress, including beating world-champion Go players, controlling robotic hands, and even painting pictures.

One of the key sub-problems of RL is value estimation – learning the long-term consequences of being in a state. This can be tricky because future returns are generally noisy, affected by many things other than the present state. The further we look into the future, the more this becomes true.

But while difficult, estimating value is also essential to many approaches to RL.

Monte Carlo Value Estimation

The natural way to estimate the value of a state is as the average return you observe from that state. We call this Monte Carlo value estimation.

Cliff World
is a classic RL example, where the agent learns to walk along a cliff to reach a goal.

Sometimes the agent reaches its goal.

Other times it falls off the cliff.

Monte Carlo averages over trajectories where they intersect.

If a state is visited by only one episode, Monte Carlo says its value is the return of that episode. If multiple episodes visit a state, Monte Carlo estimates its value as the average over them.

Update Rules

Let’s write Monte Carlo a bit more formally. In RL, we often describe algorithms with update rules, which tell us how estimates change with one more episode. We’ll use an “updates toward” (\hookleftarrow) operator to keep equations simple.

Conclusion

In this article we introduced a new way to think about TD learning. It helps us see why TD learning can be beneficial, why it can be effective for off-policy learning, and why there can be challenges in combining TD learning with function approximators.

We encourage you to use the playground below to build on these intuitions, or to try an experiment of your own.

Gridworld playground

FAQs

Q: What is the main idea of this article?

A: The main idea of this article is to introduce a new way to think about TD learning and its benefits in reinforcement learning.

Q: What is the purpose of the gridworld example?

A: The purpose of the gridworld example is to illustrate the concept of Monte Carlo value estimation and its limitations.

Q: What are the challenges in combining TD learning with function approximators?

A: The challenges in combining TD learning with function approximators are discussed in the article, including the need to anneal the lambda parameter over the course of training.

Latest stories

Read More

LEAVE A REPLY

Please enter your comment!
Please enter your name here