Reinforcement Learning with Pytorch
May 31, 2020
This post serves as an introduction to some basic deep reinforcement learning methods including policy gradients, actor-critic, and deep Q-network, and a Pytorch implementation of the methods.
I’m watching Berkeley’s Deep Reinforcement Learning course (CS 285) by Professor Sergey Levine, a fantastic course. The course website for Fall 2019 offering can be found here. The course offers well-structured implementations of major deep RL algorithms (Policy Gradients, Actor-Critic, DQN, etc.) with tensorflow. With the purpose of horning my RL skills, I decide to implement Pytorch-based implementations of the deep RL algorithms. The first ones will include Policy Gradients, A2C, and DQN. I may try to implement some model-based RL methods and other algorithms in the future as well. The code structure will be based on Berkeley’s course material at this repo. Here is my implementation.
Reinforcement Learning Basics
I will first introduce some key RL concepts and algorithms based on Professor Levine’s lectures. If you are interested, please check out the course. The professor is a great explainer!
RL problem can be thought of an agent playing an interactive game with a model. Our agent performs certain action in response to an observation of the model’s state . For example, we ,ay somehow model the earth’s dynamics, our agent observe that it is raining (), while some complex state of the earth like atmospheric dynamics is producing the rain (), and our agent decides to take an umbrella (). The model transitions to next state according to our agent’s action, and provides a feedback, a reward . The whole process is referred to as a Markov decision process (MDP). Sometimes, when game is really simple, we observe the full state of the environment, so coincides with . Other times, we only have observations, in which case the process is a partially observed Markov decision process (POMDP). I will deal with the former case in this post. Our goal is to win the game, or to receive as much rewards as possible. Our goal is to come up with a policy for which, given an observation (or state), we perform some action. The policy in RL setting is usually stochastic: given a state, rather than producing a deterministic action, we produce a distribution over all possible actions.
In summary and in formal mathematics language, we define our problem with following sets and functions:
- , a set of possible actions that can be taken by our agent;
- , a set of possible states of the model;
-
, the transition probability for the next state given current state and the agent’s action;
Note that the model is stochastic rather than deterministic, so our optimal policy should be correspondingly stochastic.
-
, the reward function given current state and the agent’s action;
The range is usually ; the transition probability and the reward function together define the environment or the model.
-
, the policy.
produces a distribution over actions.
We assess our work by the total rewards obtained in expectation, and we often discount future steps with a exponentially decreasing factor . For a trajectory , the total discounted reward is:
Given a policy , we can sample trajectories based on the policy by continuing feeding the next state values, and we write in a shorthand notation to express the distribution of trajectories:
We need to represent our set of possible policies with some function class. For example, in deep RL, we represent a policy with some neural network. Let our policies be parametrized by , our objective is to find the optimal parameter:
Model-free RL Algorithms
There are two main classes of RL algorithms: model-free methods and model-based methods.
- Model-free: directly learn the optimal policy without learning the dynamics of the model, the transition probability and the reward function . Nevertheless, we may want to learn Value Funtions or Q-Functions (we’ll talk about this later) which is somewhat related to the reward function.
- Model-based: try to learn the model dynamics, then plan the corresponding actions given the initial state.
I’ll focus on several model-free algorithms.
Policy Gradient
As we want to optimize some value w.r.t , perhaps the most straightforward solution is do gradient ascent on that value. Fortunately, the gradient is indeed tractable.
Let
where in the last step we approximate the expectation by sampling trajectories based on the current policy. This is an unbiased and consistent estimator, but it generally produces high variance, especially with smaller . This also intuitively make sense: we take the gradient of the MLE estimation of the policy distribution weighted by the total discounted rewards.
Another remark to make is that Policy Gradient based methods are on-policy, meaning that we need to keep generating samples based on trained policies at training time. For off-policy methods, we generate a batch of data beforehand and we don’t sample additional trajectories based on trained policies at training time.
Actor-Critic
We’ve briefly mentioned the high variance problem, and we will try to remedy this problem. To decrease the variance of the gradient estimation, we want to decrease the total reward term. Intuitively, we want to maximize the probability of actions corresponding to higher relative rewards, but not necessarily high absolute rewards. Say we have a good action that returns a reward of 10001 and a bad action that returns a reward of 10000. Our current gradient will tend to maximize the probability of both actions. We should somehow down-scale the reward term.
-
Rewards-to-go
The first thing to notice is that it is nonsensical to always multiply the cumulative rewards from the very first step. The rewards after time shouldn’t affect the actions after time . We can therefore reduce the variance of our estimate by re-writing the gradient formula:
Somewhat surprisingly, this still produces an unbiased gradient estimator.
-
Baseline
The second thing to do ties back to our discussion on relative rewards, which we will from now on call advantages.
A very easy way to estimate the advantages is to simply subtract a baseline from the total rewards. A useful baseline is the expectation of total discounted rewards:
and
This will also give a non-biased estimator of the gradient, and in fact you can work out the math to find out that the gradient estimate remains unbiased as long as is independent w.r.t .
These two fixes will make Policy Gradient more stable, but we can do better with out baseline. Let’s take a step back and reformulate our definition of advantages.
We can define the expected rewards w.r.t a policy at a state-action pair by Q-functions:
Where is a informal shorthand notation for the joint distribution of the policy and the transition probability distribution . The Q-functions are the total discounted rewards-to-go we have above.
We can also define a Value Functions by the state marginal of the Q-Functions.
Now we can formally define the Advantage Functions, which intuitively describes the relative rewards of an action at a given state:
Now our goal is to estimate the policy gradient in the following way:
Even though is now dependent upon , our estimator is still unbiased because we are in effect subtracting a state marginal value, the value function. (In fact I haven’t go through the detailed proof of this statement and the detailed derivation can be found in paper.)
Now how do we estimate the advantage functions? We have expressed value functions in terms of q-functions, but actually we can also express q-functions in terms of value functions:
where in the last step we approximate the expectation with our samples of next state .
Now we can approximate the advantage by:
We only need to model the value function now! Again, whenever we need to model a complex function, we can use neural nets. We also need some kinds of evaluations to improve our model of the value function. We can iteratively improve our model via a supervised approach.
- Goal: fit a model with parameters
-
Evaluations:
-
Monte-Carlo evaluation: sample a trajectory and approximate by the rewards encountered
-
Bootstrapped evaluation
-
-
Training
- Training data:
- Supervised learning:
Now a bit of terminology (and finally tied back to the section’s title), our value function estimate is our critic, which informs us how well is our actions. Our policy is our actor, and hence Actor-Critic method.
Deep Q-Learning Network (DQN)
But wait a minute, if we can model value functions or even better Q-functions, why do we still need policy gradients? We can simply take actions according to those values, and if those values are optimal, we will get an optimal policy. So the idea of Deep Q-Learning Network is straightforward: we use deep neural nets to approximate the q-functions and take actions accordingly. Like how we improve our estimates of value functions with Actor-Critic, we take a supervised learning approach to iteratively update the Q-functions.
To formalize this idea, we first observe that for any Q-funcion , we can devise a (deterministic) policy that is at least as good as :
Therefore, we have an optimal Q-function, which I’ll simply refer to as which corresponds to an optimal (deterministic) policy . Our goal is to estimate this Q-function with deep neural net.
- Goal: fit a model with parameters
-
Evaluations:
We observe that
However, this poses a moving target problem: our evaluation is constantly updating as our parameters update. (Boostrapped evaluation for the value function estimation has the same problem, but we still have a policy gradient so the problem is endurable.) To fix this problem, we introduce another parameter that is close to but not really . A easy choice is the at a previous update step.
-
Training
- Training data:
- Supervised learning:
Double Q-Learning
Double Q-Learning is an effective way to decrease the variance our model and therefore enhancing stability. The idea is that when we evaluate
we take the max of Q-function values, and we are also maximizing the noise. We can instead evaluate the value as:
Pytorch implementation
Here is the full implementation.
Code structure
├── main.py
├── trainer.py
├── agents
│ ├── ac_agent.py
│ ├── base_agent.py
│ ├── dqn_agent.py
│ └── pg_agent.py
├── actors
│ ├── MLP_policy.py
│ ├── argmax_policy.py
│ └── base_policy.py
├── critics
│ ├── base_critic.py
│ ├── bootstrapped_critic.py
│ └── dqn_critic.py
└── utils
├── pytorch_utils.py
├── replay_buffer.py
└── utils.py
main.py
Scripts and the main function.
trainer.py
Trainer class. Wrap and train the agents.
agents
Agents that interact with the environment and seek to optimize rewards. Each agent has an actor (policy) and possibly a critic (value/Q function estimators).
class BaseAgent(object):
'''
class template for deep RL agents
'''
def __init__(self, **kwargs):
super(BaseAgent, self).__init__(**kwargs)
def train(self):
raise NotImplementedError
def add_to_replay_buffer(self, paths):
raise NotImplementedError
def sample(self, batch_size):
raise NotImplementedError
Each agent has three major functions.
train(self)
trains the agent’s actor and/or criticsample(self, batch_size)
samples some trajectories based on the current policyadd_to_replay_buffer(self, paths)
add sampled trajectories to a database, or replay buffer. Methods like policy gradient and actor-critic actually do not require such a replay buffer, but we include it anyways for the sake of simplicity and extensibility.
actors
The class template for actors is as followed.
class BasePolicy(object):
'''
class template for deep RL policies (actors)
'''
def __init__(self, **kwargs):
super(BasePolicy, self).__init__(**kwargs)
def get_action(self, obs):
raise NotImplementedError
def update(self, obs, acs):
raise NotImplementedError
def save(self, filepath):
raise NotImplementedError
def restore(self, filepath):
raise NotImplementedError
We have MLPPolicy
, which is used for policy gradient based methods, and ArgMaxPolicy
, which is used for DQN. The functions are pretty self-explanatory.
critics
The class template for critics is as followed.
class BaseCritic(object):
'''
class template for deep RL critics
'''
def __init__(self, **kwargs):
pass
def update(self, ob_no, next_ob_no, re_n, terminal_n):
raise NotImplementedError
We have BootstrappedCritic
, which is the bootstrapped evaluation of the value function, and DQNCritic
, which is used for DQN.
utils
Some useful util functions like sampling trajectories. pytorch_utils
defines different network structures that we may use for our algorithms.
Policy Gradient
We first delve into MLP policy. At training, the policy calculate a pseudo-loss
and use autograd to update parameters.
class MLPPolicy(BasePolicy, nn.Module):
...
def update(self, obs_batch, actions, adv_vals):
obs_batch = torch.from_numpy(obs_batch).type(self.dtype)
adv_vals = torch.from_numpy(adv_vals).type(self.dtype)
logits_n = self.policy(Variable(obs_batch.view(obs_batch.shape[0],-1)))
# Calculate log(pi(a|s))
log_probs = Variable(torch.Tensor())
if self.discrete:
for i in range(len(obs_batch)):
action_distribution = Categorical(logits_n[i])
action = torch.tensor(actions[i])
log_prob = action_distribution.log_prob(action)
if len(log_probs) > 0:
log_probs = torch.cat([log_probs, log_prob.reshape(1)])
else:
log_probs = log_prob.reshape(1)
else:
pass
loss = (torch.sum(torch.mul(log_probs, Variable(adv_vals)).mul(-1), -1))
# Update network weights
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
return loss
For get_action
, the policy runs the network on the observations, create a distribution of actions, and sample an action. (I haven’t implemented the case for continuous action space.)
class MLPPolicy(BasePolicy, nn.Module):
...
def get_action(self, obs):
with torch.no_grad():
obs = torch.from_numpy(obs).type(self.dtype)
action_probs = self.policy(obs.view(-1))
if self.discrete:
action_distribution = Categorical(action_probs)
action = action_distribution.sample()
else:
pass
return action.data.numpy()
When then delve into the agent to see how do we calculate the advantages. We use a mean baseline and rewards-to-go.
class PGAgent(BaseAgent):
...
def train(self, ob_batch, ac_batch, re_batch, next_ob_batch, terminal_batch):
q_values = self.calculate_q_vals(re_batch)
advantage_values = self.estimate_advantage(q_values)
loss = self.actor.update(ob_batch, ac_batch, advantage_values)
return loss
def calculate_q_vals(self, rews_list):
# calculate reward_to_go q-values
q_values = np.concatenate([self.calc_reward_to_go(r) for r in rews_list])
return q_values
def calc_reward_to_go(self, rewards):
"""
Input:
a list of length T
a list of rewards {r_0, r_1, ..., r_t', ... r_{T-1}} from a single rollout of length T
Output:
a list of length T
a list where the entry in each index t is sum_{t'=t}^{T-1} gamma^(t'-t) * r_{t'}
"""
all_discounted_cumsums = []
for start_time_index in range(rewards.shape[0]):
indices = np.arange(start_time_index, len(rewards))
discounts = np.power(self.gamma, indices - start_time_index)
discounted_rtg = discounts * rewards[indices]
sum_discounted_rtg = np.sum(discounted_rtg)
all_discounted_cumsums.append(sum_discounted_rtg)
list_of_discounted_cumsums = np.array(all_discounted_cumsums)
return list_of_discounted_cumsums
def estimate_advantage(self, q_values):
# mean baseline
advs = (q_values - q_values.mean())
# standardize advantages
advs = (advs - np.mean(advs)) / (np.std(advs) + 1e-8)
return advs
Actor-Critic
Actor-critic method is actually very similar to Policy Gradient. We only need to substitute advantage values with values estimated by our BootstrappedCritic
network. Here is how we train BootstrappedCritic
network.
class BootstrappedCritic(BaseCritic):
...
def update(self, ob_no, next_ob_no, re_n, terminal_n):
ob_no = torch.from_numpy(ob_no).type(self.dtype)
next_ob_no = torch.from_numpy(next_ob_no).type(self.dtype)
for i in range(self.num_grad_steps_per_target_update * self.num_target_updates):
if i % self.num_grad_steps_per_target_update == 0:
targ_val = re_n + self.gamma * self.value_func(next_ob_no).detach().numpy() * (1-terminal_n)
targ_val = torch.from_numpy(targ_val).type(self.dtype)
loss = self.lossCrit(targ_val, self.value_func(ob_no))
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
return loss
DQN
The ArgMaxPolicy
is really simple. It simply runs our Q-Network on observations and take the argmax of the values (for discrete action space).
import torch
class ArgMaxPolicy(object):
...
def get_action(self, obs):
with torch.no_grad():
obs = torch.from_numpy(obs).type(self.dtype)
qvals = self.critic.q_func(obs.unsqueeze(0).permute(0,3,1,2))
action = torch.argmax(qvals)
return action.data.numpy()
The critic network update is also straightforward if we follow the formula.
class DQNCritic(BaseCritic):
def __init__(self, params, optimizer=optim.Adam):
...
self.q_func = CovNet(self.ob_dim, self.ac_dim).type(self.dtype)
self.target_q_func = CovNet(self.ob_dim, self.ac_dim).type(self.dtype)
self.target_q_func.load_state_dict(self.q_func.state_dict())
self.target_q_func.eval()
self.optimizer = optimizer(self.q_func.parameters(), lr=self.lr)
self.lossCrit = F.smooth_l1_loss
def update(self, ob_batch, ac_batch, re_batch, next_ob_batch, terminal_batch):
ob_batch = torch.from_numpy(ob_batch).type(self.dtype).permute(0,3,1,2)
next_ob_batch = torch.from_numpy(next_ob_batch).type(self.dtype).permute(0,3,1,2)
ac_batch = torch.from_numpy(ac_batch).type(torch.int64).view(-1,1)
predicted_vals = self.q_func(Variable(ob_batch)).gather(1, ac_batch)
next_state_qvals = self.target_q_func(next_ob_batch)
if self.double_q:
target_actions = self.q_func(next_ob_batch).argmax(1).view(-1,1)
next_state_vals = next_state_qvals.gather(1, target_actions).view(1,-1).detach().numpy()
else:
next_state_vals = next_state_qvals.max(1)[0].detach().numpy()
target_vals = re_batch + self.gamma * next_state_vals * (1-terminal_batch)
target_vals = torch.from_numpy(target_vals).type(self.dtype).view(-1,1).detach()
loss = self.lossCrit(predicted_vals, target_vals)
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
return loss
def get_qval(self, obs):
obs = torch.from_numpy(obs).type(self.dtype)
return self.q_func(Variable(obs.permute(0,3,1,2)))
def update_target(self):
self.target_q_func.load_state_dict(self.q_func.state_dict())
In DQNAgent
, we need to update our target network once in a while to make good estimates.
class DQNAgent(BaseAgent):
...
def train(self, ob_batch, ac_batch, re_batch, next_ob_batch, terminal_batch):
if (self.num_param_updates % self.target_update_freq == 0):
self.critic.update_target()
self.num_param_updates += 1;
loss = self.critic.update(ob_batch, ac_batch, re_batch, next_ob_batch, terminal_batch)
return loss
Note: there are many possible optimizations to make and I may make mistakes, please contact me for either case. Many thanks!
References
[1] CS 285 at UC Berkeley, Deep Reinforcement Learning, Slides, lecture recordings, and course Github repository.
[2] Levine & Koltun (2013). Guided policy search: deep RL with importance sampled policy gradient.
[3] Philip Thomas, Bias in natural actor-critic algorithms. ICML 2014
[4] Mnih, Badia, Mirza, Graves, Lillicrap, Harley, Silver, Kavukcuoglu(2016). Asynchronous methods for deep reinforcement learning: A3C—actor-critic
[5] Mnih et al. (2013). Human-level control through deep reinforcement learning: Q-learning with convolutional networks for playing Atari.
[6] Van Hasselt, Guez, Silver. (2015). Deep reinforcement learning with double Q-learning: a very effective trick to improve performance of deep Q-learning.