Reinforcement Learning with Pytorch

May 31, 2020

This post serves as an introduction to some basic deep reinforcement learning methods including policy gradients, actor-critic, and deep Q-network, and a Pytorch implementation of the methods.

Reinforcement Learning Basics
Model-free RL Algorithms
Pytorch implementation
References

I’m watching Berkeley’s Deep Reinforcement Learning course (CS 285) by Professor Sergey Levine, a fantastic course. The course website for Fall 2019 offering can be found here. The course offers well-structured implementations of major deep RL algorithms (Policy Gradients, Actor-Critic, DQN, etc.) with tensorflow. With the purpose of horning my RL skills, I decide to implement Pytorch-based implementations of the deep RL algorithms. The first ones will include Policy Gradients, A2C, and DQN. I may try to implement some model-based RL methods and other algorithms in the future as well. The code structure will be based on Berkeley’s course material at this repo. Here is my implementation.

Reinforcement Learning Basics

I will first introduce some key RL concepts and algorithms based on Professor Levine’s lectures. If you are interested, please check out the course. The professor is a great explainer!

RL problem can be thought of an agent playing an interactive game with a model. Our agent performs certain action $a$ in response to an observation $o$ of the model’s state $s$ . For example, we ,ay somehow model the earth’s dynamics, our agent observe that it is raining ( $o$ ), while some complex state of the earth like atmospheric dynamics is producing the rain ( $s$ ), and our agent decides to take an umbrella ( $a$ ). The model transitions to next state $s'$ according to our agent’s action, and provides a feedback, a reward $r$ . The whole process is referred to as a Markov decision process (MDP). Sometimes, when game is really simple, we observe the full state of the environment, so $o$ coincides with $s$ . Other times, we only have observations, in which case the process is a partially observed Markov decision process (POMDP). I will deal with the former case in this post. Our goal is to win the game, or to receive as much rewards as possible. Our goal is to come up with a policy for which, given an observation (or state), we perform some action. The policy in RL setting is usually stochastic: given a state, rather than producing a deterministic action, we produce a distribution over all possible actions.

In summary and in formal mathematics language, we define our problem with following sets and functions:

$\mathcal{A}$ , a set of possible actions that can be taken by our agent;
$\mathcal{S}$ , a set of possible states of the model;
$P(s' | s, a) = \mathbb{P}[S_{t+1}=s' | S_t = s, A_t = a]$ ,

the transition probability for the next state given current state and the agent’s action;

Note that the model is stochastic rather than deterministic, so our optimal policy should be correspondingly stochastic.

$r(s, a)$ , the reward function given current state and the agent’s action;

The range is usually $\mathbb{R}$ ; the transition probability and the reward function together define the environment or the model.
$\pi: \mathcal{S} \rightarrow \mathbb{P}(\mathcal{A})$ , the policy.

$\pi(s)$ produces a distribution over actions.

We assess our work by the total rewards obtained in expectation, and we often discount future steps with a exponentially decreasing factor $\gamma$ . For a trajectory $\tau = (s_1, a_1, \dots, s_T, a_T)$ , the total discounted reward is:

r(\tau) := \sum_{t=1}^T \gamma^{t-1} r(s_t, a_t)

Given a policy $\pi$ , we can sample trajectories based on the policy by continuing feeding the next state values, and we write in a shorthand notation to express the distribution of trajectories:

\pi(\tau) = \mathbb{P}(s_1) \prod_{t=1}^T \pi(a_t|s_t)P(s_{t+1}|s_t, a_t)

We need to represent our set of possible policies with some function class. For example, in deep RL, we represent a policy with some neural network. Let our policies be parametrized by $\theta$ , our objective is to find the optimal parameter:

\theta^* = argmax_\theta \mathbb{E}_{\tau \sim \pi_{\theta}(\tau)} [r(\tau)]

Model-free RL Algorithms

There are two main classes of RL algorithms: model-free methods and model-based methods.

Model-free: directly learn the optimal policy $\pi$ without learning the dynamics of the model, the transition probability $P(s'|s, a)$

and the reward function $r(s, a)$ . Nevertheless, we may want to learn Value Funtions or Q-Functions (we’ll talk about this later) which is somewhat related to the reward function.

Model-based: try to learn the model dynamics, then plan the corresponding actions given the initial state.

I’ll focus on several model-free algorithms.

Policy Gradient

As we want to optimize some value w.r.t $\theta$ , perhaps the most straightforward solution is do gradient ascent on that value. Fortunately, the gradient is indeed tractable.

Let

J(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta}(\tau)} [r(\tau)] = \int \pi_{\theta}(\tau) r(\tau) d\tau

\begin{aligned} \nabla_\theta J(\theta) &= \int \nabla_\theta \pi_{\theta}(\tau) r(\tau) d\tau \\ &= \int \pi_{\theta}(\tau) \nabla_\theta log \pi_{\theta}(\tau) r(\tau) d\tau \\ &= \int \pi_{\theta}(\tau) (\sum_{t=1}^T \nabla_\theta log \pi_{\theta} (a_t|s_t)) (\sum_{t=1}^T \gamma^{t-1} r(s_t, a_t)) d\tau \\ &= \mathbb{E}_{\tau \sim \pi_{\theta}(\tau)} [(\sum_{t=1}^T \nabla_\theta log \pi_{\theta} (a_t|s_t)) (\sum_{t=1}^T \gamma^{t-1} r(s_t, a_t))] \\ &\approx \sum_{i=1}^N (\sum_{t=1}^T \nabla_\theta log \pi_{\theta} (a_t|s_t)) (\sum_{t=1}^T \gamma^{t-1} r(s_t, a_t)) \end{aligned}

where in the last step we approximate the expectation by sampling $N$ trajectories based on the current policy. This is an unbiased and consistent estimator, but it generally produces high variance, especially with smaller $N$ . This also intuitively make sense: we take the gradient of the MLE estimation of the policy distribution weighted by the total discounted rewards.

Another remark to make is that Policy Gradient based methods are on-policy, meaning that we need to keep generating samples based on trained policies at training time. For off-policy methods, we generate a batch of data beforehand and we don’t sample additional trajectories based on trained policies at training time.

Actor-Critic

We’ve briefly mentioned the high variance problem, and we will try to remedy this problem. To decrease the variance of the gradient estimation, we want to decrease the total reward term. Intuitively, we want to maximize the probability of actions corresponding to higher relative rewards, but not necessarily high absolute rewards. Say we have a good action $a_1$ that returns a reward of 10001 and a bad action $a_2$ that returns a reward of 10000. Our current gradient will tend to maximize the probability of both actions. We should somehow down-scale the reward term.

Rewards-to-go

The first thing to notice is that it is nonsensical to always multiply the cumulative rewards from the very first step. The rewards after time $t$ shouldn’t affect the actions after time $t$ . We can therefore reduce the variance of our estimate by re-writing the gradient formula:
$\nabla_\theta J(\theta) = \approx \sum_{i=1}^N (\sum_{t=1}^T \nabla_\theta log \pi_{\theta} (a_t|s_t)) (\sum_{t'=t}^T \gamma^{t-t'} r(s_{t'}, a_{t'}))$
Somewhat surprisingly, this still produces an unbiased gradient estimator.
Baseline

The second thing to do ties back to our discussion on relative rewards, which we will from now on call advantages.

A very easy way to estimate the advantages is to simply subtract a baseline from the total rewards. A useful baseline is the expectation of total discounted rewards:
$b = \mathbb{E}_{\tau \sim pi_{\theta}(\tau)}[{r(\tau)}] \approx \sum_{i=1}^N r(\tau)$
and
$\nabla_\theta J(\theta) \approx \sum_{i=1}^N (\sum_{t=1}^T \nabla_\theta log \pi_{\theta} (a_t|s_t)) (\sum_{t'=t}^T \gamma^{t-t'} r(s_{t'}, a_{t'})-b))$
This will also give a non-biased estimator of the gradient, and in fact you can work out the math to find out that the gradient estimate remains unbiased as long as $b$ is independent w.r.t $\theta$ .

These two fixes will make Policy Gradient more stable, but we can do better with out baseline. Let’s take a step back and reformulate our definition of advantages.

We can define the expected rewards w.r.t a policy $\pi_\theta$ at a state-action pair by Q-functions:

Q^{\pi_\theta}(s_t, a_t) = \sum_{t'=t}^T \mathbb{E}_{P_\theta}[\gamma^{t'-t} r(s_{t'}, a_{t'}) | s_t, a_t]

Where $P_\theta$ is a informal shorthand notation for the joint distribution of the policy $\pi_\theta$ and the transition probability distribution $P(s' | s, a)$ . The Q-functions are the total discounted rewards-to-go we have above.

We can also define a Value Functions by the state marginal of the Q-Functions.

V^{\pi_\theta}(s_t) = \mathbb{E}_{a_t \sim \pi_\theta(a_t | s_t)}[Q^{\pi_\theta}(s_t, a_t)]

Now we can formally define the Advantage Functions, which intuitively describes the relative rewards of an action at a given state:

A^{\pi_\theta}(a_t, s_t) = Q^{\pi_\theta}(s_t, a_t) - V^{\pi_\theta}(s_t)

Now our goal is to estimate the policy gradient in the following way:

\nabla_\theta J(\theta) \approx \sum_{i=1}^N (\sum_{t=1}^T \nabla_\theta log \pi_{\theta} (a_t|s_t)) A^{\pi_\theta}(a_t, s_t))

Even though $A^{\pi_\theta}(a_t, s_t)$ is now dependent upon $\theta$ , our estimator is still unbiased because we are in effect subtracting a state marginal value, the value function. (In fact I haven’t go through the detailed proof of this statement and the detailed derivation can be found in paper.)

Now how do we estimate the advantage functions? We have expressed value functions in terms of q-functions, but actually we can also express q-functions in terms of value functions:

\begin{aligned} Q^{\pi_\theta}(s_t, a_t) &= r(s_t, a_t) + \gamma \mathbb{E}_{s_{t+1} \sim P(s_{t+1} | s_t, a_t)}[V^{\pi_\theta}(s_{t+1})] \\ &\approx r(s_t, a_t) + \gamma V^{\pi_\theta}(s_{t+1}) \end{aligned}

where in the last step we approximate the expectation with our samples of next state $s_{t+1}$ .

Now we can approximate the advantage by:

A^{\pi_\theta}(a_t, s_t) = r(s_t, a_t) + \gamma V^{\pi_\theta}(s_{t+1}) - V^{\pi_\theta}(s_t)

We only need to model the value function now! Again, whenever we need to model a complex function, we can use neural nets. We also need some kinds of evaluations to improve our model of the value function. We can iteratively improve our model via a supervised approach.

Goal: fit a model $V^{\pi_\theta}_{\phi}(s_t)$ with parameters $\phi$
Evaluations:
- Monte-Carlo evaluation: sample a trajectory and approximate $V^{\pi_\theta}(s_t)$ by the rewards encountered
  $V^{\pi_\theta}(s_t) \approx \sum_{t'=t}^T \gamma^{t'-t} r(s_{t'}, a_{t'})$
- Bootstrapped evaluation
  $V^{\pi_\theta}(s_t) \approx r(s_t, a_t) + \gamma V^{\pi_\theta}_{\phi}(s_{t+1})$
Training
- Training data: $\{(s_t, V^{\pi_\theta}(s_t))_i\}$
- Supervised learning:
$L(\phi) = \frac{1}{2} \sum_{i} ||V^{\pi_\theta}_{\phi}(s_t) - V^{\pi_\theta}(s_t)||^2$

Now a bit of terminology (and finally tied back to the section’s title), our value function estimate is our critic, which informs us how well is our actions. Our policy is our actor, and hence Actor-Critic method.

Deep Q-Learning Network (DQN)

But wait a minute, if we can model value functions or even better Q-functions, why do we still need policy gradients? We can simply take actions according to those values, and if those values are optimal, we will get an optimal policy. So the idea of Deep Q-Learning Network is straightforward: we use deep neural nets to approximate the q-functions and take actions accordingly. Like how we improve our estimates of value functions with Actor-Critic, we take a supervised learning approach to iteratively update the Q-functions.

To formalize this idea, we first observe that for any Q-funcion $Q^\pi(s, a)$ , we can devise a (deterministic) policy $\pi'$ that is at least as good as $\pi$ :

\pi'(a|s) = argmax_a Q^\pi(s, a)

Therefore, we have an optimal Q-function, which I’ll simply refer to as $Q(s, a)$ which corresponds to an optimal (deterministic) policy $\pi(a|s) = argmax_a Q(s, a)$ . Our goal is to estimate this Q-function with deep neural net.

Goal: fit a model $Q_{\phi}(s, a)$ with parameters $\phi$
Evaluations:

We observe that
$Q_{\phi}(s, a)= r(s, a) + \gamma \max_{a'} Q_{\phi}(s', a')$
However, this poses a moving target problem: our evaluation is constantly updating as our parameters update. (Boostrapped evaluation for the value function estimation has the same problem, but we still have a policy gradient so the problem is endurable.) To fix this problem, we introduce another parameter $\phi'$ that is close to but not really $\phi$ . A easy choice is the $\phi$ at a previous update step.
$Q_{\phi}(s, a) = r(s, a) + \gamma \max_{a'} Q_{\phi'}(s', a')$
Training
- Training data: $\{(s, a, s', r)_i\}$
- Supervised learning:
$L(\phi) = \sum_{i} \text{Huber Loss} [Q_\phi(s, a), r(s, a) + \gamma \max_{a'} Q_{\phi'}(s', a')]$

Double Q-Learning

Double Q-Learning is an effective way to decrease the variance our model and therefore enhancing stability. The idea is that when we evaluate

Q_{\phi}(s, a) = r(s, a) + \gamma \max_{a'} Q_{\phi'}(s', a')

we take the max of Q-function values, and we are also maximizing the noise. We can instead evaluate the value as:

Q_{\phi}(s, a) = r(s, a) + \gamma Q_{\phi'}(s', argmax_{a'}Q_{\phi} (s', a'))

Pytorch implementation

Here is the full implementation.

Code structure

├── main.py
├── trainer.py
├── agents
│   ├── ac_agent.py
│   ├── base_agent.py
│   ├── dqn_agent.py
│   └── pg_agent.py
├── actors
│   ├── MLP_policy.py
│   ├── argmax_policy.py
│   └── base_policy.py
├── critics
│   ├── base_critic.py
│   ├── bootstrapped_critic.py
│   └── dqn_critic.py
└── utils
    ├── pytorch_utils.py
    ├── replay_buffer.py
    └── utils.py

main.py

Scripts and the main function.

trainer.py

Trainer class. Wrap and train the agents.

agents

Agents that interact with the environment and seek to optimize rewards. Each agent has an actor (policy) and possibly a critic (value/Q function estimators).

class BaseAgent(object):
    '''
    class template for deep RL agents
    '''
    def __init__(self, **kwargs):
        super(BaseAgent, self).__init__(**kwargs)

    def train(self):
        raise NotImplementedError

    def add_to_replay_buffer(self, paths):
        raise NotImplementedError

    def sample(self, batch_size):
        raise NotImplementedError

Each agent has three major functions.

train(self) trains the agent’s actor and/or critic
sample(self, batch_size) samples some trajectories based on the current policy
add_to_replay_buffer(self, paths) add sampled trajectories to a database, or replay buffer. Methods like policy gradient and actor-critic actually do not require such a replay buffer, but we include it anyways for the sake of simplicity and extensibility.

actors

The class template for actors is as followed.

class BasePolicy(object):
    '''
    class template for deep RL policies (actors)
    '''
    def __init__(self, **kwargs):
       super(BasePolicy, self).__init__(**kwargs)

    def get_action(self, obs):
        raise NotImplementedError

    def update(self, obs, acs):
        raise NotImplementedError

    def save(self, filepath):
        raise NotImplementedError

    def restore(self, filepath):
        raise NotImplementedError

We have MLPPolicy, which is used for policy gradient based methods, and ArgMaxPolicy, which is used for DQN. The functions are pretty self-explanatory.

critics

The class template for critics is as followed.

class BaseCritic(object):
    '''
    class template for deep RL critics
    '''
    def __init__(self, **kwargs):
        pass

    def update(self, ob_no, next_ob_no, re_n, terminal_n):
        raise NotImplementedError

We have BootstrappedCritic, which is the bootstrapped evaluation of the value function, and DQNCritic, which is used for DQN.

utils

Some useful util functions like sampling trajectories. pytorch_utils defines different network structures that we may use for our algorithms.

Policy Gradient

We first delve into MLP policy. At training, the policy calculate a pseudo-loss

L(\theta) = \sum_{i=1}^N (\sum_{t=1}^T log \pi_{\theta} (a_t|s_t)) A^{\pi_\theta}(s_t))

and use autograd to update parameters.

class MLPPolicy(BasePolicy, nn.Module):
        
    ...

    def update(self, obs_batch, actions, adv_vals):
        obs_batch = torch.from_numpy(obs_batch).type(self.dtype)
        adv_vals = torch.from_numpy(adv_vals).type(self.dtype)
        logits_n = self.policy(Variable(obs_batch.view(obs_batch.shape[0],-1)))

        # Calculate log(pi(a|s))
        log_probs = Variable(torch.Tensor()) 
        if self.discrete:
            for i in range(len(obs_batch)):

                action_distribution = Categorical(logits_n[i])
                action = torch.tensor(actions[i])
                log_prob = action_distribution.log_prob(action)
                if len(log_probs) > 0:
                    log_probs = torch.cat([log_probs, log_prob.reshape(1)])
                else:
                    log_probs = log_prob.reshape(1)
        else:
            pass

        loss = (torch.sum(torch.mul(log_probs, Variable(adv_vals)).mul(-1), -1))
        
        # Update network weights
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        return loss

For get_action, the policy runs the network on the observations, create a distribution of actions, and sample an action. (I haven’t implemented the case for continuous action space.)

class MLPPolicy(BasePolicy, nn.Module):    
        
    ...

    def get_action(self, obs):

        with torch.no_grad():
            obs = torch.from_numpy(obs).type(self.dtype)
            action_probs = self.policy(obs.view(-1))

            if self.discrete:
                action_distribution = Categorical(action_probs)
                action = action_distribution.sample()
            else:
                pass

        return action.data.numpy()

When then delve into the agent to see how do we calculate the advantages. We use a mean baseline and rewards-to-go.

class PGAgent(BaseAgent):
        
    ...

    def train(self, ob_batch, ac_batch, re_batch, next_ob_batch, terminal_batch):
        q_values = self.calculate_q_vals(re_batch)
        advantage_values = self.estimate_advantage(q_values)
        loss = self.actor.update(ob_batch, ac_batch, advantage_values)
        return loss

    def calculate_q_vals(self, rews_list):
        # calculate reward_to_go q-values
        q_values = np.concatenate([self.calc_reward_to_go(r) for r in rews_list])
        return q_values

    def calc_reward_to_go(self, rewards):
        """
            Input:
                a list of length T 
                a list of rewards {r_0, r_1, ..., r_t', ... r_{T-1}} from a single rollout of length T
            Output: 
                a list of length T
                a list where the entry in each index t is sum_{t'=t}^{T-1} gamma^(t'-t) * r_{t'}
        """

        all_discounted_cumsums = []
        for start_time_index in range(rewards.shape[0]): 
            indices = np.arange(start_time_index, len(rewards))
            discounts = np.power(self.gamma, indices - start_time_index)
            discounted_rtg = discounts * rewards[indices]
            sum_discounted_rtg = np.sum(discounted_rtg)
            all_discounted_cumsums.append(sum_discounted_rtg)

        list_of_discounted_cumsums = np.array(all_discounted_cumsums)
        return list_of_discounted_cumsums 

    def estimate_advantage(self, q_values):
        # mean baseline
        advs = (q_values - q_values.mean())
        # standardize advantages
        advs = (advs - np.mean(advs)) / (np.std(advs) + 1e-8)

        return advs

Actor-Critic

Actor-critic method is actually very similar to Policy Gradient. We only need to substitute advantage values with values estimated by our BootstrappedCritic network. Here is how we train BootstrappedCritic network.

class BootstrappedCritic(BaseCritic):

    ...

    def update(self, ob_no, next_ob_no, re_n, terminal_n):
        ob_no = torch.from_numpy(ob_no).type(self.dtype)
        next_ob_no = torch.from_numpy(next_ob_no).type(self.dtype)

        for i in range(self.num_grad_steps_per_target_update * self.num_target_updates):
            if i % self.num_grad_steps_per_target_update == 0:
                targ_val = re_n + self.gamma * self.value_func(next_ob_no).detach().numpy() * (1-terminal_n)
                targ_val = torch.from_numpy(targ_val).type(self.dtype)
            loss = self.lossCrit(targ_val, self.value_func(ob_no))
            self.optimizer.zero_grad()
            loss.backward()
            self.optimizer.step()
        return loss

DQN

The ArgMaxPolicy is really simple. It simply runs our Q-Network on observations and take the argmax of the values (for discrete action space).

import torch

class ArgMaxPolicy(object):

    ...

    def get_action(self, obs):

        with torch.no_grad():
            obs = torch.from_numpy(obs).type(self.dtype)
            qvals = self.critic.q_func(obs.unsqueeze(0).permute(0,3,1,2))
            action = torch.argmax(qvals)
            return action.data.numpy()

The critic network update is also straightforward if we follow the formula.

class DQNCritic(BaseCritic):
    def __init__(self, params, optimizer=optim.Adam):

        ...

        self.q_func = CovNet(self.ob_dim, self.ac_dim).type(self.dtype)
        self.target_q_func = CovNet(self.ob_dim, self.ac_dim).type(self.dtype)
        self.target_q_func.load_state_dict(self.q_func.state_dict())
        self.target_q_func.eval()
        self.optimizer = optimizer(self.q_func.parameters(), lr=self.lr)
        self.lossCrit = F.smooth_l1_loss

    def update(self, ob_batch, ac_batch, re_batch, next_ob_batch, terminal_batch):
        ob_batch = torch.from_numpy(ob_batch).type(self.dtype).permute(0,3,1,2)
        next_ob_batch = torch.from_numpy(next_ob_batch).type(self.dtype).permute(0,3,1,2)
        ac_batch = torch.from_numpy(ac_batch).type(torch.int64).view(-1,1)

        predicted_vals = self.q_func(Variable(ob_batch)).gather(1, ac_batch)
        next_state_qvals = self.target_q_func(next_ob_batch)

        if self.double_q:
            target_actions = self.q_func(next_ob_batch).argmax(1).view(-1,1)
            next_state_vals = next_state_qvals.gather(1, target_actions).view(1,-1).detach().numpy()
        else:
            next_state_vals = next_state_qvals.max(1)[0].detach().numpy()
        target_vals = re_batch + self.gamma * next_state_vals * (1-terminal_batch)
        target_vals = torch.from_numpy(target_vals).type(self.dtype).view(-1,1).detach()
        loss = self.lossCrit(predicted_vals, target_vals)
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
        return loss

    def get_qval(self, obs):
        obs = torch.from_numpy(obs).type(self.dtype)  
        return self.q_func(Variable(obs.permute(0,3,1,2)))   

    def update_target(self):
        self.target_q_func.load_state_dict(self.q_func.state_dict())

In DQNAgent, we need to update our target network once in a while to make good estimates.

class DQNAgent(BaseAgent):
    
    ...

    def train(self, ob_batch, ac_batch, re_batch, next_ob_batch, terminal_batch):
        if (self.num_param_updates % self.target_update_freq == 0):
            self.critic.update_target()
        self.num_param_updates += 1;
        loss = self.critic.update(ob_batch, ac_batch, re_batch, next_ob_batch, terminal_batch)
        return loss

Note: there are many possible optimizations to make and I may make mistakes, please contact me for either case. Many thanks!

References

[1] CS 285 at UC Berkeley, Deep Reinforcement Learning, Slides, lecture recordings, and course Github repository.

[2] Levine & Koltun (2013). Guided policy search: deep RL with importance sampled policy gradient.

[3] Philip Thomas, Bias in natural actor-critic algorithms. ICML 2014

[4] Mnih, Badia, Mirza, Graves, Lillicrap, Harley, Silver, Kavukcuoglu(2016). Asynchronous methods for deep reinforcement learning: A3C—actor-critic

[5] Mnih et al. (2013). Human-level control through deep reinforcement learning: Q-learning with convolutional networks for playing Atari.

[6] Van Hasselt, Guez, Silver. (2015). Deep reinforcement learning with double Q-learning: a very effective trick to improve performance of deep Q-learning.