Policy

Policy#

In reinforcement learning, the agent interacts with environments to improve itself. In this tutorial we will concentrate on the agent part. In Tianshou, both the agent and the core DRL algorithm are implemented in the Policy module. Tianshou provides more than 20 Policy modules, each representing one DRL algorithm. See supported algorithms here.

The agents interacting with the environment

All Policy modules inherit from a BasePolicy Class and share the same interface.

Creating your own Policy#

We will use the simple REINFORCE algorithm Policy to show the implementation of a Policy Module. The Policy we implement here will be a highly scaled-down version of PGPolicy in Tianshou.

Initialization#

Firstly we create the REINFORCEPolicy by inheriting from BasePolicy in Tianshou.

class REINFORCEPolicy(BasePolicy):
    """Implementation of REINFORCE algorithm."""

    def __init__(self):
        super().__init__(action_space=action_space)

As we have mentioned, the Policy Module mainly does two things:

policy.forward() receives observation and other information (stored in a Batch) from the environment and returns a new Batch containing the action.
policy.update() receives training data sampled from the replay buffer and updates itself, and then returns logging details.

policy.forward() and policy.update()

We also need to take care of the following things:

Since Tianshou is a Deep RL libraries, there should be a policy network in our Policy Module, also a Torch optimizer.
In Tianshou’s BasePolicy, Policy.update() first calls Policy.process_fn() to preprocess training data and computes quantities like episodic returns (gradient free), then it will call Policy.learn() to perform the back-propagation.

Then we get the implementation below.

class REINFORCEPolicy(BasePolicy):
    """Implementation of REINFORCE algorithm."""

    def __init__(
        self, model: torch.nn.Module, optim: torch.optim.Optimizer, action_space: gym.Space
    ):
        super().__init__(action_space=action_space)
        self.actor = model
        self.optim = optim

    def forward(self, batch: Batch) -> Batch:
        """Compute action over the given batch data."""
        act = None
        return Batch(act=act)

    def process_fn(self, batch: Batch, buffer: ReplayBuffer, indices: np.ndarray) -> Batch:
        """Compute the discounted returns for each transition."""
        pass

    def learn(self, batch: Batch, batch_size: int, repeat: int) -> Dict[str, List[float]]:
        """Perform the back-propagation."""
        return

Policy.forward()#

According to the equation of REINFORCE algorithm in Spinning Up’s documentation, we need to map the observation to an action distribution in action space using neural network (self.actor).

Let us suppose the action space is discrete, and the distribution is a simple categorical distribution.

def forward(self, batch: Batch) -> Batch:
    """Compute action over the given batch data."""
    self.dist_fn = torch.distributions.Categorical
    logits = self.actor(batch.obs)
    dist = self.dist_fn(logits)
    act = dist.sample()
    return Batch(act=act, dist=dist)

Policy.process_fn()#

Now that we have defined our actor, if given training data we can set up a loss function and optimize our neural network. However, before that, we must first calculate episodic returns for every step in our training data to construct the REINFORCE loss function.

Calculating episodic return is not hard, given ReplayBuffer.next() allows us to access every reward to go in an episode. A more convenient way would be to simply use the built-in method BasePolicy.compute_episodic_return() inherited from BasePolicy.

def process_fn(self, batch: Batch, buffer: ReplayBuffer, indices: np.ndarray) -> Batch:
    """Compute the discounted returns for each transition."""
    returns, _ = self.compute_episodic_return(batch, buffer, indices, gamma=0.99, gae_lambda=1.0)
    batch.returns = returns
    return batch

BasePolicy.compute_episodic_return() could also be used to compute GAE. Another similar method is BasePolicy.compute_nstep_return(). Check the source code for more details.

Policy.learn()#

Data batch returned by Policy.process_fn() will flow into Policy.learn(). Final we can construct our loss function and perform the back-propagation.

def learn(self, batch: Batch, batch_size: int, repeat: int) -> Dict[str, List[float]]:
    """Perform the back-propagation."""
    logging_losses = []
    for _ in range(repeat):
        for minibatch in batch.split(batch_size, merge_last=True):
            self.optim.zero_grad()
            result = self(minibatch)
            dist = result.dist
            act = to_torch_as(minibatch.act, result.act)
            ret = to_torch(minibatch.returns, torch.float, result.act.device)
            log_prob = dist.log_prob(act).reshape(len(ret), -1).transpose(0, 1)
            loss = -(log_prob * ret).mean()
            loss.backward()
            self.optim.step()
            logging_losses.append(loss.item())
    return {"loss": logging_losses}

Implementation#

Finally we can assemble the implemented methods and form a REINFORCE Policy.

class REINFORCEPolicy(BasePolicy):
    """Implementation of REINFORCE algorithm."""

    def __init__(
        self,
        *,
        model: torch.nn.Module,
        optim: torch.optim.Optimizer,
        action_space: gym.Space,
    ):
        super().__init__(action_space=action_space)
        self.actor = model
        self.optim = optim
        # action distribution
        self.dist_fn = torch.distributions.Categorical

    def forward(self, batch: Batch) -> Batch:
        """Compute action over the given batch data."""
        logits, _ = self.actor(batch.obs)
        dist = self.dist_fn(logits)
        act = dist.sample()
        return Batch(act=act, dist=dist)

    def process_fn(self, batch: Batch, buffer: ReplayBuffer, indices: np.ndarray) -> Batch:
        """Compute the discounted returns for each transition."""
        returns, _ = self.compute_episodic_return(
            batch, buffer, indices, gamma=0.99, gae_lambda=1.0
        )
        batch.returns = returns
        return batch

    def learn(self, batch: Batch, batch_size: int, repeat: int) -> Dict[str, List[float]]:
        """Perform the back-propagation."""
        logging_losses = []
        for _ in range(repeat):
            for minibatch in batch.split(batch_size, merge_last=True):
                self.optim.zero_grad()
                result = self(minibatch)
                dist = result.dist
                act = to_torch_as(minibatch.act, result.act)
                ret = to_torch(minibatch.returns, torch.float, result.act.device)
                log_prob = dist.log_prob(act).reshape(len(ret), -1).transpose(0, 1)
                loss = -(log_prob * ret).mean()
                loss.backward()
                self.optim.step()
                logging_losses.append(loss.item())
        return {"loss": logging_losses}

Use the policy#

Note that BasePolicy itself inherits from torch.nn.Module. As a result, you can consider all Policy modules as a Torch Module. They share similar APIs.

Firstly we will initialize a new REINFORCE policy.

state_shape = 4
action_shape = 2
# Usually taken from an env by using env.action_space
action_space = gym.spaces.Box(low=-1, high=1, shape=(2,))
net = Net(state_shape, hidden_sizes=[16, 16], device="cpu")
actor = Actor(net, action_shape, device="cpu").to("cpu")
optim = torch.optim.Adam(actor.parameters(), lr=0.0003)

policy = REINFORCEPolicy(model=actor, optim=optim, action_space=action_space)

REINFORCE policy shares same APIs with the Torch Module.

print(policy)
print("========================================")
for para in policy.parameters():
    print(para.shape)

REINFORCEPolicy(
  (actor): Actor(
    (preprocess): Net(
      (model): MLP(
        (model): Sequential(
          (0): Linear(in_features=4, out_features=16, bias=True)
          (1): ReLU()
          (2): Linear(in_features=16, out_features=16, bias=True)
          (3): ReLU()
        )
      )
    )
    (last): MLP(
      (model): Sequential(
        (0): Linear(in_features=16, out_features=2, bias=True)
      )
    )
  )
)
========================================
torch.Size([16, 4])
torch.Size([16])
torch.Size([16, 16])
torch.Size([16])
torch.Size([2, 16])
torch.Size([2])

Making decision#

Given a batch of observations, the policy can return a batch of actions and other data.

obs_batch = Batch(obs=np.ones(shape=(256, 4)))
action = policy(obs_batch)  # forward() method is called
print(action)

Batch(
    act: tensor([0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0,
                 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1,
                 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0,
                 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0,
                 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0,
                 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
                 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0,
                 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0,
                 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0,
                 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0]),
    dist: Categorical(probs: torch.Size([256, 2])),
)

Save and Load models#

Naturally, Tianshou Policy can be saved and loaded like a normal Torch Network.

torch.save(policy.state_dict(), "policy.pth")
assert policy.load_state_dict(torch.load("policy.pth"))

Algorithm Updating#

We have to collect some data and save them in the ReplayBuffer before updating our agent(policy). Typically we use collector to collect data, but we leave this part till later when we have learned the Collector in Tianshou. For now we generate some fake data.

Generating fake data#

Firstly, we need to “pretend” that we are using the “Policy” to collect data. We plan to collect 10 data so that we can update our algorithm.

# a buffer is initialised with its maxsize set to 20.
print("========================================")
buf = ReplayBuffer(size=12)
print(buf)
print("maxsize: {}, data length: {}".format(buf.maxsize, len(buf)))
env = gym.make("CartPole-v1")

========================================
ReplayBuffer()
maxsize: 12, data length: 0

Now we are pretending to collect the first episode. The first episode ends at step 3 (perhaps because we are performing too badly).

obs, info = env.reset()
for i in range(3):
    act = policy(Batch(obs=obs[np.newaxis, :])).act.item()
    obs_next, rew, _, truncated, info = env.step(act)
    # pretend ending at step 3
    terminated = True if i == 2 else False
    info["id"] = i
    buf.add(
        Batch(
            obs=obs,
            act=act,
            rew=rew,
            terminated=terminated,
            truncated=truncated,
            obs_next=obs_next,
            info=info,
        )
    )
    obs = obs_next

print(buf)

ReplayBuffer(
    obs: array([[ 0.00107117,  0.03532007, -0.02566063,  0.03666544],
                [ 0.00177757,  0.23080042, -0.02492732, -0.26400194],
                [ 0.00639358,  0.42626914, -0.03020736, -0.5644418 ],
                [ 0.        ,  0.        ,  0.        ,  0.        ],
                [ 0.        ,  0.        ,  0.        ,  0.        ],
                [ 0.        ,  0.        ,  0.        ,  0.        ],
                [ 0.        ,  0.        ,  0.        ,  0.        ],
                [ 0.        ,  0.        ,  0.        ,  0.        ],
                [ 0.        ,  0.        ,  0.        ,  0.        ],
                [ 0.        ,  0.        ,  0.        ,  0.        ],
                [ 0.        ,  0.        ,  0.        ,  0.        ],
                [ 0.        ,  0.        ,  0.        ,  0.        ]],
               dtype=float32),
    act: array([1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
    rew: array([1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.]),
    terminated: array([False, False,  True, False, False, False, False, False, False,
                       False, False, False]),
    truncated: array([False, False, False, False, False, False, False, False, False,
                      False, False, False]),
    obs_next: array([[ 0.00177757,  0.23080042, -0.02492732, -0.26400194],
                     [ 0.00639358,  0.42626914, -0.03020736, -0.5644418 ],
                     [ 0.01491896,  0.6218016 , -0.0414962 , -0.8664863 ],
                     [ 0.        ,  0.        ,  0.        ,  0.        ],
                     [ 0.        ,  0.        ,  0.        ,  0.        ],
                     [ 0.        ,  0.        ,  0.        ,  0.        ],
                     [ 0.        ,  0.        ,  0.        ,  0.        ],
                     [ 0.        ,  0.        ,  0.        ,  0.        ],
                     [ 0.        ,  0.        ,  0.        ,  0.        ],
                     [ 0.        ,  0.        ,  0.        ,  0.        ],
                     [ 0.        ,  0.        ,  0.        ,  0.        ],
                     [ 0.        ,  0.        ,  0.        ,  0.        ]],
                    dtype=float32),
    info: Batch(
              id: array([0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
          ),
    done: array([False, False,  True, False, False, False, False, False, False,
                 False, False, False]),
)

Now we are pretending to collect the second episode. At step 7 the second episode still doesn’t end, but we are unwilling to wait, so we stop collecting to update the algorithm.

obs, info = env.reset()
for i in range(3, 10):
    act = policy(Batch(obs=obs[np.newaxis, :])).act.item()
    obs_next, rew, _, truncated, info = env.step(act)
    # pretend this episode never end
    terminated = False
    info["id"] = i
    buf.add(
        Batch(
            obs=obs,
            act=act,
            rew=rew,
            terminated=terminated,
            truncated=truncated,
            obs_next=obs_next,
            info=info,
        )
    )
    obs = obs_next

Our replay buffer looks like this now.

print(buf)
print("maxsize: {}, data length: {}".format(buf.maxsize, len(buf)))

ReplayBuffer(
    obs: array([[ 1.0711700e-03,  3.5320066e-02, -2.5660632e-02,  3.6665443e-02],
                [ 1.7775714e-03,  2.3080042e-01, -2.4927324e-02, -2.6400194e-01],
                [ 6.3935798e-03,  4.2626914e-01, -3.0207362e-02, -5.6444180e-01],
                [-8.3724856e-03, -5.0253076e-03, -2.6548475e-02,  4.5823455e-02],
                [-8.4729921e-03,  1.9046707e-01, -2.5632005e-02, -2.5511611e-01],
                [-4.6636509e-03, -4.2797043e-03, -3.0734329e-02,  2.9373109e-02],
                [-4.7492445e-03, -1.9894773e-01, -3.0146865e-02,  3.1220278e-01],
                [-8.7281996e-03, -3.9362752e-01, -2.3902809e-02,  5.9522796e-01],
                [-1.6600750e-02, -5.8840692e-01, -1.1998251e-02,  8.8028681e-01],
                [-2.8368888e-02, -7.8336382e-01,  5.6074853e-03,  1.1691737e+00],
                [ 0.0000000e+00,  0.0000000e+00,  0.0000000e+00,  0.0000000e+00],
                [ 0.0000000e+00,  0.0000000e+00,  0.0000000e+00,  0.0000000e+00]],
               dtype=float32),
    act: array([1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]),
    rew: array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0.]),
    terminated: array([False, False,  True, False, False, False, False, False, False,
                       False, False, False]),
    truncated: array([False, False, False, False, False, False, False, False, False,
                      False, False, False]),
    obs_next: array([[ 0.00177757,  0.23080042, -0.02492732, -0.26400194],
                     [ 0.00639358,  0.42626914, -0.03020736, -0.5644418 ],
                     [ 0.01491896,  0.6218016 , -0.0414962 , -0.8664863 ],
                     [-0.00847299,  0.19046707, -0.02563201, -0.2551161 ],
                     [-0.00466365, -0.0042797 , -0.03073433,  0.02937311],
                     [-0.00474924, -0.19894773, -0.03014687,  0.31220278],
                     [-0.0087282 , -0.39362752, -0.02390281,  0.59522796],
                     [-0.01660075, -0.5884069 , -0.01199825,  0.8802868 ],
                     [-0.02836889, -0.7833638 ,  0.00560749,  1.1691737 ],
                     [-0.04403616, -0.9785583 ,  0.02899096,  1.4636093 ],
                     [ 0.        ,  0.        ,  0.        ,  0.        ],
                     [ 0.        ,  0.        ,  0.        ,  0.        ]],
                    dtype=float32),
    info: Batch(
              id: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 0]),
          ),
    done: array([False, False,  True, False, False, False, False, False, False,
                 False, False, False]),
)
maxsize: 12, data length: 10

Updates#

Now we have got a replay buffer with 10 data steps in it. We can call Policy.update() to train.

# 0 means sample all data from the buffer
# batch_size=10 defines the training batch size
# repeat=6 means repeat the training for 6 times
policy.update(0, buf, batch_size=10, repeat=6)

{'loss': [2.22137713432312,
219578266143799,
2177863121032715,
21600604057312,
2142333984375,
2124671936035156]}

Not that difficult, right?