mapolicy

mapolicy#

Source code: tianshou/policy/multiagent/mapolicy.py

class MultiAgentPolicyManager(*, policies: list[BasePolicy], env: PettingZooEnv, action_scaling: bool = False, action_bound_method: Optional[Literal['clip', 'tanh']] = 'clip', lr_scheduler: torch.optim.lr_scheduler.LRScheduler | MultipleLRSchedulers | None = None)[source]#

Multi-agent policy manager for MARL.

This multi-agent policy manager accepts a list of BasePolicy. It dispatches the batch data to each of these policies when the “forward” is called. The same as “process_fn” and “learn”: it splits the data and feeds them to each policy. A figure in Multi-Agent Reinforcement Learning can help you better understand this procedure.

Parameters:

policies – a list of policies.
env – a PettingZooEnv.
action_scaling – if True, scale the action from [-1, 1] to the range of action_space. Only used if the action_space is continuous.
action_bound_method – method to bound action to range [-1, 1]. Only used if the action_space is continuous.
lr_scheduler – if not None, will be called in policy.update().

exploration_noise(act: numpy.ndarray | BatchProtocol, batch: RolloutBatchProtocol) → numpy.ndarray | BatchProtocol[source]#: Add exploration noise from sub-policy onto act.

forward(batch: Batch, state: dict | Batch | None = None, **kwargs: Any) → Batch[source]#

Dispatch batch data from obs.agent_id to every policy’s forward.

Parameters:: state – if None, it means all agents have no state. If not None, it should contain keys of “agent_1”, “agent_2”, …
Returns:: a Batch with the following contents:

{
    "act": actions corresponding to the input
    "state": {
        "agent_1": output state of agent_1's policy for the state
        "agent_2": xxx
        ...
        "agent_n": xxx}
    "out": {
        "agent_1": output of agent_1's policy for the input
        "agent_2": xxx
        ...
        "agent_n": xxx}
}

learn(batch: RolloutBatchProtocol, *args: Any, **kwargs: Any) → dict[str, float | list[float]][source]#

Dispatch the data to all policies for learning.

Returns:: a dict with the following contents:

{
    "agent_1/item1": item 1 of agent_1's policy.learn output
    "agent_1/item2": item 2 of agent_1's policy.learn output
    "agent_2/xxx": xxx
    ...
    "agent_n/xxx": xxx
}

process_fn(batch: RolloutBatchProtocol, buffer: ReplayBuffer, indice: ndarray) → BatchProtocol[source]#

Dispatch batch data from obs.agent_id to every policy’s process_fn.

Save original multi-dimensional rew in “save_rew”, set rew to the reward of each agent during their “process_fn”, and restore the original reward afterwards.

replace_policy(policy: BasePolicy, agent_id: int) → None[source]#: Replace the “agent_id”th policy in this manager.

train(mode: bool = True) → Self[source]#: Set each internal policy in training mode.