Poor Evaluation Performance in PPO #425

sdpkjc · 2023-10-17T15:45:35Z

Problem Description

I have been encountering an issue with the poor evaluation performance when using the PPO model from HuggingFace.

https://huggingface.co/cleanrl/Hopper-v4-ppo_continuous_action-seed1

mean_reward on Hopper-v4 3.83 +/- 5.28

Upon inspecting the TensorBoard curves provided within the HuggingFace repository, the roll-out data appeared to be normal, which has left me somewhat puzzled.

To determine whether this was a result of randomness, I ran experiments with three different random seeds. Certainly, the evaluation performance remained consistently poor across all these different runs. To probe further, I performed a full-course evaluation experiment on the PPO model, again with three different random seeds. Intriguingly, the evaluation performance began normal but then considerably depreciated over time.

This situation eerily resembles an overfitting issue, although that seemed improbable since I tried to rule out data correlation by running a parallel experiment across four environments. Yet, the poor evaluation performance persisted.

I would appreciate any insights into this issue or possible suggestions towards resolving it.

Checklist

I have installed dependencies via poetry install (see CleanRL's installation guideline.
I have checked that there is no similar issue in the repo.
I have checked the documentation site and found not relevant information in GitHub issues.

Current Behavior

Expected Behavior

Possible Solution

Steps to Reproduce

The text was updated successfully, but these errors were encountered:

vwxyzjn · 2023-10-17T15:52:35Z

My bad for not looking closely in #423. I think the issue is that the normalize wrappers have states which are not saved. See #310 (comment)

sdpkjc · 2023-10-17T16:23:59Z

Thank you so much for your response, this indeed is an intriguing issue. I have gone through #310 and it makes sense that we need to save obs_rms and return_rms.

However, pickling and uploading them directly on HuggingFace might not be the most elegant solution, not to mention we would need to modify the enjoy.py workflow.

I am considering that since these are part of what the agent has learnt, we could save these two values within the agent object itself. This way, when saving, they can be pickled along with the cleanrl_model file, and would make the restoration process more efficient.

class Agent(nn.Module):
    def __init__(self, envs):
        super().__init__()
        self.critic = nn.Sequential(
            layer_init(nn.Linear(np.array(envs.single_observation_space.shape).prod(), 64)),
            nn.Tanh(),
            layer_init(nn.Linear(64, 64)),
            nn.Tanh(),
            layer_init(nn.Linear(64, 1), std=1.0),
        )
        self.actor_mean = nn.Sequential(
            layer_init(nn.Linear(np.array(envs.single_observation_space.shape).prod(), 64)),
            nn.Tanh(),
            layer_init(nn.Linear(64, 64)),
            nn.Tanh(),
            layer_init(nn.Linear(64, np.prod(envs.single_action_space.shape)), std=0.01),
        )
        self.actor_logstd = nn.Parameter(torch.zeros(1, np.prod(envs.single_action_space.shape)))
        
        self.env_obs_rms = envs.envs[0].obs_rms
        self.env_return_rms = envs.envs[0].return_rms

    def get_value(self, x):
        return self.critic(x)

    def get_action_and_value(self, x, action=None):
        action_mean = self.actor_mean(x)
        action_logstd = self.actor_logstd.expand_as(action_mean)
        action_std = torch.exp(action_logstd)
        probs = Normal(action_mean, action_std)
        if action is None:
            action = probs.sample()
        return action, probs.log_prob(action).sum(1), probs.entropy().sum(1), self.critic(x)

There is also the question of whether only saving obs_rms and return_rms for the first environment is sufficient for multiple parallel environments.

vwxyzjn · 2023-10-17T19:24:38Z

Alternatively, we could just implement the normalize wrappers ourselves: https://github.com/openai/phasic-policy-gradient/blob/7295473f0185c82f9eb9c1e17a373135edd8aacc/phasic_policy_gradient/reward_normalizer.py#L8-L39

sdpkjc · 2023-12-07T19:20:52Z

I want to confirm whether we need to save the NormalizeReward_wrapper. If we don't save it, our policy won't be able to continue training after being downloaded and can only be used without further training. Saving it requires adding both NormalizeObservation_wrapper and NormalizeReward_wrapper to the code, which could make our single-file implementation overly lengthy.

vwxyzjn · 2023-12-07T20:26:43Z

Yeah so that's a bit unfortunate. I guess an alternative is to load the states in the normalize wrappers along with the saved model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor Evaluation Performance in PPO #425

Poor Evaluation Performance in PPO #425

sdpkjc commented Oct 17, 2023 •

edited

vwxyzjn commented Oct 17, 2023

sdpkjc commented Oct 17, 2023 •

edited

vwxyzjn commented Oct 17, 2023

sdpkjc commented Dec 7, 2023

vwxyzjn commented Dec 7, 2023

Poor Evaluation Performance in PPO #425

Poor Evaluation Performance in PPO #425

Comments

sdpkjc commented Oct 17, 2023 • edited

Problem Description

Checklist

Current Behavior

Expected Behavior

Possible Solution

Steps to Reproduce

vwxyzjn commented Oct 17, 2023

sdpkjc commented Oct 17, 2023 • edited

vwxyzjn commented Oct 17, 2023

sdpkjc commented Dec 7, 2023

vwxyzjn commented Dec 7, 2023

sdpkjc commented Oct 17, 2023 •

edited

sdpkjc commented Oct 17, 2023 •

edited