-
Notifications
You must be signed in to change notification settings - Fork 545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor Evaluation Performance in PPO #425
Comments
My bad for not looking closely in #423. I think the issue is that the normalize wrappers have states which are not saved. See #310 (comment) |
Thank you so much for your response, this indeed is an intriguing issue. I have gone through #310 and it makes sense that we need to save However, pickling and uploading them directly on HuggingFace might not be the most elegant solution, not to mention we would need to modify the I am considering that since these are part of what the agent has learnt, we could save these two values within the agent object itself. This way, when saving, they can be pickled along with the class Agent(nn.Module):
def __init__(self, envs):
super().__init__()
self.critic = nn.Sequential(
layer_init(nn.Linear(np.array(envs.single_observation_space.shape).prod(), 64)),
nn.Tanh(),
layer_init(nn.Linear(64, 64)),
nn.Tanh(),
layer_init(nn.Linear(64, 1), std=1.0),
)
self.actor_mean = nn.Sequential(
layer_init(nn.Linear(np.array(envs.single_observation_space.shape).prod(), 64)),
nn.Tanh(),
layer_init(nn.Linear(64, 64)),
nn.Tanh(),
layer_init(nn.Linear(64, np.prod(envs.single_action_space.shape)), std=0.01),
)
self.actor_logstd = nn.Parameter(torch.zeros(1, np.prod(envs.single_action_space.shape)))
self.env_obs_rms = envs.envs[0].obs_rms
self.env_return_rms = envs.envs[0].return_rms
def get_value(self, x):
return self.critic(x)
def get_action_and_value(self, x, action=None):
action_mean = self.actor_mean(x)
action_logstd = self.actor_logstd.expand_as(action_mean)
action_std = torch.exp(action_logstd)
probs = Normal(action_mean, action_std)
if action is None:
action = probs.sample()
return action, probs.log_prob(action).sum(1), probs.entropy().sum(1), self.critic(x) There is also the question of whether only saving |
Alternatively, we could just implement the normalize wrappers ourselves: https://github.com/openai/phasic-policy-gradient/blob/7295473f0185c82f9eb9c1e17a373135edd8aacc/phasic_policy_gradient/reward_normalizer.py#L8-L39 |
I want to confirm whether we need to save the |
Yeah so that's a bit unfortunate. I guess an alternative is to load the states in the normalize wrappers along with the saved model. |
Problem Description
I have been encountering an issue with the poor evaluation performance when using the PPO model from HuggingFace.
https://huggingface.co/cleanrl/Hopper-v4-ppo_continuous_action-seed1
mean_reward on Hopper-v4 3.83 +/- 5.28
Upon inspecting the TensorBoard curves provided within the HuggingFace repository, the roll-out data appeared to be normal, which has left me somewhat puzzled.
To determine whether this was a result of randomness, I ran experiments with three different random seeds. Certainly, the evaluation performance remained consistently poor across all these different runs. To probe further, I performed a full-course evaluation experiment on the PPO model, again with three different random seeds. Intriguingly, the evaluation performance began normal but then considerably depreciated over time.
This situation eerily resembles an overfitting issue, although that seemed improbable since I tried to rule out data correlation by running a parallel experiment across four environments. Yet, the poor evaluation performance persisted.
I would appreciate any insights into this issue or possible suggestions towards resolving it.
Checklist
poetry install
(see CleanRL's installation guideline.Current Behavior
Expected Behavior
Possible Solution
Steps to Reproduce
The text was updated successfully, but these errors were encountered: