Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor Evaluation Performance in PPO #425

Open
3 tasks
sdpkjc opened this issue Oct 17, 2023 · 5 comments
Open
3 tasks

Poor Evaluation Performance in PPO #425

sdpkjc opened this issue Oct 17, 2023 · 5 comments

Comments

@sdpkjc
Copy link
Collaborator

sdpkjc commented Oct 17, 2023

Problem Description

I have been encountering an issue with the poor evaluation performance when using the PPO model from HuggingFace.

https://huggingface.co/cleanrl/Hopper-v4-ppo_continuous_action-seed1

mean_reward on Hopper-v4 3.83 +/- 5.28

image image image

Upon inspecting the TensorBoard curves provided within the HuggingFace repository, the roll-out data appeared to be normal, which has left me somewhat puzzled.

To determine whether this was a result of randomness, I ran experiments with three different random seeds. Certainly, the evaluation performance remained consistently poor across all these different runs. To probe further, I performed a full-course evaluation experiment on the PPO model, again with three different random seeds. Intriguingly, the evaluation performance began normal but then considerably depreciated over time.

This situation eerily resembles an overfitting issue, although that seemed improbable since I tried to rule out data correlation by running a parallel experiment across four environments. Yet, the poor evaluation performance persisted.

image image

I would appreciate any insights into this issue or possible suggestions towards resolving it.

Checklist

Current Behavior

Expected Behavior

Possible Solution

Steps to Reproduce

@vwxyzjn
Copy link
Owner

vwxyzjn commented Oct 17, 2023

My bad for not looking closely in #423. I think the issue is that the normalize wrappers have states which are not saved. See #310 (comment)

@sdpkjc
Copy link
Collaborator Author

sdpkjc commented Oct 17, 2023

Thank you so much for your response, this indeed is an intriguing issue. I have gone through #310 and it makes sense that we need to save obs_rms and return_rms.

However, pickling and uploading them directly on HuggingFace might not be the most elegant solution, not to mention we would need to modify the enjoy.py workflow.

I am considering that since these are part of what the agent has learnt, we could save these two values within the agent object itself. This way, when saving, they can be pickled along with the cleanrl_model file, and would make the restoration process more efficient.

class Agent(nn.Module):
    def __init__(self, envs):
        super().__init__()
        self.critic = nn.Sequential(
            layer_init(nn.Linear(np.array(envs.single_observation_space.shape).prod(), 64)),
            nn.Tanh(),
            layer_init(nn.Linear(64, 64)),
            nn.Tanh(),
            layer_init(nn.Linear(64, 1), std=1.0),
        )
        self.actor_mean = nn.Sequential(
            layer_init(nn.Linear(np.array(envs.single_observation_space.shape).prod(), 64)),
            nn.Tanh(),
            layer_init(nn.Linear(64, 64)),
            nn.Tanh(),
            layer_init(nn.Linear(64, np.prod(envs.single_action_space.shape)), std=0.01),
        )
        self.actor_logstd = nn.Parameter(torch.zeros(1, np.prod(envs.single_action_space.shape)))
        
        self.env_obs_rms = envs.envs[0].obs_rms
        self.env_return_rms = envs.envs[0].return_rms

    def get_value(self, x):
        return self.critic(x)

    def get_action_and_value(self, x, action=None):
        action_mean = self.actor_mean(x)
        action_logstd = self.actor_logstd.expand_as(action_mean)
        action_std = torch.exp(action_logstd)
        probs = Normal(action_mean, action_std)
        if action is None:
            action = probs.sample()
        return action, probs.log_prob(action).sum(1), probs.entropy().sum(1), self.critic(x)

There is also the question of whether only saving obs_rms and return_rms for the first environment is sufficient for multiple parallel environments.

@vwxyzjn
Copy link
Owner

vwxyzjn commented Oct 17, 2023

Alternatively, we could just implement the normalize wrappers ourselves: https://github.com/openai/phasic-policy-gradient/blob/7295473f0185c82f9eb9c1e17a373135edd8aacc/phasic_policy_gradient/reward_normalizer.py#L8-L39

@sdpkjc
Copy link
Collaborator Author

sdpkjc commented Dec 7, 2023

I want to confirm whether we need to save the NormalizeReward_wrapper. If we don't save it, our policy won't be able to continue training after being downloaded and can only be used without further training. Saving it requires adding both NormalizeObservation_wrapper and NormalizeReward_wrapper to the code, which could make our single-file implementation overly lengthy.

@vwxyzjn
Copy link
Owner

vwxyzjn commented Dec 7, 2023

Yeah so that's a bit unfortunate. I guess an alternative is to load the states in the normalize wrappers along with the saved model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants