Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why is there no design evaluation and save model module? #310

Open
madlsj opened this issue Nov 3, 2022 · 22 comments
Open

Why is there no design evaluation and save model module? #310

madlsj opened this issue Nov 3, 2022 · 22 comments

Comments

@madlsj
Copy link

madlsj commented Nov 3, 2022

Problem Description

Why is there no design evaluation and save model module?

@vwxyzjn
Copy link
Owner

vwxyzjn commented Nov 8, 2022

Hi @madlsj thanks for raising this issue. We are hoping to gradually enable model saving and evaluation at #292. So far, we have omitted evaluation for a list of reasons

  • reduced codebase: Model saving and evaluation are mostly boilerplate code, so removing them reduces quite some lines of code.
  • straight forward for user to implement: For users who need evaluation / model saving, implementing them is very straightforward (e.g., add a torch.save at the end of the training). As a result, we have primarily put our attention on making the training code succinct.
  • we can already visualize the agent’s gameplay: One primary use case for saving models is to help researchers visualize the agent’s gameplay. However, we already have this when we do --capture-video which saves videos to wandb.
  • historical issues with saving models: There are also some weird issues. For example, ppo_continuous_action.py also needs to save the states in the normalize wrappers, which is tricky to implement. We also support JAX, which require a slightly different way to run evaluation loops.
  • evaluation methodology: depending on the evaluation methodology we don't even need a evaluation module. There are two types of evaluation methods: explicit evaluation and implicit evaluation.
    • The evaluation module enables an explicit evaluation phase. We train a model for some steps and the module can evaluate the model for some episodes. While an explicit evaluation phase is useful, it can be time-consuming to run and has subtle issues Machado et al., 2017.
    • Instead, Machado et al recommend an implicit evaluation phase — simply report the average episodic return of the last $N$ episodes. This evaluation method aligns better with the general goal of RL which is continual learning. This method also detects issues with catastrophic forgetting. Our current approach can already empower this type of evaluation (see Prototype RLops Utility #307 which reports the average return of the last 100 episodes).

That said, recent advances such as reincarnate RL aims to reuse existing models, models / evaluation can become increasingly relavent, so we are gradually adopting it.

@JamesKCS
Copy link

Hi @vwxyzjn, thank you for your excellent work on this awesome library. I have a couple questions closely related to the above, and thought I'd piggyback on this issue rather than making my own:

I am attempting to save/load a trained agent trained with code based on your ppo_continuous_action.py for evaluation. The agent performed significantly worse after reloading than it did during the latter stages of training. Eventually, I figured out what you mentioned above: I need to save and load the state of the NormalizeObservation wrapper, since that is effectively a learned part of the agent.

So I am saving and load the observation RunningMeanStd.mean, RunningMeanStd.var, and RunningMeanStd.count variables. Since it is parallelized across n environments, and there seem to be n instances of the above variables, I simply grab the one from the environment with index 0. Q1: Is just grabbing those variables from index 0 a reasonable solution, or do I need to do something more sophisticated?

Before I run more long trials, Q2: Are there any other similar "gotchas" that could also cause worse performance when saving/loading an agent? (I know that if I want to resume training, I'd have to also save/load the reward normalization RunningMeanStd data, but that shouldn't be necessary just for evaluation, correct?) Thank you!

@vwxyzjn
Copy link
Owner

vwxyzjn commented Nov 15, 2022

Hi @JamesKCS, thanks for sharing these issues.

Q1: Is just grabbing those variables from index 0 a reasonable solution, or do I need to do something more sophisticated?

I recommend applying the normalize wrappers at the vectorized environment level like

https://github.com/vwxyzjn/envpool-cleanrl/blob/880552a168c08af334b5e5d8868bfbe5ea881445/ppo_continuous_action_envpool.py#L194-L198

which performs better in my experience when num_envs>1. Then you can grab the RunningMeanStd.mean which is the running mean of all the envs. A better solution in the future is to implement the normalize wrappers in the torch or jax side so that the implementation could also benefit from hardware acceleration and a more straightforward model-saving procedure.

Q2: Are there any other similar "gotchas" that could also cause worse performance when saving/loading an agent?

I think ppo_continuous_action is the main one with the environment storing some training states. Others should be fine. With #292 we should also have a more quantitative answer to this question.

@JamesKCS
Copy link

@vwxyzjn Just an update: the above did seem to fix the problem. After retraining and using the learned normalized observations, the newly loaded agent had returns similar to the those from the end of the training. Thank you again!

@qiuruiyu
Copy link

Sorry, I have a question.
Now I have a setpoint control problem and used ppo_continuous_action.py for training. in evaluation, I write the code as the following:

env = gym.vector.SyncVectorEnv(
        [make_env(1, 1, False, "1", 0.99)]
)
state_dict = torch.load('./runs/Shell-v0__ppo_continuous_action__1__1687066897/agent_500.pt')
agent = Agent(env)
agent.load_state_dict(state_dict, strict=True)

s, _ = env.reset() 


while True:
    s_mean = env.envs[0].obs_rms.mean 
    s_var = env.envs[0].obs_rms.var 
    s_epsilon = env.envs[0].epsilon 
    
    # a, log_prob, entropy, value = agent.get_action_and_value(torch.FloatTensor(s))
    a = agent.actor_mean(torch.FloatTensor(s)).detach().numpy()
    s_, r, terminated, truncated, info = env.step(a)

    s_re = s * np.sqrt(s_var + s_epsilon) + s_mean 
    y.append(s_re.reshape(9, -1)[3:6, -1])
    du.append(s_re.reshape(9, -1)[-3:, -1])
    if "final_info" not in info:
                    continue
    for info in info["final_info"]:
        # Skip the envs that are not done
        if info is None:
            continue
        print(f"episodic_return={info['episode']['r']}")
    if truncated:
        break 
    s = s_ 

For one thing, the episodic_return printed is worse than trained, then, I found that even at the state just reset, the action given by the policy is quite different with trained. Is there any problem with my code?

@vwxyzjn
Copy link
Owner

vwxyzjn commented Jun 18, 2023

Sorry, I have a question. Now I have a setpoint control problem and used ppo_continuous_action.py for training. in evaluation, I write the code as the following:

env = gym.vector.SyncVectorEnv(
        [make_env(1, 1, False, "1", 0.99)]
)
state_dict = torch.load('./runs/Shell-v0__ppo_continuous_action__1__1687066897/agent_500.pt')
agent = Agent(env)
agent.load_state_dict(state_dict, strict=True)

s, _ = env.reset() 


while True:
    s_mean = env.envs[0].obs_rms.mean 
    s_var = env.envs[0].obs_rms.var 
    s_epsilon = env.envs[0].epsilon 
    
    # a, log_prob, entropy, value = agent.get_action_and_value(torch.FloatTensor(s))
    a = agent.actor_mean(torch.FloatTensor(s)).detach().numpy()
    s_, r, terminated, truncated, info = env.step(a)

    s_re = s * np.sqrt(s_var + s_epsilon) + s_mean 
    y.append(s_re.reshape(9, -1)[3:6, -1])
    du.append(s_re.reshape(9, -1)[-3:, -1])
    if "final_info" not in info:
                    continue
    for info in info["final_info"]:
        # Skip the envs that are not done
        if info is None:
            continue
        print(f"episodic_return={info['episode']['r']}")
    if truncated:
        break 
    s = s_ 

For one thing, the episodic_return printed is worse than trained, then, I found that even at the state just reset, the action given by the policy is quite different with trained. Is there any problem with my code?

Maybe @JamesKCS can share his snippet? The issue is that during training you need to save the states of the wrappers in the environment, such as obs_rms.mean along with agent.pt, which was not done in the snippet you shared

@qiuruiyu
Copy link

@vwxyzjn update the question above. I just remove all env.wrappers except for FlattenObservation() and RecordEpisodicStatistics() because I have already did normalization in my env.py. From the print info, reward converges still. However the test is not normal.

        env_eval = gym.vector.SyncVectorEnv(
            [make_env(1, 1, False, "1", 0.99)]
    )   
        epi_r_eval = 0 
        s_eval, _ = env_eval.reset() 
        while True: 
            a_eval, _, _, _ = agent.get_action_and_value(torch.FloatTensor(s_eval))
            a_eval = a_eval.detach().numpy() 
            s_eval_, r_eval, ter, trun, info_eval = env_eval.step(a_eval)
            epi_r_eval += r_eval 
            if trun:
                break 
            s_evl = s_eval_

I added the following code for evaluation after writer.add_scalar but the result of evaluation is greatly different with training. Log is as the following, really strange. help me plz:

global_step=26400, episodic_return=[-45.999065]
global_step=26500, episodic_return=[-50.956413]
global_step=26600, episodic_return=[-50.758144]
SPS: 2534
----------- EVALUATION ---------------
EPISODE REWARD:  [-300791.29141884]
--------------------------------
global_step=26700, episodic_return=[-102.30504]
global_step=26800, episodic_return=[-36.258324]
global_step=26900, episodic_return=[-54.978573]

@vwxyzjn
Copy link
Owner

vwxyzjn commented Jun 20, 2023

Roughly speaking, you should do

eval_env.obs_rms = env.obs_rms
eval_env.return_rms = env.return_rms

@qiuruiyu
Copy link

Roughly speaking, you should do

eval_env.obs_rms = env.obs_rms
eval_env.return_rms = env.return_rms

I have removed these wrappers but it still shows bad performance when evaluation?maybe it shouldn't be the problem of obs_rms

@vwxyzjn
Copy link
Owner

vwxyzjn commented Jun 20, 2023

You should not remove these wrappers because they are what the agent was trained on.

@qiuruiyu
Copy link

yes.I tried to remove them in make_env function which means the agent was trained on a original env but the result was the same.
you mean, set eval_env.obs_rms=env.obs_rms and the env is wrapped by normalizeObservation, and the agent can be evaluated in a right way?

@vwxyzjn
Copy link
Owner

vwxyzjn commented Jun 20, 2023

Yes. During training, the agent sees normalized observations, but during your evaluation, the agent sees unnormalized obs, which is prob the reason its performance was bad.

@qiuruiyu
Copy link

ppo_continuous_action.txt
Sorry again. In order to upload the file I change the suffix to txt. At the last of the code I add the part for regular evaluation. I tried your advice to set the obs_rms of eval env same as training env. But it didn't work.
Then, I tried Pendulum-v1, and Only leave env = gym.wrappers.RecordEpisodeStatistics(env) in make_env function. However, it still performs just like random action or something else.
Therefore, I think it's not a problem of obs_rms, because I have totally get rid of all wrappers both in the training and evaluation envs.

@vwxyzjn
Copy link
Owner

vwxyzjn commented Jun 20, 2023

PPO does not solve Pendulum-v1 if I recall correctly.

@qiuruiyu
Copy link

PPO does not solve Pendulum-v1 if I recall correctly.

Pendulum-v1 is also a continuous action env and I test it. And here is the result.
image

@qiuruiyu
Copy link

PPO does not solve Pendulum-v1 if I recall correctly.

Sorry, No matter how, My question is why the evaluation is failed. Please, I really need it.

@vwxyzjn
Copy link
Owner

vwxyzjn commented Jun 21, 2023

Ok I gave it a shot at https://wandb.ai/costa-huang/cleanRL/runs/3nhnaboz. It should work.

Edit: it did work

eval_episodic_return=[-838.558]
5488 eval_episodic_return=[-693.1612]
5489 eval_episodic_return=[-122.41518]
5490 eval_episodic_return=[-126.593254]
5491 eval_episodic_return=[-1525.9708]
5492 eval_episodic_return=[-1092.5537]
5493 eval_episodic_return=[-7.024438]
5494 eval_episodic_return=[-249.01135]
5495 eval_episodic_return=[-827.8679]
5496 ----------- EVALUATION ---------------
5497 EPISODE REWARD: -587.4559
5498 --------------------------------

Long story short, there were some subtle issues with the way you use gym’s api. I’d suggest doing a file diff to identify those differences.

@JamesKCS
Copy link

JamesKCS commented Jun 24, 2023

Maybe @JamesKCS can share his snippet?

My solution is hacky, since I just load the wrapper from one environment, rather than the more correct way which is to use

the running mean of all the envs.

However, I'm happy to share in case it helps anyone else (it shouldn't be hard to modify my solution to be more correct):

Save with:

pickle_item(envs.get_obs_norm_rms_obj(), saved_policies_dir+"trial_%s_obs_rms_obj_pickle" % trial_num)

Load with:

def load_learned_observation_normalization(envs, trial_num):
    rms_obj = load_pickled_item("saved_policies/trial_%s_obs_rms_obj_pickle" % trial_num)
    envs.set_obs_norm_rms_obj(rms_obj)

Helper functions:

class Environment():
    def __init__(self, env_id, num_envs, seed, device, capture_video, run_name, gamma):
        self.gym_sync_vec_env = gym.vector.SyncVectorEnv([self.clean_rl_make_env(env_id, seed + i, i, capture_video, run_name, gamma) for i in range(num_envs)])
        self.capture_video = False
        self.num_envs = num_envs
        self.global_step = 0
        self.device = device

    def get_NormalizeObservation_wrapper(self, env_num=0):
        return self.gym_sync_vec_env.envs[env_num].env.env.env
    
    def get_obs_norm_rms_obj(self, env_num=0):
        return self.get_NormalizeObservation_wrapper(env_num=env_num).obs_rms
    
    def set_obs_norm_rms_obj(self, rms_obj, env_num=0):
        self.get_NormalizeObservation_wrapper(env_num=env_num).obs_rms = rms_obj
    
    def get_obs_norm_rms_vars(self, env_num=0):
        rms_obj = self.get_obs_norm_rms_obj(env_num)
        return rms_obj.mean, rms_obj.var, rms_obj.count

def pickle_item(item, filename):
    with open(filename, 'wb') as file:
        pickle.dump(item, file)

def load_pickled_item(filename):
    with open(filename, 'rb') as file:
        return pickle.load(file)

Edit: added the relevant bits of my Environment class, and some relevant helper functions. For anyone looking to imitate this, I recommend trying to understand what I did rather than blindly copying, since this was just a quick hack to get things working.

@qiuruiyu
Copy link

Maybe @JamesKCS can share his snippet?

My solution is hacky, since I just load the wrapper from one environment, rather than the more correct way which is to use

the running mean of all the envs.

However, I'm happy to share in case it helps anyone else (it shouldn't be hard to modify my solution to be more correct):

Save with:

pickle_item(envs.get_obs_norm_rms_obj(), saved_policies_dir+"trial_%s_obs_rms_obj_pickle" % trial_num)

Load with:

def load_learned_observation_normalization(envs, trial_num):
    rms_obj = load_pickled_item("saved_policies/trial_%s_obs_rms_obj_pickle" % trial_num)
    envs.set_obs_norm_rms_obj(rms_obj)

Helper functions:

    def get_NormalizeObservation_wrapper(self, env_num=0):
        return self.gym_sync_vec_env.envs[env_num].env.env.env

    def get_obs_norm_rms_obj(self, env_num=0):
        return self.get_NormalizeObservation_wrapper(env_num=env_num).obs_rms

    def set_obs_norm_rms_obj(self, rms_obj, env_num=0):
        self.get_NormalizeObservation_wrapper(env_num=env_num).obs_rms = rms_obj

    def get_obs_norm_rms_vars(self, env_num=0):
        rms_obj = self.get_obs_norm_rms_obj(env_num)
        return rms_obj.mean, rms_obj.var, rms_obj.count

Is your evaluation part wrapped during the training process? like, you evaluate once every 10000 steps.
If I save the agent for one training, and then I load it in another py or Jupyter notebook. After I create the env, wrapped it with the same NormalizeObservation. Do I need to save / load obs_rms again? The agent receive the normalized observation, so I don't need to do any process when evaluation because the obs has already been normalized?

@JamesKCS
Copy link

Is your evaluation part wrapped during the training process? like, you evaluate once every 10000 steps.
If I save the agent for one training, and then I load it in another py or Jupyter notebook. After I create the env, wrapped it with the same NormalizeObservation. Do I need to save / load obs_rms again? The agent receive the normalized observation, so I don't need to do any process when evaluation because the obs has already been normalized?

I'm not sure that I fully understand the question, so I'll just walk you through how it works.

  1. When saving the agent, you also have to save the env wrapper (in my code, this is the pickle_item call).
  2. When loading an agent (whether for evaluation or for further training), you have to load the env wrapper. In my code, this is the load_learned_observation_normalization call.

Remember that the environment wrapper is really part of the policy/agent (even though it's not represented that way in the code). So, if you don't load the wrapper correctly when loading a saved policy/agent, you aren't loading the complete policy/agent (since you are effectively "resetting" this component of the policy/agent), and should expect bad results.

Does that clear things up? If you are still confused, it might be worth rereading the top few posts on this thread, since they explain the basic problem and solution. Best of luck with your work! :)

@JamesKCS
Copy link

@qiuruiyu P.S. I edited the code snippets above to give more context.

@roger-creus
Copy link

Roughly speaking, you should do

eval_env.obs_rms = env.obs_rms
eval_env.return_rms = env.return_rms

For everyone else, env.envs[0].obs_rms will access the Reward RMS wrapper because it is added later. You should overwrite env.envs[0].env.env.env.obs_rms instead

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants