-
Notifications
You must be signed in to change notification settings - Fork 545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why is there no design evaluation and save model module? #310
Comments
Hi @madlsj thanks for raising this issue. We are hoping to gradually enable model saving and evaluation at #292. So far, we have omitted evaluation for a list of reasons
That said, recent advances such as reincarnate RL aims to reuse existing models, models / evaluation can become increasingly relavent, so we are gradually adopting it. |
Hi @vwxyzjn, thank you for your excellent work on this awesome library. I have a couple questions closely related to the above, and thought I'd piggyback on this issue rather than making my own: I am attempting to save/load a trained agent trained with code based on your So I am saving and load the observation Before I run more long trials, Q2: Are there any other similar "gotchas" that could also cause worse performance when saving/loading an agent? (I know that if I want to resume training, I'd have to also save/load the reward normalization RunningMeanStd data, but that shouldn't be necessary just for evaluation, correct?) Thank you! |
Hi @JamesKCS, thanks for sharing these issues.
I recommend applying the normalize wrappers at the vectorized environment level like which performs better in my experience when
I think |
@vwxyzjn Just an update: the above did seem to fix the problem. After retraining and using the learned normalized observations, the newly loaded agent had returns similar to the those from the end of the training. Thank you again! |
Sorry, I have a question. env = gym.vector.SyncVectorEnv(
[make_env(1, 1, False, "1", 0.99)]
)
state_dict = torch.load('./runs/Shell-v0__ppo_continuous_action__1__1687066897/agent_500.pt')
agent = Agent(env)
agent.load_state_dict(state_dict, strict=True)
s, _ = env.reset()
while True:
s_mean = env.envs[0].obs_rms.mean
s_var = env.envs[0].obs_rms.var
s_epsilon = env.envs[0].epsilon
# a, log_prob, entropy, value = agent.get_action_and_value(torch.FloatTensor(s))
a = agent.actor_mean(torch.FloatTensor(s)).detach().numpy()
s_, r, terminated, truncated, info = env.step(a)
s_re = s * np.sqrt(s_var + s_epsilon) + s_mean
y.append(s_re.reshape(9, -1)[3:6, -1])
du.append(s_re.reshape(9, -1)[-3:, -1])
if "final_info" not in info:
continue
for info in info["final_info"]:
# Skip the envs that are not done
if info is None:
continue
print(f"episodic_return={info['episode']['r']}")
if truncated:
break
s = s_ For one thing, the episodic_return printed is worse than trained, then, I found that even at the state just reset, the action given by the policy is quite different with trained. Is there any problem with my code? |
Maybe @JamesKCS can share his snippet? The issue is that during training you need to save the states of the wrappers in the environment, such as obs_rms.mean along with agent.pt, which was not done in the snippet you shared |
@vwxyzjn update the question above. I just remove all env.wrappers except for FlattenObservation() and RecordEpisodicStatistics() because I have already did normalization in my env.py. From the print info, reward converges still. However the test is not normal. env_eval = gym.vector.SyncVectorEnv(
[make_env(1, 1, False, "1", 0.99)]
)
epi_r_eval = 0
s_eval, _ = env_eval.reset()
while True:
a_eval, _, _, _ = agent.get_action_and_value(torch.FloatTensor(s_eval))
a_eval = a_eval.detach().numpy()
s_eval_, r_eval, ter, trun, info_eval = env_eval.step(a_eval)
epi_r_eval += r_eval
if trun:
break
s_evl = s_eval_ I added the following code for evaluation after
|
Roughly speaking, you should do
|
I have removed these wrappers but it still shows bad performance when evaluation?maybe it shouldn't be the problem of obs_rms |
You should not remove these wrappers because they are what the agent was trained on. |
yes.I tried to remove them in make_env function which means the agent was trained on a original env but the result was the same. |
Yes. During training, the agent sees normalized observations, but during your evaluation, the agent sees unnormalized obs, which is prob the reason its performance was bad. |
ppo_continuous_action.txt |
PPO does not solve |
Sorry, No matter how, My question is why the evaluation is failed. Please, I really need it. |
Ok I gave it a shot at https://wandb.ai/costa-huang/cleanRL/runs/3nhnaboz. It should work. Edit: it did work eval_episodic_return=[-838.558] Long story short, there were some subtle issues with the way you use gym’s api. I’d suggest doing a file diff to identify those differences. |
My solution is hacky, since I just load the wrapper from one environment, rather than the more correct way which is to use
However, I'm happy to share in case it helps anyone else (it shouldn't be hard to modify my solution to be more correct): Save with:
Load with:
Helper functions:
Edit: added the relevant bits of my Environment class, and some relevant helper functions. For anyone looking to imitate this, I recommend trying to understand what I did rather than blindly copying, since this was just a quick hack to get things working. |
Is your evaluation part wrapped during the training process? like, you evaluate once every 10000 steps. |
I'm not sure that I fully understand the question, so I'll just walk you through how it works.
Remember that the environment wrapper is really part of the policy/agent (even though it's not represented that way in the code). So, if you don't load the wrapper correctly when loading a saved policy/agent, you aren't loading the complete policy/agent (since you are effectively "resetting" this component of the policy/agent), and should expect bad results. Does that clear things up? If you are still confused, it might be worth rereading the top few posts on this thread, since they explain the basic problem and solution. Best of luck with your work! :) |
@qiuruiyu P.S. I edited the code snippets above to give more context. |
For everyone else, |
Problem Description
Why is there no design evaluation and save model module?
The text was updated successfully, but these errors were encountered: