Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in RND Intrinsic Reward Normalization #360

Open
akarshkumar0101 opened this issue Feb 17, 2023 · 1 comment
Open

Bug in RND Intrinsic Reward Normalization #360

akarshkumar0101 opened this issue Feb 17, 2023 · 1 comment

Comments

@akarshkumar0101
Copy link

Problem Description

The CleanRL has the following lines of code:

curiosity_reward_per_env = np.array(
    [discounted_reward.update(reward_per_step) for reward_per_step in curiosity_rewards.cpu().data.numpy().T]
)
mean, std, count = (
    np.mean(curiosity_reward_per_env),
    np.std(curiosity_reward_per_env),
    len(curiosity_reward_per_env),
)
reward_rms.update_from_moments(mean, std**2, count)

curiosity_rewards /= np.sqrt(reward_rms.var)

curiosity_rewards has shape (num_steps, num_envs).
Then, when for loop over curiosity_rewards.cpu().data.numpy().T, we are looping over the num_envs, not num_steps.
So, we are calculating the intrinsic "return" w/ running exponential sum of rewards over environments.
We're supposed to calculate the intrinsic "return" w/ running exponential sum of rewards over timesteps.

It ends up not mattering too much, since we only need an approximate scale of the reward to normalize them, but is still wrong.

In the original implementation of RND:
They do rffs_int = np.array([self.I.rff_int.update(rew) for rew in self.I.buf_rews_int.T]),
but their buf_rews_int has shape (num_envs, num_steps), so it makes sense to do the transpose.
In CleanRL, the shape is already reversed, so transposing makes it loop over environments.

Another issue is that in the original implementation, they update the RMS with all the rewards in the batch using self.I.rff_rms_int.update(rffs_int.ravel()).
CleanRL updates the RMS using count=len(curiosity_reward_per_env) which is equal to num_envs.

Possible Solution

I have fixed both of these issues in the following possible solution.

curiosity_reward_per_env = np.array(
    [discounted_reward.update(reward_per_step) for reward_per_step in curiosity_rewards.cpu().data.numpy()]
)
reward_rms.update(curiosity_reward_per_env.flatten())

curiosity_rewards /= np.sqrt(reward_rms.var)

I also opened up a pull request with these fixes.

@yooceii @vwxyzjn

@yooceii
Copy link
Collaborator

yooceii commented Feb 19, 2023

Hi Akarsh, it's indeed a bug. Can u provide a comparison plot of before and after the fix? (I don't see PR tho)
At the same time, can u also rename curiosity_reward_per_env to curiosity_reward_per_step

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants