Bug in RND Intrinsic Reward Normalization #360

akarshkumar0101 · 2023-02-17T21:08:57Z

Problem Description

The CleanRL has the following lines of code:

curiosity_reward_per_env = np.array(
    [discounted_reward.update(reward_per_step) for reward_per_step in curiosity_rewards.cpu().data.numpy().T]
)
mean, std, count = (
    np.mean(curiosity_reward_per_env),
    np.std(curiosity_reward_per_env),
    len(curiosity_reward_per_env),
)
reward_rms.update_from_moments(mean, std**2, count)

curiosity_rewards /= np.sqrt(reward_rms.var)

curiosity_rewards has shape (num_steps, num_envs).
Then, when for loop over curiosity_rewards.cpu().data.numpy().T, we are looping over the num_envs, not num_steps.
So, we are calculating the intrinsic "return" w/ running exponential sum of rewards over environments.
We're supposed to calculate the intrinsic "return" w/ running exponential sum of rewards over timesteps.

It ends up not mattering too much, since we only need an approximate scale of the reward to normalize them, but is still wrong.

In the original implementation of RND:
They do rffs_int = np.array([self.I.rff_int.update(rew) for rew in self.I.buf_rews_int.T]),
but their buf_rews_int has shape (num_envs, num_steps), so it makes sense to do the transpose.
In CleanRL, the shape is already reversed, so transposing makes it loop over environments.

Another issue is that in the original implementation, they update the RMS with all the rewards in the batch using self.I.rff_rms_int.update(rffs_int.ravel()).
CleanRL updates the RMS using count=len(curiosity_reward_per_env) which is equal to num_envs.

Possible Solution

I have fixed both of these issues in the following possible solution.

curiosity_reward_per_env = np.array(
    [discounted_reward.update(reward_per_step) for reward_per_step in curiosity_rewards.cpu().data.numpy()]
)
reward_rms.update(curiosity_reward_per_env.flatten())

curiosity_rewards /= np.sqrt(reward_rms.var)

I also opened up a pull request with these fixes.

@yooceii @vwxyzjn

The text was updated successfully, but these errors were encountered:

yooceii · 2023-02-19T23:25:21Z

Hi Akarsh, it's indeed a bug. Can u provide a comparison plot of before and after the fix? (I don't see PR tho)
At the same time, can u also rename curiosity_reward_per_env to curiosity_reward_per_step

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug in RND Intrinsic Reward Normalization #360

Bug in RND Intrinsic Reward Normalization #360

akarshkumar0101 commented Feb 17, 2023

yooceii commented Feb 19, 2023

Bug in RND Intrinsic Reward Normalization #360

Bug in RND Intrinsic Reward Normalization #360

Comments

akarshkumar0101 commented Feb 17, 2023

Problem Description

Possible Solution

yooceii commented Feb 19, 2023