Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fall off state bug #50

Open
GregHydeDartmouth opened this issue Dec 21, 2022 · 0 comments
Open

Fall off state bug #50

GregHydeDartmouth opened this issue Dec 21, 2022 · 0 comments

Comments

@GregHydeDartmouth
Copy link

Opening an issue for a potential bug I experienced using your library.
After applying value iteration on the officeworld problem for reward machine 3 I noticed unexpected behavior from the agent performing optimally with an e-greedy policy with e=0.0.

A trace of that trajectory of the form (s, u, a, r, s', u') is here:
[[2, 1, 0], 0, 3, 0.0, [1, 1, 0], 0]
[[1, 1, 0], 0, 0, 0.0, [1, 2, 0], 0]
[[1, 2, 0], 0, 0, 0.0, [1, 3, 0], 0]
[[1, 3, 0], 0, 3, 0.0, [0, 3, 0], 0]
[[0, 3, 0], 0, 0, 0.0, [0, 4, 0], 0]
[[0, 4, 0], 0, 0, 0.0, [0, 5, 0], 0]
[[0, 5, 0], 0, 1, 0.0, [1, 5, 0], 0]
[[1, 5, 0], 0, 0, 0.0, [1, 6, 0], 0]
[[1, 6, 0], 0, 1, 0.0, [2, 6, 0], 0]
[[2, 6, 0], 0, 0, 0.0, [2, 7, 0], 0]
[[2, 7, 0], 0, 1, 0.0, [3, 7, 0], 0]
[[3, 7, 0], 0, 2, 0.0, [3, 6, 0], 3]
[[3, 6, 0], 3, 1, 0.0, [4, 6, 0], 3]
[[4, 6, 0], 3, 1, 0.0, [5, 6, 0], 3]
[[5, 6, 0], 3, 0, 0.0, [5, 7, 0], 3]
[[5, 7, 0], 3, 1, 0.0, [6, 7, 0], 3]
[[6, 7, 0], 3, 2, 0.0, [6, 6, 0], 3]
[[6, 6, 0], 3, 1, 0.0, [7, 6, 0], 3]
[[7, 6, 0], 3, 2, 0.0, [7, 5, 0], 3]
[[7, 5, 0], 3, 2, 0.0, [7, 4, 0], 4]
[[7, 4, 0], 4, 0, 0.0, [7, 5, 0], 4]
[[7, 5, 0], 4, 0, 0.0, [7, 6, 0], 4]
[[7, 6, 0], 4, 0, 1.0, [7, 7, 0], -1]

As you can see, the agent performs correctly up until it arrived in u=4. From there it moves up from the mail location (self.objects[(7,4)] = "e" # MAIL) to the plant location (self.objects[(7,7)] = "n" # PLANT) and, surprisingly, receives a reward 1 from doing so.

I believe the bug is a result of the _compute_next_state function in reward_machine.py. The issue is that when the agent "falls off" you return self.terminal_u as your base case and returns that as u'. However, there is a valid reward machine reward of 1 when transitioning from u=4 to u'=-1. This is seemingly why the reward machine returns a value of 1 when as it fails walking into a plant.

A simple fix for this it to add a class attribute called self.fall_off = -2 in the RewardMachine class init. Then in the _compute_next_state function you can instead return self.fall_off as your base case. Finally, in the step function you can change the done indicator to:
done = (u2 == self.terminal_u or u2 == self.fall_off)

This should still trigger a terminal state like effect, but result in no reward machine reward as there is no reward from transitioning from u=4 to u=-2.

See the following output after the change:

[[2, 1, 0], 0, 3, 0.0, [1, 1, 0], 0]
[[1, 1, 0], 0, 0, 0.0, [1, 2, 0], 0]
[[1, 2, 0], 0, 0, 0.0, [1, 3, 0], 0]
[[1, 3, 0], 0, 3, 0.0, [0, 3, 0], 0]
[[0, 3, 0], 0, 0, 0.0, [0, 4, 0], 0]
[[0, 4, 0], 0, 0, 0.0, [0, 5, 0], 0]
[[0, 5, 0], 0, 1, 0.0, [1, 5, 0], 0]
[[1, 5, 0], 0, 0, 0.0, [1, 6, 0], 0]
[[1, 6, 0], 0, 1, 0.0, [2, 6, 0], 0]
[[2, 6, 0], 0, 0, 0.0, [2, 7, 0], 0]
[[2, 7, 0], 0, 1, 0.0, [3, 7, 0], 0]
[[3, 7, 0], 0, 2, 0.0, [3, 6, 0], 3]
[[3, 6, 0], 3, 1, 0.0, [4, 6, 0], 3]
[[4, 6, 0], 3, 1, 0.0, [5, 6, 0], 3]
[[5, 6, 0], 3, 0, 0.0, [5, 7, 0], 3]
[[5, 7, 0], 3, 1, 0.0, [6, 7, 0], 3]
[[6, 7, 0], 3, 2, 0.0, [6, 6, 0], 3]
[[6, 6, 0], 3, 1, 0.0, [7, 6, 0], 3]
[[7, 6, 0], 3, 2, 0.0, [7, 5, 0], 3]
[[7, 5, 0], 3, 2, 0.0, [7, 4, 0], 4]
[[7, 4, 0], 4, 0, 0.0, [7, 5, 0], 4]
[[7, 5, 0], 4, 0, 0.0, [7, 6, 0], 4]
[[7, 6, 0], 4, 3, 0.0, [6, 6, 0], 4]
[[6, 6, 0], 4, 0, 0.0, [6, 7, 0], 4]
[[6, 7, 0], 4, 3, 0.0, [5, 7, 0], 4]
[[5, 7, 0], 4, 2, 0.0, [5, 6, 0], 4]
[[5, 6, 0], 4, 3, 0.0, [4, 6, 0], 4]
[[4, 6, 0], 4, 2, 0.0, [4, 5, 0], 4]
[[4, 5, 0], 4, 2, 1.0, [4, 4, 0], -1]

Here the agent correctly arrives in the office location (self.objects[(4,4)] = "g" # OFFICE) from u=4.

I'm sure there is a more elegant solution than providing another indicator for a terminal state, however I found this to be the easiest considering the reward machine reward output operates agnostic to the true propositions being handed to the u transition function.

I've only assessed this for the single problem instance. I'm not sure if this bug appears in other reward machine problems (though I think it likely could).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant