You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Opening an issue for a potential bug I experienced using your library.
After applying value iteration on the officeworld problem for reward machine 3 I noticed unexpected behavior from the agent performing optimally with an e-greedy policy with e=0.0.
As you can see, the agent performs correctly up until it arrived in u=4. From there it moves up from the mail location (self.objects[(7,4)] = "e" # MAIL) to the plant location (self.objects[(7,7)] = "n" # PLANT) and, surprisingly, receives a reward 1 from doing so.
I believe the bug is a result of the _compute_next_state function in reward_machine.py. The issue is that when the agent "falls off" you return self.terminal_u as your base case and returns that as u'. However, there is a valid reward machine reward of 1 when transitioning from u=4 to u'=-1. This is seemingly why the reward machine returns a value of 1 when as it fails walking into a plant.
A simple fix for this it to add a class attribute called self.fall_off = -2 in the RewardMachine class init. Then in the _compute_next_state function you can instead return self.fall_off as your base case. Finally, in the step function you can change the done indicator to:
done = (u2 == self.terminal_u or u2 == self.fall_off)
This should still trigger a terminal state like effect, but result in no reward machine reward as there is no reward from transitioning from u=4 to u=-2.
Here the agent correctly arrives in the office location (self.objects[(4,4)] = "g" # OFFICE) from u=4.
I'm sure there is a more elegant solution than providing another indicator for a terminal state, however I found this to be the easiest considering the reward machine reward output operates agnostic to the true propositions being handed to the u transition function.
I've only assessed this for the single problem instance. I'm not sure if this bug appears in other reward machine problems (though I think it likely could).
The text was updated successfully, but these errors were encountered:
Opening an issue for a potential bug I experienced using your library.
After applying value iteration on the officeworld problem for reward machine 3 I noticed unexpected behavior from the agent performing optimally with an e-greedy policy with e=0.0.
A trace of that trajectory of the form (s, u, a, r, s', u') is here:
[[2, 1, 0], 0, 3, 0.0, [1, 1, 0], 0]
[[1, 1, 0], 0, 0, 0.0, [1, 2, 0], 0]
[[1, 2, 0], 0, 0, 0.0, [1, 3, 0], 0]
[[1, 3, 0], 0, 3, 0.0, [0, 3, 0], 0]
[[0, 3, 0], 0, 0, 0.0, [0, 4, 0], 0]
[[0, 4, 0], 0, 0, 0.0, [0, 5, 0], 0]
[[0, 5, 0], 0, 1, 0.0, [1, 5, 0], 0]
[[1, 5, 0], 0, 0, 0.0, [1, 6, 0], 0]
[[1, 6, 0], 0, 1, 0.0, [2, 6, 0], 0]
[[2, 6, 0], 0, 0, 0.0, [2, 7, 0], 0]
[[2, 7, 0], 0, 1, 0.0, [3, 7, 0], 0]
[[3, 7, 0], 0, 2, 0.0, [3, 6, 0], 3]
[[3, 6, 0], 3, 1, 0.0, [4, 6, 0], 3]
[[4, 6, 0], 3, 1, 0.0, [5, 6, 0], 3]
[[5, 6, 0], 3, 0, 0.0, [5, 7, 0], 3]
[[5, 7, 0], 3, 1, 0.0, [6, 7, 0], 3]
[[6, 7, 0], 3, 2, 0.0, [6, 6, 0], 3]
[[6, 6, 0], 3, 1, 0.0, [7, 6, 0], 3]
[[7, 6, 0], 3, 2, 0.0, [7, 5, 0], 3]
[[7, 5, 0], 3, 2, 0.0, [7, 4, 0], 4]
[[7, 4, 0], 4, 0, 0.0, [7, 5, 0], 4]
[[7, 5, 0], 4, 0, 0.0, [7, 6, 0], 4]
[[7, 6, 0], 4, 0, 1.0, [7, 7, 0], -1]
As you can see, the agent performs correctly up until it arrived in u=4. From there it moves up from the mail location (self.objects[(7,4)] = "e" # MAIL) to the plant location (self.objects[(7,7)] = "n" # PLANT) and, surprisingly, receives a reward 1 from doing so.
I believe the bug is a result of the _compute_next_state function in reward_machine.py. The issue is that when the agent "falls off" you return self.terminal_u as your base case and returns that as u'. However, there is a valid reward machine reward of 1 when transitioning from u=4 to u'=-1. This is seemingly why the reward machine returns a value of 1 when as it fails walking into a plant.
A simple fix for this it to add a class attribute called self.fall_off = -2 in the RewardMachine class init. Then in the _compute_next_state function you can instead return self.fall_off as your base case. Finally, in the step function you can change the done indicator to:
done = (u2 == self.terminal_u or u2 == self.fall_off)
This should still trigger a terminal state like effect, but result in no reward machine reward as there is no reward from transitioning from u=4 to u=-2.
See the following output after the change:
[[2, 1, 0], 0, 3, 0.0, [1, 1, 0], 0]
[[1, 1, 0], 0, 0, 0.0, [1, 2, 0], 0]
[[1, 2, 0], 0, 0, 0.0, [1, 3, 0], 0]
[[1, 3, 0], 0, 3, 0.0, [0, 3, 0], 0]
[[0, 3, 0], 0, 0, 0.0, [0, 4, 0], 0]
[[0, 4, 0], 0, 0, 0.0, [0, 5, 0], 0]
[[0, 5, 0], 0, 1, 0.0, [1, 5, 0], 0]
[[1, 5, 0], 0, 0, 0.0, [1, 6, 0], 0]
[[1, 6, 0], 0, 1, 0.0, [2, 6, 0], 0]
[[2, 6, 0], 0, 0, 0.0, [2, 7, 0], 0]
[[2, 7, 0], 0, 1, 0.0, [3, 7, 0], 0]
[[3, 7, 0], 0, 2, 0.0, [3, 6, 0], 3]
[[3, 6, 0], 3, 1, 0.0, [4, 6, 0], 3]
[[4, 6, 0], 3, 1, 0.0, [5, 6, 0], 3]
[[5, 6, 0], 3, 0, 0.0, [5, 7, 0], 3]
[[5, 7, 0], 3, 1, 0.0, [6, 7, 0], 3]
[[6, 7, 0], 3, 2, 0.0, [6, 6, 0], 3]
[[6, 6, 0], 3, 1, 0.0, [7, 6, 0], 3]
[[7, 6, 0], 3, 2, 0.0, [7, 5, 0], 3]
[[7, 5, 0], 3, 2, 0.0, [7, 4, 0], 4]
[[7, 4, 0], 4, 0, 0.0, [7, 5, 0], 4]
[[7, 5, 0], 4, 0, 0.0, [7, 6, 0], 4]
[[7, 6, 0], 4, 3, 0.0, [6, 6, 0], 4]
[[6, 6, 0], 4, 0, 0.0, [6, 7, 0], 4]
[[6, 7, 0], 4, 3, 0.0, [5, 7, 0], 4]
[[5, 7, 0], 4, 2, 0.0, [5, 6, 0], 4]
[[5, 6, 0], 4, 3, 0.0, [4, 6, 0], 4]
[[4, 6, 0], 4, 2, 0.0, [4, 5, 0], 4]
[[4, 5, 0], 4, 2, 1.0, [4, 4, 0], -1]
Here the agent correctly arrives in the office location (self.objects[(4,4)] = "g" # OFFICE) from u=4.
I'm sure there is a more elegant solution than providing another indicator for a terminal state, however I found this to be the easiest considering the reward machine reward output operates agnostic to the true propositions being handed to the u transition function.
I've only assessed this for the single problem instance. I'm not sure if this bug appears in other reward machine problems (though I think it likely could).
The text was updated successfully, but these errors were encountered: