Fall off state bug #50

GregHydeDartmouth · 2022-12-21T17:14:59Z

Opening an issue for a potential bug I experienced using your library.
After applying value iteration on the officeworld problem for reward machine 3 I noticed unexpected behavior from the agent performing optimally with an e-greedy policy with e=0.0.

A trace of that trajectory of the form (s, u, a, r, s', u') is here:
[[2, 1, 0], 0, 3, 0.0, [1, 1, 0], 0]
[[1, 1, 0], 0, 0, 0.0, [1, 2, 0], 0]
[[1, 2, 0], 0, 0, 0.0, [1, 3, 0], 0]
[[1, 3, 0], 0, 3, 0.0, [0, 3, 0], 0]
[[0, 3, 0], 0, 0, 0.0, [0, 4, 0], 0]
[[0, 4, 0], 0, 0, 0.0, [0, 5, 0], 0]
[[0, 5, 0], 0, 1, 0.0, [1, 5, 0], 0]
[[1, 5, 0], 0, 0, 0.0, [1, 6, 0], 0]
[[1, 6, 0], 0, 1, 0.0, [2, 6, 0], 0]
[[2, 6, 0], 0, 0, 0.0, [2, 7, 0], 0]
[[2, 7, 0], 0, 1, 0.0, [3, 7, 0], 0]
[[3, 7, 0], 0, 2, 0.0, [3, 6, 0], 3]
[[3, 6, 0], 3, 1, 0.0, [4, 6, 0], 3]
[[4, 6, 0], 3, 1, 0.0, [5, 6, 0], 3]
[[5, 6, 0], 3, 0, 0.0, [5, 7, 0], 3]
[[5, 7, 0], 3, 1, 0.0, [6, 7, 0], 3]
[[6, 7, 0], 3, 2, 0.0, [6, 6, 0], 3]
[[6, 6, 0], 3, 1, 0.0, [7, 6, 0], 3]
[[7, 6, 0], 3, 2, 0.0, [7, 5, 0], 3]
[[7, 5, 0], 3, 2, 0.0, [7, 4, 0], 4]
[[7, 4, 0], 4, 0, 0.0, [7, 5, 0], 4]
[[7, 5, 0], 4, 0, 0.0, [7, 6, 0], 4]
[[7, 6, 0], 4, 0, 1.0, [7, 7, 0], -1]

As you can see, the agent performs correctly up until it arrived in u=4. From there it moves up from the mail location (self.objects[(7,4)] = "e" # MAIL) to the plant location (self.objects[(7,7)] = "n" # PLANT) and, surprisingly, receives a reward 1 from doing so.

I believe the bug is a result of the _compute_next_state function in reward_machine.py. The issue is that when the agent "falls off" you return self.terminal_u as your base case and returns that as u'. However, there is a valid reward machine reward of 1 when transitioning from u=4 to u'=-1. This is seemingly why the reward machine returns a value of 1 when as it fails walking into a plant.

A simple fix for this it to add a class attribute called self.fall_off = -2 in the RewardMachine class init. Then in the _compute_next_state function you can instead return self.fall_off as your base case. Finally, in the step function you can change the done indicator to:
done = (u2 == self.terminal_u or u2 == self.fall_off)

This should still trigger a terminal state like effect, but result in no reward machine reward as there is no reward from transitioning from u=4 to u=-2.

See the following output after the change:

[[2, 1, 0], 0, 3, 0.0, [1, 1, 0], 0]
[[1, 1, 0], 0, 0, 0.0, [1, 2, 0], 0]
[[1, 2, 0], 0, 0, 0.0, [1, 3, 0], 0]
[[1, 3, 0], 0, 3, 0.0, [0, 3, 0], 0]
[[0, 3, 0], 0, 0, 0.0, [0, 4, 0], 0]
[[0, 4, 0], 0, 0, 0.0, [0, 5, 0], 0]
[[0, 5, 0], 0, 1, 0.0, [1, 5, 0], 0]
[[1, 5, 0], 0, 0, 0.0, [1, 6, 0], 0]
[[1, 6, 0], 0, 1, 0.0, [2, 6, 0], 0]
[[2, 6, 0], 0, 0, 0.0, [2, 7, 0], 0]
[[2, 7, 0], 0, 1, 0.0, [3, 7, 0], 0]
[[3, 7, 0], 0, 2, 0.0, [3, 6, 0], 3]
[[3, 6, 0], 3, 1, 0.0, [4, 6, 0], 3]
[[4, 6, 0], 3, 1, 0.0, [5, 6, 0], 3]
[[5, 6, 0], 3, 0, 0.0, [5, 7, 0], 3]
[[5, 7, 0], 3, 1, 0.0, [6, 7, 0], 3]
[[6, 7, 0], 3, 2, 0.0, [6, 6, 0], 3]
[[6, 6, 0], 3, 1, 0.0, [7, 6, 0], 3]
[[7, 6, 0], 3, 2, 0.0, [7, 5, 0], 3]
[[7, 5, 0], 3, 2, 0.0, [7, 4, 0], 4]
[[7, 4, 0], 4, 0, 0.0, [7, 5, 0], 4]
[[7, 5, 0], 4, 0, 0.0, [7, 6, 0], 4]
[[7, 6, 0], 4, 3, 0.0, [6, 6, 0], 4]
[[6, 6, 0], 4, 0, 0.0, [6, 7, 0], 4]
[[6, 7, 0], 4, 3, 0.0, [5, 7, 0], 4]
[[5, 7, 0], 4, 2, 0.0, [5, 6, 0], 4]
[[5, 6, 0], 4, 3, 0.0, [4, 6, 0], 4]
[[4, 6, 0], 4, 2, 0.0, [4, 5, 0], 4]
[[4, 5, 0], 4, 2, 1.0, [4, 4, 0], -1]

Here the agent correctly arrives in the office location (self.objects[(4,4)] = "g" # OFFICE) from u=4.

I'm sure there is a more elegant solution than providing another indicator for a terminal state, however I found this to be the easiest considering the reward machine reward output operates agnostic to the true propositions being handed to the u transition function.

I've only assessed this for the single problem instance. I'm not sure if this bug appears in other reward machine problems (though I think it likely could).

tdelgado00 mentioned this issue Feb 13, 2024

Bug fix: RM transitions #51

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fall off state bug #50

Fall off state bug #50

GregHydeDartmouth commented Dec 21, 2022

Fall off state bug #50

Fall off state bug #50

Comments

GregHydeDartmouth commented Dec 21, 2022