Add fail_hard decorator for worker methods #6210

mrocklin · 2022-04-26T15:28:46Z

We add a decorator that we can apply to a worker method that closes the
worker if an exception is encountered in that method.

We add a decorator that we can apply to a worker method that closes the worker if an exception is encountered in that method. This is still a work in progress, and is just up here as a proof of concept.

gjoseph92

Any other methods that we think aren't covered here, but are common causes of issues

This should probably be on everything that gets called via loop.add_callback. In worker.py, I see that that's:

I'd also like to see this in core.py, but that's a bigger story:

It's interesting that transitions isn't on that list, because it just gets called indirectly, likely via handle_scheduler, gather_dep, etc. If we just catch the concurrency entrypoints, we wouldn't actually need to wrap these internal functions like transitions or validate_state.

In fact, I wonder if doing that could make things even worse. For example, a transitions error stemming from a compute-task op will propagate up to handle_scheduler. handle_scheduler will already log and reraise the error... but then try to reconnect to the scheduler (or close!).

If we add a @fail_hard to transitions, this code path will still exist. But while handle_scheduler is trying to reconnect or await self.close, a separate self.close callback will already be running in the background. So we're trying to reconnect and close at the same time. Which one is even proper behavior? (A while ago we discovered this reconnection logic is also broken #5480, so it's extra unclear what should happen.) But brittle as it is, reconnection usually works right now. If there's also a self.close running in the background, I bet it won't. And what if the attempt to reconnect causes the self.close to also throw an error? Now the worker is in an even more broken state, and it's still not closed.

I assume this isn't a special case to handle_scheduler, but that many other places could have similar race conditions.

I do want to see more structure to our concurrency. I do think errors happening at all these places you've annotated should cause the worker to close. But I'm worried that, with how tangled things are, adding more unstructured concurrency, well-meaning and reasonable as it seems, could sometimes make things even worse.

gjoseph92 · 2022-04-26T15:43:00Z

distributed/worker.py

+            except Exception as e:
+                logger.exception(e)
+                # TODO: send event to scheduler
+                self.loop.add_callback(self.close, nanny=False, executor_wait=False)


What if self.close encounters an error? :)

Then we probably wouldn't close? I'm not sure that we can guarnatee against this case. Do you have some recommendation?

gjoseph92 · 2022-04-26T15:45:45Z

distributed/worker.py

+        async def wrapper(self, *args, **kwargs):
+            try:
+                return await method(self, *args, **kwargs)
+            except Exception as e:


Suggested change

except Exception as e:

except BaseException as e:

This would fix #5958 if you also wrapped execute (you should wrap execute either way I think).

The fact that BaseExceptions in callbacks aren't propagated by Tornado is pretty crazy. If we're going to add manual support for propagating exceptions like this, I don't see why we'd let BaseExceptions be ignored.

I'm OK with catching BaseException for tasks, i.e. in apply_function, et al. to fix #5958

However, I would be worried to close workers upon a asyncio.CancelledError. While I don't think we're using cancellation in many places right now, this would be a very confusing behavior if that ever changes.

mrocklin · 2022-04-26T16:30:18Z

I do want to see more structure to our concurrency. I do think errors happening at all these places you've annotated should cause the worker to close. But I'm worried that, with how tangled things are, adding more unstructured concurrency, well-meaning and reasonable as it seems, could sometimes make things even worse

@gjoseph92 to be clear, are you suggesting that we shouldn't make this change? Or are you just bringing up the general concern that we're not certain about this change, but still think that something in this direction probably makes sense to do?

fjetter · 2022-04-26T17:25:04Z

I think we might be mixing up a few things. I agree that dealing with all add_callbacks properly would be great but I would not go as far as to close the worker upon every exception, e.g. is a failed heartbeat cause for this escalation?

Putting this stuff into core.py / Server is also an interesting approach but I think I would prefer a more structured approach there that thinks about how we deal with asyncio overall. the same is true for PeriodicCallbacks. That all related obviously to #6201

I consider this specific change as a "very crude exception handler for state validation exceptions". For this I would consider it useful to cover

Worker.transitions
Worker.transition
Worker.validate_*
Worker.handle_*

gjoseph92 · 2022-04-26T17:46:44Z

are you suggesting that we shouldn't make this change?

I don't think we should make the change currently in this PR. I think the sorts of race conditions I pointed out have the potential to make things even worse.

still think that something in this direction probably makes sense to do?

I think we can come up with something that works though. The real problem we're trying to solve is that Tornado drops exceptions from callbacks. Let's focus on that problem. A few ideas:

only use fail_hard on functions being passed into loop.add_callback. Probably via functools.partial instead of a decorator (to be sure, if the function is called directly somewhere, that it's not accidentally wrapped). Don't put it on transitions, for example. Whatever transitions error occurs will eventually propagate up to an add_callback point, if they're not handled above. Because we know anything running via add_callback cannot have error handling above it, we should be safe to implement our own.
use fail_hard wherever we want, but don't re-raise the exception. Even if there's exception handling logic above the call site, assume that we're always right and the worker should always close. This avoids race conditions with other error handlers that also might use add_callback. This is probably a little simpler and closer to the spirit of your PR.
If a synchronous function is passed to loop.add_callback and we want to fail_hard on it, fail_hard should wrap it in an async function, which just calls the sync function synchronously, but then does await self.close rather than loop.add_callback(self.close, ...). That way, we can handle errors from self.close. If the self.close call fails, we should do something drastic—probably os._exit(1) (I think sys.exit won't work because the BaseException it raises will just be ignored, though reading this in asyncio I'm not longer sure). Simply by being in fail_hard, something was already wrong, so if our last-ditch error handling is failing, we really need the worker to shut down immediately.
- Or, if self.close is itself wrapped in fail_hard (with some recursion-preventing logic so that fail_hard won't call close again if close fails, but goes straight into the os._exit branch), then we can safely continue using loop.add_callback in fail_hard for synchronous functions, since if the close call fail eventually, we know it'll still result in an os._exit—we don't have to await around for that.

Or we could go in a different direction, and set the worker event loop's exception handler. We keep saying "Tornado drops exceptions", but actually it's not Tornado's fault. Tornado is just using asyncio, and asyncio drops exceptions by default. But we could customize that behavior, and have it shut down the worker. This the most correct and reliable thing to do, since it'll automatically apply to everything. But it also wouldn't be possible to do with gen_cluster tests where multiple workers share an event loop without first fixing #4959 (since the error handler would need to know which worker is currently "active" and needs to be closed).

mrocklin · 2022-04-26T18:33:24Z

I think I agree with @fjetter here (or at least my interpretation of what @fjetter is saying) which is that I support avoiding add_callback more generally, but that I think that we can make incremental progress here. I think that it comes to this statement:

I think the sorts of race conditions I pointed out have the potential to make things even worse.

Yes, I agree that this is true. However, I also think that it is unlikely. In general worker.close calls are pretty good at closing the worker, and I don't anticipate any particular complication there. Do we have reason to anticipate something there? I think that I'm just generally a bit less concerned here (although that's a common theme 🙂 ).

I'm also trying to get something in by next release so that we can have a stop-gap measure for current state machine issues (and an easy way to provide future stop-gap solutions) while we spend time investigating.

Given that, I'm also open to other approaches. Let me go through some of the options above with the "let's get something in in the next day or two" timeline in mind.

We could apply this to only gather_dep, which is the source of some issues, is already called in add_callback, and is async so we won't be calling add_callback any more than we already are. I think that this would meet Gabe's constraint of "no more add_callback" (or at least that's how I'm interpretting Gabe's concern).

This is a special-case of what I'm hearing above in Gabe's statement,

only use fail_hard on functions being passed into loop.add_callback
Another option from Gabe:

use fail_hard wherever we want, but don't re-raise the exception. Even if there's exception handling logic above the call site, assume that we're always right and the worker should always close. This avoids race conditions with other error handlers that also might use add_callback. This is probably a little simpler and closer to the spirit of your PR.

I'd prefer to reject this for now. It's a large enough change that I suspect that it would have unforseen effects that I can't reason through completely in a short timeline. I think that it's a reasonable idea, but not with my current timeline constraint.
If a synchronous function is passed to loop.add_callback and we want to fail_hard on it, fail_hard should wrap it in an async function, which just calls the sync function synchronously, but then does await self.close rather than loop.add_callback(self.close, ...)

Eh, don't forget that these aren't sync in the client sense. They're normal def functions, but they're running in the event loop. There is no easy way to run async code within them.
Asyncio exception handler. I like this idea. We could do a lot with it I think. I'd like to defer this to after this release though.

mrocklin · 2022-04-26T18:36:06Z

To option 1, which was "just use this on gather_dep" I'll also add handle_scheduler and all of the handle methods. I think that the only cases where @fjetter and @gjoseph92 's constraints differ here is on internal functions like transitions and validation methods. I'm going to restrict to the shared subset so that, hopefully, this can go in quickly.

mrocklin · 2022-04-26T18:40:48Z

OK, for now I've removed fail_hard on anything non-asynchronous. I don't actually necessarily think that this is the right approach long-term, but I'm fine starting here, and I think that it's ok to start with something agreeable and leave discussion for later.

mrocklin · 2022-04-26T18:42:00Z

Removing WIP title

mrocklin · 2022-04-26T19:51:39Z

I've been playing with this with test_chaos_rechunk and it definitely causes it to be more noisy in ways that I think people here will appreciate.

mrocklin · 2022-04-26T19:54:41Z

Hrm, nevermind. It's not capturing something failing in ensure_communicating yet (which should be captured because it's in handle_stream/handle_scheduler. 👀

mrocklin · 2022-04-26T22:42:03Z

Yeah, there is still a bit of work to do here. It's fun to dive into this though. I'm hopeful that we can get both the chaos testing to be much more sensitive (with this and other recent scheduler-event signals) and hopefully also expose other things that have been going on at the same time.

gjoseph92 · 2022-04-26T22:34:30Z

distributed/worker.py

@@ -1199,6 +1226,7 @@ async def heartbeat(self):
        finally:
            self.heartbeat_active = False

+    @fail_hard


Once the decorator is here, what's the point of having this block anymore?

distributed/distributed/worker.py

Lines 1201 to 1210 in 198522b

finally:

if self.reconnect and self.status in Status.ANY_RUNNING:

logger.info("Connection to scheduler broken. Reconnecting...")

self.loop.add_callback(self.heartbeat)

else:

logger.info(

"Connection to scheduler broken. Closing without reporting. Status: %s",

self.status,

)

await self.close(report=False)

The decorator is going to close the worker as soon as the finally block completes. So what's the point of trying to reconnect if we're going to close either way? These seem like opposing behaviors.

I think we want to either remove the try/except entirely from handle_scheduler (because all we want to do on error is close the worker, and @fail_hard will do that for us anyway), or not use @fail_hard here, if we do in fact still want to reconnect in the face of errors.

gjoseph92 · 2022-04-26T23:46:02Z

It's not capturing something failing in ensure_communicating yet

First off, there's this:

distributed/distributed/core.py

Lines 636 to 638 in 198522b

    
           except OSError: 
        
               # FIXME: This is silently ignored, is this intentional? 
        
               pass

Also, sometimes ensure_communicating can be scheduled as a callback from other entrypoints you might not be accounting for. For example, busy workers in gather_dep schedule self._readd_busy_worker to run later

distributed/distributed/worker.py

Line 3100 in 198522b

self.io_loop.call_later(0.15, self._readd_busy_worker, worker)

which then calls ensure_communicating

distributed/distributed/worker.py

Lines 3142 to 3145 in 198522b

    
           @log_errors 
        
           def _readd_busy_worker(self, worker: str) -> None: 
        
               self.busy_workers.remove(worker) 
        
               self.ensure_communicating()

Same with find_missing, which is running in a PeriodicCallback.

Additionally, ensure_communicating gets called to handle UnpauseEvent:

distributed/distributed/worker.py

Lines 3640 to 3646 in 198522b

    
               @handle_event.register 
        
               def _(self, ev: UnpauseEvent) -> RecsInstrs: 
        
                   """Emerge from paused status. Do not send this event directly. Instead, just set 
        
                   Worker.status back to running. 
        
                   """ 
        
                   assert self.status == Status.running 
        
                   self.ensure_communicating()

I think the fact that _handle_instructions currently only processes SendMessageToScheduler and Execute means that ensure_communicating won't be called as part of a done_callback on an asyncio task. (If that were the case, the problem would be that though _handle_stimulus_from_task does check the task's result to capture any exception the task may have raised, if it does raise an unexpected exception, that exception will just be propagated to the ether of the event loop (because _handle_stimulus_from_task is just a callback).)

But UnpauseEvent is handled in the Worker.status setter. So any place we set self.status = Status.running, we could be also transitively calling ensure_communicating.

I think all that code-tracing above makes it clear that it's really hard, and tedious, to find every entrypoint to these functions we care about and add a @fail_hard to them. We'd much rather just add @fail_hard to the places we actually care about.

I think that this would meet Gabe's constraint of "no more add_callback" (or at least that's how I'm interpretting Gabe's concern)

I'm not too concerned about add_callback, because as I mentioned we can easily avoid it. I was specifically concerned about

try:
    return await method(self, *args, **kwargs)
except Exception:
    ...
    self.loop.add_callback(self.close, nanny=False, executor_wait=False)
    # ^ what if self.close fails in the callback? then we're letting the worker continue to run in a broken state.
    # to even reach this place in code, things that we thought were "unlikely" have already happened.
    # this is our ejector seat. nobody pulls the ejector seat unless the airplane's already seriously broken.
    # we need to make sure ejection still happens, even if something goes wrong in the process.
    raise  # <-- why re-raise? we're shutting down.

We are taking the very final action here of closing the worker if an error happens. What other error-handling logic could possibly need to run? Whatever error handling logic used to happen before this PR, it's irrelevant now, because the worker will be dead in the next few seconds.

use fail_hard wherever we want, but don't re-raise the exception

It's a large enough change that I suspect that it would have unforseen effects that I can't reason through completely in a short timeline. I think that it's a reasonable idea, but not with my current timeline constraint.

I guess I'm confused by this concern, because regardless of how it's implemented, this large change is what this PR is doing. Hopefully my comment about whether handle_scheduler should try to reconnect illustrates this. We're now shutting down the whole worker. We used to not do that. I don't think there's much to reason through—the worker will be shutting down, so existing error-handling logic will be superfluous. We could reraise, but that just adds even more to reason about (what will the error handlers above us do? Could they possibly conflict with self.close?).

gjoseph92 · 2022-04-26T23:55:45Z

Anyway, this is how I'd feel most comfortable implementing it:

async def _force_close(self):
    try:
        await asyncio.wait_for(self.close(nanny=False, executor_wait=False), 30)
    except (Exception, BaseException):  # <-- include BaseException here or not??
        # Worker is in a very broken state if closing fails. We need to shut down immediately,
        # to ensure things don't get even worse and this worker potentially deadlocks the cluster.
        logger.critical(
            "Error trying close worker in response to broken internal state. "
            "Forcibly exiting worker NOW",
            exc_info=True
        )
        # use `os._exit` instead of `sys.exit` because of uncertainty
        # around propagating `SystemExit` from asyncio callbacks
        os._exit(1)


def fail_hard(method):
    """
    Decorator to close the worker if this method encounters an exception.
    """
    if iscoroutinefunction(method):

        @functools.wraps(method)
        async def wrapper(self, *args, **kwargs):
            try:
                return await method(self, *args, **kwargs)
            except Exception as e:
                logger.exception(e)
                # TODO: send event to scheduler
                await _force_close(self)
                
                # Could re-raise here if we wanted, but what's the point? We're closing.
                # Main argument would be if the caller expects a return value, it'll
                # fail anyway if we return None.

    else:
        @functools.wraps(method)
        def wrapper(self, *args, **kwargs):
            try:
                return method(self, *args, **kwargs)
            except Exception as e:
                logger.exception(e)
                # TODO: send event to scheduler
                self.loop.add_callback(_force_close, self)
                # ^ Normally, you'd never know if an `add_callback` is actually going to succeed.
                # But using `_force_close`, we can be confident the worker will get closed no matter what.

                # Could re-raise here if we wanted, but what's the point? We're closing.

    return wrapper

Then we could use @fail_hard on whatever functions we wanted. Ideally, we'd also remove the dead code of any try/except blocks around functions decorated with @fail_hard.

I do realize one good reason for re-raising the exception though: if we don't, the caller might immediately error anyway because we returned None instead of a real value. I still don't like letting other error handlers do unknown things while we're trying to close, but _force_close would make me pretty confident that the worker will shut down one way or another.

github-actions · 2022-04-27T00:14:08Z

Unit Test Results

      16 files +      4       16 suites +4 7h 38m 22s ⏱️ + 2h 7m 27s
  2 741 tests +      4   2 655 ✔️ +      8     80 💤 -   10 6 ❌ +6
21 829 runs +5 439 20 776 ✔️ +5 169 1 047 💤 +264 6 ❌ +6

For more details on these failures, see this check.

Results for commit 50c31ef. ± Comparison against base commit b837003.

♻️ This comment has been updated with latest results.

fjetter · 2022-04-27T13:36:51Z

I would like to repeat what I said yesterday because I am surprised that so much of the conversation here was about add_callback

I do not think this PR is a good approach to structured concurrency. It is true that we are missing any proper means of dealing with exceptions. For the record, I tried asyncio exception handlers a long time ago already and it doesn't work because when using tornado, there are no unhandled exceptions. Tornado wraps all callbacks basically with a log_error and doesn't reraise.
IMO, this issue should not be about how to capture and deal with exceptions. This issue should be about "What is the exception handler for state validation errors".

Think pseudo code

try:
    some_code()
except StateCorruptionDetected:
    handle_state_corruption()

Even with structured concurrency or an asyncio exception handler we'd need to talk about "What happens to an unhandled state corruption error, assuming we can detect it".

What I heard so far, the requirements for handle_state_corruption is:

Log something
Try to tell the scheduler
Try to close gracefully (e.g. evict data)
Make sure this worker is dead

This handle_state_corruption is the body of the fail_hard decorator.

Regarding what this body should look like, I like the approach in #6210 (comment) using

Log a message
Send an event to the scheduler
Close gracefully with timeout (Don't wait for the executor since this can block the event loop #XXX; there is a PR about this)
If this fails or timeouts, kill the interpreter.

include BaseException here or not??

I don't see a reason why it should be included but at the same time I don't see a lot of harm

To the second question of where to handle this exception (task groups/whatever would be nice but we don't have them), I think it's the same problem as for all other exception handlers. Try to be as target with exception handling as possible. The places where I see the possibility for us to detect these state violation errors are

Worker.transitions
Worker.transition
Worker.validate_*
Worker.handle_*

Once we have a better approach to exception handling overall we may pull this to an outer layer but for now these are the functions that matter. I think all concerns about race conditions are moot if we're calling sys._exit

Also set the chaos_rechunk test to sys.exit, this to focus us on situations where another worker fails, rather than this worker.

mrocklin · 2022-04-27T21:52:21Z

OK, I've used the formulation recommended above by @gjoseph92 but modified to ...

Send an event to the scheduler
Not trigger if we're already closing. I think that we should maybe reverse this at some point, but for now I want us to focus on failures that are in normal operation.

Every 20 runs or so on my laptop I'm now getting the following error, which is great news!

Worker: tcp://127.0.0.1:36529
InvalidTransition: ('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1895) :: released->missing
  Story:
    ("('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1895)", 'ensure-task-exists', 'released', 'compute-task-1651095913.7592962', 1651095913.7619393)
    ("('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1895)", 'released', 'fetch', 'fetch', {}, 'compute-task-1651095913.7592962', 1651095913.763571)
    ('gather-dependencies', 'tcp://127.0.0.1:37115', {"('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1824)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1859)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1900)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1812)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1851)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1883)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1827)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1816)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1847)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1844)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1868)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1839)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1430)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1450)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1803)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1891)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1848)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1852)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1907)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1903)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1880)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1892)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1864)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1899)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1867)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1836)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1840)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1807)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1832)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1855)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1828)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1831)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1908)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1895)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1808)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1904)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1860)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1911)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1875)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1804)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1856)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1896)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1390)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1884)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1871)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1872)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1820)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1835)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1879)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1800)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1876)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1823)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1811)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1863)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1819)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1815)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1887)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1843)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1888)"}, 'ensure-communicating-1651095913.7787864', 1651095913.7789662)
    ("('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1895)", 'fetch', 'flight', 'flight', {}, 'ensure-communicating-1651095913.7787864', 1651095913.7791023)
    ('request-dep', 'tcp://127.0.0.1:37115', {"('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1824)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1859)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1900)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1812)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1851)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1883)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1827)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1816)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1847)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1844)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1868)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1839)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1430)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1450)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1803)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1891)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1848)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1852)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1907)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1903)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1880)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1892)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1864)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1899)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1867)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1836)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1840)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1807)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1832)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1855)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1828)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1831)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1908)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1895)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1808)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1904)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1860)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1911)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1875)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1804)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1856)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1896)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1390)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1884)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1871)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1872)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1820)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1835)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1879)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1800)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1876)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1823)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1811)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1863)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1819)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1815)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1887)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1843)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1888)"}, 'ensure-communicating-1651095913.7787864', 1651095913.780679)
    ("('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1895)", 'flight', 'released', 'cancelled', {}, 'steal-58', 1651095913.8156393)
    ('receive-dep', 'tcp://127.0.0.1:37115', {"('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1824)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1859)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1900)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1812)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1851)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1883)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1827)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1816)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1847)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1844)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1868)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1839)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1430)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1450)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1803)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1891)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1848)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1852)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1907)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1903)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1880)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1892)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1864)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1899)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1867)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1836)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1840)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1807)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1832)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1855)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1828)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1831)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1908)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1895)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1808)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1904)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1860)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1911)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1875)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1804)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1856)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1896)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1390)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1884)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1871)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1872)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1820)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1835)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1879)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1800)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1876)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1823)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1811)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1863)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1819)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1815)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1887)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1843)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1888)"}, 'ensure-communicating-1651095913.7787864', 1651095913.8474061)
    ("('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1895)", 'cancelled', 'memory', 'cancelled', {"('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1895)": 'released'}, 'ensure-communicating-1651095913.7787864', 1651095913.8480968)
    ("('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1895)", 'release-key', 'ensure-communicating-1651095913.7787864', 1651095913.8481023)
    ("('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1895)", 'cancelled', 'released', 'released', {"('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1895)": 'forgotten'}, 'ensure-communicating-1651095913.7787864', 1651095913.8481138)
    ("('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1895)", 'released', 'forgotten', 'forgotten', {}, 'ensure-communicating-1651095913.7787864', 1651095913.8481185)
    ("('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1895)", 'ensure-task-exists', 'released', 'compute-task-1651095919.0555253', 1651095919.0676618)
    ("('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1895)", 'released', 'fetch', 'fetch', {}, 'compute-task-1651095919.0555253', 1651095919.0678046)
    ('gather-dependencies', 'tcp://127.0.0.1:45327', {"('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1891)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1895)"}, 'ensure-communicating-1651095919.1077273', 1651095919.107835)
    ("('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1895)", 'fetch', 'flight', 'flight', {}, 'ensure-communicating-1651095919.1077273', 1651095919.1078584)
    ('request-dep', 'tcp://127.0.0.1:45327', {"('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1891)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1895)"}, 'ensure-communicating-1651095919.1077273', 1651095919.1186914)
    ("('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1895)", 'flight', 'released', 'cancelled', {}, 'steal-194', 1651095919.2378554)
    ('receive-dep-failed', 'tcp://127.0.0.1:45327', {"('rechunk-merge-42d027b592cefa31c0593225bdda7a8b', 0, 395)", "('rechunk-merge-42d027b592cefa31c0593225bdda7a8b', 0, 369)", "('rechunk-merge-42d027b592cefa31c0593225bdda7a8b', 0, 389)", "('rechunk-merge-42d027b592cefa31c0593225bdda7a8b', 0, 387)", "('rechunk-merge-42d027b592cefa31c0593225bdda7a8b', 0, 368)", "('rechunk-merge-42d027b592cefa31c0593225bdda7a8b', 0, 392)", "('rechunk-merge-42d027b592cefa31c0593225bdda7a8b', 0, 412)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1891)", "('rechunk-merge-42d027b592cefa31c0593225bdda7a8b', 0, 373)", "('rechunk-merge-42d027b592cefa31c0593225bdda7a8b', 0, 377)", "('rechunk-merge-42d027b592cefa31c0593225bdda7a8b', 0, 385)", "('rechunk-merge-42d027b592cefa31c0593225bdda7a8b', 0, 363)", "('rechunk-merge-42d027b592cefa31c0593225bdda7a8b', 0, 370)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1895)", "('rechunk-merge-42d027b592cefa31c0593225bdda7a8b', 0, 372)", "('rechunk-split-2f03aa8100cc9be2cc9038da818a1e26', 337)", "('rechunk-merge-42d027b592cefa31c0593225bdda7a8b', 0, 366)", "('rechunk-merge-42d027b592cefa31c0593225bdda7a8b', 0, 398)", "('rechunk-merge-42d027b592cefa31c0593225bdda7a8b', 0, 371)", "('rechunk-merge-42d027b592cefa31c0593225bdda7a8b', 0, 367)"}, 'ensure-communicating-1651095919.1077273', 1651095919.5865068)
    ('missing-who-has', 'tcp://127.0.0.1:45327', "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1895)", 'ensure-communicating-1651095919.1077273', 1651095919.5865407)
    ("('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1895)", 'release-key', 'ensure-communicating-1651095919.1077273', 1651095919.5866287)
    ("('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1895)", 'cancelled', 'released', 'released', {}, 'ensure-communicating-1651095919.1077273', 1651095919.5866582)
Worker: tcp://127.0.0.1:36529
InvalidTransition: ('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1895) :: cancelled->missing
  Story:
    ("('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1895)", 'ensure-task-exists', 'released', 'compute-task-1651095913.7592962', 1651095913.7619393)
    ("('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1895)", 'released', 'fetch', 'fetch', {}, 'compute-task-1651095913.7592962', 1651095913.763571)
    ('gather-dependencies', 'tcp://127.0.0.1:37115', {"('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1824)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1859)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1900)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1812)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1851)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1883)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1827)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1816)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1847)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1844)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1868)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1839)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1430)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1450)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1803)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1891)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1848)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1852)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1907)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1903)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1880)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1892)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1864)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1899)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1867)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1836)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1840)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1807)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1832)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1855)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1828)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1831)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1908)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1895)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1808)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1904)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1860)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1911)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1875)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1804)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1856)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1896)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1390)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1884)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1871)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1872)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1820)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1835)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1879)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1800)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1876)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1823)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1811)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1863)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1819)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1815)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1887)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1843)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1888)"}, 'ensure-communicating-1651095913.7787864', 1651095913.7789662)
    ("('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1895)", 'fetch', 'flight', 'flight', {}, 'ensure-communicating-1651095913.7787864', 1651095913.7791023)
    ('request-dep', 'tcp://127.0.0.1:37115', {"('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1824)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1859)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1900)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1812)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1851)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1883)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1827)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1816)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1847)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1844)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1868)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1839)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1430)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1450)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1803)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1891)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1848)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1852)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1907)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1903)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1880)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1892)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1864)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1899)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1867)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1836)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1840)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1807)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1832)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1855)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1828)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1831)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1908)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1895)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1808)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1904)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1860)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1911)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1875)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1804)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1856)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1896)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1390)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1884)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1871)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1872)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1820)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1835)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1879)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1800)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1876)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1823)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1811)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1863)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1819)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1815)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1887)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1843)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1888)"}, 'ensure-communicating-1651095913.7787864', 1651095913.780679)
    ("('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1895)", 'flight', 'released', 'cancelled', {}, 'steal-58', 1651095913.8156393)
    ('receive-dep', 'tcp://127.0.0.1:37115', {"('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1824)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1859)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1900)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1812)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1851)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1883)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1827)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1816)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1847)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1844)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1868)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1839)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1430)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1450)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1803)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1891)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1848)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1852)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1907)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1903)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1880)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1892)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1864)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1899)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1867)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1836)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1840)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1807)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1832)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1855)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1828)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1831)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1908)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1895)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1808)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1904)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1860)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1911)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1875)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1804)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1856)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1896)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1390)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1884)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1871)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1872)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1820)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1835)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1879)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1800)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1876)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1823)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1811)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1863)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1819)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1815)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1887)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1843)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1888)"}, 'ensure-communicating-1651095913.7787864', 1651095913.8474061)
    ("('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1895)", 'cancelled', 'memory', 'cancelled', {"('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1895)": 'released'}, 'ensure-communicating-1651095913.7787864', 1651095913.8480968)
    ("('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1895)", 'release-key', 'ensure-communicating-1651095913.7787864', 1651095913.8481023)
    ("('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1895)", 'cancelled', 'released', 'released', {"('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1895)": 'forgotten'}, 'ensure-communicating-1651095913.7787864', 1651095913.8481138)
    ("('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1895)", 'released', 'forgotten', 'forgotten', {}, 'ensure-communicating-1651095913.7787864', 1651095913.8481185)
    ("('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1895)", 'ensure-task-exists', 'released', 'compute-task-1651095919.0555253', 1651095919.0676618)
    ("('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1895)", 'released', 'fetch', 'fetch', {}, 'compute-task-1651095919.0555253', 1651095919.0678046)
    ('gather-dependencies', 'tcp://127.0.0.1:45327', {"('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1891)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1895)"}, 'ensure-communicating-1651095919.1077273', 1651095919.107835)
    ("('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1895)", 'fetch', 'flight', 'flight', {}, 'ensure-communicating-1651095919.1077273', 1651095919.1078584)
    ('request-dep', 'tcp://127.0.0.1:45327', {"('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1891)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1895)"}, 'ensure-communicating-1651095919.1077273', 1651095919.1186914)
    ("('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1895)", 'flight', 'released', 'cancelled', {}, 'steal-194', 1651095919.2378554)
    ('receive-dep-failed', 'tcp://127.0.0.1:45327', {"('rechunk-merge-42d027b592cefa31c0593225bdda7a8b', 0, 395)", "('rechunk-merge-42d027b592cefa31c0593225bdda7a8b', 0, 369)", "('rechunk-merge-42d027b592cefa31c0593225bdda7a8b', 0, 389)", "('rechunk-merge-42d027b592cefa31c0593225bdda7a8b', 0, 387)", "('rechunk-merge-42d027b592cefa31c0593225bdda7a8b', 0, 368)", "('rechunk-merge-42d027b592cefa31c0593225bdda7a8b', 0, 392)", "('rechunk-merge-42d027b592cefa31c0593225bdda7a8b', 0, 412)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1891)", "('rechunk-merge-42d027b592cefa31c0593225bdda7a8b', 0, 373)", "('rechunk-merge-42d027b592cefa31c0593225bdda7a8b', 0, 377)", "('rechunk-merge-42d027b592cefa31c0593225bdda7a8b', 0, 385)", "('rechunk-merge-42d027b592cefa31c0593225bdda7a8b', 0, 363)", "('rechunk-merge-42d027b592cefa31c0593225bdda7a8b', 0, 370)", "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1895)", "('rechunk-merge-42d027b592cefa31c0593225bdda7a8b', 0, 372)", "('rechunk-split-2f03aa8100cc9be2cc9038da818a1e26', 337)", "('rechunk-merge-42d027b592cefa31c0593225bdda7a8b', 0, 366)", "('rechunk-merge-42d027b592cefa31c0593225bdda7a8b', 0, 398)", "('rechunk-merge-42d027b592cefa31c0593225bdda7a8b', 0, 371)", "('rechunk-merge-42d027b592cefa31c0593225bdda7a8b', 0, 367)"}, 'ensure-communicating-1651095919.1077273', 1651095919.5865068)
    ('missing-who-has', 'tcp://127.0.0.1:45327', "('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1895)", 'ensure-communicating-1651095919.1077273', 1651095919.5865407)
    ("('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1895)", 'release-key', 'ensure-communicating-1651095919.1077273', 1651095919.5866287)
    ("('rechunk-split-42d027b592cefa31c0593225bdda7a8b', 1895)", 'cancelled', 'released', 'released', {}, 'ensure-communicating-1651095919.1077273', 1651095919.5866582)

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../mambaforge/envs/dask-distributed/lib/python3.10/site-packages/tornado/ioloop.py:530: in run_sync
    return future_cell[0].result()
distributed/utils_test.py:1096: in coro
    await end_cluster(s, workers)
distributed/utils_test.py:937: in end_cluster
    check_invalid_worker_transitions(s)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

s = <Scheduler 'tcp://127.0.0.1:39109', workers: 0, cores: 0, tasks: 0>

    def check_invalid_worker_transitions(s: Scheduler) -> None:
        if not s.events.get("invalid-worker-transition"):
            return
    
        for timestamp, msg in s.events["invalid-worker-transition"]:
            worker = msg.pop("worker")
            print("Worker:", worker)
            print(InvalidTransition(**msg))
    
>       raise ValueError(
            "Invalid worker transitions found", len(s.events["invalid-worker-transition"])
        )
E       ValueError: ('Invalid worker transitions found', 2)

mrocklin · 2022-04-27T21:57:32Z

Hrm, so this is interesting

    @fail_hard
    async def handle_scheduler(self, comm):
        try:
            await self.handle_stream(comm, every_cycle=[self.ensure_communicating])
        except Exception as e:
            logger.exception(e)
            raise
        finally:
            if self.reconnect and self.status in Status.ANY_RUNNING:
                logger.info("Connection to scheduler broken.  Reconnecting...")
                self.loop.add_callback(self.heartbeat)
            else:
                logger.info(
                    "Connection to scheduler broken. Closing without reporting.  Status: %s",
                    self.status,
                )
                await self.close(report=False)

Currently handle_scheduler tries to reconnect on an exception. This makes some sense. Comm errors are reasonable, and reconnection in that case is reasonable. However, it might be cleaner if we just bail and let the Nanny handle restart+reconnect on a failure. This would be a change in current behavior, but maybe ok.

mrocklin · 2022-04-27T22:09:22Z

I should note that the error above isn't actually the error generated from these changes (the gather_dep errors are more rare) but some of the other changes, like using sys.exit rather than close_gracefully in the chaos test, and the cleanliness around closing well, seems to be letting these more relevant errors come up more easily.

If we merge this, this test is likely to start lighting up on CI. I think that I'm ok with that very short term. This seems like good pain to feel.

gjoseph92 · 2022-04-27T22:20:58Z

Currently handle_scheduler tries to reconnect on an exception

Yeah, that's what I was asking about in https://github.com/dask/distributed/pull/6210/files#r859199361.

Comm errors are reasonable, and reconnection in that case is reasonable

except Exception as e is catching more than just comm errors though #5483 :) Reconnection in the case of state errors may not be as reasonable?

it might be cleaner if we just bail and let the Nanny handle restart+reconnect on a failure

I'm in favor of this, mostly because we've seen bugs with reconnection in the past: #5480 #5481 #5457.

mrocklin · 2022-04-27T22:30:24Z

Yeah, I think that the question here is "should handle_scheduler handle some exceptions?" Are there some CommClosedErrors that we're actually ok with, and don't want to raise up to fail_hard, but actually want to handle locally?

Short term my guess is that we just want to cancel on everything and restart the worker. Longer term, I wouldn't be surprised if we want workers to be a little bit more resilient than this, but I also wouldn't be surprised if it goes the other direction.

mrocklin · 2022-04-27T22:35:58Z

@fjetter as far as state machine failures go, I think that a lot of what I was seeing before were failures when we were shutting down. These are genuine and we should care about them, but recently I've shifted the chaos test to just use sys.exit, which makes us blind to these (which i think is ok short term).

As a result, if you're still looking at state machine issues I'd like to focus you on the failure in #6210 (comment) , which comes up consistently with this test, while the others don't. This one seems to happen to workers when other workers fail, which seems like something that we should care about much more.

We've removed this functionality for now

mrocklin · 2022-04-28T16:05:06Z

OK, so there is a behavior change here around reconnection. If we're ok with that then I think that this is good to go. I'm not feeling a huge rush to get this in, but wouldn't mind getting it in this release.

mrocklin · 2022-04-28T16:31:48Z

To share my thoughts on that calculus, this does identify and resolve a state machine transition failure (and presumably resolve others as well). Because of this, I'd very much like to get it in.

I'm a little nervous about the reconnect change, but not that nervous.

As I think more about this, I think I'm more in favor of merging before the release. @gjoseph92 if you have time to look this over again I would welcome a 👍 or 👎

distributed/worker.py

fjetter · 2022-04-28T17:00:59Z

I'm on board with removing application layer reconnects and think we should replace it with a more robust network layer. I think there is tons of code we can clean up if we remove this feature, e.g.

distributed/distributed/scheduler.py

Lines 3751 to 3784 in b837003

    
           if nbytes: 
        
               assert isinstance(nbytes, dict) 
        
               already_released_keys = [] 
        
               for key in nbytes: 
        
                   ts: TaskState = self.tasks.get(key)  # type: ignore 
        
                   if ts is not None and ts.state != "released": 
        
                       if ts.state == "memory": 
        
                           self.add_keys(worker=address, keys=[key]) 
        
                       else: 
        
                           t: tuple = self._transition( 
        
                               key, 
        
                               "memory", 
        
                               stimulus_id, 
        
                               worker=address, 
        
                               nbytes=nbytes[key], 
        
                               typename=types[key], 
        
                           ) 
        
                           recommendations, client_msgs, worker_msgs = t 
        
                           self._transitions( 
        
                               recommendations, client_msgs, worker_msgs, stimulus_id 
        
                           ) 
        
                           recommendations = {} 
        
                   else: 
        
                       already_released_keys.append(key) 
        
               if already_released_keys: 
        
                   if address not in worker_msgs: 
        
                       worker_msgs[address] = [] 
        
                   worker_msgs[address].append( 
        
                       { 
        
                           "op": "remove-replicas", 
        
                           "keys": already_released_keys, 
        
                           "stimulus_id": stimulus_id, 
        
                       } 
        
                   )

However, I am nervous about this too and I would like to not rush this. I would definitely prefer a dedicated PR.

Is this absolutely required to get the fail hard in?

mrocklin · 2022-04-28T19:55:06Z

Is this absolutely required to get the fail hard in?

I'll see if I can separate. I'll probably allow CommClosedErrors and OSErrors to pass through.

This reverts commit a0721aa.

This reverts commit b59ccd0.

This reverts commit a08b5dd.

mrocklin · 2022-04-28T20:26:47Z

However, I am nervous about this too and I would like to not rush this. I would definitely prefer a dedicated PR.
Is this absolutely required to get the fail hard in?

All is well now. The behavior is still there, and it's a bit cleaner now.

distributed/worker.py

Add fail_hard decorator for worker methods

6cbc05e

We add a decorator that we can apply to a worker method that closes the worker if an exception is encountered in that method. This is still a work in progress, and is just up here as a proof of concept.

mrocklin changed the title ~~Add fail_hard decorator for worker methods~~ WIP: Add fail_hard decorator for worker methods Apr 26, 2022

mrocklin mentioned this pull request Apr 26, 2022

Close worker when we discover an internal error #6206

Closed

gjoseph92 reviewed Apr 26, 2022

View reviewed changes

mrocklin added 2 commits April 26, 2022 13:15

Add test and scheduler eventing for fail_hard

6eeac6b

Fail hard on worker handlers

97511ed

mrocklin added 3 commits April 26, 2022 13:36

Only fail_hard on handlers and gather_dep

73471b2

remove fail_hard on sync handlers

b2fa15b

Disallow non-coroutinefunctions

a991d84

mrocklin changed the title ~~WIP: Add fail_hard decorator for worker methods~~ Add fail_hard decorator for worker methods Apr 26, 2022

gjoseph92 reviewed Apr 26, 2022

View reviewed changes

fjetter assigned mrocklin Apr 27, 2022

fjetter linked an issue Apr 27, 2022 that may be closed by this pull request

Close worker when we discover an internal error #6206

Closed

mrocklin added 2 commits April 27, 2022 15:56

Merge branch 'main' of github.com:dask/distributed into fail-hard

b33b411

Use new fail_hard on more methods

43b2962

Also set the chaos_rechunk test to sys.exit, this to focus us on situations where another worker fails, rather than this worker.

remove exception handling from handle_scheduler

a0721aa

mrocklin added 4 commits April 28, 2022 06:34

Remove test_reconnect

b59ccd0

We've removed this functionality for now

Remove other test_reconnect

a08b5dd

Merge branch 'main' of github.com:dask/distributed into fail-hard

e1c7447

remove test_no_reconnect too

a0b6f10

fjetter reviewed Apr 28, 2022

View reviewed changes

distributed/worker.py Show resolved Hide resolved

distributed/worker.py Show resolved Hide resolved

distributed/worker.py Show resolved Hide resolved

mrocklin added 4 commits April 28, 2022 15:11

Revert "remove exception handling from handle_scheduler"

ebe8886

This reverts commit a0721aa.

Revert "Remove test_reconnect"

a1399a1

This reverts commit b59ccd0.

Revert "Remove other test_reconnect"

0505df8

This reverts commit a08b5dd.

Remove exception block from handle_scheduler

50c31ef

fjetter reviewed Apr 29, 2022

View reviewed changes

distributed/worker.py Show resolved Hide resolved

[skip ci] remove comment

96f7bd2

fjetter approved these changes Apr 29, 2022

View reviewed changes

fjetter merged commit be45ba2 into dask:main Apr 29, 2022

mrocklin deleted the fail-hard branch April 29, 2022 13:11

fjetter mentioned this pull request Apr 29, 2022

Close workers if an exception occurs during transition #4735

Closed

gjoseph92 mentioned this pull request May 4, 2022

Fix fail_hard for sync functions #6269

Merged

2 tasks

fjetter mentioned this pull request May 20, 2022

Fix test_stress_scatter_death #6404

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fail_hard decorator for worker methods #6210

Add fail_hard decorator for worker methods #6210

mrocklin commented Apr 26, 2022 •

edited

gjoseph92 left a comment

gjoseph92 Apr 26, 2022

mrocklin Apr 26, 2022

gjoseph92 Apr 26, 2022

fjetter Apr 26, 2022

mrocklin commented Apr 26, 2022

fjetter commented Apr 26, 2022

gjoseph92 commented Apr 26, 2022

mrocklin commented Apr 26, 2022

mrocklin commented Apr 26, 2022

mrocklin commented Apr 26, 2022

mrocklin commented Apr 26, 2022

mrocklin commented Apr 26, 2022

mrocklin commented Apr 26, 2022

mrocklin commented Apr 26, 2022

gjoseph92 Apr 26, 2022

gjoseph92 commented Apr 26, 2022

gjoseph92 commented Apr 26, 2022

github-actions bot commented Apr 27, 2022 •

edited

fjetter commented Apr 27, 2022

mrocklin commented Apr 27, 2022

mrocklin commented Apr 27, 2022

mrocklin commented Apr 27, 2022

gjoseph92 commented Apr 27, 2022

mrocklin commented Apr 27, 2022

mrocklin commented Apr 27, 2022

mrocklin commented Apr 28, 2022

mrocklin commented Apr 28, 2022

fjetter commented Apr 28, 2022

mrocklin commented Apr 28, 2022

mrocklin commented Apr 28, 2022

	finally:
	if self.reconnect and self.status in Status.ANY_RUNNING:
	logger.info("Connection to scheduler broken. Reconnecting...")
	self.loop.add_callback(self.heartbeat)
	else:
	logger.info(
	"Connection to scheduler broken. Closing without reporting. Status: %s",
	self.status,
	)
	await self.close(report=False)

Add fail_hard decorator for worker methods #6210

Add fail_hard decorator for worker methods #6210

Conversation

mrocklin commented Apr 26, 2022 • edited

gjoseph92 left a comment

Choose a reason for hiding this comment

gjoseph92 Apr 26, 2022

Choose a reason for hiding this comment

mrocklin Apr 26, 2022

Choose a reason for hiding this comment

gjoseph92 Apr 26, 2022

Choose a reason for hiding this comment

fjetter Apr 26, 2022

Choose a reason for hiding this comment

mrocklin commented Apr 26, 2022

fjetter commented Apr 26, 2022

gjoseph92 commented Apr 26, 2022

mrocklin commented Apr 26, 2022

mrocklin commented Apr 26, 2022

mrocklin commented Apr 26, 2022

mrocklin commented Apr 26, 2022

mrocklin commented Apr 26, 2022

mrocklin commented Apr 26, 2022

mrocklin commented Apr 26, 2022

gjoseph92 Apr 26, 2022

Choose a reason for hiding this comment

gjoseph92 commented Apr 26, 2022

gjoseph92 commented Apr 26, 2022

github-actions bot commented Apr 27, 2022 • edited

Unit Test Results

fjetter commented Apr 27, 2022

mrocklin commented Apr 27, 2022

mrocklin commented Apr 27, 2022

mrocklin commented Apr 27, 2022

gjoseph92 commented Apr 27, 2022

mrocklin commented Apr 27, 2022

mrocklin commented Apr 27, 2022

mrocklin commented Apr 28, 2022

mrocklin commented Apr 28, 2022

fjetter commented Apr 28, 2022

mrocklin commented Apr 28, 2022

mrocklin commented Apr 28, 2022

mrocklin commented Apr 26, 2022 •

edited

github-actions bot commented Apr 27, 2022 •

edited