You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We've run into async cancellation future problems in omicron several times; one of the more painful ones is oxidecomputer/omicron#3140. Quickly summarizing: the endpoint called a function which locked a tokio mutex and then performed a sequence of long-running operations. If the client disconnected early, the future encompassing the endpoint was cancelled, which unlocked the tokio mutex and left the data protected by the mutex in an unexpected and difficult-to-debug state.
Async cancellation is tricky at best, and currently dropshot's behavior means the futures returned by endpoint functions may be cancelled under the control of the client. We had a live discussion today, and decided that a reasonable short-term bandaid (as opposed to auditing all endpoints for cancel safety and figuring out some way to keep those audits up to date) is to give dropshot servers a configuration option to tokio::spawn the endpoint futures, preventing from being cancelled if the client disconnects. This is not foolproof and may have its own problems, but seems preferable to the kinds of data corruption we've seen.
For omicron in the short term, we can put in a server-wide configuration option. In the medium term, we could add annotations to allow individual endpoints to opt into specific behavior (either "spawn" or "drop on client disconnect"). This is very much a short term quick fix that will not address all problems, as dropshot clients are not the only source of future cancellations (tokio::select! is another common one).
The text was updated successfully, but these errors were encountered:
We've run into async cancellation future problems in omicron several times; one of the more painful ones is oxidecomputer/omicron#3140. Quickly summarizing: the endpoint called a function which locked a
tokio
mutex and then performed a sequence of long-running operations. If the client disconnected early, the future encompassing the endpoint was cancelled, which unlocked the tokio mutex and left the data protected by the mutex in an unexpected and difficult-to-debug state.Async cancellation is tricky at best, and currently dropshot's behavior means the futures returned by endpoint functions may be cancelled under the control of the client. We had a live discussion today, and decided that a reasonable short-term bandaid (as opposed to auditing all endpoints for cancel safety and figuring out some way to keep those audits up to date) is to give dropshot servers a configuration option to
tokio::spawn
the endpoint futures, preventing from being cancelled if the client disconnects. This is not foolproof and may have its own problems, but seems preferable to the kinds of data corruption we've seen.For omicron in the short term, we can put in a server-wide configuration option. In the medium term, we could add annotations to allow individual endpoints to opt into specific behavior (either "spawn" or "drop on client disconnect"). This is very much a short term quick fix that will not address all problems, as dropshot clients are not the only source of future cancellations (
tokio::select!
is another common one).The text was updated successfully, but these errors were encountered: