Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix withTimeout helper function #1293

Merged
merged 2 commits into from Oct 11, 2022
Merged

Conversation

edoardopirovano
Copy link
Contributor

Noticed while looking at logs during the rollout of TRAP caching.

It turns out that I had slightly misunderstood the semantics of Promise.race. While it returns the value of the first promise that finishes, it does not actually halt the other promise (there is, as far as I now understand things, no way to do this in JavaScript). This means we were eventually calling the timeout callback even if the task completed. This isn't a big deal because the timeout callback just logged an informational message. Still, it's not ideal and creates confusing logs since we log a message about the download timing out some time after it has in fact already finished.

This PR fixes that, and adds a unit test that would've caught this. It's still not ideal that when we do timeout we don't halt the download in the background but I think that is okay since if we've timed out it probably means that has got stuck somehow so isn't using network resources. Actually cancelling the download would require re-writing parts of the Actions cache library to support cancellation, which is outside of our team's scope (although it's certainly a sensible feature request we should consider filing with the Actions team).

Merge / deployment checklist

  • Confirm this change is backwards compatible with existing workflows.
  • Confirm the readme has been updated if necessary.
  • Confirm the changelog has been updated if necessary.

@edoardopirovano edoardopirovano requested a review from a team as a code owner October 11, 2022 09:10
@jbj
Copy link

jbj commented Oct 11, 2022

It's still not ideal that when we do timeout we don't halt the download in the background but I think that is okay since if we've timed out it probably means that has got stuck somehow so isn't using network resources.

If the download keeps being stuck in the background, will the NodeJS runtime then be allowed to shut down when the action is done with all its foreground tasks? Can you verify this?

@edoardopirovano
Copy link
Contributor Author

If the download keeps being stuck in the background, will the NodeJS runtime then be allowed to shut down when the action is done with all its foreground tasks? Can you verify this?

Yikes, an excellent point, I've made a commit on top of this PR where I added a very long task with a timeout, and spawned a test run with this. Sure enough it continues with the foreground tasks but then gets stuck at the end of the step with the long task after finishing those 🙁.

So, unless there's something I'm missing (more input from JS experts would be welcome), it sounds like we can't add a timeout on top of the existing code as we're trying to do here and we really will need the proper solution of having a configurable timeout or cancellation mechanism threaded into the Actions cache library code which is unfortunate since it means having proper handling of outages in the Actions cache now requires work on another team.

@jbj
Copy link

jbj commented Oct 11, 2022

Would it help to call process.exit at the end of runWrapper?

@edoardopirovano
Copy link
Contributor Author

Would it help to call process.exit at the end of runWrapper?

This works, but is a little scary because we might also terminate other async tasks that are still pending.

I had a conversation with @henrymercer about this, and he spotted that actions/toolkit#1140 added a timeout option for the downloading, which seems like a better idea than our workaround. I've bumped the version of the library we pull in so that we have that PR and started using that new option. The timeout is per-segment rather than for the whole cache, but as we are bounding our caches to 1GB they will always fit in one segment (which is 2GB on a 64-bit machine) so this isn't an issue.

For the uploading, there is no such option. I propose we leave the workaround in place there. It's not ideal that we won't terminate the workflow until we hit the timeout for the run, but at least with our workaround the code will run to completion before we hang which means that results will get uploaded to the Code Scanning, the database will make it to MRVA if applicable, etc. This should still hopefully be a rare case, and we can monitor with our telemetry to see if it becomes a concern and then perhaps think about more aggressive solutions like using process.exit() or spawning a new process for the upload that we can kill independently.

In the future, of course, the ideal solution would be for the Actions library to just expose a timeout option in the upload case too.

Copy link
Contributor

@aeisenberg aeisenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks reasonable.

Copy link
Contributor

@henrymercer henrymercer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small suggestion around documenting some of what we learnt. Unfortunately accepting this requires rebuilding the Action and reapproving the PR, so feel free to skip it.

Comment on lines +902 to +903
* Note that this does NOT cancel the original promise, so that promise will
* continue in the background even after the timeout has expired.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Note that this does NOT cancel the original promise, so that promise will
* continue in the background even after the timeout has expired.
* Important: This does NOT cancel the original promise, so that promise will continue in the
* background even after the timeout has expired. If the original promise hangs, then this will
* prevent the process terminating.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that is indeed better. Will do this in a follow-up to avoid dismissing the approvals here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants