Refactor status handler, remove taskDefined handler #5665

lotas · 2022-09-07T11:01:18Z

This resolves an issue where two different handlers were trying to check/create/update check_run at the same time, which led to several check_runs being created and status reports were being sent to the wrong one.

Now single status handler manages all state transitions including task definition, running, and all of the completion states.

Besides that, if multiple events try to update same taskId, it will be ensured that they are executed strictly in order to not overwrite future events or create multiple check runs.

This also removes /rerequested functionality for re-run tasks, as this feature is broken on github side, status remains forever completed no matter what. Instead, new check run is being created if task was rerun.

petemoore · 2022-09-07T11:20:35Z

This looks good, but I'm wondering how best we can test the various types of race conditions that caused us pains before, aside from testing it in production. For example, could we deploy it to staging before releasing it, and then mirror production traffic to staging, to see how it handles things?

I'm guessing it is going to be pretty tricky to mock production like conditions without building a big framework to emulate heavy load etc.

lotas · 2022-09-07T11:36:44Z

This looks good, but I'm wondering how best we can test the various types of race conditions that caused us pains before, aside from testing it in production. For example, could we deploy it to staging before releasing it, and then mirror production traffic to staging, to see how it handles things?

I'm guessing it is going to be pretty tricky to mock production like conditions without building a big framework to emulate heavy load etc.

I will be testing it on dev.alpha right now, with the single connected repo to confirm that refactoring still works.
Plus all unit tests still gives us a good signal that the expected functionality and outputs are still the same.

Goal of this refactoring were to avoid concurrent handlers trying to create/update check runs for a single task.
This was possible in the past (although extremely rare) because of two different handlers - taskDefined and status. And as we saw in this task, it happened because both of those handlers were creating check run just 0.005s apart :)

Now, both events will be sent to the same queue, and as a result only one event will be handled at a time. PulseConsumer is waiting for handler to finish before receiving new message from that queue. As long as we don't scale github-worker to multiple instances it should be fine ;) And if we do, we'd have to come up with some distributed state machines .. huh :)

matt-boris

Lookin good!

changelog/ZuGEp6tWRWW4ED2sVXTkuQ.md

matt-boris · 2022-09-07T18:27:27Z

services/github/src/handlers/status.js

+    const taskDefinition = await this.queueClient.task(taskId);
+    const fetchArtifact = async (artifactPath) => {
+      if (taskDefined || !runId) {
+        // when task is being defined, there will be no artifacts, so we fake the call and return empty response


matt-boris · 2022-09-07T18:30:56Z

services/github/src/handlers/status.js

-      build,
-      scopes: taskDefinition.scopes,
-    });
+    const [liveLogText, customCheckRunText, customCheckRunAnnotationsText ] = await Promise.all([


Suggested change

const [liveLogText, customCheckRunText, customCheckRunAnnotationsText ] = await Promise.all([

const [ liveLogText, customCheckRunText, customCheckRunAnnotationsText ] = await Promise.all([

services/github/src/handlers/status.js

matt-boris

Just a couple small suggestions! But overall - 💪🏻 💪🏻

matt-boris · 2022-09-08T13:43:02Z

changelog/ZuGEp6tWRWW4ED2sVXTkuQ.md

+
+Refactored github status checks handler to do handle task status transitions in single place.
+
+Previous implementaition relied on two handlers: taskDefined and statusChanged.


Suggested change

Previous implementaition relied on two handlers: taskDefined and statusChanged.

Previous implementation relied on two handlers: taskDefined and statusChanged.

services/github/src/github-auth.js

matt-boris · 2022-09-08T13:45:41Z

services/github/src/handlers/index.js

-
-    const callHandler = (name, handler) => message => {
+    // handler returned must be async, as timedHandler will not be able to time it correctly
+    const callHandler = (name, handler) => message =>


🦅 👁️

this caused function to return early, and all monitor.timedHandler() were returning over-optimistic durations :)

lotas · 2022-09-08T14:43:11Z

services/github/src/handlers/status.js

+  /**
+   * Github has a limit of 64Kb for the whole payload
+   */
+  getRemainingMaxSize() {


this will ensure that if someone includes custom annotations, we will consider its size too

lotas · 2022-09-08T20:53:28Z

services/github/src/handlers/status.js

+  const { reasonResolved } = runs[runId] || {};
+  const taskDefined = state === undefined;
+
+  await qLock.acquire(taskId);


this ensures we only run single status update for a given task. Consequent handlers will pause here until the previous is done.

lotas · 2022-09-08T20:58:06Z

services/github/src/queue-lock.js

+  }
+
+  release(name) {
+    const nextResolver = this.queue[name].shift();


This is the only part that can cause issues, if client code doesn't call .release() all pending callers will be stuck.
At the moment status handler ensures that it always calls it before the end.

For the future we might want to implement some fail-safe watchdog timers to force-release lock after some timeout

I don't know if a timer is the right fix here. If something is stuck and taking forever (like the GitHub API), probably waiting for it to fail is the right approach. The risk is that a return gets added to the function without a corresponding release. One way to avoid that is to have the acquire take an async callback which is called with the lock held, and unconditionally release the lock when the callback returns.

Yeah, I was also thinking about this callback approach, but didn't want to make code more complicated on the handler side

lotas · 2022-09-08T20:58:58Z

services/github/src/queue-lock.js

+/**
+ * Implements locked queue to allow one routine running at a time
+ */
+class QueuedLock {


is there a better name?

I wonder if something from https://github.com/sindresorhus/promise-fun would serve this purpose, or at least could be used for the map values?

I wasn't able to find anything that matched my use-case

petemoore · 2022-09-09T11:37:45Z

I'm not feeling too confident about being a good reviewer for this change. @djmitche Do you have any thoughts on this?

lotas · 2022-09-12T07:32:20Z

services/github/config.yml

@@ -5,7 +5,6 @@ defaults:
    deprecatedResultStatusQueue: 'stat-result'
    deprecatedInitialStatusQueue: 'stat-init'
    resultStatusQueue: 'ch-result'
-    initialStatusQueue: 'ch-init'


I guess we'll also need to drop this queue manually

djmitche

This looks pretty solid -- my notes aren't anything serious.

How clear is it to someone deploying TC that they can't run more than one handler process? I suspect that previously this wasn't necessary but wasn't expressly forbidden. Now it must be forbidden.

services/github/src/github-auth.js

djmitche · 2022-09-13T20:24:51Z

services/github/src/handlers/status.js

-  let debug = makeDebug(this.monitor, { taskGroupId, taskId });
-  debug(`Handling state change for task ${taskId} in group ${taskGroupId}, reason=${reasonResolved || state}`);
+  let debug = makeDebug(this.monitor, { taskGroupId, taskId, id: `id-${counter}` });
+  counter += 1;


Could this be an instance variable instead?

thanks! Realized that it is even better to move it to makeDebug() so it would be applied to all handlers using this utility function, and not only for status handler 👍

djmitche · 2022-09-13T20:30:58Z

services/github/src/queue-lock.js

+  }
+
+  release(name) {
+    const nextResolver = this.queue[name].shift();


I don't know if a timer is the right fix here. If something is stuck and taking forever (like the GitHub API), probably waiting for it to fail is the right approach. The risk is that a return gets added to the function without a corresponding release. One way to avoid that is to have the acquire take an async callback which is called with the lock held, and unconditionally release the lock when the callback returns.

djmitche · 2022-09-13T20:33:11Z

services/github/src/queue-lock.js

+/**
+ * Implements locked queue to allow one routine running at a time
+ */
+class QueuedLock {


I wonder if something from https://github.com/sindresorhus/promise-fun would serve this purpose, or at least could be used for the map values?

Changes the way how status handlers are being called after changes introduced in #5665. Consumer handlers need to stay sync and return early to allow queue send new messages. This also improves timedHandler usage for handlers and adds periodic debug stats: number of running/error/total handlers by queue. Fixes #5728

lotas requested a review from a team as a code owner September 7, 2022 11:01

lotas requested review from petemoore and matt-boris and removed request for a team September 7, 2022 11:01

lotas force-pushed the feature/github-status-handlers branch 2 times, most recently from c95f390 to a63ad93 Compare September 7, 2022 11:51

lotas changed the title ~~feat(github): Refactor status handler, remove taskDefined handler~~ WIP: [ci skip] Refactor status handler, remove taskDefined handler Sep 7, 2022

lotas force-pushed the feature/github-status-handlers branch 3 times, most recently from cc492d6 to d031f11 Compare September 7, 2022 12:40

lotas changed the title ~~WIP: [ci skip] Refactor status handler, remove taskDefined handler~~ Refactor status handler, remove taskDefined handler Sep 7, 2022

matt-boris reviewed Sep 7, 2022

View reviewed changes

lotas force-pushed the feature/github-status-handlers branch 3 times, most recently from d41ea41 to e47b721 Compare September 8, 2022 13:27

matt-boris reviewed Sep 8, 2022

View reviewed changes

lotas marked this pull request as draft September 8, 2022 14:34

lotas commented Sep 8, 2022

View reviewed changes

lotas force-pushed the feature/github-status-handlers branch 2 times, most recently from 63dff6d to 530a832 Compare September 8, 2022 20:48

lotas marked this pull request as ready for review September 8, 2022 20:52

lotas commented Sep 8, 2022

View reviewed changes

lotas force-pushed the feature/github-status-handlers branch from 530a832 to fe0bdb8 Compare September 8, 2022 20:54

lotas commented Sep 8, 2022

View reviewed changes

petemoore removed their request for review September 9, 2022 11:38

lotas requested a review from djmitche September 12, 2022 07:31

lotas commented Sep 12, 2022

View reviewed changes

djmitche previously approved these changes Sep 13, 2022

View reviewed changes

feat(github): Refactor status handler, remove taskDefined handler

5239527

lotas dismissed djmitche’s stale review via 5239527 September 15, 2022 10:00

lotas force-pushed the feature/github-status-handlers branch from fe0bdb8 to 5239527 Compare September 15, 2022 10:00

lotas merged commit 13096b4 into main Sep 15, 2022

lotas deleted the feature/github-status-handlers branch September 15, 2022 10:17

lotas mentioned this pull request Oct 19, 2022

Github status handler slowing down on FxCI #5728

Closed

lotas mentioned this pull request Oct 25, 2022

Github status handler fix #5733

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor status handler, remove taskDefined handler #5665

Refactor status handler, remove taskDefined handler #5665

lotas commented Sep 7, 2022 •

edited

petemoore commented Sep 7, 2022

lotas commented Sep 7, 2022

matt-boris left a comment

matt-boris Sep 7, 2022

matt-boris Sep 7, 2022

matt-boris left a comment

matt-boris Sep 8, 2022

lotas Sep 8, 2022

matt-boris Sep 8, 2022

lotas Sep 8, 2022

lotas Sep 8, 2022

lotas Sep 8, 2022

lotas Sep 8, 2022

djmitche Sep 13, 2022

lotas Sep 15, 2022

lotas Sep 8, 2022

djmitche Sep 13, 2022

lotas Sep 15, 2022

petemoore commented Sep 9, 2022

lotas Sep 12, 2022

djmitche left a comment

djmitche Sep 13, 2022

lotas Sep 15, 2022

djmitche Sep 13, 2022

djmitche Sep 13, 2022

	const [liveLogText, customCheckRunText, customCheckRunAnnotationsText ] = await Promise.all([
	const [ liveLogText, customCheckRunText, customCheckRunAnnotationsText ] = await Promise.all([


		Refactored github status checks handler to do handle task status transitions in single place.

		Previous implementaition relied on two handlers: taskDefined and statusChanged.

	Previous implementaition relied on two handlers: taskDefined and statusChanged.
	Previous implementation relied on two handlers: taskDefined and statusChanged.

Refactor status handler, remove taskDefined handler #5665

Refactor status handler, remove taskDefined handler #5665

Conversation

lotas commented Sep 7, 2022 • edited

petemoore commented Sep 7, 2022

lotas commented Sep 7, 2022

matt-boris left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matt-boris left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

petemoore commented Sep 9, 2022

Choose a reason for hiding this comment

djmitche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lotas commented Sep 7, 2022 •

edited