Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor status handler, remove taskDefined handler #5665

Merged
merged 1 commit into from
Sep 15, 2022

Conversation

lotas
Copy link
Contributor

@lotas lotas commented Sep 7, 2022

This resolves an issue where two different handlers were trying to check/create/update check_run at the same time, which led to several check_runs being created and status reports were being sent to the wrong one.

Now single status handler manages all state transitions including task definition, running, and all of the completion states.

Besides that, if multiple events try to update same taskId, it will be ensured that they are executed strictly in order to not overwrite future events or create multiple check runs.

This also removes /rerequested functionality for re-run tasks, as this feature is broken on github side, status remains forever completed no matter what. Instead, new check run is being created if task was rerun.

@lotas lotas requested a review from a team as a code owner September 7, 2022 11:01
@lotas lotas requested review from petemoore and matt-boris and removed request for a team September 7, 2022 11:01
@petemoore
Copy link
Member

This looks good, but I'm wondering how best we can test the various types of race conditions that caused us pains before, aside from testing it in production. For example, could we deploy it to staging before releasing it, and then mirror production traffic to staging, to see how it handles things?

I'm guessing it is going to be pretty tricky to mock production like conditions without building a big framework to emulate heavy load etc.

@lotas
Copy link
Contributor Author

lotas commented Sep 7, 2022

This looks good, but I'm wondering how best we can test the various types of race conditions that caused us pains before, aside from testing it in production. For example, could we deploy it to staging before releasing it, and then mirror production traffic to staging, to see how it handles things?

I'm guessing it is going to be pretty tricky to mock production like conditions without building a big framework to emulate heavy load etc.

I will be testing it on dev.alpha right now, with the single connected repo to confirm that refactoring still works.
Plus all unit tests still gives us a good signal that the expected functionality and outputs are still the same.

Goal of this refactoring were to avoid concurrent handlers trying to create/update check runs for a single task.
This was possible in the past (although extremely rare) because of two different handlers - taskDefined and status. And as we saw in this task, it happened because both of those handlers were creating check run just 0.005s apart :)
image

Now, both events will be sent to the same queue, and as a result only one event will be handled at a time. PulseConsumer is waiting for handler to finish before receiving new message from that queue. As long as we don't scale github-worker to multiple instances it should be fine ;) And if we do, we'd have to come up with some distributed state machines .. huh :)

@lotas lotas force-pushed the feature/github-status-handlers branch 2 times, most recently from c95f390 to a63ad93 Compare September 7, 2022 11:51
@lotas lotas changed the title feat(github): Refactor status handler, remove taskDefined handler WIP: [ci skip] Refactor status handler, remove taskDefined handler Sep 7, 2022
@lotas lotas force-pushed the feature/github-status-handlers branch 3 times, most recently from cc492d6 to d031f11 Compare September 7, 2022 12:40
@lotas lotas changed the title WIP: [ci skip] Refactor status handler, remove taskDefined handler Refactor status handler, remove taskDefined handler Sep 7, 2022
Copy link
Contributor

@matt-boris matt-boris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lookin good!

changelog/ZuGEp6tWRWW4ED2sVXTkuQ.md Outdated Show resolved Hide resolved
const taskDefinition = await this.queueClient.task(taskId);
const fetchArtifact = async (artifactPath) => {
if (taskDefined || !runId) {
// when task is being defined, there will be no artifacts, so we fake the call and return empty response
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻

build,
scopes: taskDefinition.scopes,
});
const [liveLogText, customCheckRunText, customCheckRunAnnotationsText ] = await Promise.all([
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const [liveLogText, customCheckRunText, customCheckRunAnnotationsText ] = await Promise.all([
const [ liveLogText, customCheckRunText, customCheckRunAnnotationsText ] = await Promise.all([

services/github/src/handlers/status.js Outdated Show resolved Hide resolved
@lotas lotas force-pushed the feature/github-status-handlers branch 3 times, most recently from d41ea41 to e47b721 Compare September 8, 2022 13:27
Copy link
Contributor

@matt-boris matt-boris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple small suggestions! But overall - 💪🏻 💪🏻


Refactored github status checks handler to do handle task status transitions in single place.

Previous implementaition relied on two handlers: taskDefined and statusChanged.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Previous implementaition relied on two handlers: taskDefined and statusChanged.
Previous implementation relied on two handlers: taskDefined and statusChanged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! :)

services/github/src/github-auth.js Show resolved Hide resolved

const callHandler = (name, handler) => message => {
// handler returned must be async, as timedHandler will not be able to time it correctly
const callHandler = (name, handler) => message =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🦅 👁️

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this caused function to return early, and all monitor.timedHandler() were returning over-optimistic durations :)

@lotas lotas marked this pull request as draft September 8, 2022 14:34
/**
* Github has a limit of 64Kb for the whole payload
*/
getRemainingMaxSize() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will ensure that if someone includes custom annotations, we will consider its size too

@lotas lotas force-pushed the feature/github-status-handlers branch 2 times, most recently from 63dff6d to 530a832 Compare September 8, 2022 20:48
@lotas lotas marked this pull request as ready for review September 8, 2022 20:52
const { reasonResolved } = runs[runId] || {};
const taskDefined = state === undefined;

await qLock.acquire(taskId);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this ensures we only run single status update for a given task. Consequent handlers will pause here until the previous is done.

@lotas lotas force-pushed the feature/github-status-handlers branch from 530a832 to fe0bdb8 Compare September 8, 2022 20:54
}

release(name) {
const nextResolver = this.queue[name].shift();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only part that can cause issues, if client code doesn't call .release() all pending callers will be stuck.
At the moment status handler ensures that it always calls it before the end.

For the future we might want to implement some fail-safe watchdog timers to force-release lock after some timeout

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if a timer is the right fix here. If something is stuck and taking forever (like the GitHub API), probably waiting for it to fail is the right approach. The risk is that a return gets added to the function without a corresponding release. One way to avoid that is to have the acquire take an async callback which is called with the lock held, and unconditionally release the lock when the callback returns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was also thinking about this callback approach, but didn't want to make code more complicated on the handler side

/**
* Implements locked queue to allow one routine running at a time
*/
class QueuedLock {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a better name?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if something from https://github.com/sindresorhus/promise-fun would serve this purpose, or at least could be used for the map values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't able to find anything that matched my use-case

@petemoore
Copy link
Member

I'm not feeling too confident about being a good reviewer for this change. @djmitche Do you have any thoughts on this?

@petemoore petemoore removed their request for review September 9, 2022 11:38
@lotas lotas requested a review from djmitche September 12, 2022 07:31
@@ -5,7 +5,6 @@ defaults:
deprecatedResultStatusQueue: 'stat-result'
deprecatedInitialStatusQueue: 'stat-init'
resultStatusQueue: 'ch-result'
initialStatusQueue: 'ch-init'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we'll also need to drop this queue manually

djmitche
djmitche previously approved these changes Sep 13, 2022
Copy link
Collaborator

@djmitche djmitche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty solid -- my notes aren't anything serious.

How clear is it to someone deploying TC that they can't run more than one handler process? I suspect that previously this wasn't necessary but wasn't expressly forbidden. Now it must be forbidden.

services/github/src/github-auth.js Show resolved Hide resolved
let debug = makeDebug(this.monitor, { taskGroupId, taskId });
debug(`Handling state change for task ${taskId} in group ${taskGroupId}, reason=${reasonResolved || state}`);
let debug = makeDebug(this.monitor, { taskGroupId, taskId, id: `id-${counter}` });
counter += 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be an instance variable instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! Realized that it is even better to move it to makeDebug() so it would be applied to all handlers using this utility function, and not only for status handler 👍

}

release(name) {
const nextResolver = this.queue[name].shift();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if a timer is the right fix here. If something is stuck and taking forever (like the GitHub API), probably waiting for it to fail is the right approach. The risk is that a return gets added to the function without a corresponding release. One way to avoid that is to have the acquire take an async callback which is called with the lock held, and unconditionally release the lock when the callback returns.

/**
* Implements locked queue to allow one routine running at a time
*/
class QueuedLock {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if something from https://github.com/sindresorhus/promise-fun would serve this purpose, or at least could be used for the map values?

@lotas lotas force-pushed the feature/github-status-handlers branch from fe0bdb8 to 5239527 Compare September 15, 2022 10:00
@lotas lotas merged commit 13096b4 into main Sep 15, 2022
@lotas lotas deleted the feature/github-status-handlers branch September 15, 2022 10:17
lotas added a commit that referenced this pull request Oct 25, 2022
Changes the way how status handlers are being called after changes
introduced in #5665.
Consumer handlers need to stay sync and return early to allow queue send
new messages.
This also improves timedHandler usage for handlers and adds periodic
debug stats: number of running/error/total handlers by queue.

Fixes #5728
@lotas lotas mentioned this pull request Oct 25, 2022
lotas added a commit that referenced this pull request Oct 25, 2022
Changes the way how status handlers are being called after changes
introduced in #5665.
Consumer handlers need to stay sync and return early to allow queue send
new messages.
This also improves timedHandler usage for handlers and adds periodic
debug stats: number of running/error/total handlers by queue.

Fixes #5728
lotas added a commit that referenced this pull request Oct 25, 2022
Changes the way how status handlers are being called after changes
introduced in #5665.
Consumer handlers need to stay sync and return early to allow queue send
new messages.
This also improves timedHandler usage for handlers and adds periodic
debug stats: number of running/error/total handlers by queue.

Fixes #5728
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants