Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3.5 ListWorkflows causes server to hang when there are lots of archived workflows #12025

Closed
2 of 3 tasks
sjhewitt opened this issue Oct 17, 2023 · 111 comments · Fixed by #12736, #12912 or #13021
Closed
2 of 3 tasks

3.5 ListWorkflows causes server to hang when there are lots of archived workflows #12025

sjhewitt opened this issue Oct 17, 2023 · 111 comments · Fixed by #12736, #12912 or #13021
Assignees
Labels
area/api Argo Server API area/workflow-archive P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority type/bug type/regression Regression from previous behavior (a specific type of bug)

Comments

@sjhewitt
Copy link

sjhewitt commented Oct 17, 2023

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issues exists when I tested with :latest
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

We had >200,000 rows in the workflow archive table, and when trying to view the new combined workflow/archived workflow list page in the UI, the server times out

scanning the code, it looks like the LoadWorkflows code loads all rows from the archive table, combines them with the k8s results and then applies sorting and limiting.

as a workaround, we've reduced the archive ttl from 14 days to 1 day, and the endpoint now responds before timing out, but is still pretty slow.

Version

v3.5.0

--- edits below by agilgur5 to add updates since this is a (very) popular issue ---

Updates

@terrytangyuan
Copy link
Member

Thanks for this issue. This is a known issue if you have a lot of archived workflows. It's caused by the pagination method that first loads all live workflows and archived workflows and then performs pagination. cc @sunyeongchoi who worked on this in #11761.

@sunyeongchoi
Copy link
Member

Hello. I will test the issue as soon as possible and think about a solution.
thank you.

@agilgur5 agilgur5 added type/regression Regression from previous behavior (a specific type of bug) P1 High priority. All bugs with >=5 thumbs up that aren’t P0, plus: Any other bugs deemed high priority area/api Argo Server API labels Oct 18, 2023
@agilgur5
Copy link
Member

agilgur5 commented Oct 18, 2023

For posterity, this was actually discussed yesterday at the Contributors Meeting.
@jmeridth had been looking into it as it is blocking his team from upgrading as well and had eventually traced it to this PR discussion: #11761 (comment).

(I was involved as I made substantial refactors to the UI for 3.5 -- #11891 in particular -- in case those were the cause, but the UI is actually unrelated in this case, and the refactor actually decreased the number of network requests.
Also #11840 removed a default date filter, but that was entirely client-side anyway, so did not impact any networking.)

@jmeridth
Copy link
Member

jmeridth commented Oct 18, 2023

@sjhewitt thank you for filing this. I'm having the same issue. 30+ second page loads. I'm gathering data now and will post here once obtained.

@sunyeongchoi thanks for taking a look.

Can confirm this isn't related to the "past month" default on the UI being removed from the search box like @agilgur5 states. That is only client side.

@terrytangyuan
Copy link
Member

terrytangyuan commented Oct 18, 2023

There are some optimizations we can do, e.g. #12030.

However, this issue is more specific to argo-server/backend. The root cause is that the pagination method we implemented for ListWorkflows() requires retrieval of the entire list of workflows at once.

@sunyeongchoi
Copy link
Member

@sjhewitt @jmeridth
Hello. I have a question about the circumstances under which this bug occurred.
Does this issue occur as soon as you first enter the Workflows list page? Or does when moving the page?

@jmeridth
Copy link
Member

@suryaoruganti every visit.

@sunyeongchoi
Copy link
Member

@jmeridth Ok! Thanks for your answer.

@terrytangyuan
Copy link
Member

terrytangyuan commented Oct 23, 2023

@sunyeongchoi and I have discussed a couple of potential solutions. Unfortunately, there are edge cases that we cannot get around with those approaches (due to the complexity of pagination, deduplication, sorting, etc.).

I propose that we add a dropdown to allow users to select whether to display:

  1. Only live workflows;
  2. Only archived workflows;
  3. Both live and archived workflows (with notice/warning that this is only suitable when the number of workflows is not too large);

Additional requirements:

  1. This dropdown box is only available if there are both archived workflows. UI should be smart enough to figure this out.
  2. We should also consider the ability to disable the third option as admin to avoid bringing down cluster performance.
  3. The "archived" column should only be displayed when appropriate. For example, it's evident that a workflow is archived if the user is only viewing archived workflows.

Motivations for this proposal:

  1. Some users only care about one of these types of workflows;
  2. Since there are performance issues that we cannot get around in order to view both types of workflows, we should only provide this option with caution;
  3. Keep using the original pagination implementation for live workflows or archived workflows where the logic is much more precise while keeping the front-end codebase simple;
  4. The first two options are almost identical to previous versions, but the UI should be less buggy since they now share most of the implementation; the third option is an addition to previous versions.

Any thoughts?

@rwong2888
Copy link
Contributor

rwong2888 commented Oct 23, 2023

Just reverted back to v3.5.0-rc1 due to this. Perhaps we can add the default started time back?

@terrytangyuan
Copy link
Member

terrytangyuan commented Oct 23, 2023

I am reverting related change #12068 for now since otherwise the UI is not usable when there are many workflows. In the meantime, we can continue discussing proposal #12025 (comment) here. We can probably release a patch version next week #11997 (comment).

@agilgur5
Copy link
Member

agilgur5 commented Oct 23, 2023

Just reverted back to v3.5.0-rc1 due to this. Perhaps we can add the default started time back?

If you're referring to #11840, it is mentioned above that that it is unrelated to this issue, as that filter is entirely client side (as the k8s API server has no native way of doing date filtering, though that logic also pre-dates me and that PR)

@agilgur5
Copy link
Member

@sunyeongchoi and I have discussed a couple of potential solutions. Unfortunately, there are edge cases that we cannot get around with those approaches (due to the complexity of pagination, deduplication, sorting, etc.).

What are some of those edge cases? We can still over/under fetch so long as it does not overload the server.
For instance, in the worst-case, if a user has 20 Workflows per page set, we can retrieve 20 from k8s and 20 from the Archive DB, which is not horrendous (but for sure could be optimized).

Did the previous Archived Workflows page not have pagination? If so, I would think it would have been similarly susceptible to this, just not as frequently hit since it was a separate page.

4. The first two options are almost identical to previous versions, but the UI should be less buggy since they now share most of the implementation; the third option is an addition to previous versions.

I feel like separate pages is a better UX than a drop-down. If the APIs are identical, some careful refactoring could make them share a UI implementation.

@sjhewitt
Copy link
Author

sjhewitt commented Oct 23, 2023

@sunyeongchoi and I have discussed a couple of potential solutions. Unfortunately, there are edge cases that we cannot get around with those approaches (due to the complexity of pagination, deduplication, sorting, etc.).

What are some of those edge cases? We can still over/under fetch so long as it does not overload the server. For instance, in the worst-case, if a user has 20 Workflows per page set, we can retrieve 20 from k8s and 20 from the Archive DB, which is not horrendous (but for sure could be optimized).

I'm similarly curious...
I wonder if it would be possible to use a cursor that encodes 2 offsets - one for the k8s api and one for the db, then fetches limit rows from both sources with the given offset, merges the results together and applies the limit to that combined list.

something like:

orderBy = ...
filters = ...
limit = 20
cursor = 0, 0
k8sResults = fetchK8s(cursor[0], limit, filters, orderBy)
dbResults = fetchDB(cursor[1], limit, filters, orderBy)
results = mergeResults(k8sResults, dbResults).slice(0, limit)

newK8sOffset = getLastK8sResult(results)
newDBOffset = getLastDbResult(results)
newCursor = (newK8sOffset, newDBOffset)

@sunyeongchoi
Copy link
Member

sunyeongchoi commented Oct 24, 2023

@agilgur5 @sjhewitt

Hello. I kept thinking of a way to merge Archived and Workflows into one page instead of splitting them into two pages.

However, there are two problems that continue to hinder solving this problem.
First, Archived and Workflows may overlap with each other, so deduplication is necessary.
Second, we have no control over the pagination of Kubernetes resources(Workflows).

before this issue occurred, Logic removed duplicates by searching 20 Archived Workflows and 20 Workflows each when 20 pieces of data were needed on one page.

However, in this case, some Workflows data may be missing when viewing the next page.

This is because kubernetes' own pagination does not allow us to decide which data to start the next page with.

This problem can be solved if Kubernetes' own pagination logic can be used as is. (Kubernetes code analysis is required)

But if that's not possible, I don't know if there's any other better way.

In conclusion, I thought that in order to know what data to start with on the next page, it would be impossible for us to control this unless we knew exactly the continue logic(for pagination) of Kubernetes and could implement it in the same way, unless we retrieved the entire Workflows data.

Do you have any other good suggestions?

@sjhewitt
Copy link
Author

ahh, I see - I didn't have much knowledge of the k8s api, so didn't realize it doesn't really support filtering/ordering/pagination.

The 3 options I see are:

  • separate the k8s backed api/ui from the db backed api/ui (reverting to previous behaviour)
  • fetch the whole k8s data and merge it with a subset of the db data
  • persist workflows in the db from the moment they are submitted, updating their state as they are scheduled/change state. then (if the db archive is enabled) make the UI solely reliant on querying from the db. In this case, the data could be augmented with data from the k8s API if the workflows are still running...

@sunyeongchoi
Copy link
Member

sunyeongchoi commented Oct 24, 2023

There is an idea that suddenly came to mind while writing a comment.

I think this problem can be solved by knowing which data to start with on the next page.

In that case, I thought I could solve the problem by using a function that performs pagination based on the resourceVersion that I implemented previously.

Existing: Use the cursorPaginationByResourceVersion function after merging all Archived Workflows and all Workflows data.

Suggestion: If 20 pieces of data are needed on one page, fetch 20 Archived Workflows, fetch 20 Workflows, and then merge them. Use the cursorPaginationByResourceVersion function when searching Workflows data for the next page.

However, even with this method, all Workflows will be fetched on every page. (But not fetch all Archived Workflows)

I think it will be more efficient than the previous method.

I'll listen to your opinions and if you think it's a good method, I'll try it and share it with you.

@Guillermogsjc
Copy link

Guillermogsjc commented Oct 24, 2023

it is not only breaking and making unusable UI on v3.5.0, it is also crashing badly with OOM on 3200mb guaranteed deployed pod, with a lot of archived workflows (postgresql) and few at etcd (6 hours TTL on worfkflows with working GC).

The issue is at the main view where all workflows are listed.

Also probably on this pagination, it would be useful to change the defaults on time ranges to show. Currently, it is one month, but probably it would be better to have a default on 1 or 2 days, to free that argo-server list workflows. This, together with the flags "show archived" that you are commenting, would help a lot.

@terrytangyuan
Copy link
Member

terrytangyuan commented Oct 24, 2023

Suggestion: If 20 pieces of data are needed on one page, fetch 20 Archived Workflows, fetch 20 Workflows, and then merge them. Use the cursorPaginationByResourceVersion function when searching Workflows data for the next page.

@sunyeongchoi That might be a good optimization for the third option I listed in #12025 (comment). It could help a lot when there are a lot more archived workflows vs live workflows. Although we still need to fetch the entire list of live workflows on all pages.

Also probably on this pagination, it would be useful to change the defaults on time ranges to show. Currently, it is one month, but probably it would be better to have a default on 1 or 2 days, to free that argo-server list workflows. This, together with the flags "show archived" that you are commenting, would help a lot.

@Guillermogsjc Thanks for the suggestion. However, we still need to make the same list query in the backend and then filter in the front-end.

@jessesuen
Copy link
Member

jessesuen commented Oct 24, 2023

With the unified workflow list API for both live + archived workflows we have to do the following:

  1. we need all workflows in the cluster because kubernetes API server does not support sorting by creation timestamp, but argo-server does.
  2. we should only query X workflows from the archive, where X is the requested page size. The underlying database does support filtering and sorting, so this is efficient. The fact that we query everything from archive is nonsensical.

Number 1 is a challenge because performing LIST all workflows for every request will be taxing on K8s API server as users use the UI. Listing all workflows for the purposes of only showing 20 is extremely inefficient. To get around this, I propose a technique we use in Argo CD:

Since LIST all workflows is the requirement and also the problem, we can use an informer cache on Argo API server to avoid the additional LIST calls to k8s and have all workflows ready to be returned by API server from in-memory informer cache. When API server is returning a unified list of live + archive, it would call List() against the informer cache rather than List against K8s API, and then filter/merge/sort before returning results back to the caller.

Note that this does balloon the memory requirements of argo-server because of the informer cache. But I feel this is manageable with archive policy and gc. And the benefits of a unified live+archive API, as well as reducing load on K8s API server, outweigh the extra memory requirements of argo-server.

If we are worried about expensive processing / sorting of the results of List() calls we make against the informer cache, we could consider maintaining our own ordered workflow list (by creationTimestamp) that is automatically updated with UpdateFunc, AddFunc, DeleteFunc registered to the informer. But I consider this an optimization.

@Guillermogsjc
Copy link

Guillermogsjc commented Oct 24, 2023

a daily partition label on etcd for the workflow objects would be a mad idea? `

workflows.argoproj.io/daily_index: {{workflow.creationTimestamp.%Y-%m-%d}}

Partitioning indexes to allow performant queries on database objects, is often the way to go to avoid monolithic queries that are devastating, as the one you comment on against k8s API/etcd, that does not support by creation timestamp sort and limit.

By having a daily partition label you would be able to build the interested partitions to filter on the Argo server against k8s API.

Anyway, the problem needs to be on that monolithic query against backend database, if it is bringing everything instead of filtering, the load can be enormous into index page.

In our case at least ( following the documentation recommended good practice of short TTLs for workflow objects), where we have tons and tons of archived workflows on postgresql and only last 6 hours workflows from etcd, it is clear that the described full query against database backend is the killer.

@sunyeongchoi
Copy link
Member

Thank you so much for so many people suggest good ideas.

First, I will start with optimizing Archived Workflows first.

we should only query X workflows from the archive, where X is the requested page size. The underlying database does support filtering and sorting, so this is efficient. The fact that we query everything from archive is nonsensical.

After that I will investigate the informer cache :)

When API server is returning a unified list of live + archive, it would call List() against the informer cache rather than List against K8s API, and then filter/merge/sort before returning results back to the caller.

@terrytangyuan
Copy link
Member

@sunyeongchoi Great! Let me know if you need any help.

@Guillermogsjc
Copy link

Guillermogsjc commented Apr 10, 2024

hi @jiachengxu

on v5 still seeing this bug about workflows missing (the workflow are not badly ordered, just not appearing anywhere), there are plenty of archived workflows that should have on this picture less than 7 days of age:

image

@jiachengxu
Copy link
Member

hi @jiachengxu

on v5 still seeing this bug about workflows missing (the workflow are not badly ordered, just not appearing anywhere), there are plenty of archived workflows that should have on this picture less than 7 days of age:

image

@Guillermogsjc This is weird, I couldn't reproduce the issue on my side. Have you seen this issue with fix-v3?

@Guillermogsjc
Copy link

@jiachengxu it is an issue that has been around all the time in all fixes :(

@agilgur5
Copy link
Member

agilgur5 commented Apr 10, 2024

@Guillermogsjc to clarify, that issue doesn't exist in 3.5.5, right?

The gap with multi-day old Workflows sounds like the initial List didn't capture them into the SQLite DB.

If there is a gap with Workflows that ran after you started the Server with the fix, that is potentially more concerning, as the Watch can miss events (which is why Informers rebuild their cache on a set interval).

EDIT: the gap is with archived workflows, not live workflows, see below

@Guillermogsjc
Copy link

@agilgur5 I can confirm that it is not seen this behaviour at least on our deployment, in v3.5.5

v3.5.5 shows all the workflows from both sources. Sqlite versions look like missing a big amount of workflows from archived (etcd workflows appear normally)

@agilgur5
Copy link
Member

agilgur5 commented Apr 10, 2024

Also, to confirm, have you seen any OOMs or memory "exploding" (as you mentioned previously) in fix-v4 or fix-v5?
I am hoping that #12912 fixes the last performance regression in 3.5 (#12736 is still necessary for correctness regressions and feature regressions)

@agilgur5
Copy link
Member

Sqlite versions look like missing a big amount of workflows from archived (etcd workflows appear normally)

Oh from archived. Hmm ok, that would at least be easier to fix. That would suggest the cursor querying might be off

@jiachengxu
Copy link
Member

Sqlite versions look like missing a big amount of workflows from archived (etcd workflows appear normally)

@agilgur5 @Guillermogsjc I finally reproduced the issue, and it is indeed related to the cursor querying.
As I mentioned the pagination method above, I was trying to query the archived workflows which are older than the oldest live workflow startedAt, and this caused the missing of the archived workflows which started earlier before this oldest live workflow, but were completed and archived already.

The original reason that I used this oldest startedAt to query archived workflows was to avoid duplications because the workflows that have newer startedAt can exist both in live workflow and archived workflow. However, because the change I made here(sqlite db only stores the live workflows that don't have workflows.argoproj.io/workflow-archiving-status set to Archived), we get rid of the workflows which are archived but still in the k8s cluster on the sqlite-based informer level, we don't have the duplications anymore.

I pushed a new image that contains the above fix/improvement to quay.io/jiachengxu/argocli:fix-v6, please give it a try.

@Guillermogsjc
Copy link

Guillermogsjc commented Apr 11, 2024

change I made #12025 (comment) db only stores the live workflows that don't have workflows.argoproj.io/workflow-archiving-status set to Archived), we get rid of the workflows which are archived but still in the k8s cluster on the sqlite-based informer level, we don't have the duplications anymore.

this partitioning looks optimal

on v6 there are all archived workflows and looks fine

it uses plenty of memory when pagination is done

image

but seems bounded somehow, so it would be tweakeable by deployment (the important point is to have bounded memory usage on operations depending on workflow column mean size of tables)

just to remember that v6 behaviour shows first all non archived, then archived, so this can feel weird if you have very old etcd workflows (mostly because TTL worker errored deleting them I guess), but this is not a main concern probably.

@jiachengxu
Copy link
Member

jiachengxu commented Apr 11, 2024

just to remember that v6 behaviour shows first all non archived, then archived, so this can feel weird if you have very old etcd workflows (mostly because TTL worker errored deleting them I guess), but this is not a main concern probably.

@Guillermogsjc Are you sure about this behaviour? In my case the workflows are sorted by the started time per page, and the behaviour is defined by UI.
image

it uses plenty of memory when pagination is done

How does it compare to the previous version? I am expecting that the memory usage should be lower since I did the optimization mentioned here and also this PR.

@Guillermogsjc
Copy link

@Guillermogsjc Are you sure about this behaviour? In my case the workflows are sorted by the started time per page, and the behaviour is defined by UI.

sure, look this deployment of v6 on pro env, and those not archived workflows with so many days. Nevermind @jiachengxu this does not happen in the other environment, so maybe those are corrupt manifests somehow, do not bother with it

image

How does it compare to the previous version? I am expecting that the memory usage should be lower since I did the optimization mentioned #12025 (comment) and also #12912.

Loads of memory were breaking 2700mib on older versions and now in our case stay bounded around 1.6 gb by pod.
I would say it is significantly lower, but more strict checks are welcome

@ryancurrah
Copy link
Contributor

@terrytangyuan, @agilgur5 was doing the bulk of the review here why did it get merged before he gave his final review?

@terrytangyuan
Copy link
Member

terrytangyuan commented Apr 29, 2024

@ryancurrah What bulk review? I don’t see any blocking review comments from @agilgur5 in the PR.

@ryancurrah
Copy link
Contributor

Typically the reviewer's approval is required before merging a pull request. I'm a bit confused about why this one was merged without Anton's okay, since he did the review. Would you mind elaborating on that decision?

@terrytangyuan
Copy link
Member

Please correct me if I missed anything:

  1. All comments from @agilgur5 in feat: add sqlite-based memory store for live workflows. Fixes #12025 #12736 were addressed and marked as resolved by @agilgur5.
  2. There were no unresolved comments. No requested changes from reviewers.
  3. The PR was approved by 3 approvers. We actually only need one approval to merge a PR.

If you see any additional issues, feel free to follow up by commenting here or submitting a new issue.

@agilgur5 agilgur5 changed the title ListWorkflows causes server to hang when there are lots of archived workflows 3.5 ListWorkflows causes server to hang when there are lots of archived workflows Apr 29, 2024
@agilgur5
Copy link
Member

agilgur5 commented Apr 29, 2024

I was the main analyzer here in the issue, but I haven't had time to give #12736 an in-depth review unfortunately. Getting the approach right here was always my top priority though and I think we ironed the heck out of it.
I didn't see any glaring issues either after the CGO removal, otherwise I would've left a blocking comment.
If other Approver+ folks gave it a thorough look-over, I think it's had sufficient due diligence. If there are follow-up issues or follow-up code review comments, we can follow up on them (see also #12912 (comment) as an example of that). I wouldn't anticipate any substantial changes though; that's why I was trying to iron out the approach after all.

We otherwise don't have firm rules on the process outside of the branch protection requirements (we don't have many Approvers either, see also the Sustainability Effort).

So I think Terry made the right call there (and also gave some more time than strictly necessary for others to take a look too, which I'm pretty sure was intentional). Would be great if I could have taken a deeper look too, but unfortunately we have bottlenecks and this is the top P1, so we all have to use our best available judgment often.

@ryancurrah
Copy link
Contributor

Thank you both for helping me gain a clearer understanding of the current review process.

Joibel added a commit to Joibel/argo-workflows that referenced this issue May 7, 2024
…Fixes argoproj#12025 (argoproj#12736)"

This reverts commit f1ab5aa.

Signed-off-by: Alan Clucas <alan@clucas.org>
@Joibel
Copy link
Member

Joibel commented May 8, 2024

This fix has been reverted in #13018, so re-opening this until a new PR is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment