New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
3.5 ListWorkflows causes server to hang when there are lots of archived workflows #12025
Comments
Thanks for this issue. This is a known issue if you have a lot of archived workflows. It's caused by the pagination method that first loads all live workflows and archived workflows and then performs pagination. cc @sunyeongchoi who worked on this in #11761. |
Hello. I will test the issue as soon as possible and think about a solution. |
For posterity, this was actually discussed yesterday at the Contributors Meeting. (I was involved as I made substantial refactors to the UI for 3.5 -- #11891 in particular -- in case those were the cause, but the UI is actually unrelated in this case, and the refactor actually decreased the number of network requests. |
@sjhewitt thank you for filing this. I'm having the same issue. 30+ second page loads. I'm gathering data now and will post here once obtained. @sunyeongchoi thanks for taking a look. Can confirm this isn't related to the "past month" default on the UI being removed from the search box like @agilgur5 states. That is only client side. |
There are some optimizations we can do, e.g. #12030. However, this issue is more specific to argo-server/backend. The root cause is that the pagination method we implemented for |
@suryaoruganti every visit. |
@jmeridth Ok! Thanks for your answer. |
@sunyeongchoi and I have discussed a couple of potential solutions. Unfortunately, there are edge cases that we cannot get around with those approaches (due to the complexity of pagination, deduplication, sorting, etc.). I propose that we add a dropdown to allow users to select whether to display:
Additional requirements:
Motivations for this proposal:
Any thoughts? |
Just reverted back to |
I am reverting related change #12068 for now since otherwise the UI is not usable when there are many workflows. In the meantime, we can continue discussing proposal #12025 (comment) here. We can probably release a patch version next week #11997 (comment). |
If you're referring to #11840, it is mentioned above that that it is unrelated to this issue, as that filter is entirely client side (as the k8s API server has no native way of doing date filtering, though that logic also pre-dates me and that PR) |
What are some of those edge cases? We can still over/under fetch so long as it does not overload the server. Did the previous Archived Workflows page not have pagination? If so, I would think it would have been similarly susceptible to this, just not as frequently hit since it was a separate page.
I feel like separate pages is a better UX than a drop-down. If the APIs are identical, some careful refactoring could make them share a UI implementation. |
I'm similarly curious... something like:
|
Hello. I kept thinking of a way to merge However, there are two problems that continue to hinder solving this problem. before this issue occurred, Logic removed duplicates by searching 20 However, in this case, some This is because kubernetes' own pagination does not allow us to decide which data to start the next page with. This problem can be solved if Kubernetes' own pagination logic can be used as is. (Kubernetes code analysis is required) But if that's not possible, I don't know if there's any other better way. In conclusion, I thought that in order to know what data to start with on the next page, it would be impossible for us to control this unless we knew exactly the Do you have any other good suggestions? |
ahh, I see - I didn't have much knowledge of the k8s api, so didn't realize it doesn't really support filtering/ordering/pagination. The 3 options I see are:
|
There is an idea that suddenly came to mind while writing a comment. I think this problem can be solved by knowing which data to start with on the next page. In that case, I thought I could solve the problem by using a function that performs pagination based on the Existing: Use the Suggestion: If 20 pieces of data are needed on one page, fetch 20 However, even with this method, all I think it will be more efficient than the previous method. I'll listen to your opinions and if you think it's a good method, I'll try it and share it with you. |
it is not only breaking and making unusable UI on v3.5.0, it is also crashing badly with OOM on 3200mb guaranteed deployed pod, with a lot of archived workflows (postgresql) and few at etcd (6 hours TTL on worfkflows with working GC). The issue is at the main view where all workflows are listed. Also probably on this pagination, it would be useful to change the defaults on time ranges to show. Currently, it is one month, but probably it would be better to have a default on 1 or 2 days, to free that argo-server list workflows. This, together with the flags "show archived" that you are commenting, would help a lot. |
@sunyeongchoi That might be a good optimization for the third option I listed in #12025 (comment). It could help a lot when there are a lot more archived workflows vs live workflows. Although we still need to fetch the entire list of live workflows on all pages.
@Guillermogsjc Thanks for the suggestion. However, we still need to make the same list query in the backend and then filter in the front-end. |
With the unified workflow list API for both live + archived workflows we have to do the following:
Number 1 is a challenge because performing LIST all workflows for every request will be taxing on K8s API server as users use the UI. Listing all workflows for the purposes of only showing 20 is extremely inefficient. To get around this, I propose a technique we use in Argo CD: Since LIST all workflows is the requirement and also the problem, we can use an informer cache on Argo API server to avoid the additional LIST calls to k8s and have all workflows ready to be returned by API server from in-memory informer cache. When API server is returning a unified list of live + archive, it would call List() against the informer cache rather than List against K8s API, and then filter/merge/sort before returning results back to the caller. Note that this does balloon the memory requirements of argo-server because of the informer cache. But I feel this is manageable with archive policy and gc. And the benefits of a unified live+archive API, as well as reducing load on K8s API server, outweigh the extra memory requirements of argo-server. If we are worried about expensive processing / sorting of the results of List() calls we make against the informer cache, we could consider maintaining our own ordered workflow list (by creationTimestamp) that is automatically updated with UpdateFunc, AddFunc, DeleteFunc registered to the informer. But I consider this an optimization. |
a daily partition label on etcd for the workflow objects would be a mad idea? `
Partitioning indexes to allow performant queries on database objects, is often the way to go to avoid monolithic queries that are devastating, as the one you comment on against k8s API/etcd, that does not support by creation timestamp sort and limit. By having a daily partition label you would be able to build the interested partitions to filter on the Argo server against k8s API. Anyway, the problem needs to be on that monolithic query against backend database, if it is bringing everything instead of filtering, the load can be enormous into index page. In our case at least ( following the documentation recommended good practice of short TTLs for workflow objects), where we have tons and tons of archived workflows on postgresql and only last 6 hours workflows from etcd, it is clear that the described full query against database backend is the killer. |
Thank you so much for so many people suggest good ideas. First, I will start with optimizing Archived Workflows first.
After that I will investigate the informer cache :)
|
@sunyeongchoi Great! Let me know if you need any help. |
hi @jiachengxu on v5 still seeing this bug about workflows missing (the workflow are not badly ordered, just not appearing anywhere), there are plenty of archived workflows that should have on this picture less than 7 days of age: |
@Guillermogsjc This is weird, I couldn't reproduce the issue on my side. Have you seen this issue with |
@jiachengxu it is an issue that has been around all the time in all fixes :( |
@Guillermogsjc to clarify, that issue doesn't exist in 3.5.5, right? The gap with multi-day old Workflows sounds like the initial If there is a gap with Workflows that ran after you started the Server with the fix, that is potentially more concerning, as the EDIT: the gap is with archived workflows, not live workflows, see below |
@agilgur5 I can confirm that it is not seen this behaviour at least on our deployment, in v3.5.5 v3.5.5 shows all the workflows from both sources. Sqlite versions look like missing a big amount of workflows from archived (etcd workflows appear normally) |
Also, to confirm, have you seen any OOMs or memory "exploding" (as you mentioned previously) in |
Oh from archived. Hmm ok, that would at least be easier to fix. That would suggest the cursor querying might be off |
@agilgur5 @Guillermogsjc I finally reproduced the issue, and it is indeed related to the cursor querying. The original reason that I used this oldest I pushed a new image that contains the above fix/improvement to |
this partitioning looks optimal on v6 there are all archived workflows and looks fine it uses plenty of memory when pagination is done but seems bounded somehow, so it would be tweakeable by deployment (the important point is to have bounded memory usage on operations depending on just to remember that v6 behaviour shows first all non archived, then archived, so this can feel weird if you have very old etcd workflows (mostly because TTL worker errored deleting them I guess), but this is not a main concern probably. |
@Guillermogsjc Are you sure about this behaviour? In my case the workflows are sorted by the started time per page, and the behaviour is defined by UI.
How does it compare to the previous version? I am expecting that the memory usage should be lower since I did the optimization mentioned here and also this PR. |
sure, look this deployment of v6 on pro env, and those not archived workflows with so many days. Nevermind @jiachengxu this does not happen in the other environment, so maybe those are corrupt manifests somehow, do not bother with it
Loads of memory were breaking 2700mib on older versions and now in our case stay bounded around 1.6 gb by pod. |
@terrytangyuan, @agilgur5 was doing the bulk of the review here why did it get merged before he gave his final review? |
@ryancurrah What bulk review? I don’t see any blocking review comments from @agilgur5 in the PR. |
Typically the reviewer's approval is required before merging a pull request. I'm a bit confused about why this one was merged without Anton's okay, since he did the review. Would you mind elaborating on that decision? |
Please correct me if I missed anything:
If you see any additional issues, feel free to follow up by commenting here or submitting a new issue. |
I was the main analyzer here in the issue, but I haven't had time to give #12736 an in-depth review unfortunately. Getting the approach right here was always my top priority though and I think we ironed the heck out of it. We otherwise don't have firm rules on the process outside of the branch protection requirements (we don't have many Approvers either, see also the Sustainability Effort). So I think Terry made the right call there (and also gave some more time than strictly necessary for others to take a look too, which I'm pretty sure was intentional). Would be great if I could have taken a deeper look too, but unfortunately we have bottlenecks and this is the top P1, so we all have to use our best available judgment often. |
Thank you both for helping me gain a clearer understanding of the current review process. |
…Fixes argoproj#12025 (argoproj#12736)" This reverts commit f1ab5aa. Signed-off-by: Alan Clucas <alan@clucas.org>
This fix has been reverted in #13018, so re-opening this until a new PR is merged. |
Pre-requisites
:latest
What happened/what you expected to happen?
We had >200,000 rows in the workflow archive table, and when trying to view the new combined workflow/archived workflow list page in the UI, the server times out
scanning the code, it looks like the
LoadWorkflows
code loads all rows from the archive table, combines them with the k8s results and then applies sorting and limiting.as a workaround, we've reduced the archive ttl from 14 days to 1 day, and the endpoint now responds before timing out, but is still pretty slow.
Version
v3.5.0
--- edits below by agilgur5 to add updates since this is a (very) popular issue ---
Updates
The text was updated successfully, but these errors were encountered: