Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance enhancements #599

Open
6 of 12 tasks
praveenrewar opened this issue Sep 6, 2022 · 6 comments · Fixed by #659
Open
6 of 12 tasks

Performance enhancements #599

praveenrewar opened this issue Sep 6, 2022 · 6 comments · Fixed by #659
Assignees
Labels
carvel accepted This issue should be considered for future work and that the triage process has been completed enhancement This issue is a feature request priority/important-soon Must be staffed and worked on currently or soon.

Comments

@praveenrewar
Copy link
Member

praveenrewar commented Sep 6, 2022

Describe the problem/challenge you have
We rely on list api calls to get information from the cluster which could put some burden on the cluster if the number of objects returned is high. When the number of apps being deployed using kapp increases (in cases of kapp-controller packages), this becomes a problem as the time taken to deploy the apps increases after a certain point without any burden on the cpu or memory of the cluster nodes.

  • Throttling warnings when there are multiple kapp apps being used at the same time.
  • socket: too many open files when ulimit is set to a low number (256)

*Describe the solution you'd like
We need to minimise the list calls as much as possible (Replacing them with get or watch is also an option).

Tasks

  • Instead of trying to get all the server resources, get only the ones that are related to the available GKs (anyway we get rid of the others later so they are not being used) - Cancelled for the time being

    • Spike: This require more code changes and makes the code less readable. So we will just revisit this later if required.
  • When we deploy an app, we first list the labeled resources (GVs) and then try to get the non labeled resources one by one one that are not found in the first step. When an app is deployed for the first time, the first step would always return nil, so maybe we could skip that?

    • Spike: How would this fit into kapp code base? -> PR
  • Use watch instead of get and list while waiting for resources to reconcile. Using watch will be helpful for resources that take more time to reconcile (for example deployments), but for resources that reconcile almost immediately (for example configmap), it might bring some overhead.

    • Spike: Test if reducing the wait time interval help with this?
      We increased the wait-check-interval to 3s as it is reducing api calls and not affecting deployment time much.
      PR for the same is here and the data collected while doing spike can check here
    • Spike: Test if using watch impacts performance. -> As we have increased wait-check-interval now and it is giving better result, we can have a look on watch later. Prioritising second steps for now.
  • When we have a CRD and CR present in the same manifest, we try to fetch the server resources again to find the CRD (since it wasn't present in the cached server resources). We should somehow avoid doing this as we wouldn't find the CRD this time as well. (No need to work on this if we already work on the first one)

  • Now that we have added the resource namespaces to the fallbackAllowedNamespaces, should we always use fallbackAllowedNamespaces instead of checking resources cluster wide?

    • Spike: Figure out if scoping to fallbackAllowedNamespaces could have any side effects (testing).
  • Currently we store the unique GKs in the meta ConfigMap and we do a list on the GKs, since list calls are more expensive we can check if doing get calls for all the resources is less expensive than list calls for unique GKs.

  • Improving performance enhancement specifically during diff stage. With go profiling, it was noticed that there are too many calls to deepCopy and AsYAMLBytes. PR

Anything else you would like to add:
It might be worth understanding the API priority and fairness.


Vote on this request

This is an invitation to the community to vote on issues, to help us prioritize our backlog. Use the "smiley face" up to the right of this comment to vote.

👍 "I would like to see this addressed as soon as possible"
👎 "There are other more important things to focus on right now"

We are also happy to receive and review Pull Requests if you want to help working on this issue.

@praveenrewar praveenrewar added enhancement This issue is a feature request carvel triage This issue has not yet been reviewed for validity labels Sep 6, 2022
@100mik
Copy link
Contributor

100mik commented Sep 6, 2022

Do we have reason to believe that it is the list call adding to the burden rather than get calls in the wait stage?
I believe those would be higher in number.

@100mik
Copy link
Contributor

100mik commented Sep 6, 2022

Not sure if it will be helpful, but this KEP elaborates on the thought process and goals of API fairness and priority in detail.

@evankanderson
Copy link

In particular, both list and get calls will be counted against the API fairness and priority budget in a way that watch calls are not (there's a separate budget for those, but the assumption is that they are long-running and the cost of the initial population is amortized over the duration of the watch, possibly in conjunction with the golang informer cache).

@renuy renuy added carvel accepted This issue should be considered for future work and that the triage process has been completed and removed carvel triage This issue has not yet been reviewed for validity labels Sep 12, 2022
@praveenrewar praveenrewar added carvel accepted This issue should be considered for future work and that the triage process has been completed and removed carvel accepted This issue should be considered for future work and that the triage process has been completed labels Sep 12, 2022
@renuy renuy added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Sep 12, 2022
@renuy renuy added priority/important-soon Must be staffed and worked on currently or soon. and removed priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Sep 26, 2022
@praveenrewar praveenrewar changed the title Long running: Performance enhancements Performance enhancements Sep 28, 2022
@evankanderson
Copy link

Is the one change listed here for the initial list the only performance change needed?

Do you need help setting up a test environment?

@github-actions github-actions bot added the carvel triage This issue has not yet been reviewed for validity label Jan 19, 2023
@praveenrewar praveenrewar reopened this Jan 19, 2023
@praveenrewar
Copy link
Member Author

Hi @evankanderson I didn't mean to close it, but it got closed along with the PR, we are still working (although we are not able to spend much cycles) on some of the items from the list. Thank you so much for the help :)

@github-actions
Copy link

This issue is being marked as stale due to a long period of inactivity and will be closed in 5 days if there is no response.

@github-actions github-actions bot added the stale This issue has had no activity for a while and will be closed soon label Mar 13, 2023
@praveenrewar praveenrewar removed the stale This issue has had no activity for a while and will be closed soon label Mar 13, 2023
@renuy renuy removed the carvel triage This issue has not yet been reviewed for validity label Apr 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
carvel accepted This issue should be considered for future work and that the triage process has been completed enhancement This issue is a feature request priority/important-soon Must be staffed and worked on currently or soon.
Projects
Status: Prioritized Backlog
Development

Successfully merging a pull request may close this issue.

5 participants