New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enable deferred cluster creation #50785
enable deferred cluster creation #50785
Conversation
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the motivation for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question, when I tried in #49868 I saw no benefits
I am seeing improvement in server initialization times with two static clusters and concurrency of 50 server.initialization_time_ms: P0(nan,100) P25(nan,102.5) P50(nan,105) P75(nan,107.5) P90(nan,109) P95(nan,109.5) P99(nan,109.9) P99.5(nan,109.95) P99.9(nan,109.99) P100(nan,110) server.initialization_time_ms: P0(nan,140) P25(nan,142.5) P50(nan,145) P75(nan,147.5) P90(nan,149) P95(nan,149.5) P99(nan,149.9) P99.5(nan,149.95) P99.9(nan,149.99) P100(nan,150) Will do more tests and report back |
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
Here are the results of my test Concurrency 2 Clusters 3600 Without Deferred creation server.memory_allocated: 759852680 server.initialization_time_ms: P0(nan,46000) P25(nan,46250) P50(nan,46500) P75(nan,46750) P90(nan,46900) P95(nan,46950) P99(nan,46990) P99.5(nan,46995) P99.9(nan,46999) P100(nan,47000) With Deferred creation server.memory_allocated: 749908528 server.initialization_time_ms: P0(nan,36000) P25(nan,36250) P50(nan,36500) P75(nan,36750) P90(nan,36900) P95(nan,36950) P99(nan,36990) P99.5(nan,36995) P99.9(nan,36999) P100(nan,37000) Concurrency 10 Clusters 3600 Without Deferred creation server.memory_allocated: 901796592 server.initialization_time_ms: P0(nan,52000) P25(nan,52250) P50(nan,52500) P75(nan,52750) P90(nan,52900) P95(nan,52950) P99(nan,52990) P99.5(nan,52995) P99.9(nan,52999) P100(nan,53000) With server.memory_allocated: 783817120 server.initialization_time_ms: P0(nan,30000) P25(nan,30250) P50(nan,30500) P75(nan,30750) P90(nan,30900) P95(nan,30950) P99(nan,30990) P99.5(nan,30995) P99.9(nan,30999) P100(nan,31000) Summary: We see some improvements both in memory and init times. Out of these 3600 clusters, 400 clusters STRICT_DNS type of clusters and deferred creation does not work for them. |
@howardjohn can you please share /stats?usedonly endpoint output of your test when you saw no difference? Please enable all stats of Envoy. |
please wait if possible. i want to take a look,but i am on holiday this week |
sure it is not urgent |
How does this work in envoy, will it ceate cluster on flight, so will it influence the latency? |
Yes. When a request comes on the worker thread, it creates it inline with the already existing object https://github.com/envoyproxy/envoy/blob/8d4cf00e2980aeba11b9b74230a53f92086419bc/source/common/upstream/cluster_manager_impl.cc#L1355. Should not incur lot of latency |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this change - and any 'on demand'/'delayed' or scale improving PR - but absolutely not as a default behavior change.
Add an annotation or option or meta - off by default - and it would be great. Gradually we can make it the default - maybe for waypoints first ( no backward compat risks), some classes of gateways - and maybe even sidecars, but over few releases.
This needs to be gated by a feature flag and reversible. Otherwise, this is probably a good change. |
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
@kyessenov @costinm added feature flag. PTAL @howardjohn when you get chance, please share stats of your experiment |
I jsut have #49868 (comment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I am fine with turning it on by default with "turn off for compatibility version". Or we can get more testing then flip it
- | | ||
**Added** an experimental feature to enable cluster creation on worker threads inline during requests. | ||
This will save memory and CPU cycles in cases where there are lots of inactive clusters and > 1 worker thread. | ||
This can be enabled by setting ENABLE_DEFERRED_CLUSTER_CREATION to true in Istiod Deployment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually in the agent not istiod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then the feature flag should have its own home not under pilot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes. I forgot just copy pasted some other flag :-)
Then the feature flag should have its own home not under pilot
We will have to do that refactor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any latency test result
- | | ||
**Added** an experimental feature to enable cluster creation on worker threads inline during requests. | ||
This will save memory and CPU cycles in cases where there are lots of inactive clusters and > 1 worker thread. | ||
This can be enabled by setting ENABLE_DEFERRED_CLUSTER_CREATION to true in Istiod Deployment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then the feature flag should have its own home not under pilot
latency is very negligible based on my measurements |
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
compatibility check added. @costinm can you please review? |
- | | ||
**Added** an experimental feature to enable cluster creation on worker threads inline during requests. | ||
This will save memory and CPU cycles in cases where there are lots of inactive clusters and > 1 worker thread. | ||
This can be disabled by setting ENABLE_DEFERRED_CLUSTER_CREATION to true in agent Deployment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ENABLE_DEFERRED_CLUSTER_CREATION
starts with enable
, but true
means disable
? This confuses me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry. good point, That needs to change after the default has been changed. fixed now
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
- | | ||
**Added** an experimental feature to enable cluster creation on worker threads inline during requests. | ||
This will save memory and CPU cycles in cases where there are lots of inactive clusters and > 1 worker thread. | ||
This can be disabled by setting ENABLE_DEFERRED_CLUSTER_CREATION to false in agent Deployment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need update, now it is not enabled by default
I doubt this conclusion |
Please see what Envoy is doing when creating cluster in line https://github.com/envoyproxy/envoy/blob/8d4cf00e2980aeba11b9b74230a53f92086419bc/source/common/upstream/cluster_manager_impl.cc#L1355 - I do not think it is doing any significant work on a worker. But feel free to measure with your service |
@costinm gentle ping |
Enables deferred cluster creation on worker threads