Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable deferred cluster creation #50785

Merged

Conversation

ramaraochavali
Copy link
Contributor

@ramaraochavali ramaraochavali commented May 1, 2024

Enables deferred cluster creation on worker threads

  • Ambient
  • Configuration Infrastructure
  • Docs
  • Dual Stack
  • Installation
  • Networking
  • Performance and Scalability
  • Policies and Telemetry
  • Security
  • Test and Release
  • User Experience
  • Developer Infrastructure

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
@ramaraochavali ramaraochavali requested a review from a team as a code owner May 1, 2024 10:40
@istio-policy-bot istio-policy-bot added the release-notes-none Indicates a PR that does not require release notes. label May 1, 2024
@istio-testing istio-testing added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label May 1, 2024
Copy link
Contributor

@keithmattix keithmattix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the motivation for this?

Copy link
Member

@howardjohn howardjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question, when I tried in #49868 I saw no benefits

@ramaraochavali
Copy link
Contributor Author

I am seeing improvement in server initialization times with two static clusters and concurrency of 50

server.initialization_time_ms: P0(nan,100) P25(nan,102.5) P50(nan,105) P75(nan,107.5) P90(nan,109) P95(nan,109.5) P99(nan,109.9) P99.5(nan,109.95) P99.9(nan,109.99) P100(nan,110)

server.initialization_time_ms: P0(nan,140) P25(nan,142.5) P50(nan,145) P75(nan,147.5) P90(nan,149) P95(nan,149.5) P99(nan,149.9) P99.5(nan,149.95) P99.9(nan,149.99) P100(nan,150)

Will do more tests and report back

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
@istio-testing istio-testing added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels May 2, 2024
@ramaraochavali
Copy link
Contributor Author

ramaraochavali commented May 2, 2024

Here are the results of my test

Concurrency 2 Clusters 3600

Without Deferred creation

server.memory_allocated: 759852680
server.memory_heap_size: 1067294720
server.memory_physical_size: 834822144

server.initialization_time_ms: P0(nan,46000) P25(nan,46250) P50(nan,46500) P75(nan,46750) P90(nan,46900) P95(nan,46950) P99(nan,46990) P99.5(nan,46995) P99.9(nan,46999) P100(nan,47000)

With Deferred creation

server.memory_allocated: 749908528
server.memory_heap_size: 1056219136
server.memory_physical_size: 853909504

server.initialization_time_ms: P0(nan,36000) P25(nan,36250) P50(nan,36500) P75(nan,36750) P90(nan,36900) P95(nan,36950) P99(nan,36990) P99.5(nan,36995) P99.9(nan,36999) P100(nan,37000)

Concurrency 10 Clusters 3600

Without Deferred creation

server.memory_allocated: 901796592
server.memory_heap_size: 1266057216
server.memory_physical_size: 981704704

server.initialization_time_ms: P0(nan,52000) P25(nan,52250) P50(nan,52500) P75(nan,52750) P90(nan,52900) P95(nan,52950) P99(nan,52990) P99.5(nan,52995) P99.9(nan,52999) P100(nan,53000)

With

server.memory_allocated: 783817120
server.memory_heap_size: 1104920576
server.memory_physical_size: 858816512

server.initialization_time_ms: P0(nan,30000) P25(nan,30250) P50(nan,30500) P75(nan,30750) P90(nan,30900) P95(nan,30950) P99(nan,30990) P99.5(nan,30995) P99.9(nan,30999) P100(nan,31000)

Summary:

We see some improvements both in memory and init times. Out of these 3600 clusters, 400 clusters STRICT_DNS type of clusters and deferred creation does not work for them.

@ramaraochavali
Copy link
Contributor Author

@howardjohn can you please share /stats?usedonly endpoint output of your test when you saw no difference? Please enable all stats of Envoy.

@hzxuzhonghu
Copy link
Member

please wait if possible. i want to take a look,but i am on holiday this week

@ramaraochavali
Copy link
Contributor Author

sure it is not urgent

@ramaraochavali ramaraochavali added the do-not-merge Block automatic merging of a PR. label May 2, 2024
@hzxuzhonghu
Copy link
Member

How does this work in envoy, will it ceate cluster on flight, so will it influence the latency?

@ramaraochavali
Copy link
Contributor Author

Yes. When a request comes on the worker thread, it creates it inline with the already existing object https://github.com/envoyproxy/envoy/blob/8d4cf00e2980aeba11b9b74230a53f92086419bc/source/common/upstream/cluster_manager_impl.cc#L1355. Should not incur lot of latency

Copy link
Contributor

@costinm costinm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this change - and any 'on demand'/'delayed' or scale improving PR - but absolutely not as a default behavior change.

Add an annotation or option or meta - off by default - and it would be great. Gradually we can make it the default - maybe for waypoints first ( no backward compat risks), some classes of gateways - and maybe even sidecars, but over few releases.

@kyessenov
Copy link
Contributor

This needs to be gated by a feature flag and reversible. Otherwise, this is probably a good change.

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
@istio-testing istio-testing added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 9, 2024
Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
@ramaraochavali
Copy link
Contributor Author

@kyessenov @costinm added feature flag. PTAL

@howardjohn when you get chance, please share stats of your experiment

@howardjohn
Copy link
Member

I jsut have #49868 (comment)

Copy link
Member

@howardjohn howardjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I am fine with turning it on by default with "turn off for compatibility version". Or we can get more testing then flip it

- |
**Added** an experimental feature to enable cluster creation on worker threads inline during requests.
This will save memory and CPU cycles in cases where there are lots of inactive clusters and > 1 worker thread.
This can be enabled by setting ENABLE_DEFERRED_CLUSTER_CREATION to true in Istiod Deployment.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually in the agent not istiod?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then the feature flag should have its own home not under pilot

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes. I forgot just copy pasted some other flag :-)

Then the feature flag should have its own home not under pilot

We will have to do that refactor

Copy link
Member

@hzxuzhonghu hzxuzhonghu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any latency test result

- |
**Added** an experimental feature to enable cluster creation on worker threads inline during requests.
This will save memory and CPU cycles in cases where there are lots of inactive clusters and > 1 worker thread.
This can be enabled by setting ENABLE_DEFERRED_CLUSTER_CREATION to true in Istiod Deployment.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then the feature flag should have its own home not under pilot

@ramaraochavali
Copy link
Contributor Author

any latency test result

latency is very negligible based on my measurements

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
@ramaraochavali ramaraochavali requested a review from a team as a code owner May 10, 2024 04:56
@ramaraochavali
Copy link
Contributor Author

compatibility check added.

@costinm can you please review?

- |
**Added** an experimental feature to enable cluster creation on worker threads inline during requests.
This will save memory and CPU cycles in cases where there are lots of inactive clusters and > 1 worker thread.
This can be disabled by setting ENABLE_DEFERRED_CLUSTER_CREATION to true in agent Deployment.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ENABLE_DEFERRED_CLUSTER_CREATION starts with enable, but true means disable? This confuses me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry. good point, That needs to change after the default has been changed. fixed now

Signed-off-by: Rama Chavali <rama.rao@salesforce.com>
- |
**Added** an experimental feature to enable cluster creation on worker threads inline during requests.
This will save memory and CPU cycles in cases where there are lots of inactive clusters and > 1 worker thread.
This can be disabled by setting ENABLE_DEFERRED_CLUSTER_CREATION to false in agent Deployment.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need update, now it is not enabled by default

@hzxuzhonghu
Copy link
Member

latency is very negligible based on my measurements

I doubt this conclusion

@ramaraochavali
Copy link
Contributor Author

Please see what Envoy is doing when creating cluster in line https://github.com/envoyproxy/envoy/blob/8d4cf00e2980aeba11b9b74230a53f92086419bc/source/common/upstream/cluster_manager_impl.cc#L1355 - I do not think it is doing any significant work on a worker. But feel free to measure with your service

@ramaraochavali ramaraochavali removed the do-not-merge Block automatic merging of a PR. label May 14, 2024
@ramaraochavali
Copy link
Contributor Author

@costinm gentle ping

@istio-testing istio-testing merged commit 1e73925 into istio:master May 14, 2024
28 checks passed
@ramaraochavali ramaraochavali deleted the fix/enable_deferred_cluster branch May 15, 2024 05:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-notes-none Indicates a PR that does not require release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants