optimize/operator: the server side actively detects whether grpc connection is available #3890

1046102779 · 2021-11-12T07:13:22Z

When a large number of pods are destroyed frequently, the server side needs to actively detect whether the connection is available to prevent the leakage of operator channel resources.

#3752

…ection is available

codecov · 2021-11-12T07:42:50Z

Codecov Report

Merging #3890 (d8e5092) into master (1549dc7) will decrease coverage by 0.03%.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##           master    #3890      +/-   ##
==========================================
- Coverage   62.40%   62.37%   -0.04%     
==========================================
  Files         101      101              
  Lines        9510     9515       +5     
==========================================
  Hits         5935     5935              
- Misses       3109     3114       +5     
  Partials      466      466

Impacted Files	Coverage Δ
pkg/operator/api/api.go	`18.54% <0.00%> (-0.78%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1549dc7...d8e5092. Read the comment docs.

1046102779 · 2021-11-12T08:12:13Z

/cc @artursouza @yaron2

darren-wang · 2021-11-15T01:47:35Z

@1046102779
Hi, we are trying to using dapr in our company, do you have an idea which version is planned to have this patch and when will it be released?

yaron2 · 2021-11-15T01:51:44Z

@1046102779 Hi, we are trying to using dapr in our company, do you have an idea which version is planned to have this patch and when will it be released?

If this goes in, it will be part of 1.6 and released in December.

1046102779 · 2021-11-15T01:52:52Z

@1046102779 Hi, we are trying to using dapr in our company, do you have an idea which version is planned to have this patch and when will it be released?

@yaron2 @artursouza Interested company is willing to use dapr, can you help incorporate this PR in the next version?

daixiang0 · 2021-11-15T02:18:26Z

pkg/operator/api/api.go

+			Component: b,
+		})
+		if err != nil {
+			log.Warnf("error updating sidecar with component %s (%s): %s", c.GetName(), c.Spec.Type, err)


I think debug level is enough to ensure useful logs are not overwhelmed.

Ok, i change it at right.

At present, our dapr ecological library uses log levels. Are there any specifications? I can have a unified look

darren-wang · 2021-11-26T15:40:21Z

@1046102779 Hi, we are trying to using dapr in our company, do you have an idea which version is planned to have this patch and when will it be released?

If this goes in, it will be part of 1.6 and released in December.

Can we fix this in 1.5.1? Our adoption process is stuck and 1.6 in Jan 2022 is just too far away.

darren-wang · 2021-11-26T15:45:57Z

@1046102779 hi, can you merge this fix to release-1.5 according to #3939

yaron2 · 2021-11-26T16:13:41Z

@1046102779 Hi, we are trying to using dapr in our company, do you have an idea which version is planned to have this patch and when will it be released?

If this goes in, it will be part of 1.6 and released in December.

Can we fix this in 1.5.1? Our adoption process is stuck and 1.6 in Jan 2022 is just too far away.

Can you explain exactly what you are seeing and when, and why this is a blocker for you?

darren-wang · 2021-11-28T16:13:47Z

@1046102779 Hi, we are trying to using dapr in our company, do you have an idea which version is planned to have this patch and when will it be released?

If this goes in, it will be part of 1.6 and released in December.

Can we fix this in 1.5.1? Our adoption process is stuck and 1.6 in Jan 2022 is just too far away.

Can you explain exactly what you are seeing and when, and why this is a blocker for you?

In our company we have to make tests to check the availability and stability before the adoption of open source infrastructures like dapr.

As replied in #3752 , I tested dapr-1.4.2 by deleting a fixed number of app pods and restarting them every 30 min in 48 hours to mimic the production situation where pods are created and restarted continuously.

Then I found the memories used by sidecar-injectors steady but those of operators' grow all the time till they hit the limit allocated. The memory used by operator then dropped drastically after reaching the limit and grows again and the situation above just repeated over and over.

Since the total count of running app pods was fixed all the time and pods were just recreated during the test, I think there might be a resource leakage so I searched issues and found #3752 by @1046102779. Thanks to @1046102779 , #3752 is merged and released 1.5.0 but I found it does not solve the problem finally so @1046102779 made this PR.

The reason of this bug is that when app pods restarts, old connections became unavailable and new connections were created. However, operator does not close uavailable channels actively and reclaim the unused memory until the memory limit is reached.

#3752 moderates this problem by closing unused channels when components created or updated. This PR solve the problem finally by actively closing unused channels on the fly.

Since we confirm there is a resource leakage but not clear of the consequences of drastic memory reclaiming of operators after reaching memory limit(It seems that operators still work when memory reclaimed after reaching limits, but we think it's not normal). I can't adopt 1.4.2 and 1.5.0 with known issues only to wait for a final fix and then tests the newly released version to see if the leakage finally resolved.

To be honest, we are java engineers and not very clear of the memory mechanism in Go. In our knowledge, memory leakage could lead to serious consequences so we are seeking reply from community humbly. If this is a common practice to leave the reclaiming to the GC with a drastic memory churn, we will reexamine the test result.

yaron2 · 2021-11-29T16:02:19Z

@1046102779 Hi, we are trying to using dapr in our company, do you have an idea which version is planned to have this patch and when will it be released?

If this goes in, it will be part of 1.6 and released in December.

Can we fix this in 1.5.1? Our adoption process is stuck and 1.6 in Jan 2022 is just too far away.

Can you explain exactly what you are seeing and when, and why this is a blocker for you?

In our company we have to make tests to check the availability and stability before the adoption of open source infrastructures like dapr.

As replied in #3752 , I tested dapr-1.4.2 by deleting a fixed number of app pods and restarting them every 30 min in 48 hours to mimic the production situation where pods are created and restarted continuously.

Then I found the memories used by sidecar-injectors steady but those of operators' grow all the time till they hit the limit allocated. The memory used by operator then dropped drastically after reaching the limit and grows again and the situation above just repeated over and over.

Since the total count of running app pods was fixed all the time and pods were just recreated during the test, I think there might be a resource leakage so I searched issues and found #3752 by @1046102779. Thanks to @1046102779 , #3752 is merged and released 1.5.0 but I found it does not solve the problem finally so @1046102779 made this PR.

The reason of this bug is that when app pods restarts, old connections became unavailable and new connections were created. However, operator does not close uavailable channels actively and reclaim the unused memory until the memory limit is reached.

#3752 moderates this problem by closing unused channels when components created or updated. This PR solve the problem finally by actively closing unused channels on the fly.

Since we confirm there is a resource leakage but not clear of the consequences of drastic memory reclaiming of operators after reaching memory limit(It seems that operators still work when memory reclaimed after reaching limits, but we think it's not normal). I can't adopt 1.4.2 and 1.5.0 with known issues only to wait for a final fix and then tests the newly released version to see if the leakage finally resolved.

To be honest, we are java engineers and not very clear of the memory mechanism in Go. In our knowledge, memory leakage could lead to serious consequences so we are seeking reply from community humbly. If this is a common practice to leave the reclaiming to the GC with a drastic memory churn, we will reexamine the test result.

Thanks @darren-wang for explaining.

Based on this being a regression, I recommend and support we cherry pick this into the upcoming hotfix release.

cc @artursouza @berndverst

…ection is available (dapr#3890) * optimize/operator: the server side actively detects whether grpc connection is available * optimize/operator: the server side actively detects whether grpc connection is available * optimize/operator: the server side actively detects whether grpc connection is available Co-authored-by: Long Dai <long.dai@intel.com> Co-authored-by: Artur Souza <artursouza.ms@outlook.com> Co-authored-by: Dapr Bot <56698301+dapr-bot@users.noreply.github.com>

…ection is available (#3890) (#3963) * optimize/operator: the server side actively detects whether grpc connection is available * optimize/operator: the server side actively detects whether grpc connection is available * optimize/operator: the server side actively detects whether grpc connection is available Co-authored-by: Long Dai <long.dai@intel.com> Co-authored-by: Artur Souza <artursouza.ms@outlook.com> Co-authored-by: Dapr Bot <56698301+dapr-bot@users.noreply.github.com> Co-authored-by: yellow chicks <seachen@tencent.com> Co-authored-by: Long Dai <long.dai@intel.com> Co-authored-by: Artur Souza <artursouza.ms@outlook.com> Co-authored-by: Dapr Bot <56698301+dapr-bot@users.noreply.github.com>

…ection is available (dapr#3890) * optimize/operator: the server side actively detects whether grpc connection is available * optimize/operator: the server side actively detects whether grpc connection is available * optimize/operator: the server side actively detects whether grpc connection is available Co-authored-by: Long Dai <long.dai@intel.com> Co-authored-by: Artur Souza <artursouza.ms@outlook.com> Co-authored-by: Dapr Bot <56698301+dapr-bot@users.noreply.github.com> Signed-off-by: x-shadow-man <1494445739@qq.com>

optimize/operator: the server side actively detects whether grpc conn…

a5aed76

…ection is available

1046102779 requested review from a team as code owners November 12, 2021 07:13

1046102779 mentioned this pull request Nov 12, 2021

bugfix/operator: fix resource leak for grpc server connection #3752

Merged

1046102779 added 2 commits November 12, 2021 15:28

optimize/operator: the server side actively detects whether grpc conn…

f2b6ed4

…ection is available

optimize/operator: the server side actively detects whether grpc conn…

9b2ab1a

…ection is available

1046102779 changed the title ~~optimize/operator: the server side actively detects whether grpc conn…~~ optimize/operator: the server side actively detects whether grpc connection is available Nov 12, 2021

Merge branch 'master' into optimize/operator1112

965e775

daixiang0 reviewed Nov 15, 2021

View reviewed changes

daixiang0 and others added 10 commits November 15, 2021 10:31

Merge branch 'master' into optimize/operator1112

dac627a

Merge branch 'master' into optimize/operator1112

abf386b

Merge branch 'master' into optimize/operator1112

77b0a37

Merge branch 'master' into optimize/operator1112

763e08a

Merge branch 'master' into optimize/operator1112

887fc3e

Merge branch 'master' into optimize/operator1112

599724d

Merge branch 'master' into optimize/operator1112

dc22582

Merge branch 'master' into optimize/operator1112

8372fec

Merge branch 'master' into optimize/operator1112

1c2ccb1

Merge branch 'master' into optimize/operator1112

da7ddd5

darren-wang mentioned this pull request Nov 26, 2021

v1.5.1 Hotfix Release Checklist #3939

Closed

11 tasks

Merge branch 'master' into optimize/operator1112

d8e5092

artursouza approved these changes Nov 29, 2021

View reviewed changes

artursouza merged commit 44cdddb into dapr:master Nov 29, 2021

artursouza modified the milestone: v1.6 Nov 29, 2021

This was referenced Nov 29, 2021

Cherry-Pick into release 1.5 PR #3890 #3962

Closed

Cherry-Pick into release 1.5 PR #3890 #3963

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize/operator: the server side actively detects whether grpc connection is available #3890

optimize/operator: the server side actively detects whether grpc connection is available #3890

1046102779 commented Nov 12, 2021

codecov bot commented Nov 12, 2021 •

edited

1046102779 commented Nov 12, 2021

darren-wang commented Nov 15, 2021

yaron2 commented Nov 15, 2021

1046102779 commented Nov 15, 2021

daixiang0 Nov 15, 2021

1046102779 Nov 15, 2021

darren-wang commented Nov 26, 2021

darren-wang commented Nov 26, 2021

yaron2 commented Nov 26, 2021

darren-wang commented Nov 28, 2021

yaron2 commented Nov 29, 2021

optimize/operator: the server side actively detects whether grpc connection is available #3890

optimize/operator: the server side actively detects whether grpc connection is available #3890

Conversation

1046102779 commented Nov 12, 2021

codecov bot commented Nov 12, 2021 • edited

Codecov Report

1046102779 commented Nov 12, 2021

darren-wang commented Nov 15, 2021

yaron2 commented Nov 15, 2021

1046102779 commented Nov 15, 2021

daixiang0 Nov 15, 2021

Choose a reason for hiding this comment

1046102779 Nov 15, 2021

Choose a reason for hiding this comment

darren-wang commented Nov 26, 2021

darren-wang commented Nov 26, 2021

yaron2 commented Nov 26, 2021

darren-wang commented Nov 28, 2021

yaron2 commented Nov 29, 2021

codecov bot commented Nov 12, 2021 •

edited