Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize/operator: the server side actively detects whether grpc connection is available #3890

Merged
merged 15 commits into from
Nov 29, 2021
Merged

optimize/operator: the server side actively detects whether grpc connection is available #3890

merged 15 commits into from
Nov 29, 2021

Conversation

1046102779
Copy link
Member

When a large number of pods are destroyed frequently, the server side needs to actively detect whether the connection is available to prevent the leakage of operator channel resources.

#3752

@1046102779 1046102779 changed the title optimize/operator: the server side actively detects whether grpc conn… optimize/operator: the server side actively detects whether grpc connection is available Nov 12, 2021
@codecov
Copy link

codecov bot commented Nov 12, 2021

Codecov Report

Merging #3890 (d8e5092) into master (1549dc7) will decrease coverage by 0.03%.
The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3890      +/-   ##
==========================================
- Coverage   62.40%   62.37%   -0.04%     
==========================================
  Files         101      101              
  Lines        9510     9515       +5     
==========================================
  Hits         5935     5935              
- Misses       3109     3114       +5     
  Partials      466      466              
Impacted Files Coverage Δ
pkg/operator/api/api.go 18.54% <0.00%> (-0.78%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1549dc7...d8e5092. Read the comment docs.

@1046102779
Copy link
Member Author

/cc @artursouza @yaron2

@darren-wang
Copy link

@1046102779
Hi, we are trying to using dapr in our company, do you have an idea which version is planned to have this patch and when will it be released?

@yaron2
Copy link
Member

yaron2 commented Nov 15, 2021

@1046102779 Hi, we are trying to using dapr in our company, do you have an idea which version is planned to have this patch and when will it be released?

If this goes in, it will be part of 1.6 and released in December.

@1046102779
Copy link
Member Author

@1046102779 Hi, we are trying to using dapr in our company, do you have an idea which version is planned to have this patch and when will it be released?

@yaron2 @artursouza Interested company is willing to use dapr, can you help incorporate this PR in the next version?

Component: b,
})
if err != nil {
log.Warnf("error updating sidecar with component %s (%s): %s", c.GetName(), c.Spec.Type, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think debug level is enough to ensure useful logs are not overwhelmed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, i change it at right.

At present, our dapr ecological library uses log levels. Are there any specifications? I can have a unified look

@darren-wang darren-wang mentioned this pull request Nov 26, 2021
11 tasks
@darren-wang
Copy link

@1046102779 Hi, we are trying to using dapr in our company, do you have an idea which version is planned to have this patch and when will it be released?

If this goes in, it will be part of 1.6 and released in December.

Can we fix this in 1.5.1? Our adoption process is stuck and 1.6 in Jan 2022 is just too far away.

@darren-wang
Copy link

@1046102779 hi, can you merge this fix to release-1.5 according to #3939

@yaron2
Copy link
Member

yaron2 commented Nov 26, 2021

@1046102779 Hi, we are trying to using dapr in our company, do you have an idea which version is planned to have this patch and when will it be released?

If this goes in, it will be part of 1.6 and released in December.

Can we fix this in 1.5.1? Our adoption process is stuck and 1.6 in Jan 2022 is just too far away.

Can you explain exactly what you are seeing and when, and why this is a blocker for you?

@darren-wang
Copy link

@1046102779 Hi, we are trying to using dapr in our company, do you have an idea which version is planned to have this patch and when will it be released?

If this goes in, it will be part of 1.6 and released in December.

Can we fix this in 1.5.1? Our adoption process is stuck and 1.6 in Jan 2022 is just too far away.

Can you explain exactly what you are seeing and when, and why this is a blocker for you?

In our company we have to make tests to check the availability and stability before the adoption of open source infrastructures like dapr.

As replied in #3752 , I tested dapr-1.4.2 by deleting a fixed number of app pods and restarting them every 30 min in 48 hours to mimic the production situation where pods are created and restarted continuously.

Then I found the memories used by sidecar-injectors steady but those of operators' grow all the time till they hit the limit allocated. The memory used by operator then dropped drastically after reaching the limit and grows again and the situation above just repeated over and over.

Since the total count of running app pods was fixed all the time and pods were just recreated during the test, I think there might be a resource leakage so I searched issues and found #3752 by @1046102779. Thanks to @1046102779 , #3752 is merged and released 1.5.0 but I found it does not solve the problem finally so @1046102779 made this PR.

The reason of this bug is that when app pods restarts, old connections became unavailable and new connections were created. However, operator does not close uavailable channels actively and reclaim the unused memory until the memory limit is reached.

#3752 moderates this problem by closing unused channels when components created or updated. This PR solve the problem finally by actively closing unused channels on the fly.

Since we confirm there is a resource leakage but not clear of the consequences of drastic memory reclaiming of operators after reaching memory limit(It seems that operators still work when memory reclaimed after reaching limits, but we think it's not normal). I can't adopt 1.4.2 and 1.5.0 with known issues only to wait for a final fix and then tests the newly released version to see if the leakage finally resolved.

To be honest, we are java engineers and not very clear of the memory mechanism in Go. In our knowledge, memory leakage could lead to serious consequences so we are seeking reply from community humbly. If this is a common practice to leave the reclaiming to the GC with a drastic memory churn, we will reexamine the test result.

@yaron2
Copy link
Member

yaron2 commented Nov 29, 2021

@1046102779 Hi, we are trying to using dapr in our company, do you have an idea which version is planned to have this patch and when will it be released?

If this goes in, it will be part of 1.6 and released in December.

Can we fix this in 1.5.1? Our adoption process is stuck and 1.6 in Jan 2022 is just too far away.

Can you explain exactly what you are seeing and when, and why this is a blocker for you?

In our company we have to make tests to check the availability and stability before the adoption of open source infrastructures like dapr.

As replied in #3752 , I tested dapr-1.4.2 by deleting a fixed number of app pods and restarting them every 30 min in 48 hours to mimic the production situation where pods are created and restarted continuously.

Then I found the memories used by sidecar-injectors steady but those of operators' grow all the time till they hit the limit allocated. The memory used by operator then dropped drastically after reaching the limit and grows again and the situation above just repeated over and over.

Since the total count of running app pods was fixed all the time and pods were just recreated during the test, I think there might be a resource leakage so I searched issues and found #3752 by @1046102779. Thanks to @1046102779 , #3752 is merged and released 1.5.0 but I found it does not solve the problem finally so @1046102779 made this PR.

The reason of this bug is that when app pods restarts, old connections became unavailable and new connections were created. However, operator does not close uavailable channels actively and reclaim the unused memory until the memory limit is reached.

#3752 moderates this problem by closing unused channels when components created or updated. This PR solve the problem finally by actively closing unused channels on the fly.

Since we confirm there is a resource leakage but not clear of the consequences of drastic memory reclaiming of operators after reaching memory limit(It seems that operators still work when memory reclaimed after reaching limits, but we think it's not normal). I can't adopt 1.4.2 and 1.5.0 with known issues only to wait for a final fix and then tests the newly released version to see if the leakage finally resolved.

To be honest, we are java engineers and not very clear of the memory mechanism in Go. In our knowledge, memory leakage could lead to serious consequences so we are seeking reply from community humbly. If this is a common practice to leave the reclaiming to the GC with a drastic memory churn, we will reexamine the test result.

Thanks @darren-wang for explaining.

Based on this being a regression, I recommend and support we cherry pick this into the upcoming hotfix release.

cc @artursouza @berndverst

@artursouza artursouza merged commit 44cdddb into dapr:master Nov 29, 2021
@artursouza artursouza modified the milestone: v1.6 Nov 29, 2021
berndverst pushed a commit to berndverst/dapr that referenced this pull request Nov 29, 2021
…ection is available (dapr#3890)

* optimize/operator: the server side actively detects whether grpc connection is available

* optimize/operator: the server side actively detects whether grpc connection is available

* optimize/operator: the server side actively detects whether grpc connection is available

Co-authored-by: Long Dai <long.dai@intel.com>
Co-authored-by: Artur Souza <artursouza.ms@outlook.com>
Co-authored-by: Dapr Bot <56698301+dapr-bot@users.noreply.github.com>
artursouza added a commit that referenced this pull request Nov 29, 2021
…ection is available (#3890) (#3963)

* optimize/operator: the server side actively detects whether grpc connection is available

* optimize/operator: the server side actively detects whether grpc connection is available

* optimize/operator: the server side actively detects whether grpc connection is available

Co-authored-by: Long Dai <long.dai@intel.com>
Co-authored-by: Artur Souza <artursouza.ms@outlook.com>
Co-authored-by: Dapr Bot <56698301+dapr-bot@users.noreply.github.com>

Co-authored-by: yellow chicks <seachen@tencent.com>
Co-authored-by: Long Dai <long.dai@intel.com>
Co-authored-by: Artur Souza <artursouza.ms@outlook.com>
Co-authored-by: Dapr Bot <56698301+dapr-bot@users.noreply.github.com>
x-shadow-man pushed a commit to x-shadow-man/dapr that referenced this pull request Jan 4, 2022
…ection is available (dapr#3890)

* optimize/operator: the server side actively detects whether grpc connection is available

* optimize/operator: the server side actively detects whether grpc connection is available

* optimize/operator: the server side actively detects whether grpc connection is available

Co-authored-by: Long Dai <long.dai@intel.com>
Co-authored-by: Artur Souza <artursouza.ms@outlook.com>
Co-authored-by: Dapr Bot <56698301+dapr-bot@users.noreply.github.com>
Signed-off-by: x-shadow-man <1494445739@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants