Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resolved race condition in collector when calling Shutdown #4878

Merged

Conversation

cpheps
Copy link
Contributor

@cpheps cpheps commented Feb 16, 2022

Description:

Link to tracking Issue: Fixes #4843

Testing: Verified existing tests still work. Added unit test to verify calling Shutdown while collector is not running will cause an immediate shutdown when Run is called.

@cpheps cpheps requested a review from a team as a code owner February 16, 2022 19:57
@project-bot project-bot bot added this to In progress in Collector Feb 16, 2022
@codecov
Copy link

codecov bot commented Feb 16, 2022

Codecov Report

Merging #4878 (b8cbfa2) into main (5f2d874) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #4878   +/-   ##
=======================================
  Coverage   91.02%   91.02%           
=======================================
  Files         178      178           
  Lines       10597    10600    +3     
=======================================
+ Hits         9646     9649    +3     
  Misses        734      734           
  Partials      217      217           
Impacted Files Coverage Δ
service/collector.go 75.65% <100.00%> (+0.49%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5f2d874...b8cbfa2. Read the comment docs.

service/collector.go Show resolved Hide resolved
@@ -135,6 +135,53 @@ func TestCollector_Start(t *testing.T) {
}, time.Second*2, time.Millisecond*200)
}

// Calling shutdown before collector is running should not stop it from running
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The comment here should start with the func name, alternatively the comment could be moved inside func

Collector automation moved this from In progress to Review in progress Feb 17, 2022
Copy link
Member

@bogdandrutu bogdandrutu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR changes more than needed to fix the problem.

The adding of a new state is completely independent and should be a separate PR.

@cpheps
Copy link
Contributor Author

cpheps commented Feb 17, 2022

I think this PR changes more than needed to fix the problem.
The adding of a new state is completely independent and should be a separate PR.

@bogdandrutu So I added a new state because with out it the Collector is in a Starting state but hasn't actually started. If I call Shutdown before I call Run what should the collector do? If we don't include a new state calling Shutdown before calling Run will prevent the collector from actually starting. It puts it in a weird state where you have a collector that says it's in the Starting state but has no hope of ever running.

You're right the new state isn't technically needed to solve this problem but without a way to tell that the collector has been created and not run ,Shutdown can put the collector in a weird state.

I'm curious if you have another solution to this were we can avoid a new state.

@buptubuntu
Copy link

I think this PR changes more than needed to fix the problem.

The adding of a new state is completely independent and should be a separate PR.

Maybe we change the title of the PR to Add state in collector lifecycle? And the Created can resolve condition race.

@cpheps what do you think?

@cpheps
Copy link
Contributor Author

cpheps commented Feb 18, 2022

@buptubuntu I think that's fair. I think without the new state we trade a race condition a less than ideal behavior pattern for the collector.

@cpheps cpheps changed the title Refactored collector shutdown pattern to remove race condition Add state in collector lifecycle Feb 18, 2022
@cpheps
Copy link
Contributor Author

cpheps commented Feb 23, 2022

@bogdandrutu Curious if I can get your thoughts on my reply and @buptubuntu's suggestion?

@@ -48,6 +48,7 @@ const (
Running
Closing
Closed
Created
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why at the end? I don't think we guarantee the numbers.

service/collector.go Show resolved Hide resolved
@bogdandrutu
Copy link
Member

bogdandrutu commented Feb 26, 2022

If I call Shutdown before I call Run what should the collector do?

@cpheps this is probably the most important question.

I think we should do what the "language pattern" is, so probably we should do exactly what http server does for shutdown https://pkg.go.dev/net/http#Server.Shutdown and document the behavior better.

@cpheps
Copy link
Contributor Author

cpheps commented Feb 26, 2022

I think we should do what the "language pattern" is, so probably we should do exactly what http server does for shutdown https://pkg.go.dev/net/http#Server.Shutdown and document the behavior better.

@bogdandrutu I think that's fair. Looks like if you call shutdown on a server it will not allow it to start or will cause any listen function to immediately close. So for the collector it should do the same. I think this eliminates the need for a new state. I'll remove the created state and rename the PR back to the way it was.

Do you have any concerns on how fast the collector shutdown if Shutdown is called before Run? Seems like it will go through the entire startup process change it's state to Running then shut down. All this should be more or less instant to a user and honestly calling Shutdown pre run is not a normal use case. I think it should be fine but wanted to pose the question.

@cpheps
Copy link
Contributor Author

cpheps commented Feb 28, 2022

@bogdandrutu Removed the extra state. Do you mind taking a re-review?

@cpheps
Copy link
Contributor Author

cpheps commented Feb 28, 2022

Looks like CI failures are unrelated to changes.

service/collector.go Show resolved Hide resolved
Collector automation moved this from Review in progress to Reviewer approved Feb 28, 2022
@cpheps
Copy link
Contributor Author

cpheps commented Mar 1, 2022

Probably in a separate PR, but we should block for shutdown to return when complete correct?

Do you mean block in the shutdown function? I suppose we can, in http.server it waits for all the listeners to close down. If it's called before a Listenandserve type function there are no listeners to wait for so it exits.

For us I think we need to do some refactoring to figure out what shutdown is waiting on and ensure it doesn't block forever if called before Run. I don't think it can be the collector state as that currently won't change if Run is not executed.

Also RUN should probably not start everything if shutdown was called.

I agree on this but I felt like it was out of the scope of the PR as we probably need to restructure the startup logic.

I'll write up issues for these. I think the former needs more discussion.

@bogdandrutu bogdandrutu changed the title Add state in collector lifecycle Resolved race condition in collector when calling Shutdown Mar 1, 2022
@bogdandrutu
Copy link
Member

@cpheps I think your test catch you :)

2022-03-01T17:53:33.131Z	warn	zapgrpc/zapgrpc.go:191	[core] grpc: addrConn.createTransport failed to connect to {locahost:4317 locahost:4317 <nil> <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp: operation was canceled"	{"grpc_log": true}
2022-03-01T17:53:33.133Z	info	service/collector.go:266	Shutdown complete.
    testing.go:1152: race detected during execution of test
--- FAIL: TestCollector_ShutdownBeforeRun (0.24s)
=== RUN   TestCollector_ReportError

@cpheps
Copy link
Contributor Author

cpheps commented Mar 1, 2022

I'll double check. They were running locally even the service one that keeps failing. It might be flaky I'll check.

@cpheps
Copy link
Contributor Author

cpheps commented Mar 1, 2022

Oh I see it's the race detector I'll take a look.

Signed-off-by: Corbin Phelps <corbin.phelps@bluemedora.com>
Corbin Phelps added 2 commits March 1, 2022 19:33
Signed-off-by: Corbin Phelps <corbin.phelps@bluemedora.com>
…ine with language standards

Signed-off-by: Corbin Phelps <corbin.phelps@bluemedora.com>
@cpheps
Copy link
Contributor Author

cpheps commented Mar 2, 2022

@bogdandrutu This is actually interesting as you brought this up on my last PR. The data race is due to access to the logger pointer is not atomic. I wrote up and issue to track the atomic log pointer at #4939.

Would you be opposed to removing the log statement in the recover for shutdown? It's not a user friendly log and doesn't seem very helpful from a user's standpoint since it's referring to an unexported channel.

We could add it back once the log pointer is atomic but I don't think the message is useful enough to block merging this PR.

@bogdandrutu
Copy link
Member

+1 on removing the log

Signed-off-by: Corbin Phelps <corbin.phelps@bluemedora.com>
@cpheps cpheps force-pushed the collector-shutdown-refactor branch from 5a0f6e4 to df4dab0 Compare March 2, 2022 12:58
Signed-off-by: Corbin Phelps <corbin.phelps@bluemedora.com>
@bogdandrutu bogdandrutu merged commit 02d0fe5 into open-telemetry:main Mar 2, 2022
@cpheps cpheps deleted the collector-shutdown-refactor branch March 2, 2022 17:58
@Chinwendu20
Copy link
Contributor

Hi @cpheps please I was looking through this and I was wondering how you came up with time in the assert.eventually() function

@cpheps
Copy link
Contributor Author

cpheps commented Oct 11, 2022

@Chinwendu20 the time is a bit arbitrary. Shutdown should be almost instant. I picked 2 seconds because that should be more than enough time for Shutdown to work. It's also short enough where it shutdown fails it does not slow the entire test suit down significantly.

@Chinwendu20
Copy link
Contributor

Thank you so much @cpheps. I see this PR is related to the shutdown not being honored if called right after run. I would like to please know why you called Shutdown() before run in this test case. I see the comment on the code right after, that call but I still do not get it.

image

Thanks a lot

@cpheps
Copy link
Contributor Author

cpheps commented Oct 11, 2022

@Chinwendu20 This specific test case is to ensure if you call Shutdown before Run it means run will exit. Looking at the test I think the col.Shutdown() on line 211 is not suppose to be there. It's not contaminating the test I don't think but might be muddling the intent.

The order of operations should be this:

  1. Create collector
  2. Shutdown collector
  3. Run collector

We expect that collector should exit run within a reasonable amount of time (hence the 2 second timer). Currently though the collector has to run through its entire startup then detect the shutdown and exit. The issue you're working on should make it so as soon as Run is called it should check to see if Shutdown has been called and if so just exit rather than running any logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Collector
  
Reviewer approved
Development

Successfully merging this pull request may close these issues.

Collector Shutdown will not be honored if called to early
5 participants