Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rearchitect status handling #605

Closed
MattBrittan opened this issue Aug 9, 2022 · 5 comments
Closed

Rearchitect status handling #605

MattBrittan opened this issue Aug 9, 2022 · 5 comments

Comments

@MattBrittan
Copy link
Contributor

MattBrittan commented Aug 9, 2022

Currently this package stores the connection status in a uint32 protected by a RWMutex:

status       uint32 // see const definitions at top of file for possible values
sync.RWMutex        // Protects the above two variables (note: atomic writes are also used somewhat inconsistently)

Unfortunately the status is often accessed without locking the Mutex e.g. here. While these accesses are usually via atomic.LoadUint32 they are not particularly threadsafe and the code has been complicated to work around the deficiencies (e.g. here). This is especially apparent with auto reconnection (what happens if Disconnect is called while the system is attempting to reconnect?).

Currently the available statuses are:

const (
	disconnected uint32 = iota
	connecting
	reconnecting
	connected
)

This is fairly limiting. For instance the addition of a disconnecting status would make it obvious that the disconnection process was underway avoiding the need for workarounds.

Whilst I believe that workarounds are in place for all known issues (i.e. it works) the code can be difficult to follow which discourages new contributors (e.g. this PR) and makes checking/testing contributions more difficult then it needs to be.

I have been putting off changing this for some time, due to the potential to introduce subtle bugs, but feel that the time has come where a change is needed so will be submitting a PR shortly. I would expect this change to require a few iterations; the initial change will introduce new status handling code and follow-up PR's will (not necessarily by me) will refactor code to remove workarounds introduced due to the limitations of the old mechanism.

Note: The new code will focus on being readable over performance (for example I'm aiming to remove uses of atomic until benchmarking provides a solid rationale for using them). This change will not change the API and should make no difference to users (unless bugs are introduced!).

MattBrittan added a commit that referenced this issue Aug 10, 2022
This is a major change that moves handling of the status (i.e. disconnected, connecting etc) out to status.go. It also introduces the new status disconnecting.

All existing tests (including 10000 runs through Test_DisconnectWhileProcessingIncomingPublish) pass and I have added new ones to specifically test the connectionStatus functionality. However this is a major change so bugs may have been introduced!

The major aim of this change is to simplify future work, It has been difficult to implement changes/refactor because the status handling was fragile (and not fully thread safe in some instances).
@MattBrittan
Copy link
Contributor Author

Changes have been merged but I'll leave this issue open for any issues arising.

@MattBrittan
Copy link
Contributor Author

I've been running this live in four systems (some with very poor connections) for a couple of weeks without experiencing any issues. Will leave the issue open until after the next release but it appears that the change might be mostly bug free (but I would not be surprised if there is an edge case I have missed).
Pull requests to clean up old work arounds would be welcome (there are a few places where steps were taken to avoid deadlocks that are no longer needed because of the centrally controlled status).

@peterhoneder
Copy link

peterhoneder commented Jan 26, 2023

Hi!

We currently receive sporadic errors of the following kind when subscribing to topics:
"not currently connected and ResumeSubs not set"

The issue here is, if you tcpdump, there is never an interruption or a reconnect or anything happening while this error comes up. ResumeSubs is not set like the error already correctly says and in our case the application does this a lot:

  • subscribe
  • send something
  • wait for an answer on the subscribed topic
  • unsubscribe

Do you know any good ways to debug this further within the library. This is very sporadic and mostly happens on production work load and very infrequently, we still would like to see if we can track it down.

I'm kind of assuming right now, that this might relate to the merged PR from August, since its description fits our symptoms.

Thanks!

@MattBrittan
Copy link
Contributor Author

@peterhoneder looking at the code this can only happen if !c.IsConnectionOpen() and I cannot see any way that can occur unless the connection has dropped. Of course I may be missing something and, as I wrote the code, it would be good to have another set of eyes run through it. Can you please confirm what version you are running? (if anything I would expect the changes to decrease the likelihood of false positives on this).

This would probably be better logged as a separate issue but would require a lot more information to be actionable (ideally code extracts and debug logs). This kind of issue can be very hard to track down (and is next to impossible without debug logs when I cannot duplicate the problem).

@MattBrittan
Copy link
Contributor Author

Closing this issue off as the changes to status handling appear to be working OK (without more details I'm unable to follow up on the one comment above).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants