Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MQTT bindings conformance tests have a ~1% failure rate on Create test #1282

Closed
CodeMonkeyLeet opened this issue Nov 6, 2021 · 1 comment
Assignees
Labels
kind/bug Something isn't working
Milestone

Comments

@CodeMonkeyLeet
Copy link
Contributor

Expected Behavior

MQTT bindings conformance tests should pass reliably.

Actual Behavior

MQTT bindings conformance tests have a ~1% failure rate on Create test with the following failure signature:

=== RUN   TestBindingsConformance/mqtt-emqx/create
time="2021-11-03T23:35:17.345828789Z" level=info msg="Create test running ..." instance=fv-az206-587 scope=bindingsTest type=log ver=unknown
time="2021-11-03T23:35:17.347180905Z" level=debug msg="mqtt publishing topic testBindingTopic with data: [91 123 34 101 118 101 110 116 84 121 112 101 34 58 34 116 101 115 116 34 44 34 101 118 101 110 116 84 105 109 101 34 58 32 34 50 48 49 56 45 48 49 45 50 53 84 50 50 58 49 50 58 49 57 46 52 53 53 54 56 49 49 90 34 44 34 115 117 98 106 101 99 116 34 58 34 100 97 112 114 45 99 111 110 102 45 116 101 115 116 115 34 44 34 105 100 34 58 34 65 50 51 52 45 49 50 51 52 45 49 50 51 52 34 44 34 100 97 116 97 34 58 34 114 111 111 116 47 62 34 125 93]" instance=fv-az206-587 scope=testLogger type=log ver=unknown
    bindings.go:191: 
        	Error Trace:	bindings.go:191
        	Error:      	Received unexpected error:
        	            	mqtt error from publish: write tcp [::1]:32810->[::1]:1884: use of closed network connection
        	Test:       	TestBindingsConformance/mqtt-emqx/create
        	Messages:   	expected no error invoking output binding
time="2021-11-03T23:35:17.348703823Z" level=info msg="Create test done." instance=fv-az206-587 scope=bindingsTest type=log ver=unknown
        --- FAIL: TestBindingsConformance/mqtt-emqx/create (0.00s)

Tracing the MQTT code, this seems internal to the paho.mqtt.golang lib, and may be related to eclipse/paho.mqtt.golang#410. Dapr uses the default AutoReconnect value (true) for the MQTT client configuration and from the traces it looks like MQTT is constantly reconnecting in the background.

In the failure cases, it looks like there is a race condition between the disconnect and when publish is called to queue a new outbound message. The most minimal example I've seen is:

...
[DEBUG] [client]   enter Publish
[DEBUG] [client]   sending publish message, topic: testBindingTopic
[ERROR] [client]   Connect comms goroutine - error triggered EOF
[DEBUG] [net]      obound msg to write 1
[DEBUG] [client]   internalConnLost called
[DEBUG] [client]   stopCommsWorkers called
[ERROR] [net]      outgoing obound reporting error  write tcp 127.0.0.1:42870->127.0.0.1:1884: use of closed network connection
...

And depending on the threading and run:

...
[DEBUG] [client]   enter Publish
[DEBUG] [client]   sending publish message, topic: testBindingTopic
[DEBUG] [net]      obound msg to write 1
[DEBUG] [net]      incoming complete
[DEBUG] [net]      startIncomingComms: got msg on ibound
[DEBUG] [net]      logic waiting for msg on ibound
[DEBUG] [net]      startIncomingComms: ibound complete
[DEBUG] [net]      startIncomingComms goroutine complete
[ERROR] [client]   Connect comms goroutine - error triggered EOF
[DEBUG] [client]   internalConnLost called
[DEBUG] [client]   stopCommsWorkers called
[DEBUG] [router]   matchAndDispatch exiting
[DEBUG] [client]   startCommsWorkers output redirector finished
[DEBUG] [pinger]   keepalive stopped
[DEBUG] [client]   stopCommsWorkers waiting for workers
[DEBUG] [client]   stopCommsWorkers waiting for comms
[DEBUG] [client]   internalConnLost waiting on workers
[DEBUG] [net]      received connack
[DEBUG] [client]   startCommsWorkers called
[DEBUG] [client]   client is connected/reconnected
[DEBUG] [net]      incoming started
[DEBUG] [net]      startIncomingComms started
[DEBUG] [net]      outgoing started
[DEBUG] [net]      startComms started
[DEBUG] [client]   startCommsWorkers done
[DEBUG] [store]    enter Resume
[DEBUG] [store]    exit resume
[DEBUG] [net]      logic waiting for msg on ibound
[DEBUG] [net]      startIncomingComms: inboundFromStore complete
[DEBUG] [net]      logic waiting for msg on ibound
[DEBUG] [net]      outgoing waiting for an outbound message
[DEBUG] [pinger]   keepalive starting
[ERROR] [net]      outgoing obound reporting error  write tcp 127.0.0.1:60916->127.0.0.1:1884: use of closed network connection

Where in success cases, the trace is usually:

...
[DEBUG] [client]   enter Publish
[DEBUG] [client]   sending publish message, topic: testBindingTopic
[DEBUG] [net]      obound msg to write 1
[DEBUG] [net]      obound wrote msg, id: 1
[DEBUG] [net]      outgoing waiting for an outbound message
...

Additional observations:

  • The failure repros if only using the output binding, so does not appear to be an interaction between the input & output binding clients in the same component.
  • The MQTT bindings and pubsub look like they have exactly the same code that invokes publish but there has not been a single instance of the failure signature in the pubsub conformance test. This suggests there's some additional factor to the failure.

Steps to Reproduce the Problem

  1. It is useful to enable paho.mqtt.golang lib's tracing for this investigation, which can be added to the Init() method in mqtt.go:
    // DEBUG: Enable MQTT internal logging
    mqtt.ERROR = log.New(os.Stdout, "[ERROR] ", 0)
    mqtt.CRITICAL = log.New(os.Stdout, "[CRIT] ", 0)
    mqtt.WARN = log.New(os.Stdout, "[WARN]  ", 0)
    mqtt.DEBUG = log.New(os.Stdout, "[DEBUG] ", 0)
  2. The tracing produces a lot of spew and does not distinguish between the input and output binding's mqtt clients, so it also helpful to mock out the Read() method so it's not actually using MQTT (e.g. with a simple timer loop that calls the app's read handler).
  3. Run the mqtt conformance test in a tight loop until failure to capture the trace:
    while go test -v -tags=conftests -count=1 -timeout=1m ./tests/conformance -run=TestBindingsConformance/mqtt.emqx > mqtt-emqx.log; do :; done

Release Note

RELEASE NOTE: FIX MQTT output bindings sporadic failure on Create operation.

@CodeMonkeyLeet CodeMonkeyLeet added the kind/bug Something isn't working label Nov 6, 2021
@artursouza artursouza added this to the v1.6 milestone Nov 19, 2021
@CodeMonkeyLeet
Copy link
Contributor Author

No failures since the patch fix went in, resolving as fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants