New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Malformed memberlist gossip messages sent with >=127 ingesters #4326
Comments
Does the issue still exists if we lower |
Unfortunately it does, you can see in the log the size of the offending message is only 2201 bytes. The default |
Proposed fix: hashicorp/memberlist#239 |
I'm not running 127 ingesters, I'm running 75, but I have a lot of other nodes in the memberlist, more than 127 total. Similar errors in the logs: {"caller":"memberlist_logger.go:74","level":"error","msg":"msg type (45) not supported from=192.168.40.68:7946","ts":"2021-09-21T22:47:06.791965958Z"}
{"caller":"memberlist_logger.go:74","level":"error","msg":"msg type (45) not supported from=192.168.40.68:7946","ts":"2021-09-21T22:47:06.791980313Z"}
{"caller":"memberlist_logger.go:74","level":"error","msg":"msg type (45) not supported from=192.168.40.68:7946","ts":"2021-09-21T22:47:06.791992495Z"}
{"caller":"memberlist_logger.go:74","level":"error","msg":"msg type (119) not supported from=192.168.40.68:7946","ts":"2021-09-21T22:47:06.792003049Z"}
{"caller":"memberlist_logger.go:74","level":"error","msg":"msg type (100) not supported from=192.168.40.68:7946","ts":"2021-09-21T22:47:06.792013388Z"}
{"caller":"memberlist_logger.go:74","level":"error","msg":"msg type (98) not supported from=192.168.40.68:7946","ts":"2021-09-21T22:47:06.79202562Z"}
{"caller":"memberlist_logger.go:74","level":"error","msg":"msg type (114) not supported from=192.168.40.68:7946","ts":"2021-09-21T22:47:06.792059365Z"}
{"caller":"memberlist_logger.go:74","level":"error","msg":"msg type (45) not supported from=192.168.40.68:7946","ts":"2021-09-21T22:47:06.792074547Z"}
{"caller":"memberlist_logger.go:74","level":"error","msg":"msg type (32) not supported from=192.168.40.68:7946","ts":"2021-09-21T22:47:06.792087473Z"} |
Good point - I was only testing with ingesters, but it's actually related to the number of memberlist members, not ring members. |
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions. |
Describe the bug
When starting >= 127 ingesters, the following message will appear in the logs of other ingesters:
To Reproduce
Steps to reproduce the behavior:
-ring.store=memberlist
(9aa910f)Expected behavior
The log message should not appear, and therefore indicating that malformed packets are not being sent.
Environment:
Storage Engine
Additional Context
The messages coincide precisely with the data sent when the 127th ingester is started up. With transport logging enabled, we can see the data being received which causes the errors. It's interesting to note that lots of errors appear to originate from parsing a single packet.
full output
We can trace back to see that the bad packet is sent out on startup of ingester 127. (Note that it is also sent to five other instances but , all which report the same messages as the above, but the :
The packet is sent 177ms after the "joined memberlist cluster" message, which is close to the default gossip interval of 200ms. This would indicate that the first gossip attempt is sending this packet. Looking at the gossip function, we see that a number of pending broadcasts are obtained and combined into a single "compound" message. The function that does that however has an unchecked limitation of only being able to send 255 messages.
Presumably, this is normally fine, but because we increase the
UDPBufferSize
drastically, we can fit way over 255 messages in a packet. I suspect this is the problem, something about 127 instances means that >255 broadcasts end up being queued up.The text was updated successfully, but these errors were encountered: