Malformed memberlist gossip messages sent with >=127 ingesters #4326

stevesg · 2021-06-29T12:12:49Z

Describe the bug
When starting >= 127 ingesters, the following message will appear in the logs of other ingesters:

ts=2021-06-25T18:05:35.393200406Z caller=memberlist_logger.go:74 level=error msg="Failed to decode ping request: codec.decoder: Only encoded map or array can be decoded into a struct. (valueType: 2) from=10.128.0.82:12127"
ts=2021-06-25T18:05:35.393225486Z caller=memberlist_logger.go:74 level=error msg="msg type (89) not supported from=10.128.0.82:12127"

To Reproduce
Steps to reproduce the behavior:

Start 127 ingesters with -ring.store=memberlist (9aa910f)
Witness log message in multiple of the first 126 ingesters

Expected behavior
The log message should not appear, and therefore indicating that malformed packets are not being sent.

Environment:

Infrastructure: VM /w 64 vCPU
Deployment tool: Scripted test case

Storage Engine

Blocks
Chunks

Additional Context

The messages coincide precisely with the data sent when the 127th ingester is started up. With transport logging enabled, we can see the data being received which causes the errors. It's interesting to note that lots of errors appear to originate from parsing a single packet.

level=debug ts=2021-06-25T18:05:35.392866804Z caller=tcp_transport.go:234 msg="TCPTransport: New connection" addr=10.128.0.82:54624
level=debug ts=2021-06-25T18:05:35.392946734Z caller=tcp_transport.go:302 msg="TCPTransport: Received packet" addr=10.128.0.82:12127 size=2201 hash=50add2a5ede947ef1c0a78da3d67b9ea
ts=2021-06-25T18:05:35.393200406Z caller=memberlist_logger.go:74 level=error msg="Failed to decode ping request: codec.decoder: Only encoded map or array can be decoded into a struct. (valueType: 2) from=10.128.0.82:12127"
ts=2021-06-25T18:05:35.393225486Z caller=memberlist_logger.go:74 level=error msg="msg type (89) not supported from=10.128.0.82:12127"
ts=2021-06-25T18:05:35.39324799Z caller=memberlist_logger.go:74 level=error msg="Failed to decode ping request: codec.decoder: Only encoded map or array can be decoded into a struct. (valueType: 2) from=10.128.0.82:12127"
..

full output

We can trace back to see that the bad packet is sent out on startup of ingester 127. (Note that it is also sent to five other instances but , all which report the same messages as the above, but the :

level=info ts=2021-06-25T18:05:35.01401412Z caller=memberlist_client.go:512 msg="joined memberlist cluster" reached_nodes=1
stderr.cortex-127\055:level=debug ts=2021-06-25T18:05:35.191879946Z caller=tcp_transport.go:495 msg="WriteTo: packet sent" addr=10.128.0.82:12110 size=2201 hash=50add2a5ede947ef1c0a78da3d67b9ea
..

The packet is sent 177ms after the "joined memberlist cluster" message, which is close to the default gossip interval of 200ms. This would indicate that the first gossip attempt is sending this packet. Looking at the gossip function, we see that a number of pending broadcasts are obtained and combined into a single "compound" message. The function that does that however has an unchecked limitation of only being able to send 255 messages.

Presumably, this is normally fine, but because we increase the UDPBufferSize drastically, we can fit way over 255 messages in a packet. I suspect this is the problem, something about 127 instances means that >255 broadcasts end up being queued up.

The text was updated successfully, but these errors were encountered:

pracucci · 2021-06-29T13:53:20Z

Does the issue still exists if we lower UDPBufferSize?

stevesg · 2021-06-29T14:06:36Z

Unfortunately it does, you can see in the log the size of the offending message is only 2201 bytes. The default UDPBufferSize is 1400, presumably to avoid fragmentation with a standard ~1500 byte MTU size.

stevesg · 2021-07-08T13:57:37Z

Proposed fix: hashicorp/memberlist#239

cabrinha · 2021-09-21T22:51:57Z

I'm not running 127 ingesters, I'm running 75, but I have a lot of other nodes in the memberlist, more than 127 total. Similar errors in the logs:

{"caller":"memberlist_logger.go:74","level":"error","msg":"msg type (45) not supported from=192.168.40.68:7946","ts":"2021-09-21T22:47:06.791965958Z"}
{"caller":"memberlist_logger.go:74","level":"error","msg":"msg type (45) not supported from=192.168.40.68:7946","ts":"2021-09-21T22:47:06.791980313Z"}
{"caller":"memberlist_logger.go:74","level":"error","msg":"msg type (45) not supported from=192.168.40.68:7946","ts":"2021-09-21T22:47:06.791992495Z"}
{"caller":"memberlist_logger.go:74","level":"error","msg":"msg type (119) not supported from=192.168.40.68:7946","ts":"2021-09-21T22:47:06.792003049Z"}
{"caller":"memberlist_logger.go:74","level":"error","msg":"msg type (100) not supported from=192.168.40.68:7946","ts":"2021-09-21T22:47:06.792013388Z"}
{"caller":"memberlist_logger.go:74","level":"error","msg":"msg type (98) not supported from=192.168.40.68:7946","ts":"2021-09-21T22:47:06.79202562Z"}
{"caller":"memberlist_logger.go:74","level":"error","msg":"msg type (114) not supported from=192.168.40.68:7946","ts":"2021-09-21T22:47:06.792059365Z"}
{"caller":"memberlist_logger.go:74","level":"error","msg":"msg type (45) not supported from=192.168.40.68:7946","ts":"2021-09-21T22:47:06.792074547Z"}
{"caller":"memberlist_logger.go:74","level":"error","msg":"msg type (32) not supported from=192.168.40.68:7946","ts":"2021-09-21T22:47:06.792087473Z"}

stevesg · 2021-09-23T10:18:32Z

Good point - I was only testing with ingesters, but it's actually related to the number of memberlist members, not ring members.

stale · 2021-12-25T09:07:54Z

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

stevesg mentioned this issue Jun 29, 2021

Memberlist: Broadcast from first push/pull is always dropped with ~29 ingesters #4327

Closed

2 tasks

cabrinha mentioned this issue Sep 21, 2021

Memberlist: expose configuration of memberlist handoff queue depth #4402

Closed

stale bot added the stale label Dec 25, 2021

stale bot closed this as completed Jan 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Malformed memberlist gossip messages sent with >=127 ingesters #4326

Malformed memberlist gossip messages sent with >=127 ingesters #4326

stevesg commented Jun 29, 2021 •

edited

pracucci commented Jun 29, 2021

stevesg commented Jun 29, 2021 •

edited

stevesg commented Jul 8, 2021

cabrinha commented Sep 21, 2021

stevesg commented Sep 23, 2021

stale bot commented Dec 25, 2021

Malformed memberlist gossip messages sent with >=127 ingesters #4326

Malformed memberlist gossip messages sent with >=127 ingesters #4326

Comments

stevesg commented Jun 29, 2021 • edited

pracucci commented Jun 29, 2021

stevesg commented Jun 29, 2021 • edited

stevesg commented Jul 8, 2021

cabrinha commented Sep 21, 2021

stevesg commented Sep 23, 2021

stale bot commented Dec 25, 2021

stevesg commented Jun 29, 2021 •

edited

stevesg commented Jun 29, 2021 •

edited