Connection Slot Exhaustion with Passive Nodes #29329

lyciumlee · 2024-03-24T14:24:32Z

System information

Geth version: geth version 1.13.14
CL client & version: e.g. lighthouse/nimbus/prysm@v1.0.0
OS & Version: Windows/Linux/OSX
Commit hash : (if develop)

This issue has been reported to Fredrik Svantes, and Fredrik Svantes suggested that I open an issue here. @fredriksvantes

Short description:

The ETH Wire Protocol lack of a mechanism for periodically disconnecting passive nodes, i.e., nodes receive messages but do not disseminate them to others, allowing an attacker to exhaust all public nodes’ inbound connection slots with little IP and bandwidth resources, and thus preventing new nodes from joining the network.
Attack scenario:
Connection exhaustion attacks have a long history in p2p systems. This attack is cheaper in Ethereum than in most other p2p networks as the Ethereum network does not evict passive nodes. These nodes are operated with extremely low costs as they do not disseminate blocks and transaction messages to the others. Neither do they download historical blockchain data when joining the network.

The attacker deploys dozens of modified Geth nodes in the Ethereum network, which differs from ordinary Geth nodes in the following two aspects. First, these nodes have their outbound connection limit removed, and are constantly trying to establish connections with all known reachable Ethereum nodes. Optionally, these nodes have a large limit on inbound connections, allowing them to accept as many connections from newly-joined nodes as possible. Second, they do not store or propagate any blockchain data. This significantly lowers the storage and bandwidth cost of the attack.

By the current protocol, these nodes are considered benign by the network. Once they establish the ETH Wire Protocol handshake with honest nodes, they only receive ETH Wire Protocol messages without actively sending any themselves. This way, the attacker continually occupied the honest nodes’ connection slots, preventing new nodes from joining the network.

The process described above constitutes a low-cost DoS attack. In an ideal scenario, an attacker would only need the computational resources of 34 nodes—the number of inbound connection slots—to attack the entire Ethereum Mainnet network.

The root cause of this attack is due to the Wire Protocol not establishing a challenge-response and reputation mechanism to verify whether peers in the Eth Wire Protocol are honest or passive nodes.

Impact:
Due to all inbound connections being occupied by attacking nodes, new nodes are unable to join the Ethereum Mainnet, and nodes that have dropped off cannot rejoin the network.

Components：
The Peer module under the eth protocol in Geth does not differentiate between active and passive nodes, so it is impossible to determine which nodes are good and which are bad.

After ETH 2.0, execution clients are only responsible for relaying transaction-related messages to each other, as can be seen in the source code file eth/protocols/eth/peer.go. The broadcastTransactions and announceTransactions methods of the peer struct are responsible for the forwarding and handling of new transactions.

// NewPeer creates a wrapper for a network connection and negotiated  protocol
// version.
func NewPeer(version uint, p *p2p.Peer, rw p2p.MsgReadWriter, txpool TxPool) *Peer {
	peer := &Peer{
		id:              p.ID().String(),
		Peer:            p,
		rw:              rw,
		version:         version,
		knownTxs:        newKnownCache(maxKnownTxs),
		knownBlocks:     newKnownCache(maxKnownBlocks),
		queuedBlocks:    make(chan *blockPropagation, maxQueuedBlocks),
		queuedBlockAnns: make(chan *types.Block, maxQueuedBlockAnns),
		txBroadcast:     make(chan []common.Hash),
		txAnnounce:      make(chan []common.Hash),
		reqDispatch:     make(chan *request),
		reqCancel:       make(chan *cancel),
		resDispatch:     make(chan *response),
		txpool:          txpool,
		term:            make(chan struct{}),
	}
	// Start up all the broadcasters
	go peer.broadcastBlocks()
	go peer.broadcastTransactions()
	go peer.announceTransactions()
	go peer.dispatcher()

	return peer
}

It can be observed that within these two functions, and in the corresponding functions that handle them—handleNewPooledTransactionHashes, handleTransactions, and handlePooledTransactions—there are no checks performed to determine if the peer nodes are active.
Reproduction:
To implement an attacking node that does not forward any transaction messages, it is only necessary to remove the call functions for transactions forwarding in broadcastTransactions and announceTransactions.

The following code is what we wish for the attacker to modify.

eth/handler.go

// BroadcastTransactions will propagate a batch of transactions
// - To a square root of all peers for non-blob transactions
// - And, separately, as announcements to all peers which are not known to
// already have the given transaction.
func (h *handler) BroadcastTransactions(txs types.Transactions) {
	var (
		blobTxs  int // Number of blob transactions to announce only
		largeTxs int // Number of large transactions to announce only

		directCount int // Number of transactions sent directly to peers (duplicates included)
		directPeers int // Number of peers that were sent transactions directly
		annCount    int // Number of transactions announced across all peers (duplicates included)
		annPeers    int // Number of peers announced about transactions

		txset = make(map[*ethPeer][]common.Hash) // Set peer->hash to transfer directly
		annos = make(map[*ethPeer][]common.Hash) // Set peer->hash to announce
	)
	// Broadcast transactions to a batch of peers not knowing about it
	for _, tx := range txs {
		peers := h.peers.peersWithoutTransaction(tx.Hash())

		var numDirect int
		switch {
		case tx.Type() == types.BlobTxType:
			blobTxs++
		case tx.Size() > txMaxBroadcastSize:
			largeTxs++
		default:
			numDirect = int(math.Sqrt(float64(len(peers))))
		}
		// Send the tx unconditionally to a subset of our peers
		for _, peer := range peers[:numDirect] {
			txset[peer] = append(txset[peer], tx.Hash())
		}
		// For the remaining peers, send announcement only
		for _, peer := range peers[numDirect:] {
			annos[peer] = append(annos[peer], tx.Hash())
		}
	}
	for peer, hashes := range txset {
		directPeers++
		directCount += len(hashes)
		

              **peer.AsyncSendTransactions(hashes) removed!**

	}
	for peer, hashes := range annos {
		annPeers++
		annCount += len(hashes)
		**peer.AsyncSendPooledTransactionHashes(hashes) removed!!!**
	}
	log.Debug("Distributed transactions", "plaintxs", len(txs)-blobTxs-largeTxs, "blobtxs", blobTxs, "largetxs", largeTxs,
		"bcastpeers", directPeers, "bcastcount", directCount, "annpeers", annPeers, "anncount", annCount)
}

eth/protocol/eth/broadcast.go

// announceTransactions is a write loop that schedules transaction broadcasts
// to the remote peer. The goal is to have an async writer that does not lock up
// node internals and at the same time rate limits queued data.
func (p *Peer) announceTransactions() {
	var (
		queue  []common.Hash         // Queue of hashes to announce as transaction stubs
		done   chan struct{}         // Non-nil if background announcer is running
		fail   = make(chan error, 1) // Channel used to receive network error
		failed bool                  // Flag whether a send failed, discard everything onward
	)
	for {
		// If there's no in-flight announce running, check if a new one is needed
		if done == nil && len(queue) > 0 {
			// Pile transaction hashes until we reach our allowed network limit
			var (
				count        int
				pending      []common.Hash
				pendingTypes []byte
				pendingSizes []uint32
				size         common.StorageSize
			)
			for count = 0; count < len(queue) && size < maxTxPacketSize; count++ {
				if tx := p.txpool.Get(queue[count]); tx != nil {
					pending = append(pending, queue[count])
					pendingTypes = append(pendingTypes, tx.Type())
					pendingSizes = append(pendingSizes, uint32(tx.Size()))
					size += common.HashLength
				}
			}
			// Shift and trim queue
			queue = queue[:copy(queue, queue[count:])]

			// If there's anything available to transfer, fire up an async writer
			if len(pending) > 0 {
				done = make(chan struct{})
				go func() {
**// remove the following code
					if err := p.sendPooledTransactionHashes(pending, pendingTypes, pendingSizes); err != nil {
						fail <- err
						return
					}**
					close(done)
					p.Log().Trace("Sent transaction announcements", "count", len(pending))
				}()
			}
		}
		// Transfer goroutine may or may not have been started, listen for events
		select {
		case hashes := <-p.txAnnounce:
			// If the connection failed, discard all transaction events
			if failed {
				continue
			}
			// New batch of transactions to be broadcast, queue them (with cap)
			queue = append(queue, hashes...)
			if len(queue) > maxQueuedTxAnns {
				// Fancy copy and resize to ensure buffer doesn't grow indefinitely
				queue = queue[:copy(queue, queue[len(queue)-maxQueuedTxAnns:])]
			}

		case <-done:
			done = nil

		case <-fail:
			failed = true

		case <-p.term:
			return
		}
	}
}

The following code is an optimization for the exp to ensure it reaches ideal conditions.
We need to modify the functions func (d *downloader.Downloader) RegisterPeer(id string, version uint, peer downloader.Peer) and func (s *snap.Syncer) Register(peer snap.SyncPeer) so that they directly return nil. These two functions are used to register services related to blockchain synchronization protocols.

// RegisterPeer injects a new download peer into the set of block source to be
// used for fetching hashes and blocks from.
func (d *Downloader) RegisterPeer(id string, version uint, peer Peer) error {
	**var logger log.Logger
	if len(id) < 16 {
		// Tests use short IDs, don't choke on them
		logger = log.New("peer", id)
	} else {
		logger = log.New("peer", id[:8])
	}
	logger.Trace("Registering sync peer")
	if err := d.peers.Register(newPeerConnection(id, version, peer, logger)); err != nil {
		logger.Error("Failed to register sync peer", "err", err)
		return err
	}**removed
	return nil
}


// Register injects a new data source into the syncer's peerset.
func (s *Syncer) Register(peer SyncPeer) error {
	**// Make sure the peer is not registered yet
	id := peer.ID()
	s.lock.Lock()
	if _, ok := s.peers[id]; ok {
		log.Error("Snap peer already registered", "id", id)
		s.lock.Unlock()
		return errors.New("already registered")
	}
	s.peers[id] = peer
	s.rates.Track(id, msgrate.NewTracker(s.rates.MeanCapacities(), s.rates.MedianRoundTrip()))
	// Mark the peer as idle, even if no sync is running
	s.accountIdlers[id] = struct{}{}
	s.storageIdlers[id] = struct{}{}
	s.bytecodeIdlers[id] = struct{}{}
	s.trienodeHealIdlers[id] = struct{}{}
	s.bytecodeHealIdlers[id] = struct{}{}
	s.lock.Unlock()

	// Notify any active syncs that a new peer can be assigned data
	s.peerJoin.Send(id)**removed
	return nil
}

According to data from ethernodes.org, there are approximately 7000 nodes in the entire network, so we need to set MaxPeers in node/defaults.go of defaults.go to 7000 * 3 = 21000.

Fix:
To fix this security vulnerability, 1. we can add a response reputation variable to the peer structure. Each time a message from the ETH Wire Protocol is received from a node, its score is increased. Then, for example, every 5 minutes, a certain amount of points will be deducted periodically. If within an hour, the score of a node drops below a certain threshold, then the node will be marked as malicious.
Or,
2. In the implementation of the ETH Wire Protocol, nodes do not implement challenge-response among honest nodes. Challenge-response refers to honest nodes randomly requesting known messages from their peers to detect if they are actively responsive nodes.

The text was updated successfully, but these errors were encountered:

weiihann · 2024-03-24T16:07:07Z

If an honest geth node connects to a malicious node and attempts to sync with it, it will be dropped as the malicious node couldn't provide the correct data, referring to this code section.

learnerLj · 2024-03-24T16:46:51Z

I am also looking into this problem. see #29327

lyciumlee · 2024-03-24T16:53:53Z

@weiihann Dear weiihann, the situation you described is partially correct, but there are two scenarios where Passive nodes will still exist in large numbers.

Nodes that are started honestly will try to select the neighbor node with the highest block height for block synchronization, which means that as long as malicious nodes do not actively declare their block height to be very high, they can still exist.
Nodes that are synchronizing with the network will not start with other honest nodes because, in the case of ETH 2.0, blocks are inserted into Geth by Prysm.

weiihann · 2024-03-26T10:59:44Z

I'm curious why hasn't this occur a long time ago, if the attack cost is relatively low.

Btw, just in case you missed it, I'd suggest to submit this to the bug bounty program.

fjl · 2024-03-26T11:25:35Z

It was already submitted to the bug bounty program, and then got sent here.

learnerLj · 2024-03-26T12:35:42Z

I'm curious why hasn't this occur a long time ago, if the attack cost is relatively low.

This problem may be first proposed around 2014 or earlier. However, as mentioned in #29327 (comment) and #29327 (comment) and #29034 (comment) by karalabe, the early developer encountered a very severe error where each peer rejects each other due to similar "improvement". Since that, they try not to make changes and hold the idea it is acceptable.

weiihann · 2024-03-26T12:42:28Z

I'm curious why hasn't this occur a long time ago, if the attack cost is relatively low.

This problem may be first proposed around 2014 or earlier. However, as mentioned in #29327 (comment) and #29327 (comment) and #29034 (comment) by karalabe, the early developer encountered a very severe error where each peer rejects each other due to similar "improvement". Since that, they try not to make changes and hold the idea it is acceptable.

Thanks for sharing!

fjl · 2024-03-26T13:01:03Z

The attack is mostly a theoretical one. However, it would still be nice to fix it. @lyciumlee if you have a fix in mind, I am open to discuss it!

lyciumlee · 2024-03-26T13:26:18Z

@weiihann Dear weiihann, I have already submitted the report to the Ethereum Bug Bounty Program before disclosing it to the community, and I have email correspondence records. Fredrik Svantes suggested that I post the report here to discuss the issue with everyone.

lyciumlee · 2024-03-26T13:43:17Z

@fjl Dear fjl, I am primarily dedicated to the research of blockchain network protocols and have a keen interest in this area. I am also very willing to participate in the activities of building the Ethereum community. To address this issue, a score-based Peer mechanism can be used. The reasons are as follows:

The score-based Peer mechanism is a passive system, the introduction of which can allow the entire network of nodes to upgrade gradually.
. The score-based Peer mechanism aligns with the behavior code of honest nodes.

For the first reason, even if some nodes forget to upgrade, this mechanism will not affect the non-upgraded nodes, nor will it cause the network to be segmented or forked. Because this mechanism is entirely a passive behavior marking mechanism, an honest node will transfer and broadcast many messages to the network through the ETH Wire Protocol and protocols based on the ETH Wire Protocol. Therefore, the upgraded nodes know that these non-upgraded nodes are honest nodes, which will not lead to the network being split.

For the second reason, Execution Layer Nodes primarily use the ETH Wire Protocol for broadcasting Transactions-related messages, meaning when a new Transaction enters the Mempool of an honest node, it notifies other nodes that are unaware of the transaction. Therefore, upon receiving the message, upgraded nodes under the score-based Peer mechanism will recognize that the node is an honest node. Nodes that only receive messages without broadcasting any are actually harmful to the network. Although they do nothing but accept all the network messages they receive, they occupy connection slots of other nodes. The score-based mechanism can identify those that haven't generated any beneficial ETH Wire Protocol messages for a long time. If these malicious nodes are forced to participate in the message forwarding process, we also achieve the purpose of preventing this malicious behavior.

fjl · 2024-03-26T13:54:41Z

One problem with your solution suggestion is the existence of syncing nodes. While the node is syncing, it cannot relay transactions.

lyciumlee · 2024-03-26T14:03:57Z

@fjl Dear fjl, I understand the sync process you mentioned.
During the sync process, it appears that nodes do not send any protocols related to transactions. However, looking at it from another perspective, this process occurs because honest nodes are attempting to synchronize with the network. Moreover, interactions among these nodes also occur in pairs, such as the StorageRangesMsg and GetStorageRangesMsg in the Sync protocol. Synchronizing with the network is actually an intentional act regarding the network itself. Therefore, the score-based mechanism refers to marking all related messages as active. This is in contrast to the one-way network propagation mechanism that existed during the ETH 1.0 era with block propagation and continues today with transaction propagation, resulting in completely passive nodes.

holiman · 2024-03-26T14:29:59Z

It's easy to build a naive peer scoring system.
It's easy to bypass ( send random "get storagerangemsg" to simulate syncing node, but still be passive)
It's easy to improve the peer scoring system (detect the faked storage ranges, improve the faker-detection)
It's easy to bypass (use a different set of messages to fake "active" peer)
and on, and on

I'm a bit scared that if we build a peer scoring system, we're entering a game which has no end. Every 6 months, a new whitepaper will be presented on how some researchers bypassed geth's peer scoring system. And at some point, a scoring-system will unintentionally cut off syncing nodes, or client X (besu / nethermind), or clients with limited tx pool capacity, or something else.

lyciumlee · 2024-03-26T14:47:42Z

@holiman Dear holiman . I understand the concerns of the community. It seems we need to find a balance between simplicity and security.

learnerLj · 2024-03-26T14:49:21Z

Every 6 months, a new whitepaper will be presented on how some researchers bypassed geth's peer scoring system.

haha, such an interesting insight. Many companies have the motivation and incentive to bypass geth's peer scoring system. I might even secure a job with a higher salary, as they would be willing to pay for expertise in countering the scoring system in the future😃.

fjl · 2024-03-26T16:11:49Z

I think we should just add a system where we disconnect a random peer every so often. It doesn't need any score/rules.

weiihann · 2024-03-27T05:09:44Z

I think we should just add a system where we disconnect a random peer every so often. It doesn't need any score/rules.

But the peer can/would just retry the connection immediately upon disconnection?

karalabe · 2024-04-02T07:40:42Z

The reason we haven't added a reputation system or an algorithmic disconnect is because they are too easy to game. Since Geth's code is public, it's trivial to see what the rules are and how to fake them. What you'll end up with is non-zero probability of unintended side effects dropping legitimate peers and close-zero probability of actual effect on malicious peers who just fake some traffic. As @holiman mentioned, it becomes a game of whack-a-mole.

Whilst I agree that your concern is legitimate, IMO it's very hard to find a solution to a non-existing problem (as in not-actively exploited), because we just don't know how the problem would look like and what the actual solution would be. Fixing every possible attack scenario anyone can ever dream up in the future is a questionable effort (whilst noble).

An alternative line of thought is what the probabilities and gains are for such an attack. Currently block propagation is handled by consensus nodes, so to filling connection slots doens't really block the network from functioning. Transaction propagation can be impacted, but there are many MEV private pools that could be injected into directly (which many do already), so it's not obvious what the gain would be to block on mechanism whilst the other is still going strong.

My 2 cents are that we need resilience more than fool-proof-ness. For example, for the discovery protocol, we have two mechanism: the DHT itself and the DNS discovery. Both could in theory be attacked, but doing it simultaneously is probably non-trivial and would be quickly detected. We've added 2 to have each be a backup/fallback in case the other has some issues. Sure, we want to make both as robust as meaningful, but neither needs to be absolute perfect resilient.

For the transactions, we again have the two mechanisms (txpools, mev pools) that act as one-another's backup. Of course they are not serving the same purpose, but they do provide resiliency.

The sync code at some point was quite agressive with dropping "useless" peers, so it should be kind of hard to eclipse syncing nodes off from the network - at least as much as EL is concerned.

IMO it is more valuable to have a robust monitoring to detect anomalous behavior and course correct (on top of the resilient mechanisms) rather then to cover all possible bases all the time, investing infinite resources.

lyciumlee · 2024-04-11T05:33:07Z

Thank you all for your opinions, and thanks to the Ethereum developers for their enthusiastic answers. I have benefited immensely from everyone's responses.

lyciumlee added the type:bug label Mar 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connection Slot Exhaustion with Passive Nodes #29329

Connection Slot Exhaustion with Passive Nodes #29329

lyciumlee commented Mar 24, 2024

weiihann commented Mar 24, 2024

learnerLj commented Mar 24, 2024

lyciumlee commented Mar 24, 2024

weiihann commented Mar 26, 2024

fjl commented Mar 26, 2024

learnerLj commented Mar 26, 2024 •

edited

weiihann commented Mar 26, 2024

fjl commented Mar 26, 2024

lyciumlee commented Mar 26, 2024

lyciumlee commented Mar 26, 2024

fjl commented Mar 26, 2024

lyciumlee commented Mar 26, 2024 •

edited

holiman commented Mar 26, 2024

lyciumlee commented Mar 26, 2024

learnerLj commented Mar 26, 2024 •

edited

fjl commented Mar 26, 2024

weiihann commented Mar 27, 2024

karalabe commented Apr 2, 2024

lyciumlee commented Apr 11, 2024

Connection Slot Exhaustion with Passive Nodes #29329

Connection Slot Exhaustion with Passive Nodes #29329

Comments

lyciumlee commented Mar 24, 2024

System information

weiihann commented Mar 24, 2024

learnerLj commented Mar 24, 2024

lyciumlee commented Mar 24, 2024

weiihann commented Mar 26, 2024

fjl commented Mar 26, 2024

learnerLj commented Mar 26, 2024 • edited

weiihann commented Mar 26, 2024

fjl commented Mar 26, 2024

lyciumlee commented Mar 26, 2024

lyciumlee commented Mar 26, 2024

fjl commented Mar 26, 2024

lyciumlee commented Mar 26, 2024 • edited

holiman commented Mar 26, 2024

lyciumlee commented Mar 26, 2024

learnerLj commented Mar 26, 2024 • edited

fjl commented Mar 26, 2024

weiihann commented Mar 27, 2024

karalabe commented Apr 2, 2024

lyciumlee commented Apr 11, 2024

learnerLj commented Mar 26, 2024 •

edited

lyciumlee commented Mar 26, 2024 •

edited

learnerLj commented Mar 26, 2024 •

edited