Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support intercepting onion messages for offline peers #2973

Merged
merged 9 commits into from May 9, 2024

Conversation

valentinewallace
Copy link
Contributor

As part of implementing async payments, we need a way to store onion message forwards on behalf of often-offline next-hop peers. This allows payers to send a held_htlc_available onion message to the mobile recipient, where the recipient's counterparty holds onto this onion message until the recipient comes back online.

Rather than storing onion messages internally in LDK, we generate events on OM interception and later signal to the user to re-inject these onion messages for forwarding.

This implements basic offline peer OM interception support. As a follow-up, we should support indicating to users when an intercepted OM has successfully been forwarded and is safe to delete, see #2950.

Partially addresses #2950, partially addresses #2298.

@@ -921,6 +921,17 @@ where
}
msgs
}

fn enqueue_event(&self, event: Event) {
const MAX_EVENTS_BUFFER_SIZE: usize = (1 << 10) * 256;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would like to get some thoughts on this event buffer size (currently 0.34MB). I can also add logging if an event is dropped, though I believe that should be rare.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should define it as N / ::core::mem::size_of::<Event>(), but 1MiB seems fine? I don't think we're worried about a machine running with this that is super duper memory constrained.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't that exclude the majority of the bytes in the onion packet, which are on the heap? Are we trying to limit the number of events or the total size of those events?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@valentinewallace Not sure if this was missed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about that, I thought we were trying to limit the total size of the events. I'm not sure I'm understanding @TheBlueMatt's suggestion :(

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, no, @jkczyz is right, I was too used to serialization buffers.

@@ -983,6 +1065,13 @@ where
e.get_mut().enqueue_message(onion_message);
log_trace!(logger, "Forwarding an onion message to peer {}", next_node_id);
},
_ if self.intercept_oms_for_offline_peers => {
self.enqueue_event(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: we could avoid introducing a lock order dependency between message_recipients and pending_events by pushing the pending events at the end of the method. I didn't bother but lmk if that seems worth it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know nobody has responded to your comment, but I think it should be fine as is. Any thoughts, @TheBlueMatt, @jkczyz?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, should be fine.

@TheBlueMatt
Copy link
Collaborator

Basically LGTM, but we should figure out how to pipe the events through for default users. Generally our users don't expect to actually hook up to events providers manually, they expect to do it via the BP, so maybe we need to take an Optional OnionMessenger to the BP?

@valentinewallace
Copy link
Contributor Author

Basically LGTM, but we should figure out how to pipe the events through for default users. Generally our users don't expect to actually hook up to events providers manually, they expect to do it via the BP, so maybe we need to take an Optional OnionMessenger to the BP?

I believe that was already done when we started generating the ConnectionNeeded events, see: https://github.com/lightningdevkit/rust-lightning/blob/main/lightning-background-processor/src/lib.rs#L305

@arik-so
Copy link
Contributor

arik-so commented Mar 28, 2024

Looks good to me, too!

@jkczyz jkczyz self-requested a review March 28, 2024 17:24
Copy link
Contributor

@shaavan shaavan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR looks great! 🚀
Just a few points I would love to discuss!

Comment on lines 804 to 894
match message_recipients.entry(*peer_node_id) {
hash_map::Entry::Occupied(mut e) if e.get().is_connected() => {
e.get_mut().enqueue_message(message);
Ok(())
},
_ => Err(SendError::InvalidFirstHop(*peer_node_id))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was giving a thought on how we are handling this situation. It might be possible that the peer is known, but just not connected at the instant. Sending an InvalidFirstHop error might confuse users since it reads “The first hop is not a peer and …”. Maybe we could introduce a new Error type for cases like this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, why is that confusing? At this stage we don't have any way of knowing whether the peer is known or not. I think "not a peer" captures that pretty well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're not sure if the peer's known or not, InvalidFirstHop could work. But in this code snippet, we're checking if the peer's connected with if e.get().is_connected(), not if it's known. So, something like a PeerNotConnected error would make more sense for the error message. Still, I'm cool with whatever you think works best!

lightning/src/onion_message/messenger.rs Outdated Show resolved Hide resolved
Comment on lines 742 to 826
/// LDK will not rate limit how many [`Event::OnionMessageForOfflinePeer`]s
/// are generated, so it is the caller's responsibility to limit how many
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, that's correct, but we do have a hard cap on the total number of events we can store (enforced by enqueue_events). Do you think it would be a good idea to update the docs to clarify that users can't set an infinitely high cap on the number of events they can receive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The events queue should never get close to filling up, if it is we have bigger problems and the user's offer implementation is probably broken :) Not sure what you mean about users setting a cap on the number of events they can receive.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, that makes sense! The upper cap by enqueue_events is more like a fail-safe, than a hard cap, so it makes sense to not mention it here!

Not sure what you mean about users setting a cap on the number of events they can receive.

So I was trying to point to this line

"it is the caller's responsibility to limit how many..."

But considering the context, I don't think we need to make any changes here.
Thanks for clarifying it! ✌️

@codecov-commenter
Copy link

codecov-commenter commented Apr 16, 2024

Codecov Report

Attention: Patch coverage is 85.89744% with 33 lines in your changes are missing coverage. Please review.

Project coverage is 91.37%. Comparing base (3b1b0a5) to head (a5ada64).
Report is 130 commits behind head on main.

Files Patch % Lines
lightning/src/events/mod.rs 28.12% 21 Missing and 2 partials ⚠️
lightning/src/onion_message/messenger.rs 93.02% 5 Missing and 1 partial ⚠️
lightning/src/onion_message/packet.rs 0.00% 3 Missing ⚠️
lightning/src/onion_message/functional_tests.rs 99.11% 1 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2973      +/-   ##
==========================================
+ Coverage   89.35%   91.37%   +2.02%     
==========================================
  Files         117      118       +1     
  Lines       96335   109415   +13080     
  Branches    96335   109415   +13080     
==========================================
+ Hits        86080    99980   +13900     
+ Misses       8033     6972    -1061     
- Partials     2222     2463     +241     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

}

match message_recipients.entry(*peer_node_id) {
hash_map::Entry::Occupied(mut e) if e.get().is_connected() => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be useful to hit the ConnectionNeeded path, too? Say a peer comes back online, the user gets the OnionMessagePeerConnected event, but the peer goes offline before forward_onion_message is called. When the user calls forward_onion_message, it errors and consumes the OnionMessage, which is now lost.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it should be rare that a peer is disconnected by the time we go to forward, but it is currently possible. As mentioned in the PR description, planning to address this in follow-up per the race condition TODO here: #2950

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks for pointing that out. How would the ping/pong solution prevent storing data for a non-existent peer? (e.g., when a malicious actor asks us to forward a message to a made up node id)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand your question, but it's up to the LDK user's event handling code to decide which peers it should store OM forwards for and which should be discarded. If the user decides an OM is worth storing and later calls forward_onion_message when the outbound peer comes back online, then we would go through the ping/pong flow to let them know when the OM is officially safe to delete.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I think I misunderstood the proposed solution. I thought LDK would manage whether to forget the message (in OnionMessenger) based on whether the ping/pong completed. But, IIUC, the code in this PR would remain as is and LDK would instead surface some event when a peer comes online and has completed a ping/pong. Is that correct?

So, in the scenario I gave, the user would simply not store the message in the first place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code in this PR would remain as is and LDK would instead surface some event when a peer comes online and has completed a ping/pong. Is that correct?
So, in the scenario I gave, the user would simply not store the message in the first place?

Yes to both!

@@ -921,6 +921,17 @@ where
}
msgs
}

fn enqueue_event(&self, event: Event) {
const MAX_EVENTS_BUFFER_SIZE: usize = (1 << 10) * 256;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't that exclude the majority of the bytes in the onion packet, which are on the heap? Are we trying to limit the number of events or the total size of those events?

lightning/src/onion_message/messenger.rs Show resolved Hide resolved
lightning/src/onion_message/messenger.rs Outdated Show resolved Hide resolved
@@ -175,6 +175,8 @@ where
message_router: MR,
offers_handler: OMH,
custom_handler: CMH,
intercept_oms_for_offline_peers: bool,
pending_events: Mutex<Vec<Event>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use a separate queue if these aren't persisted? If we instead use message_recipients for buffering -- with a new variant -- then we can re-use outbound_buffer_full and generate events in the same manner as ConnectionNeeded. Otherwise, we are left with two different systems for buffering messages and generating events.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I believe with this approach an attacker may be able to fragment our heap by continually sending bogus forwards to a new fake peer each time (where each fake forward results in a VecDeque allocation). I think my other reasoning is that because the events generated may all be bogus, they shouldn't be allowed to count towards outbound_buffer_full, since currently all of that traffic is known to be more legitimate (either an outbound OM initiated by the LDK user or a forward to a peer that is connected).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I believe with this approach an attacker may be able to fragment our heap by continually sending bogus forwards to a new fake peer each time (where each fake forward results in a VecDeque allocation).

Yeah, though this seems to be an issue already with onion messages, FWIW. Processing an OnionMessage involves allocating a Vec<u8> for Packet::hop_data.

I think my other reasoning is that because the events generated may all be bogus, they shouldn't be allowed to count towards outbound_buffer_full, since currently all of that traffic is known to be more legitimate (either an outbound OM initiated by the LDK user or a forward to a peer that is connected).

Right, though this makes me wonder if the user should specify which peers it supports forwarding to. See other comment about the race condition.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline. Yeah, I suppose specifying which peers support forwarding adds a bit of user to the burden. They are already doing this when processing events, so no sense forcing them to duplicate it here an risk inconsistencies.

@valentinewallace
Copy link
Contributor Author

Addressed feedback.

lightning/src/onion_message/messenger.rs Outdated Show resolved Hide resolved
@@ -921,6 +921,17 @@ where
}
msgs
}

fn enqueue_event(&self, event: Event) {
const MAX_EVENTS_BUFFER_SIZE: usize = (1 << 10) * 256;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@valentinewallace Not sure if this was missed.

lightning/src/events/mod.rs Outdated Show resolved Hide resolved
@valentinewallace
Copy link
Contributor Author

Thanks @jkczyz, addressed feedback.

lightning/src/events/mod.rs Outdated Show resolved Hide resolved
@@ -983,6 +1065,13 @@ where
e.get_mut().enqueue_message(onion_message);
log_trace!(logger, "Forwarding an onion message to peer {}", next_node_id);
},
_ if self.intercept_oms_for_offline_peers => {
self.enqueue_event(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, should be fine.

Comment on lines +1199 to +1201
self.enqueue_event(
Event::OnionMessagePeerConnected { peer_node_id: *their_node_id }
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it could be a problem if we enqueue a bunch of Event::OnionMessageForOfflinePeer but end up dropping an Event::OnionMessagePeerConnected for the corresponding peer because of a full buffer. Maybe events for a different peer cause the buffer to fill up, for instance. Should we unconditionally enqueue the latter type of events?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned #2973 (comment), the events queue should never get close to filling up. So I'm not sure it's worth adding special handling for event dropping cases?

Copy link
Contributor

@arik-so arik-so left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me overall! I was about to ask a question about peers disconnecting prior to the message getting forwarded, but realized you already addressed it: #2973 (comment)

Should be good to squash imo.

@jkczyz
Copy link
Contributor

jkczyz commented May 9, 2024

Feel free to squash

@valentinewallace
Copy link
Contributor Author

Squashed.

Comment on lines 1115 to 1121
_ if self.intercept_messages_for_offline_peers => {
self.pending_events.lock().unwrap().push(
Event::OnionMessageIntercepted {
peer_node_id: next_node_id, message: onion_message
}
);
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we log_trace here like the other arms?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I omitted it because of our past discussions of avoiding adding logs that would allow someone to adversarially blow up our logs on disk (there are some pre-existing ones in this method, ofc). So I'm not sure it's a good idea?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... I thought that was only applicable for higher-level logging like log_error. Will let @TheBlueMatt provide guidance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, happy to add it in follow-up too, no reason to hold the PR up.

Comment on lines +1313 to +1325
&Event::OnionMessageIntercepted { ref peer_node_id, ref message } => {
37u8.write(writer)?;
write_tlv_fields!(writer, {
(0, peer_node_id, required),
(2, message, required),
});
},
&Event::OnionMessagePeerConnected { ref peer_node_id } => {
39u8.write(writer)?;
write_tlv_fields!(writer, {
(0, peer_node_id, required),
});
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, these are never persisted by LDK, but this lets us check the buffer size. Clients could persist them, I suppose, but would probably want to persists the parts rather than the enum.

Copy link
Contributor

@arik-so arik-so left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

@arik-so arik-so merged commit 8f1dc54 into lightningdevkit:main May 9, 2024
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants