Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make timer_tick_occurred a second long timer #3065

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

shaavan
Copy link
Contributor

@shaavan shaavan commented May 14, 2024

This PR updates timer_tick_occurred to operate on a second-long timer, providing finer timing control.

Reasoning:

  1. This PR is the prerequisite of Introduce Retry InvoiceRequest Flow #3010, which introduces the ability to retry the InvoiceRequest message in a finite time if we have not received the corresponding BOLT12 Invoice back.
  2. The retries need to be faster than a standard minute-long timer.
  3. The discussion was between introducing a new counter vs changing timer_tick duration (see comment). This PR is an attempt at the second approach.

…ogress

1. This function implicitly assumed expiry after 1 timer_tick_occurred.
2. Introduce an explicit timer tick so that the TIMER_LIMIT can be set
   to an arbitrary number.
3. This addition is utilized in the following commit.
@codecov-commenter
Copy link

codecov-commenter commented May 14, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 90.03%. Comparing base (1890e80) to head (9a344b6).
Report is 11 commits behind head on main.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3065      +/-   ##
==========================================
+ Coverage   89.82%   90.03%   +0.20%     
==========================================
  Files         116      116              
  Lines       96466    98074    +1608     
  Branches    96466    98074    +1608     
==========================================
+ Hits        86655    88305    +1650     
+ Misses       7264     7244      -20     
+ Partials     2547     2525      -22     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jkczyz
Copy link
Contributor

jkczyz commented May 14, 2024

This enhancement enhances the system's flexibility and accuracy in timing operations.

This doesn't really say anything. Please clearly state the practical reason for the change.

@@ -2260,15 +2260,15 @@ const CHECK_CLTV_EXPIRY_SANITY: u32 = MIN_CLTV_EXPIRY_DELTA as u32 - LATENCY_GRA
const CHECK_CLTV_EXPIRY_SANITY_2: u32 = MIN_CLTV_EXPIRY_DELTA as u32 - LATENCY_GRACE_PERIOD_BLOCKS - 2*CLTV_CLAIM_BUFFER;

/// The number of ticks of [`ChannelManager::timer_tick_occurred`] until expiry of incomplete MPPs
pub(crate) const MPP_TIMEOUT_TICKS: u8 = 3;
pub(crate) const MPP_TIMEOUT_TICKS: u8 = 3 * 60;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets define a constant for the ticks per minute or whatever so that we can change it in the future.

@wpaulino
Copy link
Contributor

What functionality do we plan to add that will need sub-minute ticks? Also, how do we ensure users migrate to the new duration on their end? This is basically a "silent" breaking change to the API.

@TheBlueMatt
Copy link
Collaborator

The motivation here is to retry invoice_requests. For non-mobile in a world where we can route onion messages, its probably not useful because we can route onion messages along many paths and retrying is probably not useful. On mobile its kinda useful, especially before we can route, where we may have lost connectivity and we just need to retry a second later. Sadly, there's not really a great way to do this sans-std.

@TheBlueMatt
Copy link
Collaborator

The alternative options here are basically (a) do it during message handling using std to detect time passing, (b) do it during message handling without time passing, (c) this (maybe with renaming the method to break the API). I'm a bit split between (b) and (c), honestly.

@wpaulino
Copy link
Contributor

I see, (b) is probably still not reliable enough for our needs though if we're not receiving messages (pings don't make it through the ChannelManager)? I guess (c) is fine, but have we explored doing this in OnionMessageHandler::release_pending_messages instead?

1. Previously, timer_tick_occurred was designed to be called every minute, leading to TIMER_LIMITS set in minutes.
2. However, this restricted the ability to set timer_tick length in smaller durations.
3. This commit updates timer_tick_occurred to be called every second instead of every minute.
4. All TIMER_LIMITS are adjusted accordingly to reflect this change.
5. Additionally, a test is updated to ensure successful compilation post-update.
@shaavan
Copy link
Contributor Author

shaavan commented May 15, 2024

Updated from pr3065.01 to pr3065.02 (diff):
Addressed @TheBlueMatt comment:

Changes:

  1. Introduce a new constant TICKS_PER_MINUTE and use it as the multiplying factor for other constants. This will allow us to keep track of the multiplying factor and make maintenance easy, in case we change this later.

@TheBlueMatt
I think renaming the function to cause an explicit API failure will be a good idea. However, I haven't implemented it yet. Would love to make sure that it is the right move before I move forward with the name change.
Thanks for the discussion, suggestions, and pointers, everyone.

@jkczyz
Copy link
Contributor

jkczyz commented May 15, 2024

I guess (c) is fine, but have we explored doing this in OnionMessageHandler::release_pending_messages instead?

Hmm... that may work as ChannelManager can query its PendingOutboundPayments for anything still awaiting an invoice that hasn't timed out. Though when initially sending the message, it will already be in the AwaitingInvoice state. Maybe we only retry if release_pending_messages would return an empty Vec as a simple way of not retrying before the initial request is sent? Even so, would there be sufficient enough time passing between calls to next_onion_message_for_peer before retrying?

@wpaulino
Copy link
Contributor

It gets called in PeerManager::do_attempt_write_data, which can happen every time a ping needs to be sent out (currently every 10 secs via PeerManager::timer_tick_occurred), and on every call to PeerManager::process_events and PeerManager::write_buffer_space_avail. So, it may actually get called too often, but some simple rate limiting per retry request in release_pending_messages would help.

@jkczyz
Copy link
Contributor

jkczyz commented May 16, 2024

It gets called in PeerManager::do_attempt_write_data, which can happen every time a ping needs to be sent out (currently every 10 secs via PeerManager::timer_tick_occurred), and on every call to PeerManager::process_events and PeerManager::write_buffer_space_avail. So, it may actually get called too often, but some simple rate limiting per retry request in release_pending_messages would help.

Yeah... though I'm not sure how we'd do that rate limiting. Would an exponential backoff be fine? Also note that OffersMessageHandler::release_pending_messages is called each time next_onion_message_for_peer is called. So if you have a lot of peers, it will be called more frequently than if you have few. I guess ChannelManager knows how many peers it has, so it could take that into account?

We also may want to consider using different reply paths on retries (and maybe the initial sends for that matter) in case the reply path used is the issue.

@wpaulino
Copy link
Contributor

Another alternative could be to extend the onion message handler to signal when we read a message from the peer that retries should be queued.

@TheBlueMatt
Copy link
Collaborator

ISTM if we want to avoid this we'll want a new method on ChannelMessageHandler indicating that we've received any message from any peer for any handler, which can then trigger the retry here. This avoids any kind of weird API nonsense where the OnionMessenger is getting the same and then just passing that info to the ChannelManager via the message fetching (to get a retry). This would need to be handled delicately to avoid performance impact, specifically we'll want to avoid a Mutex here and gate the message retrying on a relaxed AtomicBool read.

The other option, which @wpaulino suggested offline, is to hook deeply into the PeerManager, provide it some kind of flag saying "hey, I'm sending an important message here to peer A, please send A a ping immediately on the same socket and let me know if they responded, indicating they received the message cause they responded to a later one", then use that indicator to gate the retry. I'm not sure its worth that level of complexity, though.

@shaavan
Copy link
Contributor Author

shaavan commented May 24, 2024

Hi @TheBlueMatt,
I’ve been diving into the approach you suggested and I have a few questions I could use your guidance on:

  1. When retrying, should we be going through all the PaymentPending messages, or is there a way we would be prioritizing which ones to retry first?
  2. How can we make sure we’re not retrying too frequently?
  3. This method seems great for situations where we switch from offline to online (like coming out of the subway), but do you think it’s as effective for other scenarios, like when the receiver is the one going offline?

Thanks a lot for your guidance!

@TheBlueMatt
Copy link
Collaborator

When retrying, should we be going through all the PaymentPending messages, or is there a way we would be prioritizing which ones to retry first?

I don't see why we wouldn't use the normal flow - it tends to get drained very quickly.

How can we make sure we’re not retrying too frequently?

We should probably only retry once.

This method seems great for situations where we switch from offline to online (like coming out of the subway), but do you think it’s as effective for other scenarios, like when the receiver is the one going offline?

Indeed, it won't help in those cases, however I'm not entirely clear on what we can do for those cases. If the recipient is often-offline, they probably shouldn't be using BOLT12 to receive (pre-async-payments or unless they have a fairly robust notification/store+forward mechanism), and if they are at least somewhat robust trying again in one second seems unlikely to help and more than trying again via a separate path or some small time delta later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants