VSS Integration Tracking #246

tnull · 2024-02-07T15:30:50Z

Most of the work is already merged, here we add some tracking for the remaining tasks:

Expose the capability to setup with VssStore in bindings (WIP: Expose VssStore in bindings #245).
- Expose the capability to use a custom AuthMethod (possibly also in WIP: Expose VssStore in bindings #245).
- Expose the capability with a simpler default authentication method.
Figure out how to handle the remaining 'panic-on-persistence-failure' cases, especially where we can't return an error, such as in the EventHandler. This might be a tough one.
- Short term we could consider just log the persistence failures when VssStore is configured, but it's probably safer to keep panicking. In any case we need to add some huge warning that VSS is considered experimental and that either data might be lost at any time or that LDK Node / the app using it might crash at any time if the connection is lost.
- Mid- to longer term we need to switch to a primary/secondary storage model where we also persist to a local on-device database from which we also can handle restarts even if remote persistence fails. IMO this should happen ASAP as neither data loss nor crashing is acceptable UX.
Integrate a simple-to-use auth method.
- Possibly also expose a custom auth method via the AuthMethod trait, but this is already an advanced feature as it requires user to understand the interface and write matching code instead of providing some credentials.
Remove the cfg-gate.

The text was updated successfully, but these errors were encountered:

G8XSU · 2024-05-15T21:54:03Z

event-handling-panic-persistence-failure

event-handling-panic-persistence-failure could be treated the same as persistence failure during any other persistence, where we currently panic and restart. I agree we need to fix this, but in any case, this holds true for any remote persistence that can be implemented by user for ldk/ldk-node afaiu, so this shouldn't block vss integration, and can be tracked separately.

Mid- to longer term we need to switch to a primary/secondary storage model

There might be some misunderstanding on how primary/secondary storage model will work.
Even in primary/secondary, we will always first store to primary storage and if it is successful, we can store it in secondary.
This will not help in case of transient connection failures to primary, it is only useful for vss-phase-2/multi-device.
If we choose to do it the other way around(secondary-before-primary), we compromise on the promise of live backups and cannot ensure safe multi-device usage.

need to add some huge warning that VSS is considered experimental and that either data might be lost at any time or that
LDK Node / the app using it might crash at any time if the connection is lost.

IIUC, we will replay the events if there was a panic during event handling, so there shouldn't be any data loss. but yes the node will crash if there is persistence failure, this is true for channel-manager and channel-monitor persistence as well.
Retries in persistence should address some of the pain, but can we move forward vss-alpha in current state of ldk-node/ldk, and we can address the fallible events in parallel?
If vss-integration is not there, and user wants to use remote live backup then they are forced to rollout their own kvStore, which sadly will still be plagued by this same issue.

tnull · 2024-05-16T10:51:21Z

event-handling-panic-persistence-failure

event-handling-panic-persistence-failure could be treated the same as persistence failure during any other persistence, where we currently panic and restart.

For one, we generally try to avoid it and return an error wherever possible. But more importantly, persistence failure is currently ~never expected (i.e., only ever on hardware failure essentially), while with remote storage it will be very common, especially on mobile.

I agree we need to fix this, but in any case, this holds true for any remote persistence that can be implemented by user for ldk/ldk-node afaiu, so this shouldn't block vss integration, and can be tracked separately.

I disagree. We need to find a solution for this before we deem VSS integration 'stable'/ fully expose it in the stable API (and hence in bindings, as UniFFI doesn't support cfg flags or similar). We can't just treat this as an independent issue.

There might be some misunderstanding on how primary/secondary storage model will work. Even in primary/secondary, we will always first store to primary storage and if it is successful, we can store it in secondary. This will not help in case of transient connection failures to primary, it is only useful for vss-phase-2/multi-device. If we choose to do it the other way around(secondary-before-primary), we compromise on the promise of live backups and cannot ensure safe multi-device usage.

This might be true for the channel monitor updates where proceeding without being sure an update was persisted is crucial. For basically everything else it would be highly beneficial to write to a local cache, proceed, and try to sync the cache with the remote server, restarting/resetting in case of any conflicts if they can't be resolved. I'm indeed a bit dubious what the benefit is of the secondary local storage, if we can't ever use it in the first place. I think we discussed this many times before and I still keep my hopes up we'd eventually get a better solution for it. But granted, it won't be solved by VSS primary/secondary model as currently planned (and required for monitor persistence).

IIUC, we will replay the events if there was a panic during event handling, so there shouldn't be any data loss. but yes the node will crash if there is persistence failure, this is true for channel-manager and channel-monitor persistence as well. Retries in persistence should address some of the pain, but can we move forward vss-alpha in current state of ldk-node/ldk, and we can address the fallible events in parallel?

Well, yeah, I started working on fallible event handling in lightningdevkit/rust-lightning#2995 and intend to get it into the next LDK release. But as mentioned above, I don't think we can deem VSS stable/fully expose as long as we just crash and burn on persistence failure.

If vss-integration is not there, and user wants to use remote live backup then they are forced to rollout their own kvStore, which sadly will still be plagued by this same issue.

The keyword here is live. I kind of assume that at least for many of our users having a regular remote backup (while the node is stopped/no payments are inflight) of the SQLite store might be preferable and safer than exposing their users to arbitrary crashes, especially in environments where Lightning integration/LDK Node is only one of the things going on. Imagine a social media or gaming app just arbitrarily crashing because we disconnected from the internet for a second: unacceptable.

tnull added this to the 0.3 milestone Feb 7, 2024

tnull mentioned this issue Feb 7, 2024

Support VSS #40

Closed

tnull modified the milestones: 0.3, 0.4 Feb 7, 2024

tnull mentioned this issue Feb 7, 2024

LDK Node v0.4 Tracking Issue #167

Open

12 tasks

tnull modified the milestones: 0.3, 0.4 Feb 7, 2024

tnull assigned G8XSU Feb 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VSS Integration Tracking #246

VSS Integration Tracking #246

tnull commented Feb 7, 2024 •

edited by G8XSU

G8XSU commented May 15, 2024

tnull commented May 16, 2024 •

edited

VSS Integration Tracking #246

VSS Integration Tracking #246

Comments

tnull commented Feb 7, 2024 • edited by G8XSU

G8XSU commented May 15, 2024

tnull commented May 16, 2024 • edited

tnull commented Feb 7, 2024 •

edited by G8XSU

tnull commented May 16, 2024 •

edited