on Reliability - x-port from Discord #3585
EthnTuttle
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
"I just wanted to share some thoughts on reliability and running fedimint on raspberry pis related to aleph bft. Not sure if thats the right place for this as we usually just post questions and issues here - this post however is just informative. It came up in the devcall shortly on thursday and did not seemed to be common knowledge. Please note that the following capabilities have not been tested by me yet but should work and were explicit design goals of mine for Aleph Bft. Before Aleph Bft a crash of too many guardians at once may have resulted in the federation being unable to start up automatically. Because of this, HBBFT required a mechanism for coordinated shutdown of a federation to shutdown after a certain epoch. However, Aleph BFT does not require this and any number of guardians may crash anytime and the system should recover automatically as soon as enough guardians come back online, at least in theory. Furthermore, its config is all a guardian needs to recover its state from its peers if its db should get corrupted completely, also there is no extra code paths for this or actions that need to be taken by the guardian - only requirement here is that no more then f guardians corrupt their db in the same session and lose the alph bft backup of this session, otherwise the session may get stuck. Furthermore only the valid transactions are saved to disk exactly once and we have client side ecash aggregation now, which should cut disk space and compute requirements down by an order of magnitude or more. As a consequence of those changes I think its not reckless to run the system on unreliable hardware like pis now, at least from the consensus's point of view. If we have a small number of guardians like 4-6 which can only tolerate the loss of one guardian, guardians may be well adviced to run two nodes each on independent hardware, to allow for the hardware failure of two to three nodes.
Though paranoia is a good quality for engineers working in the bitcoin space, I just wanted too make sure the advantages of non-malicious fault tolerance is not lost on us here because of it - the latter may be just as important as full byzantine fault tolerance against malicious nodes in many situations since there has probably a lot more bitcoin just lost instead of stolen over the years.
Cashu, in contrast, requires the custodian to run a lightning node which constitutes a single point of failure and therefore requires much more reliable hardware. Consequentially, Cashu is directed at commercial users primarily providing payment services instead of custody of large amounts like your whole family and friends bitcoin for example. "
Beta Was this translation helpful? Give feedback.
All reactions