Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Harvester/Disconnects from Peers during Plot copy onto harvester via Samba #8094

Closed
whitetechnologies opened this issue Aug 16, 2021 · 9 comments
Labels
bug Something isn't working stale-issue flagged as stale and will be closed in 7 days if not updated

Comments

@whitetechnologies
Copy link

whitetechnologies commented Aug 16, 2021

Describe the bug
It is normal use case for plots to be made on one machine, and then copied over to another machine to be farmed. Various methods for this exist. In most cases, it involves mounting the farmer/harvester's plot directory onto the plotting machine. This can be accomplished via SSHFS (in the case where the plotter is Linux and farmer/harvester is Linux) or Samba (in the case where the plotter is Windows and farmer/harvester is Linux). When a plot is being copied over to the farmer/harvester, it may be detected (since the filename is immediately set) but won't be considered a valid plot until it's fully copied over. If the farmer/harvester tries to read the file, it will be an invalid plot and it should just ignore it.

It appears that when a plot is copied over SSHFS, the copy process locks the plot from being read. Therefore, the farmer/harvester just skips over that file (doesn't even try to read it). That is of course fine. On the other hand, when a plot is copied over Samba, the file is write-locked, but not read locked (this is normal Samba/windows behavior). This allows the farmer/harvester to try to read the plot file, and detect it as invalid.

RESULT: When copying plots via Samba to a farmer/harvester, the farmer/harvester drops connections to peers and loses sync. It will eventually return to normal after the plot copy is done. However, this is a BIG issue for pools because during disconnect it cannot receive partials and scale up difficulty, so pools will not properly credit points to the farmer.

While initially this can be dismissed as an "edge case" relating to only Samba, the issue appears to be with the harvester crashing/losing sync because it's finding an invalid plot (i.e the incompletely copied plot). This means that potentially any invalid plot (not just one that is being copied over via Samba) can cause this crash, thereby affecting a large amount of potential users.

To Reproduce

Steps to reproduce the behavior:

  1. Linux machine with a samba share that contains plot files
  2. Run chia client (chia start farmer) and observe normal connection, sync maintainig
  3. Copy a plot from a windows machine to the samba share while chia client is running
  4. Observe farmer immediately lose sync and become unable to connect to any other peers, gradually dropping them from the peers table
  5. Plot copy finishes from windows machine to Linux farmer.
  6. Linux farmer returns to normal and gradually reconnects to peers and gets back into sync. However, due to partials not being submitted during the copy period, pool does not credit points and scales back difficulty. In real life testing, someone with say 1000 plots in pool will lose credit for about half of them during the copy phase, and subsequently pool difficulty will not scale back up quickly enough to credit the full 1000 plots until 4+ hours.

Expected behavior
Copying a plot file to a farmer should not cause disconnects or sync issues. Farmer should just skip over the incomplete file until copy is done.

POSSIBLE FIX is in PR #8009

SW Versions
Ubuntu 20.04
Windows 10 Pro
CHIA VERSION 1.2.3 (not tested on previous version)
Samba version latest (from Linux repo)

@whitetechnologies whitetechnologies added the bug Something isn't working label Aug 16, 2021
@github-actions
Copy link
Contributor

This issue has been flagged as stale as there has been no activity on it in 14 days. If this issue is still affecting you and in need of review, please update it to keep it open.

@github-actions github-actions bot added the stale-issue flagged as stale and will be closed in 7 days if not updated label Aug 31, 2021
@ESP4Ever
Copy link

ESP4Ever commented Sep 4, 2021

Hello

I'm experiencing same issue when copying plots via Samba to/from a full node machine - stale partials appear then going out of sync
System: Synology DS1819+ 32G RAM
Dockerized Ubuntu 20 LTS + Chia v1.2.5 (on v 1.2.3 it was even worse)

@github-actions github-actions bot removed the stale-issue flagged as stale and will be closed in 7 days if not updated label Sep 4, 2021
@whitetechnologies
Copy link
Author

whitetechnologies commented Sep 6, 2021

Hello

I'm experiencing same issue when copying plots via Samba to/from a full node machine - stale partials appear then going out of sync
System: Synology DS1819+ 32G RAM
Dockerized Ubuntu 20 LTS + Chia v1.2.5 (on v 1.2.3 it was even worse)

Yes its def a Samba/Chia issue. Have you tried posting on Samba forums? I'm not sure they can help much in any case bc they won't know Chia. Samba is known to work pretty well so this is really a Chia issue and Chai team should look at fixing this. It doesn't seem to be too complex, it's almost certainly the reason I outlined above (farmer getting stuck on incompletely copied plot due to locking issues) that should be fixable on the Chia side

@jack60612
Copy link
Contributor

Hello
I'm experiencing same issue when copying plots via Samba to/from a full node machine - stale partials appear then going out of sync
System: Synology DS1819+ 32G RAM
Dockerized Ubuntu 20 LTS + Chia v1.2.5 (on v 1.2.3 it was even worse)

Yes its def a Samba/Chia issue. Have you tried posting on Samba forums? I'm not sure they can help much in any case bc they won't know Chia. Samba is known to work pretty well so this is really a Chia issue and Chai team should look at fixing this. It doesn't seem to be too complex, it's almost certainly the reason I outlined above (farmer getting stuck on incompletely copied plot due to locking issues) that should be fixable on the Chia side

Pr #8004 merged did it fix it?

@emlowe
Copy link
Contributor

emlowe commented Sep 8, 2021

It's unlikely there is any fix possible on the chia side. Your node requires a stable internet connection to peers in order to respond to signage points and other network events. While you are copying your plots over the network, you are saturating the network on your node, which is then likely timing out on its peer connections and it can no longer respond properly to network events. You may be able to throttle your network bandwidth with some settings on the Linux side of Samba. Do not saturate your network connection to your node would be my recommendation.

@whitetechnologies
Copy link
Author

It's unlikely there is any fix possible on the chia side. Your node requires a stable internet connection to peers in order to respond to signage points and other network events. While you are copying your plots over the network, you are saturating the network on your node, which is then likely timing out on its peer connections and it can no longer respond properly to network events. You may be able to throttle your network bandwidth with some settings on the Linux side of Samba. Do not saturate your network connection to your node would be my recommendation.

Sorry but you are wrong. That was my initial thought also but This IS a Chia issue. Network saturation has nothing to do with it. If using SSHFS to copy a plot, there are Zero errors. The copy time is the same as with Samba. Ergo, the network saturation of SSH FS = Samba, so if SSH FS copy is not enough to “saturate”, then neither should Samaba. After a lot of research its clear that its a file lock issue. SSH locks files differently than Samaba. File locked by Samba causes Chia to crash. Ive also tried to change Samba file locking options but there are no tweaks that work. I am sure the Samaba devs are not going to change their code to get one program to work (Chia).

On the other hand, Chia harvester should play nicely with all accepted standards in the Linux/Win systems which includes Samba. Its alo easy to imagine if harvester crashes due to this file lock from Samba, it will do so in other cases also.

If there is something I missed and despite = file copy speed Samba is > SSH FS in network saturation, then im open to review it.

@emlowe
Copy link
Contributor

emlowe commented Sep 8, 2021

The initial report says

Observe farmer immediately lose sync and become unable to connect to any other peers, gradually dropping them from the peers table
Plot copy finishes from windows machine to Linux farmer.
Linux farmer returns to normal and gradually reconnects to peers and gets back into sync

This is not crashing. This is highly suggestive the issue is network-related. When the copying is finished, the node goes back to normal

@github-actions
Copy link
Contributor

This issue has been flagged as stale as there has been no activity on it in 14 days. If this issue is still affecting you and in need of review, please update it to keep it open.

@github-actions github-actions bot added the stale-issue flagged as stale and will be closed in 7 days if not updated label Sep 23, 2021
@github-actions
Copy link
Contributor

github-actions bot commented Oct 1, 2021

This issue was automatically closed because it has been flagged as stale and subsequently passed 7 days with no further activity.

@github-actions github-actions bot closed this as completed Oct 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale-issue flagged as stale and will be closed in 7 days if not updated
Projects
None yet
Development

No branches or pull requests

4 participants