New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
0.16.0-rc1: AcceleratedDHTClient possible issue? #9309
Comments
I am not able to reproduce this issue. |
I decided to do a test. I ran multiple versions of the binary one after the other on the same computer, same connection, same .ipfs folder. Here are the results:
As you can see, the CPU spikes much higher on 0.16 and there's definitely a problem with discovery compared to the other two. I'd say this problem is reproducible. Now, the key is to reproduce it on a different machine. |
@ylempereur this is really interesting thx a lot. |
@ylempereur can you please attach reproduce instructions and profile results ? |
@ylempereur can you try running |
@Jorropo I ran it for 60 secs right when it was over 500%. The file is too big for posting here (won't let me), so I posted it on my site at: |
@julian88110 As soon as I have one, I’ll post it here, but right now, I’m still in the dark as to what’s causing it |
@ylempereur I see your are running on macos, and there is very high (go) scheduler load in your profile. @marten-seemann I know you recently worked on improving quic performance but you had issue on macos, do you think this could be related ? |
There's nothing in the v0.29.0 release that changed any of the run-loop logic in quic-go: https://github.com/lucas-clemente/quic-go/releases/tag/v0.29.0. Looking at the profile, it looks like we're resetting the timer really frequently:
If that's a bug or not likely depends on how many QUIC connections you're handling at the same time. |
I see >525% cpu with barely over 1,000 quic connections opened. in fact, I'm still seeing >300% with only 19 quic connections on the tail end of the scan |
How do you know how many QUIC connection were opened? |
|
That gives you the QUIC connections that (libp2p thinks that) are currently open. There might be a bug in how we close (or maybe not properly close) connections. Alternatively, this could be the result of a quic-go bug. |
I'm in the process of testing on another macOS machine to see if I get the same results. for now, I don't really know the cause, so I have no way to answer that |
Could you run your node with qlog enabled (set the QLOGDIR environment variable to a folder on your system). This will create a qlog file for every connection. If we're dealing with a quic-go bug, one of the qlog files would be vastly larger than the other files. Could you post that file here? |
I'll give that a try. I just finished the test on the other machine. I do see the CPU spike there as well, but I don't see a drop in discovered peers like on the other machine (it found 9,067 on that scan, which is average for me). The config on that machine is almost the default config, I only changed a couple things (such as the use of the accelerated DHT client). |
Lol, 53,835 logs were created during the initial scan. I posted the file at: |
All these files are basically the same size, I think the logging is not detailed enough to log in the (suspected) busy-looping code path. Can you rebuild your node using the |
That's a problem. While I am a backend developer, I'm not a developer on this project and have no experience with go (nor do I have the tools to build a go project). So, either you send me the binary you want me to test (I need kubo_v0.16.0-rc1_darwin-amd64.tar.gz) or you point me to a resource where the tools and procedures are explained to build this project (I use git for my work, so that part I have covered). Some observations: I was on a Zoom call for work yesterday and Kubo started its hourly run in the background. Zoom came to a grinding halt and wouldn't work at all until the scan was finished (frozen video, no sound). Needless to say, 0.14 and 0.15 never caused such a problem. This could explain why the number of discovered peers drops dramatically, Kubo might actually affect itself during the run in the same way it affected Zoom and prevent itself from doing the thing it's trying to do. Either way, it isn't usable in its current state, I can't be the only one this is/will be happening to. The CPU returns to normal when not doing the scan (<50%), even if there are many open connections. However, during the scan, even when there are very few connections open, the CPU is pegged. This points the finger at the scanning code more than the connection code. Also, since 0.14 and 0.15 do not have this problem (a scan rarely goes above 100%), this is clearly caused by code that has actually changed between 0.15 and 0.16. Only such code should be scrutinized. I see the problem on two separate machines (both macOS 12.6), and the second machine has a nearly default Kubo config (uses accelerated DHT client on both). The original runs in dhtserver mode, but I ran the other one in dhtclient mode just to see. Same problem. So, this could just be a Mac specific issue and require nothing special beyond that. |
Bit of extra info: I tried using "cpulimit -l 150 -p `pgrep -x ipfs`" right after launching the daemon. This not only prevents it from taking over my machine like before, but seems to help it with the scan as it discovers the expected number of peers this way (I tried multiple times). Not something I would want to do normally, but interesting nonetheless... |
OK, I figured out how to build Kubo from source, but I don't see how to force a specific branch on a dependency (which is what you want me to do), I'm not a go programmer. You're gonna have to give me a hint on that part :P |
@ylempereur In the kubo directory, run |
Here is the happy winner. It's ... meaty :P |
Thank you @ylempereur! That log was very helpful. I found and fixed a bug in quic-go's ACK generation logic: quic-go/quic-go#3566. I'm just not 100% sure that this is the fix for the problem you're seeing ;) Could you update the branch and run the test again (same instructions as in #9309 (comment), just make sure to re-run the |
Unfortunately, this doesn't fix my specific problem. In fact, it appears to have introduced a new problem where it now announces p2p-circuit addresses instead of my configured direct connections (in both ip4 and ip6, when using
And the latest log file: |
I switched over to the release version of 0.16 and the problem is still there (as expected). So, I decided to try something and ran the following command: Another thing (which probably belongs in 3567) is that the CPU spike happens when there is very little traffic going on in either direction (the scan rarely uses more than 10% of my bandwidth, but the CPU spike happens after the traffic drops to less than 1%). So, your queue isn't backing up for lack of bandwidth, something else is going on (which doesn't happen in 0.14 or 0.15). P.S. It appears that only the first (or first two) scans (after a daemon restart) are having discovery problems; subsequent scans do discover the proper amount of peers. I need to do this a few more times to make sure (only done it twice so far). This adds to the confusion :P |
@ylempereur I think I found a fix for the bug in quic-go: quic-go/quic-go#3570 Could I ask you to rebuild kubo with the |
Please let me know if you want me to test anything else, and thank you! |
@ylempereur : can you please test with the 0.17-rc2? It has all the latest go-libp2p fixes. |
The full on spikes (>500%) are gone, but there's still something screwy going on that wasn't happening in 0.15 and prior, which causes the first (or first two) DHT scans to find fewer DHT servers than expected. Those scans look different than normal scans, the traffic drops to near zero long before the scan ends and the CPU spikes during that time. Normal scans (third and up) do not behave that way, the traffic slowly tapers until the scan ends and the CPU doesn't spike. I've included a screenshot to illustrate a bad scan. |
The problem is still present in v0.18.0-rc1 |
@ylempereur : 0.18 is fully released now. Is it still present? |
Sorry, it will be a few more days before I can verify this properly. |
I just tested |
Checklist
Installation method
ipfs-update or dist.ipfs.tech
Version
Config
Description
Compared to previous versions (0.15 and prior), the hourly network scan done by the AcceleratedDHTClient causes a much higher CPU spike and finds far fewer peers (as reported by
ipfs stats dht wan
, 1K-3K vs 8K-10K, around the same time).Something has changed (not for the better), and I made sure to turn off ResourceMgr in both cases.
Please advise.
The text was updated successfully, but these errors were encountered: