Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0.16.0-rc1: AcceleratedDHTClient possible issue? #9309

Closed
3 tasks done
ylempereur opened this issue Sep 26, 2022 · 37 comments
Closed
3 tasks done

0.16.0-rc1: AcceleratedDHTClient possible issue? #9309

ylempereur opened this issue Sep 26, 2022 · 37 comments
Labels
kind/bug A bug in existing code (including security flaws) P1 High: Likely tackled by core team if no one steps up status/in-progress In progress

Comments

@ylempereur
Copy link

Checklist

Installation method

ipfs-update or dist.ipfs.tech

Version

Kubo version: 0.16.0-rc1
Repo version: 12
System version: amd64/darwin
Golang version: go1.19.1

Config

"ResourceMgr": {
    "Enabled": false
},

Description

Compared to previous versions (0.15 and prior), the hourly network scan done by the AcceleratedDHTClient causes a much higher CPU spike and finds far fewer peers (as reported by ipfs stats dht wan, 1K-3K vs 8K-10K, around the same time).

Something has changed (not for the better), and I made sure to turn off ResourceMgr in both cases.

Please advise.

@ylempereur ylempereur added kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization labels Sep 26, 2022
@Jorropo
Copy link
Contributor

Jorropo commented Sep 27, 2022

I am not able to reproduce this issue.
I recently had similar findings but with v0.15.0 however I went 18k to 11K so it might relate to connection speed ?

@ylempereur
Copy link
Author

I decided to do a test. I ran multiple versions of the binary one after the other on the same computer, same connection, same .ipfs folder. Here are the results:

Version     Peers   CPU
0.16.0-rc1   2638   >500% + fan
0.15.0       9347   <100%
0.14.0       9521   <100%
0.16.0-rc1      9   >200% + fan
0.16.0-rc1     13   >350% + fan
0.16.0-rc1      1   >250% + fan
0.15.0       9648   <150%
0.14.0       9333   <125%
0.16.0-rc1     83   >400% + fan
Rebooted
0.16.0-rc1   3299   >500% + fan

As you can see, the CPU spikes much higher on 0.16 and there's definitely a problem with discovery compared to the other two. I'd say this problem is reproducible. Now, the key is to reproduce it on a different machine.

@Jorropo
Copy link
Contributor

Jorropo commented Sep 27, 2022

@ylempereur this is really interesting thx a lot.

@julian88110
Copy link

@ylempereur can you please attach reproduce instructions and profile results ?
thanks!

@Jorropo
Copy link
Contributor

Jorropo commented Sep 27, 2022

@ylempereur can you try running ipfs diag profile while the CPU is spiking please (it will run for 30s you want most of this 30s covered by the spike) ? And upload the .zip generated.

@ylempereur
Copy link
Author

@Jorropo I ran it for 60 secs right when it was over 500%. The file is too big for posting here (won't let me), so I posted it on my site at:
/ipns/yves.lempereur.name/files/ipfs-profile-2022-09-27T11_25_37-07_00.zip

@ylempereur
Copy link
Author

@julian88110 As soon as I have one, I’ll post it here, but right now, I’m still in the dark as to what’s causing it

@Jorropo
Copy link
Contributor

Jorropo commented Sep 28, 2022

@ylempereur I see your are running on macos, and there is very high (go) scheduler load in your profile.
I also see QUIC is heavy in the profile on selectgo (channels are known to generate lots of scheduler load also).

@marten-seemann I know you recently worked on improving quic performance but you had issue on macos, do you think this could be related ?

@marten-seemann
Copy link
Member

There's nothing in the v0.29.0 release that changed any of the run-loop logic in quic-go: https://github.com/lucas-clemente/quic-go/releases/tag/v0.29.0.

Looking at the profile, it looks like we're resetting the timer really frequently:

ROUTINE ======================== github.com/lucas-clemente/quic-go.(*connection).maybeResetTimer in connection.go
     710ms     19.87s (flat, cum)  8.46% of Total
         .          .    748:		return time.Time{}
         .          .    749:	}
         .          .    750:	return s.lastPacketReceivedTime.Add(s.keepAliveInterval)
         .          .    751:}
         .          .    752:
      50ms       50ms    753:func (s *connection) maybeResetTimer() {
         .          .    754:	var deadline time.Time
         .          .    755:	if !s.handshakeComplete {
         .       10ms    756:		deadline = utils.MinTime(
         .       20ms    757:			s.creationTime.Add(s.config.handshakeTimeout()),
         .       30ms    758:			s.idleTimeoutStartTime().Add(s.config.HandshakeIdleTimeout),
         .          .    759:		)
         .          .    760:	} else {
      70ms      900ms    761:		if keepAliveTime := s.nextKeepAliveTime(); !keepAliveTime.IsZero() {
         .          .    762:			deadline = keepAliveTime
         .          .    763:		} else {
      20ms      160ms    764:			deadline = s.idleTimeoutStartTime().Add(s.idleTimeout)
         .          .    765:		}
         .          .    766:	}
         .          .    767:
     100ms      1.05s    768:	if ackAlarm := s.receivedPacketHandler.GetAlarmTimeout(); !ackAlarm.IsZero() {
      90ms      330ms    769:		deadline = utils.MinTime(deadline, ackAlarm)
         .          .    770:	}
     160ms      400ms    771:	if lossTime := s.sentPacketHandler.GetLossDetectionTimeout(); !lossTime.IsZero() {
      30ms      250ms    772:		deadline = utils.MinTime(deadline, lossTime)
         .          .    773:	}
      60ms      130ms    774:	if !s.pacingDeadline.IsZero() {
      30ms      100ms    775:		deadline = utils.MinTime(deadline, s.pacingDeadline)
         .          .    776:	}
         .          .    777:
      50ms     16.39s    778:	s.timer.Reset(deadline)
      50ms       50ms    779:}

If that's a bug or not likely depends on how many QUIC connections you're handling at the same time.

@ylempereur
Copy link
Author

ylempereur commented Sep 28, 2022

I see >525% cpu with barely over 1,000 quic connections opened. in fact, I'm still seeing >300% with only 19 quic connections on the tail end of the scan

@marten-seemann
Copy link
Member

I see >525% cpu with barely over 1,000 quic connections opened

How do you know how many QUIC connection were opened?

@ylempereur
Copy link
Author

ipfs swarm peers --direction | grep '/quic/' | wc -l

@marten-seemann
Copy link
Member

That gives you the QUIC connections that (libp2p thinks that) are currently open. There might be a bug in how we close (or maybe not properly close) connections.

Alternatively, this could be the result of a quic-go bug.
How can I reproduce this on my machine?

@ylempereur
Copy link
Author

I'm in the process of testing on another macOS machine to see if I get the same results. for now, I don't really know the cause, so I have no way to answer that

@marten-seemann
Copy link
Member

Could you run your node with qlog enabled (set the QLOGDIR environment variable to a folder on your system). This will create a qlog file for every connection. If we're dealing with a quic-go bug, one of the qlog files would be vastly larger than the other files. Could you post that file here?

@ylempereur
Copy link
Author

I'll give that a try. I just finished the test on the other machine. I do see the CPU spike there as well, but I don't see a drop in discovered peers like on the other machine (it found 9,067 on that scan, which is average for me). The config on that machine is almost the default config, I only changed a couple things (such as the use of the accelerated DHT client).

@ylempereur
Copy link
Author

Lol, 53,835 logs were created during the initial scan. I posted the file at:
https://yves.lempereur.name/files/logs.tgz

@marten-seemann
Copy link
Member

All these files are basically the same size, I think the logging is not detailed enough to log in the (suspected) busy-looping code path.

Can you rebuild your node using the debug-timer-resets branch of quic-go and run again? No need to send all 50k files, just the one that's very large should be sufficient. ls -laSh sorts by file size (descending).

@ylempereur
Copy link
Author

That's a problem. While I am a backend developer, I'm not a developer on this project and have no experience with go (nor do I have the tools to build a go project). So, either you send me the binary you want me to test (I need kubo_v0.16.0-rc1_darwin-amd64.tar.gz) or you point me to a resource where the tools and procedures are explained to build this project (I use git for my work, so that part I have covered).

Some observations:

I was on a Zoom call for work yesterday and Kubo started its hourly run in the background. Zoom came to a grinding halt and wouldn't work at all until the scan was finished (frozen video, no sound). Needless to say, 0.14 and 0.15 never caused such a problem. This could explain why the number of discovered peers drops dramatically, Kubo might actually affect itself during the run in the same way it affected Zoom and prevent itself from doing the thing it's trying to do. Either way, it isn't usable in its current state, I can't be the only one this is/will be happening to.

The CPU returns to normal when not doing the scan (<50%), even if there are many open connections. However, during the scan, even when there are very few connections open, the CPU is pegged. This points the finger at the scanning code more than the connection code. Also, since 0.14 and 0.15 do not have this problem (a scan rarely goes above 100%), this is clearly caused by code that has actually changed between 0.15 and 0.16. Only such code should be scrutinized.

I see the problem on two separate machines (both macOS 12.6), and the second machine has a nearly default Kubo config (uses accelerated DHT client on both). The original runs in dhtserver mode, but I ran the other one in dhtclient mode just to see. Same problem. So, this could just be a Mac specific issue and require nothing special beyond that.

@ylempereur
Copy link
Author

ylempereur commented Sep 29, 2022

Bit of extra info:

I tried using "cpulimit -l 150 -p `pgrep -x ipfs`" right after launching the daemon. This not only prevents it from taking over my machine like before, but seems to help it with the scan as it discovers the expected number of peers this way (I tried multiple times). Not something I would want to do normally, but interesting nonetheless...

@ylempereur
Copy link
Author

OK, I figured out how to build Kubo from source, but I don't see how to force a specific branch on a dependency (which is what you want me to do), I'm not a go programmer. You're gonna have to give me a hint on that part :P

@marten-seemann
Copy link
Member

@ylempereur In the kubo directory, run go get github.com/lucas-clemente/quic-go@debug-timer-resets. Then run make install to install kubo from source.

@ajnavarro
Copy link
Member

ajnavarro commented Sep 30, 2022

I cannot reproduce the problem on a Linux machine. I wanted to do a git bisect to find the problem, but I got:

  • v0.15.0: 10763 peers
  • v0.16.0-rc1: 11248 peers

But, there is a slight increase on CPU usage (>360% peaks on v0.16.0-RC1 vs >280% peaks on v0.15.0):
cpu15
cpu16

@ajnavarro
Copy link
Member

Repeating the test several times, I cannot have consistent results comparing v0.15.0 and v0.16.0-rc1, getting on both versions the same CPU usage variation:

Example using v0.16.0-rc1:

cpu16-3

@ylempereur
Copy link
Author

@marten-seemann
Copy link
Member

Thank you @ylempereur! That log was very helpful.

I found and fixed a bug in quic-go's ACK generation logic: quic-go/quic-go#3566.

I'm just not 100% sure that this is the fix for the problem you're seeing ;) Could you update the branch and run the test again (same instructions as in #9309 (comment), just make sure to re-run the go get command). Please also paste the output of grep quic-go go.mod here.

@ylempereur
Copy link
Author

Unfortunately, this doesn't fix my specific problem. In fact, it appears to have introduced a new problem where it now announces p2p-circuit addresses instead of my configured direct connections (in both ip4 and ip6, when using ipfs id). Anyway, here is the requested info:

>  kubo git:(release-v0.16.0) ✗ git status
On branch release-v0.16.0
Your branch is up to date with 'origin/release-v0.16.0'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	modified:   go.mod
	modified:   go.sum

no changes added to commit (use "git add" and/or "git commit -a")
>  kubo git:(release-v0.16.0) ✗ grep quic-go go.mod
	github.com/lucas-clemente/quic-go v0.29.2-0.20221002181507-0d8cfa56b7ec // indirect
>  kubo git:(release-v0.16.0) ✗ 

And the latest log file:
https://yves.lempereur.name/files/log_2022-10-02T20-46-47.568997UTC_client_f7ea855484bcfce444be.qlog.zst

@ylempereur
Copy link
Author

ylempereur commented Oct 4, 2022

I switched over to the release version of 0.16 and the problem is still there (as expected). So, I decided to try something and ran the following command:
ipfs config --json Swarm.Transports.Network.QUIC false
I then tried running the daemon through the scan several times in a row. The good news is that the CPU spike is completely gone (it stays around 200% during the scan, then drops back under 50% after it's done), which tells us that you are probably chasing the right cause for it. The bad news is that the peer discovery is still dismal (it finds 200-400 peers, as reported by ipfs stats dht wan, when my other node running 0.15 finds 9000-10000). I think we are chasing two separate problems...

Another thing (which probably belongs in 3567) is that the CPU spike happens when there is very little traffic going on in either direction (the scan rarely uses more than 10% of my bandwidth, but the CPU spike happens after the traffic drops to less than 1%). So, your queue isn't backing up for lack of bandwidth, something else is going on (which doesn't happen in 0.14 or 0.15).

P.S. It appears that only the first (or first two) scans (after a daemon restart) are having discovery problems; subsequent scans do discover the proper amount of peers. I need to do this a few more times to make sure (only done it twice so far). This adds to the confusion :P

@guseggert guseggert added P1 High: Likely tackled by core team if no one steps up status/in-progress In progress and removed need/triage Needs initial labeling and prioritization labels Oct 6, 2022
@marten-seemann
Copy link
Member

@ylempereur I think I found a fix for the bug in quic-go: quic-go/quic-go#3570

Could I ask you to rebuild kubo with the monotonous-timer-tracing branch, and re-run your tests (don't forget to re-enable QUIC, now that you've disabled it)?

@ylempereur
Copy link
Author

>  kubo git:(release) ✗ git status
On branch release
Your branch is up to date with 'origin/release'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	modified:   go.mod
	modified:   go.sum

no changes added to commit (use "git add" and/or "git commit -a")
>  kubo git:(release) ✗ grep quic-go go.mod
	github.com/lucas-clemente/quic-go v0.29.2-0.20221006193423-e16d0a8a89d1 // indirect
  • CPU spikes: fixed. Yay!
  • too few DHT peers: still a problem. The first two scans always report fewer than half the expected peers. subsequent scans report reasonable numbers. This does not happen in 0.15 and 0.14
  • p2p-circuit addresses: this one was my bad. I forgot to set the macOS app firewall for the newly created binary. sorry

Please let me know if you want me to test anything else, and thank you!

@BigLep
Copy link
Contributor

BigLep commented Nov 17, 2022

@ylempereur : can you please test with the 0.17-rc2? It has all the latest go-libp2p fixes.

@ylempereur
Copy link
Author

The full on spikes (>500%) are gone, but there's still something screwy going on that wasn't happening in 0.15 and prior, which causes the first (or first two) DHT scans to find fewer DHT servers than expected. Those scans look different than normal scans, the traffic drops to near zero long before the scan ends and the CPU spikes during that time. Normal scans (third and up) do not behave that way, the traffic slowly tapers until the scan ends and the CPU doesn't spike. I've included a screenshot to illustrate a bad scan.

Screenshot 2022-11-17 at 8 38 56 AM

@ylempereur
Copy link
Author

Also, the scanner appears to stall at times. I suspect the two problems are related.

Screenshot 2022-11-29 at 10 32 17 PM

@ylempereur
Copy link
Author

The problem is still present in v0.18.0-rc1

@BigLep
Copy link
Contributor

BigLep commented Jan 24, 2023

@ylempereur : 0.18 is fully released now. Is it still present?

@ylempereur
Copy link
Author

Sorry, it will be a few more days before I can verify this properly.

@ylempereur
Copy link
Author

I just tested 0.18.1 and all the symptoms appear to have been fixed.
You can consider this matter closed.
Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws) P1 High: Likely tackled by core team if no one steps up status/in-progress In progress
Projects
Archived in project
Development

No branches or pull requests

7 participants