0.16.0-rc1: AcceleratedDHTClient possible issue? #9309

ylempereur · 2022-09-26T21:34:26Z

Checklist

This is a bug report, not a question. Ask questions on discuss.ipfs.io.
I have searched on the issue tracker for my bug.
I am running the latest kubo version or have an issue updating.

Installation method

ipfs-update or dist.ipfs.tech

Version

Kubo version: 0.16.0-rc1
Repo version: 12
System version: amd64/darwin
Golang version: go1.19.1

Config

"ResourceMgr": {
    "Enabled": false
},

Description

Compared to previous versions (0.15 and prior), the hourly network scan done by the AcceleratedDHTClient causes a much higher CPU spike and finds far fewer peers (as reported by ipfs stats dht wan, 1K-3K vs 8K-10K, around the same time).

Something has changed (not for the better), and I made sure to turn off ResourceMgr in both cases.

Please advise.

The text was updated successfully, but these errors were encountered:

Jorropo · 2022-09-27T14:16:34Z

I am not able to reproduce this issue.
I recently had similar findings but with v0.15.0 however I went 18k to 11K so it might relate to connection speed ?

ylempereur · 2022-09-27T17:08:03Z

I decided to do a test. I ran multiple versions of the binary one after the other on the same computer, same connection, same .ipfs folder. Here are the results:

Version     Peers   CPU
0.16.0-rc1   2638   >500% + fan
0.15.0       9347   <100%
0.14.0       9521   <100%
0.16.0-rc1      9   >200% + fan
0.16.0-rc1     13   >350% + fan
0.16.0-rc1      1   >250% + fan
0.15.0       9648   <150%
0.14.0       9333   <125%
0.16.0-rc1     83   >400% + fan
Rebooted
0.16.0-rc1   3299   >500% + fan

As you can see, the CPU spikes much higher on 0.16 and there's definitely a problem with discovery compared to the other two. I'd say this problem is reproducible. Now, the key is to reproduce it on a different machine.

Jorropo · 2022-09-27T17:43:00Z

@ylempereur this is really interesting thx a lot.

julian88110 · 2022-09-27T18:06:20Z

@ylempereur can you please attach reproduce instructions and profile results ?
thanks!

Jorropo · 2022-09-27T18:07:03Z

@ylempereur can you try running ipfs diag profile while the CPU is spiking please (it will run for 30s you want most of this 30s covered by the spike) ? And upload the .zip generated.

ylempereur · 2022-09-27T18:58:56Z

@Jorropo I ran it for 60 secs right when it was over 500%. The file is too big for posting here (won't let me), so I posted it on my site at:
/ipns/yves.lempereur.name/files/ipfs-profile-2022-09-27T11_25_37-07_00.zip

ylempereur · 2022-09-27T19:59:32Z

@julian88110 As soon as I have one, I’ll post it here, but right now, I’m still in the dark as to what’s causing it

Jorropo · 2022-09-28T13:02:47Z

@ylempereur I see your are running on macos, and there is very high (go) scheduler load in your profile.
I also see QUIC is heavy in the profile on selectgo (channels are known to generate lots of scheduler load also).

@marten-seemann I know you recently worked on improving quic performance but you had issue on macos, do you think this could be related ?

marten-seemann · 2022-09-28T16:40:29Z

There's nothing in the v0.29.0 release that changed any of the run-loop logic in quic-go: https://github.com/lucas-clemente/quic-go/releases/tag/v0.29.0.

Looking at the profile, it looks like we're resetting the timer really frequently:

ROUTINE ======================== github.com/lucas-clemente/quic-go.(*connection).maybeResetTimer in connection.go
     710ms     19.87s (flat, cum)  8.46% of Total
         .          .    748:		return time.Time{}
         .          .    749:	}
         .          .    750:	return s.lastPacketReceivedTime.Add(s.keepAliveInterval)
         .          .    751:}
         .          .    752:
      50ms       50ms    753:func (s *connection) maybeResetTimer() {
         .          .    754:	var deadline time.Time
         .          .    755:	if !s.handshakeComplete {
         .       10ms    756:		deadline = utils.MinTime(
         .       20ms    757:			s.creationTime.Add(s.config.handshakeTimeout()),
         .       30ms    758:			s.idleTimeoutStartTime().Add(s.config.HandshakeIdleTimeout),
         .          .    759:		)
         .          .    760:	} else {
      70ms      900ms    761:		if keepAliveTime := s.nextKeepAliveTime(); !keepAliveTime.IsZero() {
         .          .    762:			deadline = keepAliveTime
         .          .    763:		} else {
      20ms      160ms    764:			deadline = s.idleTimeoutStartTime().Add(s.idleTimeout)
         .          .    765:		}
         .          .    766:	}
         .          .    767:
     100ms      1.05s    768:	if ackAlarm := s.receivedPacketHandler.GetAlarmTimeout(); !ackAlarm.IsZero() {
      90ms      330ms    769:		deadline = utils.MinTime(deadline, ackAlarm)
         .          .    770:	}
     160ms      400ms    771:	if lossTime := s.sentPacketHandler.GetLossDetectionTimeout(); !lossTime.IsZero() {
      30ms      250ms    772:		deadline = utils.MinTime(deadline, lossTime)
         .          .    773:	}
      60ms      130ms    774:	if !s.pacingDeadline.IsZero() {
      30ms      100ms    775:		deadline = utils.MinTime(deadline, s.pacingDeadline)
         .          .    776:	}
         .          .    777:
      50ms     16.39s    778:	s.timer.Reset(deadline)
      50ms       50ms    779:}

If that's a bug or not likely depends on how many QUIC connections you're handling at the same time.

ylempereur · 2022-09-28T20:11:06Z

I see >525% cpu with barely over 1,000 quic connections opened. in fact, I'm still seeing >300% with only 19 quic connections on the tail end of the scan

marten-seemann · 2022-09-28T20:21:43Z

I see >525% cpu with barely over 1,000 quic connections opened

How do you know how many QUIC connection were opened?

ylempereur · 2022-09-28T20:22:33Z

ipfs swarm peers --direction | grep '/quic/' | wc -l

marten-seemann · 2022-09-28T20:24:55Z

That gives you the QUIC connections that (libp2p thinks that) are currently open. There might be a bug in how we close (or maybe not properly close) connections.

Alternatively, this could be the result of a quic-go bug.
How can I reproduce this on my machine?

ylempereur · 2022-09-28T20:26:09Z

I'm in the process of testing on another macOS machine to see if I get the same results. for now, I don't really know the cause, so I have no way to answer that

marten-seemann · 2022-09-28T20:31:57Z

Could you run your node with qlog enabled (set the QLOGDIR environment variable to a folder on your system). This will create a qlog file for every connection. If we're dealing with a quic-go bug, one of the qlog files would be vastly larger than the other files. Could you post that file here?

ylempereur · 2022-09-28T20:35:19Z

I'll give that a try. I just finished the test on the other machine. I do see the CPU spike there as well, but I don't see a drop in discovered peers like on the other machine (it found 9,067 on that scan, which is average for me). The config on that machine is almost the default config, I only changed a couple things (such as the use of the accelerated DHT client).

ylempereur · 2022-09-28T21:13:24Z

Lol, 53,835 logs were created during the initial scan. I posted the file at:
https://yves.lempereur.name/files/logs.tgz

marten-seemann · 2022-09-29T10:15:34Z

All these files are basically the same size, I think the logging is not detailed enough to log in the (suspected) busy-looping code path.

Can you rebuild your node using the debug-timer-resets branch of quic-go and run again? No need to send all 50k files, just the one that's very large should be sufficient. ls -laSh sorts by file size (descending).

ylempereur · 2022-09-29T15:26:36Z

That's a problem. While I am a backend developer, I'm not a developer on this project and have no experience with go (nor do I have the tools to build a go project). So, either you send me the binary you want me to test (I need kubo_v0.16.0-rc1_darwin-amd64.tar.gz) or you point me to a resource where the tools and procedures are explained to build this project (I use git for my work, so that part I have covered).

Some observations:

I was on a Zoom call for work yesterday and Kubo started its hourly run in the background. Zoom came to a grinding halt and wouldn't work at all until the scan was finished (frozen video, no sound). Needless to say, 0.14 and 0.15 never caused such a problem. This could explain why the number of discovered peers drops dramatically, Kubo might actually affect itself during the run in the same way it affected Zoom and prevent itself from doing the thing it's trying to do. Either way, it isn't usable in its current state, I can't be the only one this is/will be happening to.

The CPU returns to normal when not doing the scan (<50%), even if there are many open connections. However, during the scan, even when there are very few connections open, the CPU is pegged. This points the finger at the scanning code more than the connection code. Also, since 0.14 and 0.15 do not have this problem (a scan rarely goes above 100%), this is clearly caused by code that has actually changed between 0.15 and 0.16. Only such code should be scrutinized.

I see the problem on two separate machines (both macOS 12.6), and the second machine has a nearly default Kubo config (uses accelerated DHT client on both). The original runs in dhtserver mode, but I ran the other one in dhtclient mode just to see. Same problem. So, this could just be a Mac specific issue and require nothing special beyond that.

ylempereur · 2022-09-29T16:55:23Z

Bit of extra info:

I tried using "cpulimit -l 150 -p `pgrep -x ipfs`" right after launching the daemon. This not only prevents it from taking over my machine like before, but seems to help it with the scan as it discovers the expected number of peers this way (I tried multiple times). Not something I would want to do normally, but interesting nonetheless...

ylempereur · 2022-09-29T17:33:45Z

OK, I figured out how to build Kubo from source, but I don't see how to force a specific branch on a dependency (which is what you want me to do), I'm not a go programmer. You're gonna have to give me a hint on that part :P

marten-seemann · 2022-09-30T08:35:26Z

@ylempereur In the kubo directory, run go get github.com/lucas-clemente/quic-go@debug-timer-resets. Then run make install to install kubo from source.

ajnavarro · 2022-09-30T09:50:57Z

I cannot reproduce the problem on a Linux machine. I wanted to do a git bisect to find the problem, but I got:

v0.15.0: 10763 peers
v0.16.0-rc1: 11248 peers

But, there is a slight increase on CPU usage (>360% peaks on v0.16.0-RC1 vs >280% peaks on v0.15.0):

ajnavarro · 2022-09-30T13:13:54Z

Repeating the test several times, I cannot have consistent results comparing v0.15.0 and v0.16.0-rc1, getting on both versions the same CPU usage variation:

Example using v0.16.0-rc1:

ylempereur · 2022-09-30T15:55:52Z

Here is the happy winner. It's ... meaty :P
https://yves.lempereur.name/files/log_2022-09-30T15-34-00.935868UTC_client_f0057253272ec6ad52bba986569554e3.qlog.zst

marten-seemann · 2022-10-02T18:28:37Z

Thank you @ylempereur! That log was very helpful.

I found and fixed a bug in quic-go's ACK generation logic: quic-go/quic-go#3566.

I'm just not 100% sure that this is the fix for the problem you're seeing ;) Could you update the branch and run the test again (same instructions as in #9309 (comment), just make sure to re-run the go get command). Please also paste the output of grep quic-go go.mod here.

ylempereur · 2022-10-02T21:05:59Z

Unfortunately, this doesn't fix my specific problem. In fact, it appears to have introduced a new problem where it now announces p2p-circuit addresses instead of my configured direct connections (in both ip4 and ip6, when using ipfs id). Anyway, here is the requested info:

>  kubo git:(release-v0.16.0) ✗ git status
On branch release-v0.16.0
Your branch is up to date with 'origin/release-v0.16.0'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	modified:   go.mod
	modified:   go.sum

no changes added to commit (use "git add" and/or "git commit -a")
>  kubo git:(release-v0.16.0) ✗ grep quic-go go.mod
	github.com/lucas-clemente/quic-go v0.29.2-0.20221002181507-0d8cfa56b7ec // indirect
>  kubo git:(release-v0.16.0) ✗

And the latest log file:
https://yves.lempereur.name/files/log_2022-10-02T20-46-47.568997UTC_client_f7ea855484bcfce444be.qlog.zst

ylempereur · 2022-10-04T18:26:03Z

I switched over to the release version of 0.16 and the problem is still there (as expected). So, I decided to try something and ran the following command:
ipfs config --json Swarm.Transports.Network.QUIC false
I then tried running the daemon through the scan several times in a row. The good news is that the CPU spike is completely gone (it stays around 200% during the scan, then drops back under 50% after it's done), which tells us that you are probably chasing the right cause for it. The bad news is that the peer discovery is still dismal (it finds 200-400 peers, as reported by ipfs stats dht wan, when my other node running 0.15 finds 9000-10000). I think we are chasing two separate problems...

Another thing (which probably belongs in 3567) is that the CPU spike happens when there is very little traffic going on in either direction (the scan rarely uses more than 10% of my bandwidth, but the CPU spike happens after the traffic drops to less than 1%). So, your queue isn't backing up for lack of bandwidth, something else is going on (which doesn't happen in 0.14 or 0.15).

P.S. It appears that only the first (or first two) scans (after a daemon restart) are having discovery problems; subsequent scans do discover the proper amount of peers. I need to do this a few more times to make sure (only done it twice so far). This adds to the confusion :P

marten-seemann · 2022-10-06T19:36:57Z

@ylempereur I think I found a fix for the bug in quic-go: quic-go/quic-go#3570

Could I ask you to rebuild kubo with the monotonous-timer-tracing branch, and re-run your tests (don't forget to re-enable QUIC, now that you've disabled it)?

ylempereur · 2022-10-07T03:35:43Z

>  kubo git:(release) ✗ git status
On branch release
Your branch is up to date with 'origin/release'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	modified:   go.mod
	modified:   go.sum

no changes added to commit (use "git add" and/or "git commit -a")
>  kubo git:(release) ✗ grep quic-go go.mod
	github.com/lucas-clemente/quic-go v0.29.2-0.20221006193423-e16d0a8a89d1 // indirect

CPU spikes: fixed. Yay!
too few DHT peers: still a problem. The first two scans always report fewer than half the expected peers. subsequent scans report reasonable numbers. This does not happen in 0.15 and 0.14
p2p-circuit addresses: this one was my bad. I forgot to set the macOS app firewall for the newly created binary. sorry

Please let me know if you want me to test anything else, and thank you!

BigLep · 2022-11-17T01:54:15Z

@ylempereur : can you please test with the 0.17-rc2? It has all the latest go-libp2p fixes.

ylempereur · 2022-11-17T16:49:33Z

The full on spikes (>500%) are gone, but there's still something screwy going on that wasn't happening in 0.15 and prior, which causes the first (or first two) DHT scans to find fewer DHT servers than expected. Those scans look different than normal scans, the traffic drops to near zero long before the scan ends and the CPU spikes during that time. Normal scans (third and up) do not behave that way, the traffic slowly tapers until the scan ends and the CPU doesn't spike. I've included a screenshot to illustrate a bad scan.

ylempereur · 2022-11-30T16:26:34Z

Also, the scanner appears to stall at times. I suspect the two problems are related.

ylempereur · 2022-12-14T18:57:29Z

The problem is still present in v0.18.0-rc1

BigLep · 2023-01-24T01:10:02Z

@ylempereur : 0.18 is fully released now. Is it still present?

ylempereur · 2023-01-27T17:05:14Z

Sorry, it will be a few more days before I can verify this properly.

ylempereur · 2023-02-03T17:19:28Z

I just tested 0.18.1 and all the symptoms appear to have been fixed.
You can consider this matter closed.
Thank you

ylempereur added kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization labels Sep 26, 2022

marten-seemann mentioned this issue Oct 3, 2022

run loop can busy-loop when the send queue blocks for a long period quic-go/quic-go#3567

Closed

guseggert added P1 High: Likely tackled by core team if no one steps up status/in-progress In progress and removed need/triage Needs initial labeling and prioritization labels Oct 6, 2022

ylempereur closed this as completed Mar 24, 2023

0.16.0-rc1: AcceleratedDHTClient possible issue? #9309

0.16.0-rc1: AcceleratedDHTClient possible issue? #9309

Comments

ylempereur commented Sep 26, 2022

Checklist

Installation method

Version

Config

Description

Jorropo commented Sep 27, 2022

ylempereur commented Sep 27, 2022

Jorropo commented Sep 27, 2022

julian88110 commented Sep 27, 2022

Jorropo commented Sep 27, 2022 • edited

ylempereur commented Sep 27, 2022

ylempereur commented Sep 27, 2022

Jorropo commented Sep 28, 2022 • edited

marten-seemann commented Sep 28, 2022

ylempereur commented Sep 28, 2022 • edited

marten-seemann commented Sep 28, 2022

ylempereur commented Sep 28, 2022

marten-seemann commented Sep 28, 2022

ylempereur commented Sep 28, 2022

marten-seemann commented Sep 28, 2022

ylempereur commented Sep 28, 2022

ylempereur commented Sep 28, 2022

marten-seemann commented Sep 29, 2022

ylempereur commented Sep 29, 2022

ylempereur commented Sep 29, 2022 • edited

ylempereur commented Sep 29, 2022

marten-seemann commented Sep 30, 2022

ajnavarro commented Sep 30, 2022 • edited

ajnavarro commented Sep 30, 2022

ylempereur commented Sep 30, 2022

marten-seemann commented Oct 2, 2022

ylempereur commented Oct 2, 2022

ylempereur commented Oct 4, 2022 • edited

marten-seemann commented Oct 6, 2022

ylempereur commented Oct 7, 2022

BigLep commented Nov 17, 2022

ylempereur commented Nov 17, 2022

ylempereur commented Nov 30, 2022

ylempereur commented Dec 14, 2022

BigLep commented Jan 24, 2023

ylempereur commented Jan 27, 2023

ylempereur commented Feb 3, 2023

Jorropo commented Sep 27, 2022 •

edited

Jorropo commented Sep 28, 2022 •

edited

ylempereur commented Sep 28, 2022 •

edited

ylempereur commented Sep 29, 2022 •

edited

ajnavarro commented Sep 30, 2022 •

edited

ylempereur commented Oct 4, 2022 •

edited