CI: fix more flakes, move itests to GitHub (except ARM itest) #5811

guggero · 2021-09-30T10:38:40Z

Depends on btcsuite/btcd#1752.

Fixes two problems in the itest:

Some instances of premature channel announcements caused by not all subsystems being in sync --> We fix this by always waiting 50 (or maybe 20 would be enough?) milliseconds after each mined block.
Some instances of the mining btcd and chain backend btcd node losing their connection because of the peer stall detection in btcd --> We fix this by disabling stall detection

guggero · 2021-10-01T12:19:06Z

Wow, all GitHub itests green on the first run 😮

joostjager · 2021-10-01T12:24:21Z

Wow, all GitHub itests green on the first run

Is this good or bad?

guggero · 2021-10-01T12:29:19Z

Wow, all GitHub itests green on the first run

Is this good or bad?

I would say this is very good. Not sure why you would think it wasn't?

carlaKC

Looks good, just one q about a comment I'm unsure of.

carlaKC · 2021-10-01T12:42:08Z

lntest/itest/test_harness.go

@@ -31,7 +31,7 @@ var (
 		"lndexec", itestLndBinary, "full path to lnd binary",
 	)

-	slowMineDelay = 50 * time.Millisecond
+	slowMineDelay = 20 * time.Millisecond


re commit message: why does decreasing this value slow things down?

We already had the mineBlocksSlow function that used the 50ms delay. By replacing all instances of mineBlocks with mineBlocksSlow, we make slow everything down. To reduce the amount of overall slowdown we decrease the delay from 50ms to 20ms.

Going to update the commit message to make this more clear.

carlaKC · 2021-10-01T12:50:15Z

lntest/node.go

+			// Did the event can close in the meantime? We want to
+			// avoid a "close of closed channel" panic since we're
+			// re-using the same event chan for multiple requests.


Not really understanding this comment? would this channel get closed when chanWatchRequests is finished with it? Also are we always sure this is close not another channel policy update?

I think it's that if the channel has already been closed here, and we send in another request, it'll end up double closing.

Yes, exactly.

joostjager · 2021-10-01T13:23:59Z

I would say this is very good. Not sure why you would think it wasn't?

If the tests were previously flakey because of them running on slow test machines, it could be that they uncovered issues that only show on slow production machines. So perhaps these are missed now with github actions. I have to admit that I don't even know for sure that tests being green is caused by faster test machines.

Sorry about the lack of threading, should have put my initial comment on some line.

Roasbeef · 2021-10-01T19:19:07Z

If the tests were previously flakey because of them running on slow test machines, it could be that they uncovered issues that only show on slow production machines.

I think it's mainly the lack of consistent timing w/ the series of timeouts we have. When we run w/ Travis (and their potato cluster) we end up with several processes (replicated db, 2x full node, up to 6 lnd nodes in some tests), so it's understandable that we run into some CPU scheduling weirdness that causes these flakes at times. At the same time, we've also eliminated a ton of flakes over the past 2 months due to flake hunting szn.

Travis as a service has really consistently degraded over the past year or so, then they have that massive security failure on top of that. We've given them enough chances to get their services together after being acquired by that PE firm IMO.

Roasbeef · 2021-10-01T19:20:00Z

I think the bigger gain here is also the restoration of all the lost developer time (sitting there baby sitting the test to restart it, odd failures w/ the machine (?) itself) due to Travis.

Roasbeef · 2021-10-01T22:03:47Z

Also worth nothing this brings in the btcd fix to disable the stall handler on simnet, which caused disconnections from the main miner node, which caused a ton of issues w.r.t transactions not properly propagating.

Roasbeef

LGTM 🥻

Roasbeef · 2021-10-01T22:08:35Z

lntest/node.go

+			// Did the event can close in the meantime? We want to
+			// avoid a "close of closed channel" panic since we're
+			// re-using the same event chan for multiple requests.


I think it's that if the channel has already been closed here, and we send in another request, it'll end up double closing.

Roasbeef · 2021-10-01T22:09:12Z

.travis.yml

@@ -47,71 +47,16 @@ jobs:
        - GOGC=30 make lint

    - stage: Integration Test
-      name: Btcd Integration


cy@ Travis 🤡

guggero · 2021-10-04T09:09:42Z

Rebased. But still blocked by btcsuite/btcd#1752.

Roasbeef · 2021-10-05T18:48:42Z

Interceptor tests need a wait.Predicate somewhere:

    lnd_rpc_middleware_interceptor_test.go:417: 
        	Error Trace:	lnd_rpc_middleware_interceptor_test.go:417
        	            				lnd_rpc_middleware_interceptor_test.go:125
        	Error:      	"rpc error: code = Unknown desc = the RPC server is in the process of starting up, but not yet ready to accept calls" does not contain "middleware 'itest-interceptor' is currently not registered"
        	Test:       	TestLightningNetworkDaemon/tranche01/84-of-85/btcd/rpc_middleware_interceptor/mandatory_middleware

The latest version of btcd allows its stall handler to be disabled. We use that new config option to make sure the mining btcd node and the lnd chain backend btcd node aren't disconnected if some test takes too long and no new p2p messages are exchanged.

We now redirect the mineBlocks function to the mineBlocksSlow function which waits after each mined block. To reduce the overall time impact of using that function everywhere, we only wait 20 milliseconds instead of 50ms after each mined block to give all nodes some time to process the block. This will still slow down everything by a bit but reduce flakes that are caused by different sub systems not being up-to-date.

Fixes the docker build that was caused by docker-library/postgres#884. Using the alpine and version 13 image avoids the problem introduced with postgres 14 and debian bullseye.

Roasbeef · 2021-10-05T23:20:16Z

Race cond flake is new, notified OP of that new test of it, needs a wait.Predicate there

guggero mentioned this pull request Sep 30, 2021

In-memory graph cache for faster pathfinding #5642

Merged

guggero force-pushed the itest-flake-fix branch 13 times, most recently from 93274ef to 4f4d48d Compare October 1, 2021 11:28

guggero changed the title ~~wip: more itest flake fixes~~ CI: fix more flakes, move itests to GitHub (except ARM itest) Oct 1, 2021

guggero marked this pull request as ready for review October 1, 2021 12:18

carlaKC self-requested a review October 1, 2021 12:26

carlaKC approved these changes Oct 1, 2021

View reviewed changes

Roasbeef approved these changes Oct 1, 2021

View reviewed changes

guggero force-pushed the itest-flake-fix branch from 4f4d48d to c987f0a Compare October 4, 2021 09:09

guggero force-pushed the itest-flake-fix branch from c987f0a to ef07c27 Compare October 4, 2021 09:21

guggero mentioned this pull request Oct 4, 2021

refactor: move towards more configurable implementation details #5708

Merged

guggero added 6 commits October 5, 2021 20:48

itest: fix commitment_deadline context expired flake

6bc0862

multi: fix Postgres on Travis

1774934

Fixes the docker build that was caused by docker-library/postgres#884. Using the alpine and version 13 image avoids the problem introduced with postgres 14 and debian bullseye.

itest: fix close of closed channel panic

c89637a

GitHub+Travis: move itests to GitHub Actions

134be24

guggero force-pushed the itest-flake-fix branch from ef07c27 to 134be24 Compare October 5, 2021 18:49

Roasbeef added this to the v0.14.0 milestone Oct 5, 2021

Roasbeef merged commit 5cc10ef into lightningnetwork:master Oct 5, 2021

guggero deleted the itest-flake-fix branch October 6, 2021 07:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: fix more flakes, move itests to GitHub (except ARM itest) #5811

CI: fix more flakes, move itests to GitHub (except ARM itest) #5811

guggero commented Sep 30, 2021

guggero commented Oct 1, 2021

joostjager commented Oct 1, 2021

guggero commented Oct 1, 2021

carlaKC left a comment

carlaKC Oct 1, 2021

guggero Oct 4, 2021

guggero Oct 4, 2021

carlaKC Oct 1, 2021

Roasbeef Oct 1, 2021

guggero Oct 4, 2021

joostjager commented Oct 1, 2021

Roasbeef commented Oct 1, 2021

Roasbeef commented Oct 1, 2021

Roasbeef commented Oct 1, 2021

Roasbeef left a comment

Roasbeef Oct 1, 2021

Roasbeef Oct 1, 2021

guggero commented Oct 4, 2021

Roasbeef commented Oct 5, 2021

Roasbeef commented Oct 5, 2021

CI: fix more flakes, move itests to GitHub (except ARM itest) #5811

CI: fix more flakes, move itests to GitHub (except ARM itest) #5811

Conversation

guggero commented Sep 30, 2021

guggero commented Oct 1, 2021

joostjager commented Oct 1, 2021

guggero commented Oct 1, 2021

carlaKC left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joostjager commented Oct 1, 2021

Roasbeef commented Oct 1, 2021

Roasbeef commented Oct 1, 2021

Roasbeef commented Oct 1, 2021

Roasbeef left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guggero commented Oct 4, 2021

Roasbeef commented Oct 5, 2021

Roasbeef commented Oct 5, 2021