Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: fix more flakes, move itests to GitHub (except ARM itest) #5811

Merged
merged 6 commits into from Oct 5, 2021

Conversation

guggero
Copy link
Collaborator

@guggero guggero commented Sep 30, 2021

Depends on btcsuite/btcd#1752.

Fixes two problems in the itest:

  • Some instances of premature channel announcements caused by not all subsystems being in sync --> We fix this by always waiting 50 (or maybe 20 would be enough?) milliseconds after each mined block.
  • Some instances of the mining btcd and chain backend btcd node losing their connection because of the peer stall detection in btcd --> We fix this by disabling stall detection

@guggero guggero force-pushed the itest-flake-fix branch 13 times, most recently from 93274ef to 4f4d48d Compare October 1, 2021 11:28
@guggero guggero changed the title wip: more itest flake fixes CI: fix more flakes, move itests to GitHub (except ARM itest) Oct 1, 2021
@guggero guggero marked this pull request as ready for review October 1, 2021 12:18
@guggero
Copy link
Collaborator Author

guggero commented Oct 1, 2021

Wow, all GitHub itests green on the first run 😮

@joostjager
Copy link
Collaborator

Wow, all GitHub itests green on the first run

Is this good or bad?

@carlaKC carlaKC self-requested a review October 1, 2021 12:26
@guggero
Copy link
Collaborator Author

guggero commented Oct 1, 2021

Wow, all GitHub itests green on the first run

Is this good or bad?

I would say this is very good. Not sure why you would think it wasn't?

Copy link
Collaborator

@carlaKC carlaKC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just one q about a comment I'm unsure of.

@@ -31,7 +31,7 @@ var (
"lndexec", itestLndBinary, "full path to lnd binary",
)

slowMineDelay = 50 * time.Millisecond
slowMineDelay = 20 * time.Millisecond
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re commit message: why does decreasing this value slow things down?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already had the mineBlocksSlow function that used the 50ms delay. By replacing all instances of mineBlocks with mineBlocksSlow, we make slow everything down. To reduce the amount of overall slowdown we decrease the delay from 50ms to 20ms.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to update the commit message to make this more clear.

Comment on lines +1602 to +1604
// Did the event can close in the meantime? We want to
// avoid a "close of closed channel" panic since we're
// re-using the same event chan for multiple requests.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really understanding this comment? would this channel get closed when chanWatchRequests is finished with it? Also are we always sure this is close not another channel policy update?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's that if the channel has already been closed here, and we send in another request, it'll end up double closing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly.

@joostjager
Copy link
Collaborator

I would say this is very good. Not sure why you would think it wasn't?

If the tests were previously flakey because of them running on slow test machines, it could be that they uncovered issues that only show on slow production machines. So perhaps these are missed now with github actions. I have to admit that I don't even know for sure that tests being green is caused by faster test machines.

Sorry about the lack of threading, should have put my initial comment on some line.

@Roasbeef
Copy link
Member

Roasbeef commented Oct 1, 2021

If the tests were previously flakey because of them running on slow test machines, it could be that they uncovered issues that only show on slow production machines.

I think it's mainly the lack of consistent timing w/ the series of timeouts we have. When we run w/ Travis (and their potato cluster) we end up with several processes (replicated db, 2x full node, up to 6 lnd nodes in some tests), so it's understandable that we run into some CPU scheduling weirdness that causes these flakes at times. At the same time, we've also eliminated a ton of flakes over the past 2 months due to flake hunting szn.

Travis as a service has really consistently degraded over the past year or so, then they have that massive security failure on top of that. We've given them enough chances to get their services together after being acquired by that PE firm IMO.

@Roasbeef
Copy link
Member

Roasbeef commented Oct 1, 2021

I think the bigger gain here is also the restoration of all the lost developer time (sitting there baby sitting the test to restart it, odd failures w/ the machine (?) itself) due to Travis.

@Roasbeef
Copy link
Member

Roasbeef commented Oct 1, 2021

Also worth nothing this brings in the btcd fix to disable the stall handler on simnet, which caused disconnections from the main miner node, which caused a ton of issues w.r.t transactions not properly propagating.

Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🥻

Comment on lines +1602 to +1604
// Did the event can close in the meantime? We want to
// avoid a "close of closed channel" panic since we're
// re-using the same event chan for multiple requests.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's that if the channel has already been closed here, and we send in another request, it'll end up double closing.

@@ -47,71 +47,16 @@ jobs:
- GOGC=30 make lint

- stage: Integration Test
name: Btcd Integration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cy@ Travis 🤡

@guggero
Copy link
Collaborator Author

guggero commented Oct 4, 2021

Rebased. But still blocked by btcsuite/btcd#1752.

@Roasbeef
Copy link
Member

Roasbeef commented Oct 5, 2021

Interceptor tests need a wait.Predicate somewhere:

    lnd_rpc_middleware_interceptor_test.go:417: 
        	Error Trace:	lnd_rpc_middleware_interceptor_test.go:417
        	            				lnd_rpc_middleware_interceptor_test.go:125
        	Error:      	"rpc error: code = Unknown desc = the RPC server is in the process of starting up, but not yet ready to accept calls" does not contain "middleware 'itest-interceptor' is currently not registered"
        	Test:       	TestLightningNetworkDaemon/tranche01/84-of-85/btcd/rpc_middleware_interceptor/mandatory_middleware

The latest version of btcd allows its stall handler to be disabled. We
use that new config option to make sure the mining btcd node and the lnd
chain backend btcd node aren't disconnected if some test takes too long
and no new p2p messages are exchanged.
We now redirect the mineBlocks function to the mineBlocksSlow function
which waits after each mined block. To reduce the overall time impact of
using that function everywhere, we only wait 20 milliseconds instead of 50ms
after each mined block to give all nodes
some time to process the block. This will still slow down everything by a bit
but reduce flakes that are caused by different sub systems not being
up-to-date.
Fixes the docker build that was caused by
docker-library/postgres#884.
Using the alpine and version 13 image avoids the problem introduced
with postgres 14 and debian bullseye.
@Roasbeef Roasbeef added this to the v0.14.0 milestone Oct 5, 2021
@Roasbeef
Copy link
Member

Roasbeef commented Oct 5, 2021

Race cond flake is new, notified OP of that new test of it, needs a wait.Predicate there

@Roasbeef Roasbeef merged commit 5cc10ef into lightningnetwork:master Oct 5, 2021
@guggero guggero deleted the itest-flake-fix branch October 6, 2021 07:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants