Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to retrieve events for certain block heights #5810

Open
vishalchangrani opened this issue Apr 29, 2024 · 8 comments · May be fixed by #5969
Open

Unable to retrieve events for certain block heights #5810

vishalchangrani opened this issue Apr 29, 2024 · 8 comments · May be fixed by #5969
Assignees
Labels
Bug Something isn't working S-Access

Comments

@vishalchangrani
Copy link
Contributor

🐞 Bug Report

Request to retrieve events for certain block heights fail.

flow events get --start=68225795 --end=68225795 -n mainnet A.d0bcefdf1e67ea85.HWGaragePMV2.AirdropBurn.

:x: Command Error: client: rpc error: code = ResourceExhausted desc = failed to retrieve events from execution nodes: 1 error occurred:
	* rpc error: code = ResourceExhausted desc =

While this is related to the bug https://github.com/dapperlabs/flow-go/issues/6959, it points to a different issue.
Currently, EN1 is set to return a ResourceExhausted error when querying for events. However, the fact that the GetEvents call consistently fails indicates that the public access nodes always query only EN1. This would happen if the access node only got one execution receipt for the block and it was from EN1. Hence the core issue here is that access node is most likely missing execution receipts from the other execution nodes.

What is the severity of this bug?

important

Critical - Urgent: We can't do anything if this isn't actioned immediately (product doesn't function without this, it's blocking us or users, or it resolves a high severity security issue). Whole team should drop what they're doing and work on this.

Critical: We can't do anything if this isn't actioned immediately (product doesn't function without this, it's blocking us or users, or it resolves a high severity security issue). One person should look at this right now.

Important: * We have to do this before we ship, but it can wait until the next sprint (product or feature won't function without it, but it's not blocking us or users right now). Team should do this in the next sprint.

Should have: * It would be better if we do this before we ship, but it's OK if we don't (product functions without this, but it's a better user experience). Consider adding to a future sprint.

Could have: It really doesn't matter if we do this (product functions without this, impact to user is minimal).

Reproduction steps

Steps to reproduce the behaviour:

$ flow events get --start=68225796 --end=68225796 -n mainnet A.d0bcefdf1e67ea85.HWGaragePMV2.AirdropBurn

❌ Command Error: client: rpc error: code = ResourceExhausted desc = failed to retrieve events from execution nodes: 1 error occurred:
	* rpc error: code = ResourceExhausted desc = 

Expected behaviour

Events should be returned.

Workaround

Access node 7 and 8 run by the foundation serve events locally and respond without an error for those block heigiths.

$ flow events get --start=68225795 --end=68225795 --host access-008.mainnet24.nodes.onflow.org:9000 A.d0bcefdf1e67ea85.HWGaragePMV2.AirdropBurn


Add any other context about the problem here.

@vishalchangrani vishalchangrani added the Bug Something isn't working label Apr 29, 2024
@vishalchangrani
Copy link
Contributor Author

This PR #5764 fixes the issue of EN1 missing events. Once the fix for that is rolled out, the client should not receive an error since EN1 will have all the events.
However, the root cause of this issue would still persists and needs to be fixed.

@peterargue
Copy link
Contributor

peterargue commented Apr 29, 2024

Next step is reproduce this against a single AN, then inspect the receipts for the block to see how many and from which nodes.

ANs index execution receipts in the ingestion engine here:

func (e *Engine) handleExecutionReceipt(_ flow.Identifier, r *flow.ExecutionReceipt) error {

Then choose an execution node based on receipts in storage here:

func findAllExecutionNodes(

It's possible for an AN to have only received a receipt from a single or even no ENs for a block. In this case, the AN should just try any EN.

I think we're running into a special case in this situation. If an AN is configured with a list of "preferred execution nodes", it will select one or more node from that list has it has receipts from. However, if it returns only a single node and the request to that node fails, it will not retry on another node.

@peterargue
Copy link
Contributor

peterargue commented May 1, 2024

There are 2 flags an AN can use to control which EN to use:

  • --preferred-execution-node-ids: if this is set the AN will prefer to use a node from this list if it has a receipt from any. Otherwise, it will fallback to using any EN.
  • --fixed-execution-node-ids: if this is set the AN will only use nodes from this list.

Otherwise, the node will try with any execution node.

Here's the logic:

func chooseExecutionNodes(state protocol.State, executorIDs flow.IdentifierList) (flow.IdentitySkeletonList, error) {
allENs, err := state.Final().Identities(filter.HasRole[flow.Identity](flow.RoleExecution))
if err != nil {
return nil, fmt.Errorf("failed to retreive all execution IDs: %w", err)
}
// first try and choose from the preferred EN IDs
var chosenIDs flow.IdentityList
if len(preferredENIdentifiers) > 0 {
// find the preferred execution node IDs which have executed the transaction
chosenIDs = allENs.Filter(filter.And(filter.HasNodeID[flow.Identity](preferredENIdentifiers...),
filter.HasNodeID[flow.Identity](executorIDs...)))
if len(chosenIDs) > 0 {
return chosenIDs.ToSkeleton(), nil
}
}
// if no preferred EN ID is found, then choose from the fixed EN IDs
if len(fixedENIdentifiers) > 0 {
// choose fixed ENs which have executed the transaction
chosenIDs = allENs.Filter(filter.And(
filter.HasNodeID[flow.Identity](fixedENIdentifiers...),
filter.HasNodeID[flow.Identity](executorIDs...)))
if len(chosenIDs) > 0 {
return chosenIDs.ToSkeleton(), nil
}
// if no such ENs are found then just choose all fixed ENs
chosenIDs = allENs.Filter(filter.HasNodeID[flow.Identity](fixedENIdentifiers...))
return chosenIDs.ToSkeleton(), nil
}
// If no preferred or fixed ENs have been specified, then return all executor IDs i.e. no preference at all
return allENs.Filter(filter.HasNodeID[flow.Identity](executorIDs...)).ToSkeleton(), nil
}

This issue comes up when an access node only has receipts from a single EN. In this case, if that node is offline or returns an error, the AN will not retry on any other node. This can create the situation where data for some blocks effectively becomes unavailable on that node.

ANs receive receipts from ENs as they execute blocks, and from the actual block as they are received from consensus nodes. It's possible in some situations for an AN to only have a single receipt for a block in it's store, so that situation should be handled.

@peterargue
Copy link
Contributor

peterargue commented May 1, 2024

I think we should update the behavior when --preferred-execution-node-ids is set and there are less than

const maxFailedRequestCount = 3

nodes selected, that the list is padded up to 3 nodes using the following methods (in order):

  1. Use any EN with a receipt
  2. Use any preferred node not already selected
  3. Use any EN not already selected

This would ensure there are enough fallbacks to handle cases where ENs are unavailable

@vishalchangrani
Copy link
Contributor Author

  • Use any EN with a receipt
  • Use any preferred node not already selected
  • Use any EN not already selected

shouldn't the order be,

  1. Use any preferred node not already selected
  2. Use any EN with a receipt
  3. Use any EN not already selected

Since the operator wants the preferred nodes to be given more weightage.

@peterargue
Copy link
Contributor

my thinking is that "preferred" implies that the node will try to use these if one of these nodes has executed the block, otherwise it will use another node.

If we failed over to any preferred EN, I think we're more likely to see delays responding to queries if there are other ENs that have reported executing. I'm OK with either approach

@vishalchangrani
Copy link
Contributor Author

  • Use any EN with a receipt
  • Use any preferred node not already selected
  • Use any EN not already selected

You are right - I mistakenly assumed preferred nodes would always be in the EN receipt.
Good with the order you suggested.

One question though - is the AN capable of differentiating between EN responding with a not-found error versus an EN responding with any other error?

@peterargue
Copy link
Contributor

One question though - is the AN capable of differentiating between EN responding with a not-found error versus an EN responding with any other error?

In some cases it does, but we can certainly add it where needed. Did you have a case in mind that should be checked?

@AndriiDiachuk AndriiDiachuk linked a pull request May 22, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working S-Access
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants