Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Firestore index intermittent failure creating resources - 404 Job ID not found #528

Open
evamaxfield opened this issue Jun 8, 2022 · 23 comments · Fixed by CouncilDataProject/cdp-backend#194
Labels
kind/bug Some behavior is incorrect or out of spec upstream/service

Comments

@evamaxfield
Copy link

What happened?

Certain tasks are lossing track of their Job IDs it seems?

google-native:firestore/v1:Index event_minutes_item-event_ref_ASCENDING_index_DESCENDING **creating failed** error: waiting for completion / read state googleapi: Error 400: Invalid index resource id "S0E0MW1pSmdBQ0lDDCoDIGFjYjEwNDYwMGUzZC03Yzc5LWJiZjQtMjFiYy1hMTlmODBlMSQadGx1YWZlZAcSMXJoLXJleGVkbmktbmltZGERClIS".: polling operation status: googleapi: Error 404: Job ID not found: 1e08f91a-cb12-4fbb-97c7-d3e006401bca

Steps to reproduce

I can try to make a reproduction tomorrow but this same workflow was working on v3.33.2

Full log of failing on v3.34.0: https://github.com/CouncilDataProject/abqnm/runs/6803735092?check_suite_focus=true

Full log of different instance but same stack specification working on v3.33.2: https://github.com/CouncilDataProject/boston/runs/6783350773?check_suite_focus=true

Expected Behavior

No lost Job IDs / Succeeds

Actual Behavior

Fails with errors that are too hard for me to debug unfortunately :(

Versions used

pulumi: 3.34.0 (and working in v3.33.2)
pulumi_gcp: 6.26.0 (same version used in the successful deployment on pulumi base v3.33.2)
pulumi_google_native: 0.20.0 (same version used in the successful deployment on pulumi base v3.33.2)

python: 3.9.13

Additional context

No response

Contributing

Vote on this issue by adding a 👍 reaction.
To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

@evamaxfield evamaxfield added kind/bug Some behavior is incorrect or out of spec needs-triage Needs attention from the triage team labels Jun 8, 2022
@evamaxfield evamaxfield changed the title Bug introduced in 3.34.0 404 Job ID not found -- Bug introduced in 3.34.0 Jun 9, 2022
@evamaxfield
Copy link
Author

evamaxfield commented Jun 9, 2022

If I have the time next week I may try to setup a github actions in one of my repos that generates a stack and then tears it down every day but installs pulumi alphas / betas so that I can catch bugs like this and pulumi/pulumi#9694 before they are released

@mikhailshilkov
Copy link
Member

Thank you @JacksonMaxfield. I have a hard time thinking of a possible reason for this error that could be introduced in the CLI 3.34.0... it feels like a provider problem. I expect there is a significant difference between your boston and abqnm projects that may cause that. Just a guess though.

I'll go ahead and move this issue to google-native.

@mikhailshilkov mikhailshilkov transferred this issue from pulumi/pulumi Jun 10, 2022
@evamaxfield
Copy link
Author

Yea this definitely may be a google problem :/

@evamaxfield
Copy link
Author

@jkisk
Copy link
Contributor

jkisk commented Jun 10, 2022

Thank you for opening the issue, there is some ongoing work in this provider that will hopefully address this.

@jkisk jkisk removed the needs-triage Needs attention from the triage team label Jun 10, 2022
@evamaxfield
Copy link
Author

Still happening: https://github.com/CouncilDataProject/abq/runs/6885881433?check_suite_focus=true

Any way I can help? More debug logs?

@viveklak
Copy link
Contributor

@JacksonMaxfield - seems like a number of the index resources are succeeding but one is not. I can only see the latest failure so can't corelate with prior runs. Some questions:

  1. Is this happening on every run?
  2. Is it the same index resource that is consistently failing?
  3. Any parameters to the index resource that seem to make it more likely the resource fails?
  4. Sounds like you are creating multiple indexes in parallel. On paper this could an eventual consistency issue on Google's side. Wondering if you added some artificial dependsOn to serialize the index creation helps?

@evamaxfield
Copy link
Author

Sorry! I deleted the old repos because of how out-of-wack the stack was. Easier to start over than it is to fix the stack.

  1. Yes (for newly created stacks) -- there are other stacks that have been deployed for months that don't have the problem because they are simply checking for resource existance and not creation. For example here is our "seattle" instance stack logs: https://github.com/CouncilDataProject/seattle/actions?query=workflow%3A%22Infrastructure%22

Thos are succeeding everytime.

It's happening when trying to create new stacks / resources for the first time -- it loses track of the JobID

run 1: ignorable because the repo didnt have google creds yet
run 2: https://github.com/CouncilDataProject/abq/runs/6885881433?check_suite_focus=true
run 3: https://github.com/CouncilDataProject/abq/runs/6905150159?check_suite_focus=true

  1. It looks like what is happening is that it is simply losing track of the job because on the second try, it has errors like: "index already exists" google-native:firestore/v1:Index event_minutes_item-event_ref_ASCENDING_index_DESCENDING creating error: error sending request: googleapi: Error 409: index already exists: "https://firestore.googleapis.com/v1/projects/cdp-abq-1d34e1c4/databases/%28default%29/collectionGroups/event_minutes_item/indexes" map[collectionGroupId:event_minutes_item databaseId:(default) fields:[map[fieldPath:event_ref order:ASCENDING] map[fieldPath:index order:DESCENDING]] project:cdp-abq-1d34e1c4 queryScope:COLLECTION]

But it does also have some "JobID lost" errors on the second run too. So seems like some of them get through on the first attempt and some dont?

  1. Some indexes are succeeded and some are failing so not entirely sure.... They are all largely the same... What is interesting is the pulumi stack isn't finding all the indexes we are sending it? There should be ~30 indexed field sets or something.

  2. I have tried removing the parallelism in prior runs too and that didn't work. I can try to artificially depend on the prior index.


sidenote: I am wondering what google changed recently because we have ~9 of these stacks deployed and none have ever had any problems.

@evamaxfield
Copy link
Author

Update running a stack create locally with the depends on suggestion and it is working it is simply taking a very long time. ~3 minutes per index.

@evamaxfield
Copy link
Author

Update: I can confirm that setting the indexes to create one after each other fixes the problem. Never happened before but 🤷

Normally infrastructure deploys take ~3 minutes with this stack configuration this setup now takes ~2 hours.

I am going to make the changes on the stack. Thanks for the idea.


I would love to add docs to the firstore index resource that says "recommended to make one index at a single time" -- how would I go about doing that? Should I add the same docstring contribution to all the firestore index objects?

etc?

@viveklak
Copy link
Contributor

@JacksonMaxfield thanks for trying that out! Agree this is definitely painful. I will bring this up with our Google contacts. Another option is to try splitting the indexes up in groups (of 2 or 3) instead of trying them entirely serially. Alternatively you could remove the artificial ordering and try with pulumi up --parallel <X> which limits the number of concurrent operations to a max of X. However, invoking them serially as you are doing is likely most reliable.

I would love to add docs to the firstore index resource that says "recommended to make one index at a single time" -- how would I go about doing that?

Those files are auto-generated from discovery docs published by Google so changes to those files will be overwritten. We can add a note with an overlay during our schema generation if necessary but I'd like to get a word back from Google first.

@evamaxfield
Copy link
Author

thanks for trying that out! Agree this is definitely painful. I will bring this up with our Google contacts. Another option is to try splitting the indexes up in groups (of 2 or 3) instead of trying them entirely serially. Alternatively you could remove the artificial ordering and try with pulumi up --parallel which limits the number of concurrent operations to a max of X. However, invoking them serially as you are doing is likely most reliable.

Extra notes: I wondered if it is limited by the collection the index is being attached to? Basically you could only create a single index per collection at a time but I tried that and it didn't work.

I have also seen it fail with parallelism = 2 so I am just going to leave it with this weird depends on tree right now.

Those files are auto-generated from discovery docs published by Google so changes to those files will be overwritten. We can add a note with an overlay during our schema generation if necessary but I'd like to get a word back from Google first.

Sounds good! Would love updates whenever you find out.

@evamaxfield
Copy link
Author

Unfortunately even after the recent change and local testing with the "depends on" thing we are still getting failures... https://github.com/CouncilDataProject/albuquerque/runs/6909101822?check_suite_focus=true

@viveklak viveklak changed the title 404 Job ID not found -- Bug introduced in 3.34.0 Firestore index intermittent failure creating resources - 404 Job ID not found Jun 16, 2022
@viveklak
Copy link
Contributor

Pulling out relevant error message:

+  google-native:firestore/v1:Index vote-event_ref_ASCENDING_person_ref_ASCENDING creating error: waiting for completion / read state googleapi: Error 400: Invalid index resource id "S3M1eEZqSmdBQ0lDDCoDIDM3NzI2NGJhNTA3MS1iMTBhLWQyZjQtYTczZC01MmY0NTAyMSQadGx1YWZlZAcSMXJoLXJleGVkbmktbmltZGERClIS".: polling operation status: googleapi: Error 404: Job ID not found: 12054f25-d37a-4f2d-a01b-1705ab462773

@viveklak
Copy link
Contributor

viveklak commented Jun 16, 2022

@JacksonMaxfield would you mind terribly rerunning the stack with pulumi -d --logflow --logtostderr -v=9 up --yes 2>&1 | tee /tmp/log and tarring up the logs and e-mailing to me (first-name @ pulumi.com) ? That should dump the provider's API payload to Google as well. You can remove the artificial dependsOn trick since it seems to not have conclusively helped to make grabbing the output easier.

@evamaxfield
Copy link
Author

Will do this afternoon likely

@evamaxfield
Copy link
Author

Sent! Those logs are long. Let me know if you need anything else or want me to try anything. Thank you so much for investigating.

@evamaxfield
Copy link
Author

@viveklak any update on this? I just tried again and I am still experiencing this bug: https://github.com/CouncilDataProject/oakland/runs/7620442143?check_suite_focus=true

@evamaxfield
Copy link
Author

I am going to try rolling back pulumi version to something older, I am going to try running the same script with Terraform, last ditch I don't know, I guess I will rip out the firestore index creation?

@viveklak
Copy link
Contributor

Apologies for the radio silence. I am just about to cut a v0.23.0 release with a bug fix which might be helpful (not sure). Would you be able to try it once its out before trying older releases?

@evamaxfield
Copy link
Author

No worries on silence! We are all busy and I get it. Sorry if my messages came off as pester-y, I am simply using this issue to track the state of our own build infra too.

Will happily try a new release later today!

@viveklak viveklak removed the resolution/fixed This issue was fixed label Aug 11, 2022
@evamaxfield
Copy link
Author

@evamaxfield
Copy link
Author

I have this working without pulumi and just with firebase CLI tools. Unfortunately going to head that direction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Some behavior is incorrect or out of spec upstream/service
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants