Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

releaseWF hits errors at update-marc step #4663

Open
andrewjbtw opened this issue Dec 20, 2023 · 3 comments
Open

releaseWF hits errors at update-marc step #4663

andrewjbtw opened this issue Dec 20, 2023 · 3 comments
Labels

Comments

@andrewjbtw
Copy link

In the last couple of weeks, there have been multiple days where I see multiple druids stuck in the releaseWF with an error at the update-marc step:

update-marc : problem with update-marc (BackgroundJob: 30760096): {:errors=>[{:title=>"Update MARC error", :detail=>"Unexpected response: 400 {"message":"Unexpected error in InitProducerIdResponse; The request timed out.","type":"-3","code":"INTERNAL_SERVER_ERROR"}"}]}

When I reset the errors (by using the "reset" link from the workflow grid), most of them resolve on the retry. However, if there are 10 or more errors and they're all retried at the same time, then a couple of druids usually fail a second time. I then reset those and they succeed on the next try. It seems like something may have changed on the FOLIO side since it doesn't look like the release process has changed on the SDR side and this error wasn't happening earlier.

Additional information
I think that error may end up in HB as: https://app.honeybadger.io/projects/50568/faults/100168652 There's a noticeable increase after December 1. Note that the issue in this ticket is a different error than sul-dlss/argo#4174 , which had to do with the MARC record itself. When that error happens, retrying doesn't succeed because the MARC record itself needs to be updated.

@andrewjbtw andrewjbtw added the bug label Dec 20, 2023
@shelleydoljack
Copy link

Is this error message the response from folio? {"message":"Unexpected error in InitProducerIdResponse; The request timed out.","type":"-3","code":"INTERNAL_SERVER_ERROR"} I search the folio-org repos for "InitProducerIdResponse" and I don't find anything so it is difficult to tell what's going on. The only thing I could say is "sometimes folio is slow, especially the mod-quickmarc endpoints". 🤷

@lwrubel
Copy link
Contributor

lwrubel commented Jan 19, 2024

@shelleydoljack Unexpected error in InitProducerIdResponse is a Kafka exception, if that helps in looking at logs.

@andrewjbtw
Copy link
Author

Since retrying the step from Argo seems to eventually work, I wonder if we could automate retries? That's assuming we could do it safely without overloading FOLIO with requests. I've been finding items with this issue and manually retrying approximately every week since this started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: New Issues (Needs Triage)
Development

No branches or pull requests

3 participants