Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error trying to harvest ERA via OAI #3435

Open
sfarnel opened this issue Apr 4, 2024 · 9 comments
Open

Error trying to harvest ERA via OAI #3435

sfarnel opened this issue Apr 4, 2024 · 9 comments
Assignees
Labels

Comments

@sfarnel
Copy link
Member

sfarnel commented Apr 4, 2024

Describe the bug
EBSCO (who harvests ERA for inclusion in EDS, our discovery service) has recently (beginning mid-March) run into an issue when trying to harvest ERA. This is what they have described:

"We've recently encountered an issue with the OAI data harvests for your Institutional Repository, "ERA: University of Alberta Institutional Repository (ir00008a)". Harvest has been aborted by an error "ERROR: Could not harvest from https://era.library.ualberta.ca/oai: Error while harvesting using setSpec '90b3539f-2198-4058-9786-5e3ccd9e671a:7966f60a-4353-4286-941b-dbd33cd74867'. Will not continue.

Some reasons this might be happening:

--If your server is offline for a technical issue?

--If your server is shut down on a regular schedule?

--If we should still use the following base URL for our harvest attempts https://era.library.ualberta.ca/oai

In addition, is it possible that our IP address is blocked by a firewall? In order to access and harvest your IR records via OAI-PMH, we need the following IP addresses added to your network exceptions list:

140.234.252.9

140.234.253.9

140.234.255.9"

If it's of any help, I have been able to use the basic browser approach to get records from this set without an issue; see https://era.library.ualberta.ca/oai/?verb=ListRecords&set=90b3539f-2198-4058-9786-5e3ccd9e671a:7966f60a-4353-4286-941b-dbd33cd74867&metadataPrefix=oai_dc

Not sure if this was a one-time issue or if something more substantial is happening.

@sfarnel sfarnel added the bug label Apr 4, 2024
@pgwillia
Copy link
Member

pgwillia commented Apr 16, 2024

Spoke with @nmacgreg because he was working on making firewall rules permanent. The IP from the EBSCO message is not being blocked.

@ConnorSheremeta
Copy link
Contributor

I have a hunch that this may be caused by the seemingly irregular 3:00am redis reboots which would bring this service down, I have yet to verify this however

@jefferya
Copy link
Contributor

jefferya commented Apr 22, 2024

To add another thought to the above thought. The below is an excerpt from the HAProxy log for IP 140.234.252.9 on April 1. Notice the HTTP 500 error in the first. The second is a request with the same URI, it succeeded with an HTTP 200 but required 5191ms. Does this indicate that the OAI client sent a request, then waits 5000ms and timed out thus sending a second request which caused the HTTP 500 error? I'm assuming the log timestamps recorded at the end of the HTTP session (not the start) thus the second request in the log actually starts before the first.

Testing response time:

Why do two requests to the same URI result in different HTTP status codes (the shorter one is HTTP 500 while the longer is HTTP 200)?

Apr  1 14:00:50 telford haproxy[1188]: 140.234.252.9:10400 [01/Apr/2024:14:00:50.082] jupiter-ssl-beta~ jupiter2-backend/era-app-prd-1 99/0/1/355/+455 500 +476 - - ---- 205/67/2/2/0 0/0 "GET /oai?verb=ListRecords&metadataPrefix=oai_dc&set=2c5be215-7943-434e-b96e-4390bc16b00a:5065d96a-6fcd-4870-aa03-6b94b9464064&from=2024-02-01T21:00:23Z&until=2024-04-01T20:00:24Z HTTP/1.1"
Apr  1 14:00:55 telford haproxy[1188]: 140.234.252.9:10400 [01/Apr/2024:14:00:50.537] jupiter-ssl-beta~ jupiter2-backend/era-app-prd-1 5124/0/1/66/+5191 200 +1077 - - ---- 204/67/1/1/0 0/0 "GET /oai?verb=ListRecords&metadataPrefix=oai_dc&set=2c5be215-7943-434e-b96e-4390bc16b00a:5065d96a-6fcd-4870-aa03-6b94b9464064&from=2024-02-01T21:00:23Z&until=2024-04-01T20:00:24Z HTTP/1.1"
Apr  1 14:00:55 telford haproxy[1188]: 140.234.252.9:10400 [01/Apr/2024:14:00:55.729] jupiter-ssl-beta~ jupiter2-backend/era-app-prd-2 126/0/1/68/+195 200 +1077 - - ---- 204/67/1/1/0 0/0 "GET /oai?verb=ListRecords&metadataPrefix=oai_dc&set=81420852-41c5-4624-8b60-4188d702ded6:099e7f31-a5c1-4005-bfa7-b6c7dd6349b7&from=2024-02-01T21:00:23Z&until=2024-04-01T20:00:24Z HTTP/1.1"
Apr  1 14:00:56 telford haproxy[1188]: 140.234.252.9:10400 [01/Apr/2024:14:00:55.924] jupiter-ssl-beta~ jupiter2-backend/era-app-prd-1 93/0/0/66/+159 200 +1077 - - ---- 204/67/1/1/0 0/0 "GET /oai?verb=ListRecords&metadataPrefix=oai_dc&set=2c5be215-7943-434e-b96e-4390bc16b00a:28738add-6881-447b-a655-ca9331e3cadc&from=2024-02-01T21:00:23Z&until=2024-04-01T20:00:24Z HTTP/1.1"

Description of the the HAProxy HTTP log format: https://sematext.com/blog/haproxy-logs/#http-format

@nmacgreg
Copy link

nmacgreg commented Apr 24, 2024

RE: Connor's speculation about this being related to the reboot of Redis:

  • Yes, redis is rebooted weekly, as are all VM's
  • Yes, there is some randomness built into the day and time of the reboot

I have 2 questions:

  • Is redis used in responding to an OAI query? (I thought redis was a queuing subsystem... emoji for head-scratch!)
  • Recent testing shows :
    • 100% of queries to era-app-prd-1 get "200" responses,
    • ...but only ~10% of queries to era-app-prd-2 get "200", the rest get "500"s.
    • (You must mangle your /etc/hosts for this test; don't forget to undo this!)
    • One Prod server works, the other is intermittent. How is this related to redis?

@nmacgreg
Copy link

nmacgreg commented Apr 24, 2024

Jeff was provocative on Slack & brought this to my attention.

! Note that to perform this testing, you must mangle your /etc/hosts to make era.library.ualberta.ca "point to" the host address for either era-app-prd-1 or era-app-prd-2, and it's frightfully important that you undo this mangle after testing! !

Use this command for testing: curl "https://era.library.ualberta.ca/oai/?verb=ListRecords&set=90b3539f-2198-4058-9786-5e3ccd9e671a:7966f60a-4353-4286-941b-dbd33cd74867&metadataPrefix=oai_dc"

The HTTPD logs on the Prod server era-app-prd-2 do record the 500 response, but don't give any useful details.

OTOH, the production.log is pretty chatty about the 500 response, and hands off to rollbar. Maybe this will give you a place to start? Can you recreate this in Dev, for instance? Or, has anybody tried, in Staging? Is it just this one URL, or are other URL's failing? What is source of the data used to inform the response to this query -- redis, postgresql, or an maybe an external web resource -- and can you test that backend to find the source of the flakiness?

I was unable to immediately find a explanation for this query working 100% of the time on one Prod server, but intermittently failing on another. Our business processes are really carefully designed to prevent differences between servers. I'm open to the idea of either building a new Prod VM, or performing the kind of rebuild operation we'd use to recover from a bad update (eg delete the jupiter package; clean primary directories; use the playbook to reinstall) on era-app-prd-2. Find me on Slack if you want to pursue this; I'm stepping away.

@jefferya
Copy link
Contributor

jefferya commented Apr 24, 2024

A test from yesturday:

I try an OAI request on each of the 3 prod Jupiter servers. The request works on only 2 of 3 with the third returning a 500 error.

curl "https://era.library.ualberta.ca/oai/?verb=ListRecords&set=90b3539f-2198-4058-9786-5e3ccd9e671a:7966f60a-4353-4286-941b-dbd33cd74867&metadataPrefix=oai_dc"

I'm changing my /etc/hosts file to isolate each prod server
prod-1: works
prod-2: fails
prod-3: works

The following also succeeds on 2 of 3 prod servers.

https://era.library.ualberta.ca/oai?verb=ListRecords&metadataPrefix=oai_dc

Fast forward to today, I tried reproducing yesterday's test and received different results (all 3 prod servers worked for both test URIs). What can change day-to-day and server to server?

  • ERA resources: no, error would appear across all servers
  • state/data that is time dependent and server dependent (e.g., like the sitemap cache problem)
  • One test returned a resumption token (pagination) one didn't (because all items fit in a single request).

Todo:

Fyi, I think curl --resolve might be able to replace the need to change /etc/host:

curl --resolve "era.library.ualberta.ca:443:${IP}" "https://era.library.ualberta.ca/oai/?verb=ListRecords&set=90b3539f-2198-4058-9786-5e3ccd9e671a:7966f60a-4353-4286-941b-dbd33cd74867&metadataPrefix=oai_dc"


Update 11:55 (two hours later:
* below fails on prod-1 & prod-2; succeeds on prod-3

curl --resolve "era.library.ualberta.ca:443:${IP}" "https://era.library.ualberta.ca/oai/?verb=ListRecords&set=90b3539f-2198-4058-9786-5e3ccd9e671a:7966f60a-4353-4286-941b-dbd33cd74867&metadataPrefix=oai_dc"

curl --resolve "era.library.ualberta.ca:443:${IP}" "https://era.library.ualberta.ca/oai/?verb=ListRecords&metadataPrefix=oai_dc"

@jefferya jefferya self-assigned this Apr 24, 2024
@ConnorSheremeta
Copy link
Contributor

ConnorSheremeta commented Apr 24, 2024

I had a chance to do some digging into this as I am fairly familiar with Oaisys, Redis is leveraged in Oaisys as a key value store to handle resumption tokens (defined OAI-PMH parameter). I believe Redis is on one server: with the discovery of the server discrepancy the redis reboots were likely a red herring.

I spend some time in the Jupiter/Oaisys/ActsAsRdfable code-bases and have narrowed some things down. I believe this is the piece of code throwing the error (https://github.com/ualbertalib/acts_as_rdfable/blob/e7ae5f76157139bae3804230a0d789affee1a9af/lib/acts_as_rdfable/acts_as_rdfable_core.rb#L26). format is given by Oaisys through the metadataPrefix url param (https://github.com/ualbertalib/oaisys/blob/master/app/controllers/oaisys/pmh_controller.rb#L102 --> https://github.com/ualbertalib/oaisys/blob/4d0f657480658343ba01cfe39f71ef0060ccdcab/app/views/oaisys/pmh/list_records.xml.builder#L33) which appears to be present in these requests. I wonder if it is related to this change (ualbertalib/acts_as_rdfable@e6f8604)

I did notice that the first occurrence we have of this error in rollbar was Feb 9th, 2024 (https://app.rollbar.com/a/ualbertalib/fix/item/jupiter/1833) which more or less lines up with the release containing the change linked above (jupiter version 2.7.0 released Jan 20, 2024.) which adds to my suspicion, that release also had a major Rails upgrade however. One thing that is likely unrelated but came to mind is an old ticket in shortcut (previously clubhouse which was used for project management at the time) which I have included a screenshot of below:

oaisysticket

Stepping away from this rabbit hole for now

@jefferya
Copy link
Contributor

jefferya commented Apr 26, 2024

I suspect Connor's message is on the right track. I've been unable to find another explanation nor reproduce reliably.

Expanding a bit on what I think Connor is saying:

  • format is correctly passed into the request (otherwise validation earlier in the call stack would trigger a different error)
  • the singleton_class / singleton_method are a potential the best areas for further investigation: ualbertalib/acts_as_rdfable@e6f8604
raise InvalidArgumentError unless self.class.supported_metadata_serialization_formats.include?(format)

https://github.com/ualbertalib/acts_as_rdfable/blob/e7ae5f76157139bae3804230a0d789affee1a9af/lib/acts_as_rdfable/acts_as_rdfable_core.rb#L26

Thoughts:

  • what does the change from self.singleton_class.supported_metadata_serialization_formats to self.class.supported_metadata_serialization_formats impact? Is the inability to consistently reproduce over time related to object cache: https://shopify.engineering/caching-without-marshal-part-one?
  • does the singleton pattern fail when transitioning from oai_dc to oai_etdms? Or transitioning from theses to items?

@jefferya jefferya removed their assignment May 3, 2024
@jefferya
Copy link
Contributor

jefferya commented May 3, 2024

I've updated the previous comments with my existing thoughts. My Library sprint ends this week so I'm returning the issue to the queue in the event another person has cycles to move this forward.

@ConnorSheremeta ConnorSheremeta self-assigned this May 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants