Large Deposits

In 2023 H2 added Globus integration that allows users to deposit large amounts of data, both in terms of number of files and their sizes. For instructions on how this is done please see this demo video.

Constraints

Despite the ability to create deposits with Globus there are (currently) practical constraints on moving data through the SDR, which were diagnosed during a production test of a 500 GB object with 19,000 files (hj302gv2126). These are some notes on changes that were made in March of 2023 to improve SDR processes to allow for a large deposit. These notes are meant to inform future efforts to scale SDR's ability to process large deposits.

The Technical Metadata step in the Accessioning Workflow was not able to find files. This was because the Accessioning workflow was being kicked off twice, and the amount of data was causing a race condition which ordinarily didn't manifest for Globus deposits. https://github.com/sul-dlss/happy-heron/issues/3007
H2 was unable to fetch a list of all the files from Globus in a deposit when the deposit had a large number of files (in this case 19,000). The GlobusClient.list_files() method in our globus_client library needs to issue an API call for every directory contained in the user's Globus upload directory in order to get a complete list of files. This seemed to encounter an intermittent connection termination in production (on sul-h2-prod) but not in our staging or development environments. Since we were unable to determine why the production network was behaving this way our solution was to retry the HTTP call when it failed using faraday-retry. https://github.com/sul-dlss/happy-heron/issues/3008
H2's call to update a deposit using the resource update endpoint in the SDR API was timing out. Increasing the timeout to 30 minutes didn't help. We discovered that this API request was taking a long time because it was generating missing digests for Globus deposits, since digests are not available from the Globus API itself. The solution was to move fixity generation into the SDR-API's background IngestJob and UpdateJob jobs instead of doing it as part of the HTTP response generation. See https://github.com/sul-dlss/happy-heron/issues/2995
The H2 application encountered a socket timeout when trying to update the database after waiting a long time for the Globus list files operation to complete. Rails' ActiveRecord holds on to database connections, and the network connection between sul-h2-prod and sul-h2-db-prod was getting interrupted by a firewall rule. The solution was to re-open all the database connections after returning from the (potentially long) GlobusClient.list_files() operation. This is a pattern that we've had to use elsewhere in the SDR. https://github.com/sul-dlss/happy-heron/issues/3019
Once deposited it takes 40 secs for Argo to render the item view for druid:hj302gv2126 . But at least it renders, eventually.
Once shelved it takes about a minute for the PURL to completely render the file listing: https://purl.stanford.edu/hj302gv2126
The sdr-client encounters a timeout when retrieving metadata for this large object when accessing from outside the VPN (which some users of SDR-API might be doing?) It appears to get cut off after two minutes. For example: sdr get druid:hj302gv2126
The reset-workspace in the Accession Workflow encounters a network timeout. https://github.com/sul-dlss/common-accessioning/issues/1039

Since network timeouts (HTTP API calls and Database connections) seemed to be a common theme in these difficulties we can expect to remediate these problems by a combination of:

Moving expensive work into background jobs (e.g. Sidekiq) which then issue a callback of some kind to indicate completion.
Replacing HTTP API calls with RabbitMQ messages which can be picked up and responded to asynchronously.
Partitioning Cocina responses into multiple HTTP resources using HATEOAS and paging of resources. So instead of getting all the metadata and files for an object as part of a single request it should be possible for the API and the client to allow paging of resources.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large Deposits

Constraints

Clone this wiki locally