Investigating timeouts from large objects (many files)

Overview

It is possible for objects with many files, and therefore very large cocina representations, can timeout during processing. This is because some steps are asynchronous and need to load/pass along very large data structures over API calls, which can time out.

An investigation started with an object with >6k resources (and >24k files) was timing out during assemblyWF. This is the ticket https://github.com/sul-dlss/common-accessioning/issues/1005 and the object https://argo.stanford.edu/view/druid:rm242zg2446 and the HB alert https://app.honeybadger.io/projects/52894/faults/88727290

The fix was was extend the timeout for dor-services-app calls made via dor-services-client: https://github.com/sul-dlss/dor-services-client/pull/306

Investigating

Investigation into the issue involved looking at the HB alert, determining the timeout was when posting data to dor-services, and then walking through the code on both the common-accessioning console and then dor-services-app to see what was taking a long time. This also involved saving large data structures to disk and reloading them on DSA.

Create the checksum data: (the checksum-compute robot does this if you look at the code in common-accessioning)

ssh lyberadmin@common-accessioning-prod-b
cd common-accessioning/current
ROBOT_ENVIRONMENT=production bin/console
druid='druid:rm242zg2446'

robot = Robots::DorRepo::Assembly::ChecksumCompute.new
assembly_item = robot.send(:item,druid)
cocina_model = assembly_item.cocina_model;nil

# this took about 1 hour and produced data to save (which would have then been posted to DSA)
file_sets = robot.send(:compute_checksums, assembly_item, cocina_model);nil

# save data to a file
File.open("../../shared/tmp/#{druid}_file_sets.dump", 'wb') { |f| f.write(Marshal.dump(file_sets)) }

# if you want to keep working with this data later on common-accessioning without re-creating it, you can reload it
# file_sets = Marshal.load(File.read("../../shared/tmp/druid:rm242zg2446_file_sets.dump"));nil # load up the data

# this would generated the updated cocina that is posted and is what was timing out
updated = cocina_model.new(structural: cocina_model.structural.new(contains: file_sets));nil

Now move data to DSA:

ssh dor_services@dor-services-app-prod-a.stanford.edu
cd dor_services/shared/tmp
sftp lyberadmin@common-accessioning-prod-b
cd common-accessioning/shared/tmp
get druid:rm242zg2446_file_sets.dump tmp/druid:rm242zg2446_file_sets.dump
exit

Now load up the data in DSA and try the update:

bundle exec rails c -e p

conf.echo = false # prevents echoing of large data structures to the console

druid='druid:rm242zg2446'

file_sets = Marshal.load(File.read("../../shared/tmp/druid:rm242zg2446_file_sets.dump")) # load up the data

# this is the first thing DSA does on the API call and can take a while, potentially causing a timeout
cocina_model = CocinaObjectStore.find(druid) 

# this is what the checksum-compute robot creates to post to DSA 
# (ie this bit happens in the robot, just redoing it on DSA so we can use the data):
# https://github.com/sul-dlss/common-accessioning/blob/main/lib/robots/dor_repo/assembly/checksum_compute.rb#L17
updated = cocina_model.new(structural: cocina_model.structural.new(contains: file_sets));

# this is done by dor-services-client to prep the data to post to DSA
params = Cocina::Models.without_metadata(updated).to_json

# this is what the update action in DSA does (when it gets called from dor-services client):
# https://github.com/sul-dlss/dor-services-app/blob/main/app/controllers/objects_controller.rb#L33
# this line should build the updated object, but just blows up with a ton of output to the console, which
#  we have not figured out how to suppress...suspect it is from a conversion from Stuct to Hash under the hood somewhere
cocina_object = Cocina::Models.build(params.except(:action, :controller, :id).to_unsafe_h)

# save the object (can never get here because of all the output above
UpdateObjectService.update(cocina_object, skip_lock: true);

Testing Longer Timeouts

Want to test a long running call with a longer timeout manually? For example, the #reset-workspace call?

Go to common-accessioning, and then paste in the Dor::Services::Client class (https://github.com/sul-dlss/dor-services-client/blob/main/lib/dor/services/client.rb) from dor-services-client gem, increasing the timeout or other parameters needed before pasting in.

Then configure the client, and look at the class to see if it contains your update. You can then use the client on the console with the updated timeouts.

ssh lyberadmin@common-accessioning-prod-b
cd common-accessioning/current
ROBOT_ENVIRONMENT=production bin/console

## paste in Dor::Services::Client class after adjusting

# configure client
Dor::Services::Client.configure(url: Settings.dor_services.url,token: Settings.dor_services.token,enable_get_retries: true)

druid='druid:rm242zg2446'
client = Dor::Services::Client.object(druid)
# see if your options stuck by examining class here

# test the call
client.workspace.reset

In this particular case, you can even just see which the dor-services-client is doing and perform the action directly on DSA, without being concerned with timeouts. In this case, for example, it is just making a single reset call in DSA:

ssh dor_services@dor-services-app-prod-a.stanford.edu
cd dor_services/current
bin/rails c -e p

# verify the version 
ResetWorkspaceService.reset(druid: 'druid:rm242zg2446', version: 1)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigating timeouts from large objects (many files)

Overview

Investigating

Testing Longer Timeouts

Clone this wiki locally