Ingest Queue Design Ideas #1062

terrywbrady · 2022-05-23T20:02:09Z

Design Doc

https://github.com/CDLUC3/mrt-doc/tree/main/design/queue-2023

Use Cases to Consider

Throughput management
- Limit content download rate to ensure that already downloaded content can be processed quickly (cost burden shifts to Merritt once it has been downloaded)
- Halt download of non-priority content once the overall downloaded content on EFS exceeds threshold (2T?, 4T?)
- Halt download of content from a single collection until already downloaded content has been processed (.5T)
Execution
- Any task should be runnable on any ingest server
  - Ensure notification is delivered once all jobs for a batch across all servers have completed
- Number of server instances can grow/shrink
- Limit number of threads that can be occupied by any one profile/batch to prevent starvation
Prioritization
- Ensure that once Dryad becomes a regular depositor, it will never face queue starvation
- Ensure that the demo collection has expedited priority when verifying software changes
- Allow nuxeo feed jobs to queue and be processes without impacting other workloads

Brainstorms

Current Process

Phase	Prioritization	Actions	EFS	Queue	Memory
Phase 1: Download	Number of Jobs + profile properies	Download items;	create batch folder; create job folders; download	Save bid and jid to queue; set priority	Batch state
Phase 2: Ingest	Queue priority	Pull job based on priority; process jobs; notify storage and inventory	Build system files; delete after saved to storage	Update queue item state

Proposed Process

Phase	Prioritization	Actions	EFS	Queue	Memory
Phase 1: Estimate and Prioritize	Number of BYTES + profile properties (or by bytes and files for more accuracy)	Download manifests; perform HEAD requests on all downloads; estimate bytes per job.	create batch folder; download manifests; create job folders?	Create batch state object. Save notification details to QUEUE. Write preliminary size estimate to queue.	None
Phase 2: Download	Limit downloads when calculated bytes in EFS exceeds threshold. Dryad is an exception.	Download content to EFS. Revise byte counts and overall priority.	Download content to job folders.	Update bytes and priority in queue.
Phase 3: Ingest	Queue priority.	Pull job based on priority; process jobs; notify storage and inventory	Build system files; delete after saved to storage	Update queue item state. Update batch state.

terrywbrady · 2022-05-27T18:41:48Z

Privileged daemon for Dryad and other priority collections
Separate queuing instances from worker instances to ensure response to queue requests

terrywbrady · 2023-02-27T17:35:20Z

Brainstorming idea..

State	Current Action	Future Action
Pending	Awaiting Queue based on Priority	Awaiting Estimation based on Priority
	Evaluate Held Collections	Evaluate Held Collections
Held	Awaiting Release back to Pending	No Change
Estimating	n/a	Evaluate if dynamic provisioning is needed
		Evaluate if resources need to be released (i.e download space)
		Evaluate if hold is in place for primary storage node
Provisioning	n/a	Waiting for resources to be provisioned or freed
Downloading	n/a	Restart download
Ingesting (write to storage)	n/a	Re-process ingest
		Re-queue (possibly to a different server)
Recording	n/a	Inventory processing
Reporting	n/a	Assemble report esp for batches
		Re-send report
		Fire Callback
		Callback retries
Consumed	Downlading + Ingesting + Reporting	n/a

elopatin-uc3 · 2023-04-13T17:10:45Z

Break out this table into two parts -- Figure out which of these states are batch vs. job states
Manifest submission, zip submission, regular object submission:

image each of these scenarios going through these job states
e.g. zip file going through states: batch process will do downloading, then multiple jobs processed
data elements in batch queue record and the job queue record

terrywbrady · 2023-04-13T18:12:06Z

Batch State
- Pending
- Held
  - Collections
  - Primary storage node
- Estimating
  - HEAD requests to validate object size
  - How reliable will this be
  - Priority boost if reliable
- Provisioning (dynamic ZFS creation or space calculations)
- Active Jobs State
- Reporting
  - assemble email
  - delete jobs
- Failed
- Completed
- Cancelled
- Deleted
Job State
- Pending
- Downloading
- Ingesting
- Recording
- Download Failed
- Ingest Failed
- Inventory Failed
- Reporting (callback)
- Completed
- Cancelled
- Deletion of job records is exclusively handled by the batch
Job Completion Queue
- triggers updated in batch state
Use Cases
- single file batch (UI) - 1 batch, 1 job, 1 file
- object manifest (mult file) - 1 batch, 1 job, X files
- manifest of objects - 1 batch, N job, N files
- manifest of object manifests, 1 batch, N job, X files
- container of digital files - 1 batch, 1 job, N files
- container of manifest - not supported
- manifest of containers (exploded) - 1 batch, N jobs, X files
- manifest of binary zip files (not exploded) - 1 batch, N jobs, N files
- manifest of containers (exploded) - 1 batch, N jobs, X files
- manifest of binary zip files (not exploded) - 1 batch, N jobs, N files

terrywbrady · 2023-04-13T18:25:42Z

Batch Queue
- job count
  - job completed count
  - job failed count
  - job cancelled count
  - job started count
  - job deleted count
- profile
- submitter
- payload file name
- start time
- end time
- list of zk ids
- byte estimate
Job Queue Object
- current
  - status
  - zk id number (includes priority)
  - job id
  - batch id
  - created time
  - submitter
  - profile name
  - job type
  - primary id / ark (empty if add)
  - local id
  - payload file name
  - update vs create
- new fields needed
- storage manifest url
- byte estimate
- ingest worker hostname
- storage start time
- error detail
- download retry count
- ezid retry count
- callback retry count

terrywbrady · 2023-04-13T18:45:15Z

Microservice Changes

All ingest workers use 1 queue
Inventory uses ingest queue
Decompose ingest to multiple microservices? Or rely on existing daemons.
- batch preparation - api handling
  - initiates queue entries
- batch reporting - daemon
  - cleans up queue entries
- job handling - normal queue - daemon
  - queue driven
- job handling - priority queue - daemon
  - queue driven

terrywbrady · 2023-04-18T22:59:14Z

Note from 3/18 IAS meeting:

will ZK evolve with our evolving configuration plans?
AWS SQS is very simple ... could our needs actually be simple
AWS also offers Amazon MQ

elopatin-uc3 · 2024-01-17T20:36:47Z

Latest notes: https://github.com/CDLUC3/mrt-doc/blob/main/design/queue-2023/states.md

terrywbrady · 2024-04-04T22:45:35Z

@mreyescdl and @elopatin-uc3 , are you ok if we mark this done on the assumption that we have/will have new tickets to cover the work?

elopatin-uc3 · 2024-04-04T22:57:58Z

That sounds fine to me @terrywbrady

terrywbrady mentioned this issue Jun 13, 2022

Queue throttling at the collection level #1088

Open

mreyescdl mentioned this issue Nov 22, 2022

[Ingest] High Priority Consumer Daemon #1289

Closed

elopatin-uc3 mentioned this issue Apr 4, 2023

Replace synchronous ingest --> storage calls with a storage --> ingest callback mechanism #1179

Closed

terrywbrady pinned this issue Jun 1, 2023

elopatin-uc3 assigned terrywbrady and mreyescdl Mar 27, 2024

elopatin-uc3 added the Sprint 101 label Apr 17, 2024

elopatin-uc3 closed this as completed Apr 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingest Queue Design Ideas #1062

Ingest Queue Design Ideas #1062

terrywbrady commented May 23, 2022 •

edited

terrywbrady commented May 27, 2022

terrywbrady commented Feb 27, 2023 •

edited

elopatin-uc3 commented Apr 13, 2023

terrywbrady commented Apr 13, 2023 •

edited

terrywbrady commented Apr 13, 2023 •

edited

terrywbrady commented Apr 13, 2023 •

edited

terrywbrady commented Apr 18, 2023

elopatin-uc3 commented Jan 17, 2024

terrywbrady commented Apr 4, 2024

elopatin-uc3 commented Apr 4, 2024

Ingest Queue Design Ideas #1062

Ingest Queue Design Ideas #1062

Comments

terrywbrady commented May 23, 2022 • edited

Design Doc

Use Cases to Consider

Brainstorms

Current Process

Proposed Process

terrywbrady commented May 27, 2022

terrywbrady commented Feb 27, 2023 • edited

elopatin-uc3 commented Apr 13, 2023

terrywbrady commented Apr 13, 2023 • edited

terrywbrady commented Apr 13, 2023 • edited

terrywbrady commented Apr 13, 2023 • edited

terrywbrady commented Apr 18, 2023

elopatin-uc3 commented Jan 17, 2024

terrywbrady commented Apr 4, 2024

elopatin-uc3 commented Apr 4, 2024

terrywbrady commented May 23, 2022 •

edited

terrywbrady commented Feb 27, 2023 •

edited

terrywbrady commented Apr 13, 2023 •

edited

terrywbrady commented Apr 13, 2023 •

edited

terrywbrady commented Apr 13, 2023 •

edited