Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingest Queue Design Ideas #1062

Closed
terrywbrady opened this issue May 23, 2022 · 10 comments
Closed

Ingest Queue Design Ideas #1062

terrywbrady opened this issue May 23, 2022 · 10 comments
Assignees

Comments

@terrywbrady
Copy link
Contributor

terrywbrady commented May 23, 2022

Design Doc

https://github.com/CDLUC3/mrt-doc/tree/main/design/queue-2023

Use Cases to Consider

  • Throughput management
    • Limit content download rate to ensure that already downloaded content can be processed quickly (cost burden shifts to Merritt once it has been downloaded)
    • Halt download of non-priority content once the overall downloaded content on EFS exceeds threshold (2T?, 4T?)
    • Halt download of content from a single collection until already downloaded content has been processed (.5T)
  • Execution
    • Any task should be runnable on any ingest server
      • Ensure notification is delivered once all jobs for a batch across all servers have completed
    • Number of server instances can grow/shrink
    • Limit number of threads that can be occupied by any one profile/batch to prevent starvation
  • Prioritization
    • Ensure that once Dryad becomes a regular depositor, it will never face queue starvation
    • Ensure that the demo collection has expedited priority when verifying software changes
    • Allow nuxeo feed jobs to queue and be processes without impacting other workloads

Brainstorms

Current Process

Phase Prioritization Actions EFS Queue Memory
Phase 1: Download Number of Jobs + profile properies Download items; create batch folder; create job folders; download Save bid and jid to queue; set priority Batch state
Phase 2: Ingest Queue priority Pull job based on priority; process jobs; notify storage and inventory Build system files; delete after saved to storage Update queue item state

Proposed Process

Phase Prioritization Actions EFS Queue Memory
Phase 1: Estimate and Prioritize Number of BYTES + profile properties (or by bytes and files for more accuracy) Download manifests; perform HEAD requests on all downloads; estimate bytes per job. create batch folder; download manifests; create job folders? Create batch state object. Save notification details to QUEUE. Write preliminary size estimate to queue. None
Phase 2: Download Limit downloads when calculated bytes in EFS exceeds threshold. Dryad is an exception. Download content to EFS. Revise byte counts and overall priority. Download content to job folders. Update bytes and priority in queue.
Phase 3: Ingest Queue priority. Pull job based on priority; process jobs; notify storage and inventory Build system files; delete after saved to storage Update queue item state. Update batch state.
@terrywbrady
Copy link
Contributor Author

  • Privileged daemon for Dryad and other priority collections
  • Separate queuing instances from worker instances to ensure response to queue requests

@terrywbrady
Copy link
Contributor Author

terrywbrady commented Feb 27, 2023

Brainstorming idea..

State Current Action Future Action
Pending Awaiting Queue based on Priority Awaiting Estimation based on Priority
Evaluate Held Collections Evaluate Held Collections
Held Awaiting Release back to Pending No Change
Estimating n/a Evaluate if dynamic provisioning is needed
Evaluate if resources need to be released (i.e download space)
Evaluate if hold is in place for primary storage node
Provisioning n/a Waiting for resources to be provisioned or freed
Downloading n/a Restart download
Ingesting (write to storage) n/a Re-process ingest
Re-queue (possibly to a different server)
Recording n/a Inventory processing
Reporting n/a Assemble report esp for batches
Re-send report
Fire Callback
Callback retries
Consumed Downlading + Ingesting + Reporting n/a

@elopatin-uc3
Copy link
Contributor

Break out this table into two parts -- Figure out which of these states are batch vs. job states
Manifest submission, zip submission, regular object submission:

  • image each of these scenarios going through these job states
  • e.g. zip file going through states: batch process will do downloading, then multiple jobs processed
  • data elements in batch queue record and the job queue record

@terrywbrady
Copy link
Contributor Author

terrywbrady commented Apr 13, 2023

  • Batch State
    • Pending
    • Held
      • Collections
      • Primary storage node
    • Estimating
      • HEAD requests to validate object size
      • How reliable will this be
      • Priority boost if reliable
    • Provisioning (dynamic ZFS creation or space calculations)
    • Active Jobs State
    • Reporting
      • assemble email
      • delete jobs
    • Failed
    • Completed
    • Cancelled
    • Deleted
  • Job State
    • Pending
    • Downloading
    • Ingesting
    • Recording
    • Download Failed
    • Ingest Failed
    • Inventory Failed
    • Reporting (callback)
    • Completed
    • Cancelled
    • Deletion of job records is exclusively handled by the batch
  • Job Completion Queue
    • triggers updated in batch state
  • Use Cases
    • single file batch (UI) - 1 batch, 1 job, 1 file
    • object manifest (mult file) - 1 batch, 1 job, X files
    • manifest of objects - 1 batch, N job, N files
    • manifest of object manifests, 1 batch, N job, X files
    • container of digital files - 1 batch, 1 job, N files
    • container of manifest - not supported
    • manifest of containers (exploded) - 1 batch, N jobs, X files
    • manifest of binary zip files (not exploded) - 1 batch, N jobs, N files
    • manifest of containers (exploded) - 1 batch, N jobs, X files
    • manifest of binary zip files (not exploded) - 1 batch, N jobs, N files

@terrywbrady
Copy link
Contributor Author

terrywbrady commented Apr 13, 2023

  • Batch Queue
    • job count
      • job completed count
      • job failed count
      • job cancelled count
      • job started count
      • job deleted count
    • profile
    • submitter
    • payload file name
    • start time
    • end time
    • list of zk ids
    • byte estimate
  • Job Queue Object
    • current
      • status
      • zk id number (includes priority)
      • job id
      • batch id
      • created time
      • submitter
      • profile name
      • job type
      • primary id / ark (empty if add)
      • local id
      • payload file name
      • update vs create
    • new fields needed
    • storage manifest url
    • byte estimate
    • ingest worker hostname
    • storage start time
    • error detail
    • download retry count
    • ezid retry count
    • callback retry count

@terrywbrady
Copy link
Contributor Author

terrywbrady commented Apr 13, 2023

Microservice Changes

  • All ingest workers use 1 queue
  • Inventory uses ingest queue
  • Decompose ingest to multiple microservices? Or rely on existing daemons.
    • batch preparation - api handling
      • initiates queue entries
    • batch reporting - daemon
      • cleans up queue entries
    • job handling - normal queue - daemon
      • queue driven
    • job handling - priority queue - daemon
      • queue driven

@terrywbrady
Copy link
Contributor Author

Note from 3/18 IAS meeting:

  • will ZK evolve with our evolving configuration plans?
  • AWS SQS is very simple ... could our needs actually be simple
  • AWS also offers Amazon MQ

@terrywbrady terrywbrady pinned this issue Jun 1, 2023
@elopatin-uc3
Copy link
Contributor

@terrywbrady
Copy link
Contributor Author

@mreyescdl and @elopatin-uc3 , are you ok if we mark this done on the assumption that we have/will have new tickets to cover the work?

@elopatin-uc3
Copy link
Contributor

That sounds fine to me @terrywbrady

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants