Skip to content

Event Subsystem Architecture Review

Henne Vogelsang edited this page Jul 20, 2017 · 4 revisions

This review of Events / notifications and the respective delayed jobs was conducted in July 2017 by @hennevogel, @mdeniz and @evanrolfe.

Proposals for improvement

  • Jobs of the same type do not run concurrently
  • Failed procedures notify errbit and can be retried
  • Jobs distinguish between procedures which have not yet started, completed or failed

Options for going forward

After analyzing the Event architecture we have come up with 3 possible options for an overhaul.

Overall options

Option 1 - Remove Event, single purpose jobs

No more Event classes, no events table in the database. We will store the data we need for processing a job inside the DelayedJob table (payload). Whenever the event happens (like a package fails to build), jobs are created accordingly.

Option 2 – Keep Event, use single purpose jobs

Same as option 1, but the event related data is still stored in the Event model. Data get's duplicated to the DelayedJob payload.

Option 3 - Keep Event, use single/multi purpose jobs in queues

Basically keeping everything as it is. Events get stored into the Event model. Batches of Events get processed by many multi purpose & single purpose jobs. Event data get's duplicated to the DelayedJob payload. The only change would be to run each type of job in an independent queue to avoid concurrency.

Job options

  • Multipurpose Jobs: One job that does every task (send mail, notify backend) related to the event that happened
  • Single purpose Jobs: One job per task (send mail, notify backend) related to the event that happened

Data storage options

  • Store data in the Event’s table: We need to keep the Event instances around as long as there are jobs to be processed (state & cleanup)
  • Store data in the DJ payload: Job related data gets duplicated into the DJ payload

Comparison

Summary per option:

Option 1 2 3
events processed per job One One Many
task per job One One One/Many
jobs per event Many Many None/Many (CreateJob)
Table to store data in between DJ DJ Event
Concurrency Yes Yes No
Failures handler DJ DJ Events
Individual Queues No No Yes
Copies of event data Duplicated Duplicated Normalized
Event representation No Yes Yes
Cleanup of Events No Yes Yes

Jobs

An overview of the things we have noticed about the different jobs

General Notes

  • None of these jobs can track failures.
  • Its assumed that every job will succeed.
  • ActiveJob and DelayedJob use different default queues
  • Jobs shouldn't expose methods beside perform

Event::NotifyBackends

Requirements:

  • Is needed to be processing events continously

Target:

  • Posts the event payload to the backend, for events that define raw\_type attribute.
  • Only needed for the hermes and rabbitmq backend notification plugins.

Job Creation:

  • Clock.rb creates and queues a delayed job every 30 seconds.
  • [PROBLEM] This is using DelayedJob directly, not ActiveJob.

Processing control:

  • Uses boolean attribute events.queued to keep track of whether or not this has been processed.
  • [PROBLEM] queued is set to true before the payload is posted
  • [PROBLEM] Does not handle failures.
  • [PROBLEM] notify\_backend method is only defined on Event::Base class.

Concurrency control:

  • There is nothing to prevent this job running simultaneously, which is a problem because events can be processed more than one time and being sent to the backend.

ProjectLogRotate

Target:

  • It saves ProjectLogEntry entries to the database to create the RSS feed for the last commits in projects/packages
  • Should be created ASAP
  • It only needs project log entries to exist in the database for 10 days.

Job Creation

  • Clock.rb creates and enqueu a delayed job every 10 minutes

Processing control:

  • Uses the project_logged column.
  • [PROBLEM] Continuously retries events which raise an error when creating the ProjectLogEntry, or if anything else goes wrong (i.e. the project was already deleted).
  • [PROBLEM] If we reach 10,000 unprocessable events, then that would prevents the valid events from being processed, for 10 days.
  • [PROBLEM] Events which dont descend from Event::Project or Event::Package hang around for 10 days before they get marked as logged even though they are never used by ProjectLogRotate.

Concurrency control

  • Cannot run simultaneously with another instance of itself.
  • We prevent this by running all instances of this job in a single queue with a single worker.

CreateJob

Target:

  • CreateJob is base class, the subclasses called are: ** UpdateBackendInfos - Update frontend data based on what comes from the backend ** UpdateReleasedBinaries - Updates BinaryRelease data in frontend based on what comes from the backend

Job Creation:

  • DelayedJobs are queued inside perform\_create\_jobs callback in Event::Base model
  • Each job queued increments the undone_jobs counter
  • [PROBLEM] This is using DelayedJob directly, not ActiveJob.

Processing control:

  • uses undone_jobs (integer) column to keep track of how many delayed jobs still need to be completed
  • undone_jobs == 0 means that either there were no jobs to be processed, or they have already been processed
  • when a job completes it decrements undone_jobs counter by 1
  • [PROBLEM] both jobs do not handle exceptions or failures

Concurrency control:

  • CreateJob locks the event while updating undone_jobs after the job is completed
  • UpdateReleasedBinaries runs in 'releasetracking' queue so is not concurrent
  • UpdateBackendInfos runs in the 'quick' queue so is concurrent

SendEventEmails

Target:

  • Send emails ASAP for events to subscribers
  • Create RSS notifications ASAP for events

Job Creation:

  • Clock.rb creates and enqueu a delayed job every 30 seconds.

Processing control:

  • Uses boolean attribute events.mails\_sent to keep track of whether or not this has been processed.
  • [PROBLEM] create\_rss\_notifications fails silently.
  • [PROBLEM] It cannot distinguish between single failures in email sending and / or rss notification creation
  • If either email sending or rss creation fails: ** then Errbit is notified ** [PROBLEM] mails_sent is set to true to not re-process that event

Concurrency control:

  • cannot run simultaneously with another instance of itself.
  • we prevent this by running all instances of this job in a single queue with a single worker.

UpdateNotificationEvents

Target:

  • It is reading from the backend at /lastnotifications and creating ASAP events based on that response.

Job Creation:

  • Clock.rb runs this every 17 seconds inside a thread (because it was needed to run asynchronously).
  • [PROBLEM] The use of threads complicates the processing, a Mutex is used to avoid running multiple threads at the same time

Processing control:

  • Every run of this job stores the last notification id it looked at into the database (BackendInfo.lastnotification_nr)
  • Every run of this job fetches the notifications from BackendInfo.lastnotification_nr onwards
  • Every run of this jobs is blocking access to the the backend call??? (Clarify with the backend people what /lastnotifications?block=1 means)
  • Based on limit\_reached and next attributes of backend /lastnotifications response
  • limit\_reached set to 1 means that the backend have more events to notify (> 1000) but it can't be served in one request, so, it would mean that we need to request more from the backend. That will be done in another iteration of the loop.
  • sync=lost will be set if the notification id the job starts off, is lower than the oldest number on record in the backend (probably not needed anymore as concurrent proccesses are not possible anymore)

Concurrency control:

  • cannot run simultaneously with another instance of itself.
  • [PROBLEM] we prevent this by using a semaphore/Mutex.

General Notes

  • The relationship between events and subscriptions is a complex service class and the logic only works one way. You can only find subscriptions for an event, not the other way round)
  • Event's data is duplicated for Notifications and ProjecLogEntry instances as the payload
Clone this wiki locally