Collect job state at query time #41

smashwilson · 2015-01-08T19:42:35Z

Currently, I launch a goroutine that attaches to each job's stdin, stdout and stderr streams and updates Mongo in the background as data arrives from the Docker API. This puts an (admittedly pretty high) cap on the number of jobs we can manage, because each running user job requires a dedicated goroutine somewhere on an API node. More unfortunately, it also means that when an API process goes down, we lose track of all of the jobs that were attached to that process' job runner routine: none of them will update stdout or stderr, populate their results, or transition states to StatusError or StatusDone!

A better approach is to update each job's status at list time. When the Job: List call is invoked, update:

The process' stdout and stderr from docker.
The job's status by inspecting the container's state.
If the process is completed: the job's result from stdout or its filesystem based on the result source.

... and return the new states of each job normally.

The text was updated successfully, but these errors were encountered:

smashwilson · 2015-01-08T19:46:53Z

/cc @rgbkrk

I think we still want the Runner to exist for job launching and feeding in stdin, though.

The other option is to make the JobSubmit call synchronous: don't return until the container is created and running. That would eliminate some of the race conditions related to killing jobs at weird times, but it also does away with the idea of having a queue of submitted jobs, which I don't like. Even if we have a huge cluster there's always going to be the possibility that people are sending jobs faster than we can run them.

smashwilson · 2015-01-08T20:14:12Z

A downside of this is that it makes Job: List more expensive, and Job: List is the call that multyvac polls while waiting for a job to complete.

Mitigating idea: store a LastUpdated timestamp in Mongo, and only update it from Docker if a certain amount of time has elapsed. We can also only update jobs that are in certain states. There's no reason to check for more from stdout or stderr if a job is in StatusDone or StatusKilled, especially because the only way to enter those states is another Job: List call.

smashwilson · 2015-01-08T20:20:32Z

Also, there is still some weirdness related to simultaneous Job: List calls. If a new request arrives that touches the same job, the second one will stomp on the first one's modifications. If the second request is faster to complete than the first one, the first one will stomp on the second one's modifications, which is a bigger deal.

I think I can deal with this by making the job update a findAndModify call that's based on a predicate that checks LastUpdated - essentially, only (atomically) update the job if its original state is at least as old as it was when we started. I'll probably need a Revision number in addition to the timestamp so we don't rely on timestamp consistency across a cluster.

smashwilson added this to the v0.0.2: Infrastructure milestone Feb 2, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collect job state at query time #41

Collect job state at query time #41

smashwilson commented Jan 8, 2015

smashwilson commented Jan 8, 2015

smashwilson commented Jan 8, 2015

smashwilson commented Jan 8, 2015

Collect job state at query time #41

Collect job state at query time #41

Comments

smashwilson commented Jan 8, 2015

smashwilson commented Jan 8, 2015

smashwilson commented Jan 8, 2015

smashwilson commented Jan 8, 2015