Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect job state at query time #41

Open
smashwilson opened this issue Jan 8, 2015 · 3 comments
Open

Collect job state at query time #41

smashwilson opened this issue Jan 8, 2015 · 3 comments

Comments

@smashwilson
Copy link
Member

Currently, I launch a goroutine that attaches to each job's stdin, stdout and stderr streams and updates Mongo in the background as data arrives from the Docker API. This puts an (admittedly pretty high) cap on the number of jobs we can manage, because each running user job requires a dedicated goroutine somewhere on an API node. More unfortunately, it also means that when an API process goes down, we lose track of all of the jobs that were attached to that process' job runner routine: none of them will update stdout or stderr, populate their results, or transition states to StatusError or StatusDone!

A better approach is to update each job's status at list time. When the Job: List call is invoked, update:

  • The process' stdout and stderr from docker.
  • The job's status by inspecting the container's state.
  • If the process is completed: the job's result from stdout or its filesystem based on the result source.

... and return the new states of each job normally.

@smashwilson
Copy link
Member Author

/cc @rgbkrk

I think we still want the Runner to exist for job launching and feeding in stdin, though.

The other option is to make the JobSubmit call synchronous: don't return until the container is created and running. That would eliminate some of the race conditions related to killing jobs at weird times, but it also does away with the idea of having a queue of submitted jobs, which I don't like. Even if we have a huge cluster there's always going to be the possibility that people are sending jobs faster than we can run them.

@smashwilson
Copy link
Member Author

A downside of this is that it makes Job: List more expensive, and Job: List is the call that multyvac polls while waiting for a job to complete.

Mitigating idea: store a LastUpdated timestamp in Mongo, and only update it from Docker if a certain amount of time has elapsed. We can also only update jobs that are in certain states. There's no reason to check for more from stdout or stderr if a job is in StatusDone or StatusKilled, especially because the only way to enter those states is another Job: List call.

@smashwilson
Copy link
Member Author

Also, there is still some weirdness related to simultaneous Job: List calls. If a new request arrives that touches the same job, the second one will stomp on the first one's modifications. If the second request is faster to complete than the first one, the first one will stomp on the second one's modifications, which is a bigger deal.

I think I can deal with this by making the job update a findAndModify call that's based on a predicate that checks LastUpdated - essentially, only (atomically) update the job if its original state is at least as old as it was when we started. I'll probably need a Revision number in addition to the timestamp so we don't rely on timestamp consistency across a cluster.

@smashwilson smashwilson added this to the v0.0.2: Infrastructure milestone Feb 2, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant