Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fairer queueing for parallel workers in RBI generation #889

Merged
merged 2 commits into from Apr 8, 2022

Conversation

egiurleo
Copy link
Contributor

@egiurleo egiurleo commented Apr 7, 2022

Motivation

Closes #760

Currently, the queue for RBI generation is executed in the following manner:

  • The Executor determines how many worker forks to use for RBI generation (call this number N)
  • It then splits up the queue into N groups, and gives one group of jobs to each worker
  • This does not take into account the size of each job in the queue; if one worker happens to get a few larger jobs, it won't re-distribute them among other workers who finish early. This means that RBI generation takes longer than it needs to while the worker processes those large jobs.

The ideal behavior would be to give one job to each worker at a time, then give the next job to whichever worker frees up first. This would mean that the jobs are fairly split up between workers and on the whole, RBI generation should be more efficient.

Implementation

As @paracycle pointed out in #760, there is a gem that already implements this behavior: the parallel gem! I spent a long time reading the source code of the gem's map function, and I believe that it implements the exact algorithm we're looking for: rather than splitting up jobs and giving them to workers up front, it waits for workers to complete and then gives them the next job in the queue.

Tests

I kept most of the Executor tests and switched one that was no longer relevant.

Tophatting 🎩

I have tested my change on the Shopify codebase. Not only does it work, but anecdotally, it looks like it actually generates all the gem RBIs a few minutes faster! My approach definitely was not scientific (deleted gem RBIs and then regenerated them a few times), but with the currently published version of tapioca, the process takes about 1500 seconds, and with my change, it takes about 1300. (Ignore this last part -- this result is not consistent.)

@egiurleo egiurleo changed the title Fairer queueing Fairer queueing for parallel workers in RBI generation Apr 8, 2022
@egiurleo egiurleo force-pushed the fairer-queueing branch 2 times, most recently from 0c56b97 to 7084757 Compare April 8, 2022 15:22
@egiurleo egiurleo marked this pull request as ready for review April 8, 2022 15:27
@egiurleo egiurleo requested a review from a team April 8, 2022 15:27
@egiurleo egiurleo force-pushed the fairer-queueing branch 2 times, most recently from 6f4b10e to ab49250 Compare April 8, 2022 15:52
tapioca.gemspec Outdated
@@ -24,6 +24,7 @@ Gem::Specification.new do |spec|
spec.metadata["allowed_push_host"] = "https://rubygems.org"

spec.add_dependency("bundler", ">= 1.17.3")
spec.add_dependency("parallel")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I commented before requesting review but I never submitted them 🤦🏻‍♀️

@Morriar can you explain what the difference is between adding a dependency in the Gemfile vs in the gemspec? I assumed parallel went in the gemspec because it's now a required dependency to actually run the gem, but I'm not sure I really understand the distinction.

Also, do you think I should specify a version here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, what you did is correct. The top level idea is this:

  1. A gem's Gemfile is only for doing development on the gem source code itself. It only brings in things that will be needed as you are working on the gem. Part of the things it brings in are the items listed inside the .gemspec file, since (as I'll mention in the next section) those are also needed for the gem to run.
  2. The gem, when packaged and published, needs to declare what other gems it has a "runtime" dependency on, so that tools like Bundler, etc can make sure to install those as well. These are gems that, if they are not present in a user's gem folder, will prevent the main gem from working. Those are listed in the .gemspec file as add_dependency entries.

In your case, Tapioca would not work on a user's machine if Parallel gem wasn't installed, so that dependency needs to be declared in the gemspec, since that is the only thing that gets packages inside a released gem (along with the actual source code).

Copy link
Collaborator

@Morriar Morriar Apr 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding it that:

  • the .gemspec file is used to export the list of declared runtime dependencies. So for example the gems in this file using add_dependency will be displayed as runtime dependencies on RubyGems: https://rubygems.org/gems/tapioca. Furthermore, when a project will install tapioca as a gem, they will get the dependencies declared in the .gemspec as transitive dependencies.

  • the Gemfile is used for "internal" dependencies. The ones you do not need when executing the gem. This would be gems required to run the tests, to debug, to lint etc. All the need you would only need to contribute to the project. Those dependencies will not be installed as transitive ones when installing tapioca in a project.

In your case, since parallel is required to execute tapioca it is a runtime dependency and should be declared as such in the .gemspec as you did 👍

For the version, it seems you've been using 1.21.0 to run the tests (as it is the version we already had in the Gemfile.lock file). So it might be safer to add a restriction to require at least this version since we're not sure it works with an older version of the gem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What complicates all of this is the notion of add_development_dependency that goes into a gemspec and declares optional dependencies that should be used for development. I choose to completely ignore that, since it is a hold over from before the time we had Gemfiles and that was the only way a gem could declare things it needed for development as opposed to runtime. I just always use a Gemfile for those now and never declare my dev dependencies in the gemspec.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a version constraint to the parallel gem!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you also need to do a bundle lock to update the Gemfile.lock after that, so that the lockfile also picks up that version bound.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦🏻‍♀️ Love when CI turns red immediately lol!

Copy link
Member

@paracycle paracycle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thank you. 👏

Copy link
Collaborator

@Morriar Morriar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing, thanks! 👍

@egiurleo egiurleo merged commit 9a1cb24 into main Apr 8, 2022
@egiurleo egiurleo deleted the fairer-queueing branch April 8, 2022 17:08
@paracycle paracycle added the enhancement New feature or request label Apr 29, 2022
@shopify-shipit shopify-shipit bot temporarily deployed to production May 13, 2022 23:08 Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fairer queueing for parallel workers in RBI generation
3 participants