Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the point? #1

Open
marshrossney opened this issue May 2, 2022 · 0 comments · Fixed by #3
Open

What is the point? #1

marshrossney opened this issue May 2, 2022 · 0 comments · Fixed by #3

Comments

@marshrossney
Copy link
Owner

marshrossney commented May 2, 2022

Summary

At one point I had intended to create a tool for generating reproducible containers in which arbitrary commands could be executed, with the knowledge that this tool was logging all of the information required to exactly reproduce the result.

This proved to (a) be too ambitious given my skill level, and (b) involve a considerable amount of wheel re-invention. Instead, I decided to build a much simpler tool that uses templates to semi-automate laborious coping and typing, and to offload the responsibility of making appropriate use of version control and dedicated package managers to the user.

The result is something that I think is 90% as useful as what I had originally intended, and about 2% as complicated to build and maintain (I think that's called being smart). The main use-case is someone like myself who wants to run experiments using code that's absolutely nowhere near a finished product. These experiments might amount to nothing, but I sure as hell would like to be able to go back and reproduce that 'really good result' I got months ago with some random hacked version of the code!

Background

For future reference.

My original intention was to create a tool that would act as a drop-in replacement for python when invoking scripts as in e.g. python script.py -c config.yaml, which would create and run the script inside an isolated environment, recording all the information required to exactly reproduce the result. Essentially I wanted to semi-automate the following steps:

  1. Create an isolated environment given some precise specification of files, installed packages, environment variables etc.
  2. Run a set of commands inside this environment.
  3. Log the output and generate a configuration file that allows the experiment to be exactly reproduced.

I also wanted the tool to perform a similar function for my colleagues who use Jupyter notebooks (e.g. using nbconvert or papermill), Julia, R etc.

I found that tox (or nox) does the hard work of creating an isolated virtual environment, and what's more allows you to execute arbitrary commands from inside this environment, so for a while the idea was to basically build a wrapper around tox -c experiment.ini that created a new directory and ran everything from there. This proved to be a bit fiddly, however, since tox cares a lot about the directory in which the .ini config file resides, which made it difficult to refer to files in a local repository (which would be under version control so storing their commit hash would be sufficient whereas copying them would be totally overkill).

For some time I played with using git -C /path/to/repo --work-tree . checkout <commit> -- <files> to 'checkout' a specific commit from a local repo into a different working directory (I even wanted to put this command into the tox config file). This was a bit annoying because (a) if your workspace is a subdirectory of the repo then you do not end up with the just the workspace directory, but all of its parents in the git working tree, and (b) it's actually kind of annoying to have multiple copies of the worktree checked out all over the place. Actually the behaviour was fairly unintuitive and I think too complicated for a tool that is meant to have a very low barrier to entry. Here I tried symlinking the workspace to the experiment directory to avoid having to modify paths, but this was bad because (a) you have to symlink individual files, not directories, else the experiment outputs get sent back to your main workspace, and (b) you end up with a directory cluttered with symlinks of files you don't need.

Ultimately it's pretty overkill to insist on completely environment isolation just to run an experiment. It would probably be sufficient to check that a script can run inside an isolated environment, to confirm that no unrecorded software is being used, and perhaps test agreement between outputs run inside versus outside. Also, one of my aims was to build a tool that doesn't massively change someone's workflow, since I just don't think people would use it if it did. So basically I thought it best not to build tox/nox into the tool.

Anyway, after numerous from-scratch rewrites I ended up here with something fairly bloated that amounted to little more than an over-engineered copytree. This is roughly when I realised that by far the most useful bits of my code were the time-saving elements which copied files, logged commit hashes, created a .gitignore and a README etc. It occurred that I could make something incredibly simple which was nonetheless still useful, which seems kinda, idk, smart?

So I ended up building basically a wrapper around cookiecutter which also copies files and injects a bunch of useful parameters into the context that can be referred to in templates. Cookiecutter is really just a simple API that uses Jinja to render templated files and directories. This seems like a sweet spot where most people can just use basic templates out of the box, but others can create their own (at no maintenance cost to me :D ). See #3

Horay.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant