Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move default template to static pyproject.toml #2569

Closed
wants to merge 10 commits into from

Conversation

astrojuanlu
Copy link
Member

@astrojuanlu astrojuanlu commented May 8, 2023

Description

Beginning of gh-2280. Note that this should be complemented by a similar PR in https://github.com/kedro-org/kedro-starters/, and is conditioned by the outcomes of gh-2388. But at least it shows how it would be done. Happy to postpone this PR to a later stage if needed.

Missing: documentation updates and testing all the workflows mentioned in the README.md.

Development notes

  1. The first commit fixes a typo in the README.
  2. The second commit moves src/requirements.txt, src/setup.py and src/tests to the root of the project, essentially rendering the Kedro project into a normal Python library with src-layout, as mentioned in Make all starters use pyproject.toml #2280 (comment).
  3. The third commit fixes Make all starters use pyproject.toml #2280 by consolidating project metadata in pyproject.toml (decreasing the number of files by 1), while keeping requirements.txt for dev requirements (fixing Designate a better place for development requirements in Kedro projects #2519).
  4. The fourth commit removes a mention to environment.yml in the template README.md, and replaces that by a very short explanation of library requirements vs development requirements.
  5. The fifth commit (optional) renames setup.cfg, which only holds the flake8 configuration, to .flake8, to remove confusion with the legacy setup.py and discourage users from adding more stuff there (everything should go to pyproject.toml these days).
  6. The sixth commit (optional) changes a pip install .[pandas.CSVDataSet] that is going away soon with pip install kedro-datasets[pandas.CSVDataSet].
  7. The seventh commit introduces a whole lot of churn by replacing src/requirements.txt everywhere by requirements.txt (error messages and documentation mostly).

I recommend using the GitHub functionality to pick what commits to review, especially to remove the last one.

The net result is that this decreases the number of files in the template by 1.

In the end, the workflow for users looks like this:

  • Installing the project with "everything": pip install -r requirements.txt
  • Installing only the library dependencies: pip install -e/--editable .
  • Installing documentation dependencies: pip install -e .[docs]

The README.md file was updated accordingly.

Checklist

  • Read the contributing guidelines
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes

@antonymilne
Copy link
Contributor

There's potential to move dev-requirements.txt into the optional-dependencies table of pyproject.toml, hence replacing pip install -r dev-requirements.txt by pip install -e .[test], and removing 1 more file.

Is there any reason we wouldn't do this? Given that we have (a) docs requirements already as optional-dependencies in pyproject.toml and (b) kedro requirements in pyproject.toml rather than requirements.txt. It feels a bit weird to mix and match the two systems?

Relatedly, I think doing pip install -r src/requirements.txt is pretty hard-wired into people (myself included). Does this become pip install -e .? We shouldn't underestimate the shift for an inexperienced user who is very used to a requirements.txt file migrating to use the slightly "weird" looking pip install -e .. I think this will need some clear messaging and docs as lots of people will wonder where the requirements.txt file has gone and how they use pyproject.toml. Maybe in docs we should leave out the -e for those who are just getting started with kedro?

@astrojuanlu
Copy link
Member Author

astrojuanlu commented May 17, 2023

We can definitely keep requirements.txt around, I agree it's more popular than pip install [-e] . (the reason being, it's an easy mechanism that mostly works, unlike all the rest of Python packaging...)

What we can do is to keep the library code in requirements.txt, like this:

# requirements.txt
-e file:.

With this, when people do pip install -r requirements.txt, then pip install -e . happens automatically.

What I think we should not do is leave the runtime dependencies in requirements.txt, because the library code needs them, and as such they should be in the [project] table of pyproject.toml.

And if we're keeping the requirements.txt, then maybe we can keep using it for development dependencies. So, people that only want the library code deps and know what they're doing can do pip install -e ., whereas users that have pip install -r requirements.txt ingrained can do just that. Something like this:

$ cat requirements.txt
# Install library code
-e file:.

# Development dependencies
black~=22.0
flake8>=3.7.9, <5.0
ipython>=7.31.1, <8.0; python_version < '3.8'
ipython~=8.10; python_version >= '3.8'
isort~=5.0
jupyter~=1.0
jupyterlab_server>=2.11.1, <2.16.0
jupyterlab~=3.0, <3.6.0
nbstripout~=0.4
pytest-cov~=3.0
pytest-mock>=1.7.1, <2.0
pytest~=7.2

What do you think @antonymilne? (And everyone else)

@antonymilne
Copy link
Contributor

antonymilne commented May 17, 2023

What I think we should not do is leave the runtime dependencies in requirements.txt, because the library code needs them, and as such they should be in the [project] table of pyproject.toml.

Fully agree with this. Assuming that's what we do, personally I'm fine with any of these solutions:

  1. requirements.txt just does -e . + dev requirements in dev-requirements.txt + comment in requirements.txt explaining the dependencies are now in pyproject.toml. This is your first suggestion in above comment. User would do pip install -r requirements.txt and pip install -r dev-requirements.txt
  2. requirements.txt does -e . and dev requirements. This your second suggestion in above comment. User would do pip install -r requirements.txt
  3. no requirements.txt + dev requirements as optional-dependencies in pyproject.toml + need more re-education and docs so people know to do pip install .. User would do pip install .

As far as I can tell the advantage of 1 is it's familiar and splits off dev requirements. The advantage of 2 is it's one fewer file and it's even more familiar (as gives essentially same behaviour as now). The advantage of 3 is it feels cleaner and there's two fewer files. Did I miss any other factors here?

The solution I think I'd avoid is the one you have in the code now:

  • no requirements.txt + dev requirements in dev-requirements.txt

I think this is confusing because to a user there's two different looking commands: pip install . vs. pip install -r dev-requirements.txt.

I don't have a strong preference between 1, 2, 3 above but might be worth doing some quick user research to see what people make of it. e.g. maybe actually pip install . is familiar to more people than I guess so solution 3 wouldn't require much re-education.

@deepyaman
Copy link
Member

Fully agree with this. Assuming that's what we do, personally I'm fine with any of these solutions:

  1. requirements.txt just does -e . + dev requirements in dev-requirements.txt + comment in requirements.txt explaining the dependencies are now in pyproject.toml. This is your first suggestion in above comment. User would do pip install -r requirements.txt and pip install -r dev-requirements.txt
  2. requirements.txt does -e . and dev requirements. This your second suggestion in above comment. User would do pip install -r requirements.txt
  3. no requirements.txt + dev requirements as optional-dependencies in pyproject.toml + need more re-education and docs so people know to do pip install .. User would do pip install .

Personally prefer the third option; since we already have requirements in pyproject.toml, I'm not really sure why we should also have requirements.txt with no real additional benefit. People can be educated, if necessary.

I also think it's a bit hypocritical to be trying to simplify the project template, but unnecessarily keeping extra files for roundabout requirements management.

@astrojuanlu
Copy link
Member Author

Adopted option 3 in @antonymilne's comment: now everything lives in pyproject.toml, and the workflow will be pip install [-e/--editable] ..

Notice that folks still can still freeze their dependencies if they want, since pip-tools supports pyproject.toml (I contributed that 2 years ago!)

$ pip-compile --resolver=backtracking
$ grep -A3 kedro== requirements.txt
kedro==0.18.8
    # via
    #   kedro-telemetry
    #   test-kedro-template (pyproject.toml)

@astrojuanlu
Copy link
Member Author

Not the best PR to discuss this but I added a dependency on tomli (the predecessor of stdlib tomllib https://docs.python.org/3/library/tomllib.html) before realizing we were using another third-party toml. Going forward I think it will be better to settle on tomli because it will be included in Python 3.11 onwards, so for this PR I can

  • Keep these two dependencies (hopefully minor ugliness)
  • Convert all the tests to tomli (I tried, and it would be a bit of work - would prefer to not do it here)
  • Convert my TOML handling to old toml (would prefer not to do it but it wouldn't be the end of the world if reviewers think this is better)

@astrojuanlu astrojuanlu marked this pull request as ready for review May 23, 2023 09:18
@astrojuanlu astrojuanlu requested a review from idanov as a code owner May 23, 2023 09:18
@astrojuanlu
Copy link
Member Author

I think this is ready for at least a first review pass. Not so much on the diff (it will be difficult to read, although you're more than welcome to do so), but a bit of QA (kedro new, see the files, test a few workflows).

@merelcht
Copy link
Member

A while ago we decided to use dependabot to manage and bump dependencies when possible (#1810). But if dependency/requirements.txt is removed this will no longer work. Is there a way to automatically bump dependencies in pyproject.toml?

@astrojuanlu
Copy link
Member Author

astrojuanlu commented May 23, 2023

Apparently dependabot works with pyproject.toml! dependabot/dependabot-core#5661

However, notice that this PR does not remove dependency/requirements.txt for Kedro framework. That's the job of #2597

@merelcht
Copy link
Member

Apparently dependabot works with pyproject.toml! dependabot/dependabot-core#5661

However, notice that this PR does not remove dependency/requirements.txt for Kedro framework. That's the job of #2597

Can we then update the dependabot config as part of that PR?

@merelcht
Copy link
Member

Not the best PR to discuss this but I added a dependency on tomli (the predecessor of stdlib tomllib https://docs.python.org/3/library/tomllib.html) before realizing we were using another third-party toml. Going forward I think it will be better to settle on tomli because it will be included in Python 3.11 onwards, so for this PR I can

  • Keep these two dependencies (hopefully minor ugliness)
  • Convert all the tests to tomli (I tried, and it would be a bit of work - would prefer to not do it here)

I don't know the reason why we chose this toml third party library, but if tomli is going to be the standard for python then let's definitely move to that. I'm okay with updating the tests in a separate PR to make it easier to review 🙂

  • Convert my TOML handling to old toml (would prefer not to do it but it wouldn't be the end of the world if reviewers think this is better)

What exactly do you mean by this?

Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from moving the ipython and jupyter requirements to dev requirements I'm happy with these changes. I would add this clearly to the release notes and perhaps add a small "migration" paragraph explaining these changes, because I feel like they're pretty unnoticeable to end-user but will cause confusion when they suddenly can't find their requirements.txt files anymore.

Copy link
Contributor

@antonymilne antonymilne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a huge fan of these changes, great work 👏

But I think we need to work out exactly what implications it has on various kedro commands and how we should handle that. I recommend searching for requirements.txt to see exactly where it appears in kedro source, but at a glance this change will affect:

  • several error messages/hints (including the kedro jupyter one I noted already, but there's many others)
  • kedro build-reqs (deprecated but still awkward)
  • kedro build-docs (similarly)
  • kedro micropkg (severely in the case that you have per-pipeline requirements.txt, which I think is very rare but possible)

The problem with altering error messages/hints is that the message would need to work for both the old requirements.txt template and the new pyproject.toml one. Or we could put in some logic that detects the version of your kedro template (it's in metadata) and displays the appropriate error messages.

The problem with commands themselves is more difficult because these actually rely on the structure of your project having a requirements.txt file. Leaving aside more fundamental questions about whether these command should really work like that or exist at all (when it comes to micropackaging, my opinion tends to be "no"), if we make this change to the template in kedro 0.18.x we'd need to update those commands to work for both old and new template styles, which is tricky.

Unless I'm missing something I think the options would be:

  • change these commands to handle both old and new project templates
  • postpone these changes to 0.19

I hope I am missing something because I really like these changes, but unfortunately they turn out to have quite wide-ranging implications which are not so simple to resolve in a non-breaking way.

features/environment.py Outdated Show resolved Hide resolved
features/environment.py Outdated Show resolved Hide resolved
@antonymilne
Copy link
Contributor

And I have no opinion on the toml library really - happy to go with whatever you think is best.

@SajidAlamQB
Copy link
Contributor

Not the best PR to discuss this but I added a dependency on tomli (the predecessor of stdlib tomllib https://docs.python.org/3/library/tomllib.html) before realizing we were using another third-party toml. Going forward I think it will be better to settle on tomli because it will be included in Python 3.11 onwards, so for this PR I can

  • Keep these two dependencies (hopefully minor ugliness)
  • Convert all the tests to tomli (I tried, and it would be a bit of work - would prefer to not do it here)
  • Convert my TOML handling to old toml (would prefer not to do it but it wouldn't be the end of the world if reviewers think this is better)

If we keep both dependencies I think it could lead to complications and confusion. I am ok with the third option it would keep the codebase consistent for now, in follow-up PRs we can switch our tests to tomli.

@astrojuanlu
Copy link
Member Author

Thanks all for the comments!

Apparently dependabot works with pyproject.toml! dependabot/dependabot-core#5661

However, notice that this PR does not remove dependency/requirements.txt for Kedro framework. That's the job of #2597

Can we then update the dependabot config as part of that PR?

Absolutely! Just left a note there so I don't forget.

  • Convert my TOML handling to old toml (would prefer not to do it but it wouldn't be the end of the world if reviewers think this is better)

What exactly do you mean by this?

Instead of using tomli in this PR, use toml instead. It's only one file, so I'll do as @SajidAlamQB says:

I am ok with the third option it would keep the codebase consistent for now, in follow-up PRs we can switch our tests to tomli.

And on @antonymilne's comment:

Unless I'm missing something I think the options would be:

  • change these commands to handle both old and new project templates
  • postpone these changes to 0.19

Yeah I think this is a really good point. If we merge this too early as it stands now, there's probably going to be conflicts of this sort. It also resonates with @merelcht other point:

will cause confusion when they suddenly can't find their requirements.txt files anymore.

Maybe we should circle back to option (2) in @antonymilne's earlier comment, hence to keep requirements.txt around, to make these changes less disruptive. I agree with @deepyaman that users can be educated but with the amount of shuffling going on in this PR maybe it's wiser to avoid breaking everyone's workflows at once, and start by just moving the requirements.txt to the root of the project without actually removing it. Unlike setup.py, which is actively deprecated, requirements.txt files are not going anywhere soon, so we shouldn't be in a rush to remove those.


So, summary of changes needed here, if folks agree:

  • Keep requirements.txt, and add -e file:. there so that library code installation happens automatically as a result.
  • Keep ipython in requirements.txt for now (hence not as development dependency), defer to post-0.19 any further changes to that.
  • Remove dependency on tomli for now and use toml instead, defer migration to another PR.
  • Try to remove the horrible pyproject.toml parsing hack, if possible.

And then give a second pass to the documentation and release notes.

@merelcht
Copy link
Member

That sounds like a good plan, thanks @astrojuanlu !

@antonymilne
Copy link
Contributor

Yep, sounds like a good resolution to me also. I'd just add that you should do a search in the source code for references to src/requirements.txt since I think there will be quite a few.

One very minor thing out of curiosity: why the need for file in -e file:.?

@astrojuanlu
Copy link
Member Author

Yep, sounds like a good resolution to me also. I'd just add that you should do a search in the source code for references to src/requirements.txt since I think there will be quite a few.

Good point, thanks!

One very minor thing out of curiosity: why the need for file in -e file:.?

To workaround this long standing pip-tools bug 😢 jazzband/pip-tools#204

@astrojuanlu
Copy link
Member Author

Okay, it's almost there but I have to halt this now. The remaining unit and e2e test failures have to do with kedro micropkg workflows I don't understand and some kedro package shenanigans related to the configuration creation.

@astrojuanlu

This comment was marked as duplicate.

@astrojuanlu

This comment was marked as resolved.

@astrojuanlu
Copy link
Member Author

After gh-2414 is merged, I'm going to try to rework this as follows:

  1. Cosmetic changes: environment.yml mention in the template README.md, rename setup.cfg to .flake8, replacement of pip install .[pandas.CSVDataSet] to pip install kedro-datasets[pandas.CSVDataSet] (mostly harmless)
  2. Just replace setup.py with pyproject.toml (should be almost transparent, but at this point there would be 2 pyproject.toml files)
  3. Attempt the requirements.txt changes as we decided in Move default template to static pyproject.toml #2569 (comment) (maybe tricky because of how the tests parse the requirements files - if it's too tricky, I'll abort and just leave them as they are)
  4. Merge the two pyproject.toml (essentially move one of them one directory up) to finally turn the templates into Pythonic src-layout packages)

In the interest of not having to wait until pre-0.19, I'll try to go as far as possible while not breaking backwards compatibility, adding fallbacks where needed, and then have the actual breaking changes in 0.19. I prefer not to send a PR to develop because this is going to shuffle lots of files around and create a lot of conflicts.

Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
Uses `requirements.txt` for dev requirements in project template.

Fix gh-2280.
Fix gh-2519.

Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
flake8 is the only linter that does not allow configuration in pyproject.toml,
so we rename the file to to avoid confusion with the legacy setup.py
and make it clear that it only holds the configuration of a single tool.

Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
@astrojuanlu

This comment was marked as resolved.

@astrojuanlu
Copy link
Member Author

Conversation here is too long because we bounced several ideas and also there are too many conflicts. Closing in favor of gh-2853.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make all starters use pyproject.toml
5 participants