Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How should package collection work? #7777

Closed
bluetech opened this issue Sep 20, 2020 · 23 comments
Closed

How should package collection work? #7777

bluetech opened this issue Sep 20, 2020 · 23 comments
Labels
topic: collection related to the collection phase

Comments

@bluetech
Copy link
Member

I am working on cleaning up our collection code, but the current behavior seems odd and incidental, therefore I'd first like to discuss how it should behave.

pytest and operating system versions: pytest 6.0.2/master on Linux.

Current behavior

Consider the following filesystem tree:

a
├── b
│   ├── c
│   │   ├── d
│   │   │   ├── __init__.py
│   │   │   └── test_d.py
│   │   └── test_c.py
│   ├── __init__.py
│   └── test_b.py
├── b2
│   ├── __init__.py
│   └── test_b2.py
├── __init__.py
└── test_a.py

This has several nested packages, but note that the c level doesn't have an __init__.py (just to make it more interesting).

This results in the following collection tree:

$ pytest --co a/
collected 5 items

<Package a>
  <Module test_a.py>
    <Function test_a1>
<Package b>
  <Module test_b.py>
    <Function test_b1>
<Module a/b/c/test_c.py>
  <Function test_c1>
<Package d>
  <Module test_d.py>
    <Function test_d1>
<Package b2>
  <Module test_b2.py>
    <Function test_b21>

The Packages are all flat, not nested. Namespace packages are not considered Packages. Files not inside packages are collected as standalone Modules.

This however does not reveal the real story. See what happens when --keep-duplicates is used:

$ pytest --co --keep-duplicates a/
collected 11 items

<Package a>
  <Module test_a.py>
    <Function test_a1>
  <Package b>
    <Module test_b.py>
      <Function test_b1>
    <Module test_c.py>
      <Function test_c1>
    <Package d>
      <Module test_d.py>
        <Function test_d1>
  <Package b2>
    <Module test_b2.py>
      <Function test_b21>
<Package b>
  <Module test_b.py>
    <Function test_b1>
  <Module test_c.py>
    <Function test_c1>
  <Package d>
    <Module test_d.py>
      <Function test_d1>
<Module a/b/c/test_c.py>
  <Function test_c1>
<Package d>
  <Module test_d.py>
    <Function test_d1>
<Package b2>
  <Module test_b2.py>
    <Function test_b21>

This has all of the previous collectors, but also duplicates which are nested under each Package.

Technical details

For this interested, the code details are as follows:

pytest has two recursive filesystem collectors, Session and Package.

Session.collect() walks the entire trees (of the given command line arguments) recursively in BFS order and creates collectors for Packages and Modules. It has various obscure code to special-case Packages and exclude files belonging directly to the package. The Collectors it creates always have the Session as parent (i.e. flat).

Note that the Collectors themselves are only recursively expanded to Items after the above is finished. (This step is done by Session.genitems() which calls each Collector's own collect() method).

Package.collect() also walks the package's directory recursively. It has some code to try to avoid collecting Modules belonging to sub-packages, but otherwise creates Modules and Packages with itself as parent.

Since the stuff collected by Packages was already collected by the Session (with the exception of a package's own direct files) they are mostly discarded as duplicates unless --keep-duplicates is used.

Question

My question is, what do we want to happen?

Evidently the nesting is not super important, otherwise we would have heard loud complaints by now (though there are some issues about this). But the nesting does have an effect on how package-scope fixtures are applied - reuse from super-package or not?

The duplication seems definitely broken.

And there is a question on how PEP 382 namespace packages fit into this (if it all).

Would be happy for any thoughts!

@bluetech bluetech added the topic: collection related to the collection phase label Sep 20, 2020
@RonnyPfannschmidt
Copy link
Member

I'd like to have a more detailed discussion about structuring the collection tree, we accrued a mess that requires some decomposition to begin with

@nicoddemus
Copy link
Member

Thanks @bluetech for putting up this detailed description.

Indeed the Package node was not thought through before being included, and I'm at fault for not giving the attention it deserved at the time.

Evidently the nesting is not super important, otherwise we would have heard loud complaints by now (though there are some issues about this). But the nesting does have an effect on how package-scope fixtures are applied - reuse from super-package or not?

I think packages should be nested, just like the other nodes. I'm surprised that no one noticed this inconsistency before (including myself). Then package-scoped fixtures follow the normal rules that other fixtures follow.

The question about namespace packages unfortunately is not that simple. Besides being a pain to detect at runtime (at least that I recall the discussions around that subject), we need to wonder how they would be represented given that in the same namespace package, sub-packages might be located in completely different directories, and the collection tree is directory-based. In other words, in your example if we collect a/b/c as a namespace package, what happens if some other c/foo package exists in a directory outside of the collection root a?

I think however we can postpone what to do with namespace packages to some other time, and focus on natural packages (those with __init__.py) instead for now.

I'd like to have a more detailed discussion about structuring the collection tree, we accrued a mess that requires some decomposition to begin with

Good idea @RonnyPfannschmidt. You want to do that here or somewhere else? I think here is fine IMHO.

@MarximusMaximus
Copy link

Found this issue while trying to figure out why my nested test packages all report as top level packages. I, for one, if I went through the trouble to nest them, I want them to be nested when reported. Is there a way to force that to happen right now for my situation as a workaround? I've tried so many things that I'm at a loss.

@MarximusMaximus
Copy link

Found this issue while trying to figure out why my nested test packages all report as top level packages. I, for one, if I went through the trouble to nest them, I want them to be nested when reported. Is there a way to force that to happen right now for my situation as a workaround? I've tried so many things that I'm at a loss.

Ok, so I wrote a plugin as a workaround for what I needed since I couldn't find anything else. Code is MIT license.
PyPI: https://pypi.org/project/pytest-prefer-nested-dup-tests/
github: https://github.com/MarximusMaximus/pytest-prefer-nested-dup-tests

Hopefully it's useful to other people!

@bluetech
Copy link
Member Author

Possible Plan

Looked at this again.

The main trouble here stems from the fact that we have two recursive collectors -- Session and Package -- which step on each other toes in unexpected ways, and it's very hard to fix.

I'm increasingly convinced that we shouldn't have any recursive collector. Huh? Well, if you think about it, we already have a concept to handle the recursion for us -- the collection tree. The collectors are already recursively expanded on their own, that is, a Collector can collect other Collectors, which are then collected themselves, etc. until we get only Items (for reference, see Session.genitems, which is literally implemented with simple recursion).

What does it mean in practice? At a high level:

  • Introduce a new Directory collector node.
  • Remove recursion (filesystem walking) from Session and Package.
  • A directory is collected like this: if the directory contains an __init__.py file, create a Package, otherwise a Directory.
  • Session collects directories from the collection arguments (paths given on the command line/testpaths).
  • Package and Directory collect like this: scan their directory: if the entry is a file, collect it like today; if it's a directory, collect it like as above.

This looks like this:

<Session>
    <Package a>
        <Package b>
            <Directory c>
                <Package d>
                    <Module test_d.py>
                        <Function test_d>
                <Module test_c.py>
                    <Function test_c>
            <Module test_b.py>
                <Function test_b>
        <Package b2>
            <Module test_b2.py>
                <Function test_b2>
        <Module test_a.py>
            <Function test_a>

Some points:

  • <Directory c> is what I call a "graft" - a non-package directory inside of a package directory. There are two options: nest it under <Package b>, or nest it directly under Session as an independent sub-tree. This mostly has implications for package scope fixtures.

  • Seems like in very old versions of pytest there existed a Directory node, but it was removed in 6dac774. It's too hard to try and figure out why unfortunately.

  • One thing I need to think about is file collection arguments (i.e. pytest foo/bar/test_it.py). We need to have collection nodes for foo and foo/bar in this case (be they Directorys or Packages), but ignore irrelevant paths (e.g. foo/test_baz.py). This is currently handled by Session, but can't be under the new scheme (I think. Maybe the "matchnodes" mechanism is good enough for this).

  • This will have to go in a major release as a breaking change.

I have a very simple POC implementation, but there are a bunch of details to figure out. Interested to hear opinions. I vaguely remember @RonnyPfannschmidt mentioned a Directory node, but I can't find it and might be misremembering.

@RonnyPfannschmidt
Copy link
Member

It's been more than a decade, that directory node existed

It was undone for good reasons, but I think it is reasonable to reintroduce it

@bluetech
Copy link
Member Author

It was undone for good reasons

Do you remember what they were? I do suspect there is some issue here; the Directory solution is pretty natural and harmonious with the entire collection tree concept, that I'm sure it's been tried. If there's something fundamental I'm sure I'll run into it eventually but sooner is better than later :)

@RonnyPfannschmidt
Copy link
Member

Pytest prior to the refactoring back then had no concept of separation of test running and test execution

Tests where executed as found

While changing the details, it was simpler to do collect towards if there where only files and session

@MarximusMaximus
Copy link

Suggestion regarding the design above: Directory nodes wrap Packages that are not inside other Packages, such as:

<Session>
    <Directory a>
        <Package a>
            <Package b>
                <Directory c>
                    <Package d>
                        <Module test_d.py>
                            <Function test_d>
                    <Module test_c.py>
                        <Function test_c>
                <Module test_b.py>
                    <Function test_b>
            <Package b2>
                <Module test_b2.py>
                    <Function test_b2>
            <Module test_a.py>
                <Function test_a>
    <Directory z>
        <Package z>
            ...
        <Module z>
            ...
     <Directory y>
        <Directory y2>
            <Module y3> (single explicit file path case: ./y/y2/y3.py)

I feel like this (1) makes it clear that a session can be multiple directories (since it can already be multiple "top level" packages, this is roughly equivalent), and (2) Should(tm) make it easier to track file system scoping b/c it Should(tm) only need to find the closest parent Directory node.

NOTE: The suggestion implies that we would SKIP Directory Nodes for sub-packages and modules BUT include directory nodes for "grafted sub-packages". In other words, (I think) only add a Directory Node if the directory lacks an init.py (grafted case) OR does not have a parent package (top level packages).

@bluetech
Copy link
Member Author

@MarximusMaximus right. I'm not entirely sure how it would look like exactly, but that's the basic idea - Directory is a non-package directory, and Package is a yes-package directory (really the __init__.py file); and we mirror the filesystem hierarchy with either Directory or Package nodes.

@bluetech
Copy link
Member Author

In case I get hit by a bus, my WIP branch is here: https://github.com/bluetech/pytest/commits/pkg-collect

It seems viable for sure, but very delicate.

@bluetech
Copy link
Member Author

bluetech commented Jun 4, 2023

Package should not be a Module

This comment was extracted to #11137

@bluetech
Copy link
Member Author

bluetech commented Jun 5, 2023

pytest_collect_directory hook

After the changes proposed in this issue, all of Session, Package and Directory would need to collect directories. It basically looks like this:

if (path / "__init__.py").is_file:
    return Package.from_parent(parent, path=path)
else:
    return Directory.from_parent(parent, path=path)

I don't like that this logic is hardcoded, and repeated in 3 places.

I propose to add a new hook, pytest_collect_directory, similar to the existing pytest_collect_file, but for directories. There will be 2 implementations:

  1. An implementation in main.py will create Directory nodes.
  2. An implementation in pytest.py will create Package nodes, if the directory has an __init__.py file.

This has the following benefits:

  1. Avoids the repetition.
  2. Decouples the python plugin - now main doesn't need to know about Package anymore, the python plugin plugs Package itself.
  3. Plugins can invent new features involving directories. For example, a YamlDirectory which collects files based on some manifest, or a virtual filesystem, etc.

One question is whether the hook should be firstresult or not. The pytest_collect_file hook is not, which means multiple plugins can generate multiple nodes for a single file. But firstresult=False creates an issue for pytest_collect_directory -- main doesn't want to create a Directory if a Package is created. So I'm going to define pytest_collect_directory as firstresult=True, which means only one plugin creates a node for a directory.

POC

This is currently implemented as part of the bigger change here:
https://github.com/bluetech/pytest/commits/pkg-collect
This branch doesn't pass all tests yet, but I don't foresee issues with pytest_collect_directory in particular.

@bluetech
Copy link
Member Author

bluetech commented Jun 5, 2023

Amusingly I just realized that pytest_collect_directory already existed up to pytest 6.0, but was removed because it was broken: deprecation doc, deprecation issue, deprecation pr, removal pr. So I propose to bring it back, with a different definition (py.path -> pathlib, returns the node instead of just being called??), but this time it will actually work and be useful :)

@bluetech
Copy link
Member Author

After a lot of tinkering with obscure pytest behaviors (keepduplicates, conftest exceptions, eek), I managed to make the branch pass all tests. I'll now be working on preparing it for submission (split commits, update docs, add tests, add changelogs, finding relevant issues etc.). I'll be proposing it for pytest 8.0.

@bluetech
Copy link
Member Author

Abstract Directory node, parent of Dir and Package

One change I've made from the initial proposal is to add an abstract nodes.Directory node, which is analogous to nodes.File for files, and renamed the concrete Directory node to Dir to avoid the naming conflict.

Reminder: nodes.File is the parent of python.Module, but also used for non-Python tests, like YamlFile etc.

Since now we have a proper collector hierarchy mirroring the filesystem hierarchy, I've decided to add an abstract nodes.Directory, which is a parent of the concrete python.Package (directories with __init__.py) and main.Dir (directories without __init__.py).

I think it's nice because:

  1. Plugins can invent their own Directory nodes, like they can for Files.

  2. It further decouples the python plugin, now can replace explicit checks for python.Package (where it stands for "a directory") to nodes.Directory.

  3. We can replace the package scope, which is tied to python.Package, with a directory scope which works for any Directory. Why not allow non-Package directories to have directory-level fixtures? (This will be follow up work if we think it's a good idea).

  4. I think it would be good to tighten pytest_collect_directory return type to nodes.Directory, to make clear the expectations from this hook. (Side note: would also be good to make pytest_collect_file need to return nodes.File instead of an arbitrary Collector...).

@bluetech
Copy link
Member Author

Order of files vs. directories in collection

Previously, files in a directory were collected before sub-directories in the directory. That is, given a filesystem tree

top/
├── aaa
│   └── test_aaa.py
├── test_a.py
├── test_b
│   └── test_b.py
├── test_c.py
└── zzz
    └── test_zzz.py

would collect as

<Module top/test_a.py>
  <Function test_it>
<Module top/test_c.py>
  <Function test_it>
<Module top/aaa/test_aaa.py>
  <Function test_it>
<Module top/test_b/test_b.py>
  <Function test_it>
<Module top/zzz/test_zzz.py>
  <Function test_it>

After my changes, the order that naturally flows from the code is that files and sub-directories are orderer jointly, that is the tree is

<Dir pytest>
  <Dir top>
    <Dir aaa>
      <Module test_aaa.py>
        <Function test_it>
    <Module test_a.py>
      <Function test_it>
    <Dir test_b>
      <Module test_b.py>
        <Function test_it>
    <Module test_c.py>
      <Function test_it>
    <Dir zzz>
      <Module test_zzz.py>
        <Function test_it>

For now I am keeping the joint ordering, but let me know if you think ordering files before subdirs is better.

@RonnyPfannschmidt
Copy link
Member

How about leaving the recursive FS Walk to a single utility that's collects directory/package/file nodes and provides the correct parents

The collect function of each node could then returns the items/definitions within

@bluetech
Copy link
Member Author

This would restrict Directory collectors to only handle their files. We not give them control over the sub-directories? What's the advantage?

@RonnyPfannschmidt
Copy link
Member

No id actually like to go further and leave the scanning of the files tree in the hand off pytest

I'd strictly want to avoid a situation where multiple collectors recursively have to cooperate to scan all the files

Walking the file tree and mapping directories and files to nodes ought to be handled in a single place

@bluetech
Copy link
Member Author

I understand, but why do you want it to be handled in one place? My idea is to give control to plugins (i.e. custom nodes, including our own, e.g. python, which currently requires hardcoding in Session, i.e. is not independent). That's more pytest-y IMO.

@RonnyPfannschmidt
Copy link
Member

Any extension that implements file collection Will have to share implementations of selection with pytest, which implies exposing the apis for correctly handling it,and thrusting plugin to correctly use it

Based on how people implemented things that doesn't seem safe

@bluetech
Copy link
Member Author

Any extension that implements file collection Will have to share implementations of selection with pytest, which implies exposing the apis for correctly handling it,and thrusting plugin to correctly use it

Based on how people implemented things that doesn't seem safe

I agree with this concern. I somewhat procrastinated on this comment, but I tried various things, and in the end I don't have a solution that alleviates this concern without making the hook/custom directory collector mostly useless. So instead of holding up pytest 8 for even longer, I decided to submit what I have (PR #11646), which I think is pretty good, and hope that plugins do the right thing. I tried to document the expectations in the Directory documentation.

bluetech added a commit to bluetech/pytest that referenced this issue Nov 27, 2023
bluetech added a commit to bluetech/pytest that referenced this issue Nov 27, 2023
bluetech added a commit to bluetech/pytest that referenced this issue Nov 27, 2023
bluetech added a commit to bluetech/pytest that referenced this issue Nov 28, 2023
bluetech added a commit to bluetech/pytest that referenced this issue Nov 29, 2023
bluetech added a commit to bluetech/pytest that referenced this issue Nov 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic: collection related to the collection phase
Projects
None yet
Development

No branches or pull requests

4 participants