Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How should our test matrix look like? #6085

Open
fjetter opened this issue Apr 7, 2022 · 15 comments
Open

How should our test matrix look like? #6085

fjetter opened this issue Apr 7, 2022 · 15 comments
Labels
discussion Discussing a topic with no specific actions yet

Comments

@fjetter
Copy link
Member

fjetter commented Apr 7, 2022

There have been a few conversations about how our test matrix should look like. There are clearly costs and benefits to all configurations and I think we should settle what our strategy is to not have individual discussions on tickets

Related tickets


One proposal

Currently we have:

Python version OS Which deps deps version
3.8 Linux, MacOSX, Windows mandatory, optional, Cython latest
3.9 Linux, MacOSX, Windows mandatory, optional, CUDA latest
3.10 Linux, MacOSX, Windows mandatory, optional latest+git tip

I think we should change it to:

Python version OS Which deps deps version
3.8 Linux mandatory, optional, CUDA pinned to minimum
3.8 Linux mandatory only (no numpy!) latest
3.8 Linux, MacOSX, Windows mandatory, optional, CUDA latest
3.9 Linux mandatory, optional, CUDA latest
3.10 Linux, MacOSX, Windows mandatory, optional, CUDA latest+git tip

That's 9 workflows before and after, but with one less Mac workflow and a lot of added value.
I expect the first two workflows to require a lot of effort before they become green.

Originally posted by @crusaderky in #6073 (comment)

@fjetter fjetter added the discussion Discussing a topic with no specific actions yet label Apr 7, 2022
@fjetter
Copy link
Member Author

fjetter commented Apr 7, 2022

repost: #6073 (comment)

Do we need CUDA/pynvml everywhere? Isn't this what gpuCI is for?

@fjetter
Copy link
Member Author

fjetter commented Apr 7, 2022

Another thing we might want to consider is to mark a few test modules that have a chance to be OS sensitive as such and only run them on Win and OSX which would cut the runtime significantly.

Relevant modules would be

  • Everything around comms
  • deploy (process heavy)
  • CLI
  • Wherever we fiddle with low level asyncio code (I think the asyncio_comm is the only module that qualifies for this right now from my POV)
  • Maybe protocol (memory allocation tricks)
  • Maybe there are a few special cases?

I think most of the logic we care for in our tests is entirely unrelated to the OS

@fjetter
Copy link
Member Author

fjetter commented Apr 7, 2022

I would actually like to see us dropping OSX coverage to either a subset of the test suite or to one python version only. The entire org-level constraint is to have at most 5 concurrent OSX builds and particularly during US hours we have PRs waiting for hours before they can even pick up an OSX job

@crusaderky
Copy link
Collaborator

Everything around spilling is potentially impacted by lz4 being installed or not

@mrocklin
Copy link
Member

Do we need CUDA/pynvml everywhere? Isn't this what gpuCI is for?

cc @jakirkham @quasiben @charlesbluca

@mrocklin
Copy link
Member

Another thing we might want to consider is to mark a few test modules that have a chance to be OS sensitive as such and only run them on Win and OSX which would cut the runtime significantly.

Windows is pickier when it comes to network and disk. I've found this to be valuable in surprising situations. OSX and Linux are correlated enough that I agree we don't necessarily need to duplicate across them.

I'd love to see us run the entire test suite on every OS for at least one configuration (but not all).

@jakirkham
Copy link
Member

cc @dask/gpu

@jakirkham
Copy link
Member

Do we need CUDA/pynvml everywhere? Isn't this what gpuCI is for?

Should add I don't really understand the question. I do see pynvml in the CI requirements, which is pretty lightweight. Nothing else stuck out to me as "CUDA". What is meant by that? When asking for help, it would be useful if people can formulate the question clearly as well as explain the problem, neither of which is clear to me.

@quasiben
Copy link
Member

I don't think we need CUDA/pynvml everywhere. I believe we are only running on Linux/Python 3.9/CUDA 11.5:
https://github.com/dask/distributed/blob/main/continuous_integration/gpuci/axis.yaml . @charlesbluca has been a great steward of updating gpuCI for CUDA/RAPIDS updates as new releases come in.

@crusaderky
Copy link
Collaborator

Do we need CUDA/pynvml everywhere? Isn't this what gpuCI is for?

Should add I don't really understand the question. I do see pynvml in the CI requirements, which is pretty lightweight. Nothing else stuck out to me as "CUDA". What is meant by that? When asking for help, it would be useful if people can formulate the question clearly as well as explain the problem, neither of which is clear to me.

CUDA-related packages, meaning pynvml, pytorch, and tochvision.
They are currently installed on selected Github CI environments, despite the fact that those hosts don't mount NVIDIA cards.

@jakirkham
Copy link
Member

So pynvml is definitely CUDA related. My guess is this is not needed or used in CPU, but could be wrong (we could be testing that pynvml support fails gracefully on CPU only for example). Though would ask @jacobtomlinson and @rjzamora to confirm.

This may come as a surprise, but pytorch and torchvision are not being used on GPU currently. They are only used on CPU for testing this serialization logic. Also please see this torchvision test. These libraries can in fact be used on CPU only and that seems to be what is happening here.

Should add this is exactly why I asked this question. Was concerned there may be misconceptions about what is being used where.

To the second question, is there an issue with having these tested? What brought this up?

@charlesbluca
Copy link
Member

Do we need CUDA/pynvml everywhere? Isn't this what gpuCI is for?

I would prefer keeping pynvml in all testing environments - like @jakirkham said, it is relatively lightweight, and I think it's important to cover the (rare, but plausible) case where a user has pynvml installed but no GPUs on their system to make sure that monitoring still works properly.

My interpretation of the proposed testing matrix is that we would like to expand the current gpuCI matrix in Distributed to include additional Python versions / OSes - some comments on this:

  • python 3.10 testing support would be limited if possible, as RAPIDS currently only supports 3.8 / 3.9
  • in terms of OS, gpuCI is limited to Linux, though I think Windows support through WSL2 is being explored; I don't see much value in testing on pure Windows / macOS as RAPIDS currently doesn't support them

@fjetter
Copy link
Member Author

fjetter commented Apr 14, 2022

I just want to clarify why I opened this ticket and why I'm looking for a "minimal dependency" build or why I'm suggesting to drop some packages from certain builds. I have the feeling that people are feeling the need to "defend" pynvml. This was not my intention.

High level motivation

  • CI runtimes are quite long and this is increasing iteration cycles. That's not the fault of the dependencies themselves but I'm wondering if there is anything we can "not test" and still be confident we're covered for what matters.
  • CI is often bottlenecked by OSX jobs because the entire dask github organization is sharing five concurrent runners (Drop OSX builds for python 3.9 #6073)
  • When dropping py3.7 and introducing py3.10 we had discussions in both cases about how the test matrix should look like. I noticed that nobody knew exactly why it is the way it is and where and why we made a decision about it. I am looking for clarification.

OS jobs

  • OSX jobs are limited to five concurrent jobs for the entire dask organization. Nowadays, we have much more people working on the project and this limitation is starting to hurt
  • Windows jobs are known to be slower than ubuntu jobs.
  • Most of our tests are not incredibly sensitive to OS specifics with exceptions of networking, processes, etc. Can we leverage this knowledge somehow?

If we could reduce what we test on these OSs we'd get faster iteration times for everyone.

Optional dependencies

  • Installing optional dependencies costs. We need to solve the environment, download and install the packages, we run additional tests, etc. It's not a huge thing but it adds up.
  • For example do we need scikit-learn, scipy in all environments (that's already 100MB)? Do we really want to install s3fs in every environment?
  • I like the idea of testing a library with its bare minimal runtime dependencies. I consider this a best practice to ensure that we're not accidentally hitting an import error or something like this. I don't think we have this problem right now but it's a nice to have nonetheless.
  • I'm sorry If I offended anyone by suggesting to drop CUDA/pynvml. I barely know what pynvml does. Thanks @charlesbluca for pointing out that there is even some non-GPU monitoring going on and we should ensure this doesn't crash if GPUs are not around. This is the feedback I was looking for. If you tell me "we need this everywhere", I'll ask why because I want to understand the motivation but I will obviously trust the experts on this.
  • It appears "CUDA" was a wrong label and we're rather talking about "ML stuff"? I see a few different categories, e.g. ml (pytorch, scikit-learn, scipy, keras, ...), gpu (maybe only pynvml??), io (s3, fsspec, h5py, netcdf4, ux/misc (prometheus, bokeh, compression stuff?, SSH, crick).

@jacobtomlinson
Copy link
Member

jacobtomlinson commented May 6, 2022

I think it's important to cover the (rare, but plausible) case where a user has pynvml installed but no GPUs on their system

In my experience it's not that rare. Particularly on HPC environments where folks have one conda environment and select different hardware nodes for different tasks.

@jakirkham
Copy link
Member

Thanks Florian! 🙏

So originally pytorch and torchvision were in a single CI job. In fact that may be the case with several of these. My guess is as we have dropped and added new Python versions to the test matrix things got unintentionally copied over, which has resulted in these showing up in several jobs. It probably makes sense to limit these back to one environment.

That all being said, these may wind up getting copied over to new environments again unintentionally. So it might be worth thinking of a better system to manage optional dependency testing than what we have today, which is less prone to this kind of issue. For example having a separate requirement list of optional dependencies that we install after creating the Conda environment and then only install them when some CI environment variable is set (CI_OPTIONAL_REQS or similar). We could also break them into smaller groups if that is helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Discussing a topic with no specific actions yet
Projects
None yet
Development

No branches or pull requests

7 participants