Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make torch dependency more flexible #355

Closed
jluethi opened this issue Mar 27, 2023 · 16 comments · Fixed by #406
Closed

Make torch dependency more flexible #355

jluethi opened this issue Mar 27, 2023 · 16 comments · Fixed by #406
Labels
High Priority Current Priorities & Blocking Issues package

Comments

@jluethi
Copy link
Collaborator

jluethi commented Mar 27, 2023

Currently, we hardcode torch version 1.12 in the fractal-tasks-core dependencies to make it work well on older UZH GPUs. The tasks themselves don't depend on that torch version though and run fine in other torch versions (e.g. 1.13 or even the new 2.0.0).

The 1.12 dependency made some issues on @gusqgm Windows Subsystem Linux test. On the FMI cluster, it's fine on some GPU nodes, but actually runs in the error below on other GPU nodes. I tested with torch 2.0.0 now and then everything works.

Thus, we should make the torch version more flexible. The correct torch version to install depends on the infrastructure, not the task package.

A workaround until we have it is to manually install torch of a given version into the task venv:

source /path/to/task-envs/.fractal/fractal-tasks-core0.9.0/venv/bin/activate
pip uninstall torch
pip install torch==2.0.0

If someone is searching for it, I'm hitting this error message when the torch version doesn't match:

Traceback (most recent call last):
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/fractal_tasks_core/cellpose_segmentation.py", line 693, in <module>
    run_fractal_task(
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/fractal_tasks_core/_utils.py", line 91, in run_fractal_task
    metadata_update = task_function(**task_args.dict(exclude_unset=True))
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/fractal_tasks_core/cellpose_segmentation.py", line 542, in cellpose_segmentation
    new_label_img = masked_loading_wrapper(
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/fractal_tasks_core/lib_masked_loading.py", line 240, in masked_loading_wrapper
    new_label_img = function(image_array, **kwargs)
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/fractal_tasks_core/cellpose_segmentation.py", line 110, in segment_ROI
    mask, _, _ = model.eval(
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/cellpose/models.py", line 552, in eval
    masks, styles, dP, cellprob, p = self._run_cp(x,
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/cellpose/models.py", line 616, in _run_cp
    yf, style = self._run_nets(img, net_avg=net_avg,
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/cellpose/core.py", line 363, in _run_nets
    y, style = self._run_net(img, augment=augment, tile=tile, tile_overlap=tile_overlap,
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/cellpose/core.py", line 442, in _run_net
    y, style = self._run_tiled(imgs, augment=augment, bsize=bsize,
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/cellpose/core.py", line 543, in _run_tiled
    y0, style = self.network(IMG[irange], return_conv=return_conv)
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/cellpose/core.py", line 315, in network
    y, style = self.net(X)
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/cellpose/resnet_torch.py", line 202, in forward
    T0    = self.downsample(data)
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/cellpose/resnet_torch.py", line 84, in forward
    xd.append(self.down[n](y))
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/cellpose/resnet_torch.py", line 47, in forward
    x = self.proj(x) + self.conv[1](self.conv[0](x))
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 457, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/path/to/task/envs/.fractal/fractal-tasks-core0.9.0/venv/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 453, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
@tcompa tcompa transferred this issue from fractal-analytics-platform/fractal-server Mar 28, 2023
@tcompa
Copy link
Collaborator

tcompa commented Mar 28, 2023

I guess this was meant to be a fractal-tasks-core issue (unless its goal is to provide a way to install different packages on different clusters).

Relevant refs on CUDA/pytorch versions and compatibility:

@tcompa
Copy link
Collaborator

tcompa commented Mar 28, 2023

(also: ref #220)

@jluethi
Copy link
Collaborator Author

jluethi commented Mar 28, 2023

My bad. Yes, it should be a tasks issue :)

And the goal would be to allow an admin setting things up or a user installing the core tasks to get more control about which torch version is used. The effect would be that different torch versions are installed on different clusters. Not sure what the best way to make this happen will be, but it shouldn't be a server concern if at all possible :)

@tcompa
Copy link
Collaborator

tcompa commented Mar 28, 2023

A possible way out would be to add package extras, so that one could install the package as

pip install fractal-tasks-core[pytorch112]
pip install fractal-tasks-core[pytorch113]

Let's rediscuss it.

@jluethi
Copy link
Collaborator Author

jluethi commented Mar 29, 2023

Optional extras specify the pytorch version

If nothing is specified, pip install cellpose will install something (likely the newest pytorch version)

@tcompa tcompa added the package label May 31, 2023
@jluethi
Copy link
Collaborator Author

jluethi commented Jun 6, 2023

What is our plan regarding torch versions for the fractal-tasks extra? Not the biggest fan of multiple different extra editions tbh, but would be great to allow the torch installation to work better (i.e. also work "out of the box" on more modern system than the UZH GPUs)

@tcompa
Copy link
Collaborator

tcompa commented Jun 7, 2023

Refs (to explore further):

@tcompa
Copy link
Collaborator

tcompa commented Jun 7, 2023

https://peps.python.org/pep-0508/#environment-markers This would be a decent solution, if we can provide a bunch of those markers that identify the UZH system - and if we can make it work in poetry. Big assumption: this also applies to versions, and not only to the actual presence of a dependency.

See

Maybe doable by combining

[tool.poetry.dependencies]
pathlib2 = { version = "^2.2", markers = "python_version <= '3.4' or sys_platform == 'win32'" }

with

[tool.poetry.dependencies]
foo = [
    {version = "<=1.9", python = ">=3.6,<3.8"},
    {version = "^2.0", python = ">=3.8"}
]

@tcompa
Copy link
Collaborator

tcompa commented Jun 7, 2023

We explored multiple options with @mfranzon, and we don't see any which makes sense to us via conditional dependencies or something similar. We then propose that:

  • fractal-tasks-core depends on a more flexible torch version (e.g. <=2.0.0)
  • The sysadmin keeps installing the "correct" version (of torch, for instance) after the task collection is complete.

Since this is very tedious, we also propose the following workaround for doing it automatically (to be included in fractal-server - we can then open issue over there).
The /api/v1/task/collect/pip/ currently takes this request body:

{
  "package": "string",
  "package_version": "string",
  "package_extras": "string",
  "python_version": "string"
}

We could add an additional attribute, like custom_package_versions. This would be empty by default, and only at UZH we would set it to custom_package_versions={"torch": "1.12.1"}. The behavior of the task collection would then be:

  1. Perform the whole installation in the standard way (NOTE: this must not fail!)
  2. After the installation is complete, run pip install torch==1.12.1 (where pip is replaced by the actual venv pip that is being used).

CAVEAT: this is messing with the package, and thus creating a not-so-clean log of the installation (although we would still include also the additional torch-installation logs). Such an operation is meant to be restricted to very specific cases, where there is an important dependency on hardware or system libraries - things that a regular user should not be using.

IMPORTANT NOTE 1
This workaround cannot bring us out of the versions supported by fractal-tasks-core (for instance). Say that we now require torch>=1.13.0, and then we set custom_package_versions={"torch": "1.12.1"}. This task-collection operation will fail, because the installation of the custom package goes conflicts with fractal-tasks-core.

IMPORTANT NOTE 2
We should never use this feature to install an additional package. For instance if the fractal-tasks-core does not depend on polars, and we specify custom_package_versions={"polars": "1.0"}, then task collection will fail.

MINOR NOTE:
This also fits perfectly with fractal-analytics-platform/fractal-server#686, where we would only need to add the same pip install line in the script.

@jluethi
Copy link
Collaborator Author

jluethi commented Jun 7, 2023

Thanks for digging into this! Sounds good to me.

I already tested it with torch 2.0.0 on the FMI side and that also works, so I don't see a strong reason for limiting the torch version at all for the time being.

Having the custom_package_versions sounds convenient for the Pelkmans lab setup. If that's not a major effort, I'd be in favor of having this.

@tcompa
Copy link
Collaborator

tcompa commented Jun 7, 2023

Server work is deferred to fractal-analytics-platform/fractal-server#740.
This issue remains for

  • Make torch dependency more flexible (no constraints at all? up to 2.something?)

@tcompa tcompa added the High Priority Current Priorities & Blocking Issues label Jun 7, 2023
@jluethi
Copy link
Collaborator Author

jluethi commented Jun 7, 2023

I have seen no reason for constraints so far, given that 2.0.0 still worked well. We just need torch for cellpose, right? Do we still add it as an explicit dependency for the extras (to make the custom_package_versions workaround work) or is that not necessary?

@jluethi
Copy link
Collaborator Author

jluethi commented Jun 7, 2023

Basically, our torch constraint is:

  1. Whatever cellpose needs => they define that
  2. Whatever local hardware requires (=> custom_package_versions)

tcompa added a commit that referenced this issue Jun 7, 2023
Note that when torch 2.0 is used this change also introduces additional dependencies (e.g. sympy and mpmath)
@tcompa tcompa linked a pull request Jun 7, 2023 that will close this issue
@tcompa
Copy link
Collaborator

tcompa commented Jun 7, 2023

We just need torch for cellpose, right?

Anndata also uses it, but they are not very strict in the dependency version: torch is not listed as a direct dependency in https://github.com/scverse/anndata/blob/main/pyproject.toml, and pip install anndata in a fresh environment does not install it. I think they just try to import it, and have a fall-back options if the import fails.

To do:

  • Find out whether this information is available in some docs, otherwise open an anndata issue about torch supported versions (or at least version that are known to fail)
  • Find out whether the issues that we sometimes find in the CI are only happening when torch is an implicit dependency or also when we explicitly include it in pyproject.toml.

Note: the list below is a bunch of not-very-systematic tests. This is all preliminary, but it'd be nice to understand things clearly - since we are already at it.

Here are some raw CI tests

@tcompa
Copy link
Collaborator

tcompa commented Jun 8, 2023

Finally found the issue (it's a torch 2.0.1 issue, which is exposed by anndata imports but unrelated to anndata)

Current fix: we have to include torch dependency explicitly, and make it <=2.0.0.

@tcompa
Copy link
Collaborator

tcompa commented Jun 22, 2023

For the record, the new size of the installed package is quite larger - and I think this is due to the torch 2.0 requirement of nvidia libraries:

$ pwd
/home/tommaso/Fractal/fractal-demos/examples/server/FRACTAL_TASKS_DIR/.fractal

$ du -hs fractal-tasks-core0.10.0a6/
5.4G	fractal-tasks-core0.10.0a6/

$ du -hs fractal-tasks-core0.10.0a6/venv/lib/python3.9/site-packages/* | sort -h | tail -n 5
86M	fractal-tasks-core0.10.0a6/venv/lib/python3.9/site-packages/scipy
99M	fractal-tasks-core0.10.0a6/venv/lib/python3.9/site-packages/llvmlite
185M	fractal-tasks-core0.10.0a6/venv/lib/python3.9/site-packages/triton
1.3G	fractal-tasks-core0.10.0a6/venv/lib/python3.9/site-packages/torch
2.6G	fractal-tasks-core0.10.0a6/venv/lib/python3.9/site-packages/nvidia

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
High Priority Current Priorities & Blocking Issues package
Projects
None yet
2 participants