Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test summary with previous PyTorch/TensorFlow versions #18181

Open
ydshieh opened this issue Jul 18, 2022 · 11 comments
Open

Test summary with previous PyTorch/TensorFlow versions #18181

ydshieh opened this issue Jul 18, 2022 · 11 comments
Labels
Tests Related to tests WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress

Comments

@ydshieh
Copy link
Collaborator

ydshieh commented Jul 18, 2022

Initialized by @LysandreJik, we ran the tests with previous PyTorch/TensorFlow versions. The goal is to determine if we should drop (some) earlier PyTorch/TensorFlow versions.

  • This is not exactly the same as the scheduled daily CI (torch-scatter, accelerate not installed, etc.)
  • Currently we only have the global summary (i.e. there is no number of test failures per model)

Here is the results (running on ~June 20, 2022):

  • PyTorch testing has ~27100 tests
  • TensorFlow testing has ~15700 tests
Framework No. Failures
PyTorch 1.10 50
PyTorch 1.9 710
PyTorch 1.8 1301
PyTorch 1.7 1567
PyTorch 1.6 2342
PyTorch 1.5 3315
PyTorch 1.4 3949
TensorFlow 2.8 118
TensorFlow 2.7 122
TensorFlow 2.6 122
TensorFlow 2.5 128
TensorFlow 2.4 167

It looks like the number of failures in TensorFlow testing doesn't increase much.

So far my thoughts:

  • All TF >= 2.4 should be (still) kept in the list of supported versions

Questions

  • What's you opinion regarding which versions to drop support?
  • Would you like to see the number of test failures per model?
  • TensorFlow 2.3 needs CUDA 10.1 and requires the build of a special docker image. Do you think we should make the effort on it to have the results for TF 2.3?
@ydshieh ydshieh added bug Tests Related to tests and removed bug labels Jul 18, 2022
@ydshieh
Copy link
Collaborator Author

ydshieh commented Jul 18, 2022

cc @LysandreJik @sgugger @patrickvonplaten @Rocketknight1 @gante @anton-l @NielsRogge @amyeroberts @alaradirik @stas00 @hollance to have your comments

@Rocketknight1
Copy link
Member

TF 2.3 is quite old by now, and I wouldn't make a special effort to support it. Several nice TF features (like the Numpy-like API) only arrived in TF 2.4, and we're likely to use those a lot in future.

@LysandreJik
Copy link
Member

Hey @ydshieh, would you have a summary of the failing tests handy? I'm curious to see the reason why there are so many failures for PyTorch as soon as we leave the latest version. I'm quite confident that it's an issue in our tests rather than in our internal code, so seeing the failures would help. Thanks!

@ydshieh
Copy link
Collaborator Author

ydshieh commented Jul 19, 2022

@LysandreJik I will re-run it. The previous run(s) have huge tables in the reports, and sending to Slack failed (3001 character limit). I finally ran it by disabling those blocks.

Before re-running it, I need a approve for #17921

@ydshieh
Copy link
Collaborator Author

ydshieh commented Aug 1, 2022

I ran the past CI again which returns more information. Looking the report for PyTorch 1.4 quickly, here are some observations:

There is one error occurring in almost all models:

  • from_pretrained: OSError: Unable to load weights from pytorch checkpoint file for`
    • torch.load: Attempted to read a PyTorch file with version 3, but the maximum supported version for reading is 2. Your PyTorch installation may be too old.

Another one also occurs a lot (torchscript tests)

  • (line 625) AttributeError: module 'torch.jit' has no attribute '_state'

An error occurs (specifically) to vision models (probably due to the convolution layers)

  • (line 97) RuntimeError: cuDNN error: CUDNN_STATUS_NOT_SUPPORTED. This error may appear if you passed in a non-contiguous input.

BART has 108/106 failures:

  • (line 240) RuntimeError: CUDA error: device-side assert triggered
    • Don't know what's wrong here yet

Others

  • Other AttributeError: (not exhaustive)
    • AttributeError: module 'torch' has no attribute 'minimum'
    • AttributeError: 'builtin_function_or_method' object has no attribute 'fftn'
    • AttributeError: module 'torch' has no attribute 'square'
    • AttributeError: module 'torch.nn' has no attribute 'Hardswish'
    • AttributeError: module 'torch' has no attribute 'logical_and'
    • AttributeError: module 'torch' has no attribute 'pi'
    • AttributeError: module 'torch' has no attribute 'multiply'

@LysandreJik
Copy link
Member

Thanks for the report! Taking a look at the PyTorch versions, here are the dates at which they were releases:

Most of the errors in from_pretrained seem to come from the zipfile format introduced by PyTorch 1.6. I think this is the most annoying one to patch by far.

From a first look, I'd offer to drop support for all PyTorch version inferior to < 1.6 as these have been released more than two years ago.

Do you have a link to a job containing all these failures? I'd be interested in seeing if the 2342 errors in PyTorch 1.6 are solvable simply or if they will require a significant refactor.

@ydshieh
Copy link
Collaborator Author

ydshieh commented Aug 9, 2022

The link is here. But since it contains too many jobs (all models x all versions ~= 3200 jobs), it just shows [Unicorn!] This page is taking too long to load.

I can re-run specifically for PyTorch 1.6 only, and will post a link later.

@stas00
Copy link
Contributor

stas00 commented Aug 9, 2022

From a first look, I'd offer to drop support for all PyTorch version inferior to < 1.6 as these have been released more than two years ago.

I second that.

While we are at it, do we want to establish an official shifting window of how far back we want to support pytorch versions for? As in minimum - we support at least 2 years of pytorch? If it's easy to support longer we would but it'd be easy to cut off if need be.

The user always has the older transformers that they can pin to if they really need a very old pytorch support.

@LysandreJik
Copy link
Member

Yes, that would work fine with me. If I understand correctly, that's how libraries in the PyData ecosystem (scikit-learn, numpy) manage the support of Python versions: they drop support for versions older than 2 years (scikit-learn/scikit-learn#20965, scikit-learn/scikit-learn#20084, scipy toolchaib, scipy/scipy#14655).

Dropping support for PyTorch/Flax/TensorFlow versions that have been released more than two years ago sounds good to me. That is somewhat already the case (see failing tests), but we're just not aware.

@ydshieh
Copy link
Collaborator Author

ydshieh commented Aug 10, 2022

Hi, I am wondering what it means a PyTorch/TensorFlow/Flax version is supported. I guess it doesn't imply all models work under those framework versions, but would like to know if there is more explicit definition (for transformers, or more generally, in open source projects).

@sgugger
Copy link
Collaborator

sgugger commented Aug 10, 2022

Ideally it should mean that all models work/all tests pass apart from functionality explicitly having versions tests (like CUDA bfloat16 or torch FX where we test against a specific PyTorch version).

@huggingface huggingface deleted a comment from github-actions bot Sep 5, 2022
@ydshieh ydshieh added the WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress label Sep 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Tests Related to tests WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress
Projects
None yet
Development

No branches or pull requests

5 participants