Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add VideoMAE #17821

Merged
merged 42 commits into from Aug 4, 2022
Merged

Add VideoMAE #17821

merged 42 commits into from Aug 4, 2022

Conversation

NielsRogge
Copy link
Contributor

@NielsRogge NielsRogge commented Jun 22, 2022

What does this PR do?

This PR adds VideoMAE, which extends ViTMAE to videos.

The only difference between VideoMAE and ViT is that you need to replace nn.Conv2d by nn.Conv3d in the patch embedding class. 😂

To do:

  • Decide on a name for VideoMAEFeatureExtractor (should we keep it, or rename to VideoMAEProcessor, VideoMAEPreprocessor?)
  • Decide on the input format for video models; currently I've chosen pixel_values of shape (batch_size, num_frames, num_channels, height, width). The original implementation uses (B, C, T, H, W)
  • Doc examples + tests
  • Incorporate changes of Improve vision models #17731
  • Make VideoMAEFeatureExtractor robust with return_tensors="np" by default, better tests

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this model, left a few comments.

README_ko.md Outdated Show resolved Hide resolved
src/transformers/__init__.py Show resolved Hide resolved
Comment on lines 86 to 93
def resize_video(self, video, size, resample="bilinear"):
return [self.resize(frame, size, resample) for frame in video]

def crop_video(self, video, size):
return [self.center_crop(frame, size) for frame in video]

def normalize_video(self, video, mean, std):
return [self.normalize(frame, mean, std) for frame in video]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For tensors, we should implement something using PyTorch here, as iterating through the frames will be super slow.

Copy link
Contributor Author

@NielsRogge NielsRogge Jul 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original implementation also iterates over the frames (they use cv2 instead of Pillow for each frame): https://github.com/MCG-NJU/VideoMAE/blob/bd18ef559b31bc69c6c2dc91e3fdd09343016f00/functional.py#L26

Each frame can either be a NumPy array or a PIL image.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or do you mean when you provide a single tensor of shape (B, T, C, H, W)? Cause currently the feature extractor only accepts lists of PIL images or tensors

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normalization would be way faster if done on a big tensor, if we have tensors here. Likewise it would be faster done once an a big NumPy array if we have a NumPy arrays.

If we have a list of PIL images, it's converted anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes sorry you probably only meant normalization.

Should we add a normalize_video method to image_utils.py, that accepts either a list of NumPy arrays or PIL images?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works too.

src/transformers/models/videomae/modeling_videomae.py Outdated Show resolved Hide resolved
src/transformers/models/videomae/modeling_videomae.py Outdated Show resolved Hide resolved
src/transformers/models/videomae/modeling_videomae.py Outdated Show resolved Hide resolved
@@ -0,0 +1,18 @@
# import torch
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should not be added to the PR.

@@ -0,0 +1,55 @@
import numpy as np
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one shouldn't as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still should not be there.

@fcakyon
Copy link
Contributor

fcakyon commented Jul 24, 2022

@NielsRogge do you have any ETA on this feature? I am developing a video classification fine-tuning framework, would love to use this model if it gets merged into main!

Currently only video model is PerceiverIO, right?

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 🎥 Just one small nit comment.

Only question I have is about the videomae/test.py and videomae/test_model.py files and if they're in the right place, as they look more like scripts

>>> model = VideoMAEForPreTraining.from_pretrained("nanjing/vit-mae-base")

>>> pixel_values = feature_extractor(video, return_tensors="pt").pixel_values
>>> bool_masked_pos = ...
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definition missing here

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure to remove all test scripts before merging (they really shouldn't be added in the first place, please take more care of the files you add with git).

src/transformers/models/videomae/modeling_videomae.py Outdated Show resolved Hide resolved
src/transformers/models/videomae/test.py Outdated Show resolved Hide resolved
@@ -0,0 +1,55 @@
import numpy as np
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still should not be there.

Comment on lines +26 to +31
from .utils.constants import ( # noqa: F401
IMAGENET_DEFAULT_MEAN,
IMAGENET_DEFAULT_STD,
IMAGENET_STANDARD_MEAN,
IMAGENET_STANDARD_STD,
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sgugger is this ok?

@LysandreJik
Copy link
Member

There seems to remain an issue with the docs:

Traceback (most recent call last):
  File "/usr/local/bin/doc-builder", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/site-packages/doc_builder/commands/doc_builder_cli.py", line 47, in main
    args.func(args)
  File "/usr/local/lib/python3.8/site-packages/doc_builder/commands/build.py", line 96, in build_command
    build_doc(
  File "/usr/local/lib/python3.8/site-packages/doc_builder/build_doc.py", line 405, in build_doc
    sphinx_refs = check_toc_integrity(doc_folder, output_dir)
  File "/usr/local/lib/python3.8/site-packages/doc_builder/build_doc.py", line 460, in check_toc_integrity
    raise RuntimeError(
RuntimeError: The following files are not present in the table of contents:
- model_doc/videomae
Add them to ../transformers/docs/source/en/_toctree.yml.

@NielsRogge
Copy link
Contributor Author

@LysandreJik yes I was aware of that, should be fixed now.

Don't merge already please, I'm transferring checkpoints and updating the conversion script.

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Aug 2, 2022

The documentation is not available anymore as the PR was closed or merged.

@NielsRogge NielsRogge merged commit f9a0008 into huggingface:main Aug 4, 2022
oneraghavan pushed a commit to oneraghavan/transformers that referenced this pull request Sep 26, 2022
* First draft

* Add VideoMAEForVideoClassification

* Improve conversion script

* Add VideoMAEForPreTraining

* Add VideoMAEFeatureExtractor

* Improve VideoMAEFeatureExtractor

* Improve docs

* Add first draft of model tests

* Improve VideoMAEForPreTraining

* Fix base_model_prefix

* Make model take pixel_values of shape (B, T, C, H, W)

* Add loss computation of VideoMAEForPreTraining

* Improve tests

* Improve model testsé

* Make all tests pass

* Add VideoMAE to main README

* Add tests for VideoMAEFeatureExtractor

* Add integration test

* Improve conversion script

* Rename patch embedding class

* Remove VideoMAELayer from init

* Update design of patch embeddings

* Improve comments

* Improve conversion script

* Improve conversion script

* Add conversion of pretrained model

* Add loss verification of pretrained model

* Add loss verification of unnormalized targets

* Add integration test for pretraining model

* Apply suggestions from code review

* Fix bug to make feature extractor resize only shorter edge

* Address more comments

* Improve normalization of videos

* Add doc examples

* Move constants to dedicated script

* Remove scripts

* Transfer checkpoints, fix docs

* Update script

* Update image mean and std

* Fix doc tests

* Set return_tensors to NumPy by default

* Revert the previous change

Co-authored-by: Niels Rogge <nielsrogge@Nielss-MacBook-Pro.local>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants