Add VideoMAE #17821

NielsRogge · 2022-06-22T12:17:02Z

What does this PR do?

This PR adds VideoMAE, which extends ViTMAE to videos.

The only difference between VideoMAE and ViT is that you need to replace nn.Conv2d by nn.Conv3d in the patch embedding class. 😂

To do:

Decide on a name for VideoMAEFeatureExtractor (should we keep it, or rename to VideoMAEProcessor, VideoMAEPreprocessor?)
Decide on the input format for video models; currently I've chosen pixel_values of shape (batch_size, num_frames, num_channels, height, width). The original implementation uses (B, C, T, H, W)
Doc examples + tests
Incorporate changes of Improve vision models #17731
Make VideoMAEFeatureExtractor robust with return_tensors="np" by default, better tests

sgugger

Thanks for working on this model, left a few comments.

README_ko.md

src/transformers/__init__.py

src/transformers/models/videomae/convert_videomae_to_pytorch.py

sgugger · 2022-07-07T18:09:18Z

src/transformers/models/videomae/feature_extraction_videomae.py

+    def resize_video(self, video, size, resample="bilinear"):
+        return [self.resize(frame, size, resample) for frame in video]
+
+    def crop_video(self, video, size):
+        return [self.center_crop(frame, size) for frame in video]
+
+    def normalize_video(self, video, mean, std):
+        return [self.normalize(frame, mean, std) for frame in video]


For tensors, we should implement something using PyTorch here, as iterating through the frames will be super slow.

The original implementation also iterates over the frames (they use cv2 instead of Pillow for each frame): https://github.com/MCG-NJU/VideoMAE/blob/bd18ef559b31bc69c6c2dc91e3fdd09343016f00/functional.py#L26

Each frame can either be a NumPy array or a PIL image.

Or do you mean when you provide a single tensor of shape (B, T, C, H, W)? Cause currently the feature extractor only accepts lists of PIL images or tensors

Normalization would be way faster if done on a big tensor, if we have tensors here. Likewise it would be faster done once an a big NumPy array if we have a NumPy arrays.

If we have a list of PIL images, it's converted anyway.

Oh yes sorry you probably only meant normalization.

Should we add a normalize_video method to image_utils.py, that accepts either a list of NumPy arrays or PIL images?

That works too.

src/transformers/models/videomae/modeling_videomae.py

sgugger · 2022-07-07T18:12:28Z

src/transformers/models/videomae/test.py

@@ -0,0 +1,18 @@
+# import torch


This file should not be added to the PR.

sgugger · 2022-07-07T18:12:47Z

src/transformers/models/videomae/test_model.py

@@ -0,0 +1,55 @@
+import numpy as np


This one shouldn't as well.

Still should not be there.

fcakyon · 2022-07-24T10:37:21Z

@NielsRogge do you have any ETA on this feature? I am developing a video classification fine-tuning framework, would love to use this model if it gets merged into main!

Currently only video model is PerceiverIO, right?

amyeroberts

LGTM! 🎥 Just one small nit comment.

Only question I have is about the videomae/test.py and videomae/test_model.py files and if they're in the right place, as they look more like scripts

amyeroberts · 2022-07-26T14:30:11Z

src/transformers/models/videomae/modeling_videomae.py

+        >>> model = VideoMAEForPreTraining.from_pretrained("nanjing/vit-mae-base")
+
+        >>> pixel_values = feature_extractor(video, return_tensors="pt").pixel_values
+        >>> bool_masked_pos = ...


Definition missing here

sgugger

Make sure to remove all test scripts before merging (they really shouldn't be added in the first place, please take more care of the files you add with git).

src/transformers/models/videomae/modeling_videomae.py

src/transformers/models/videomae/test.py

sgugger · 2022-08-01T16:27:35Z

src/transformers/models/videomae/test_model.py

@@ -0,0 +1,55 @@
+import numpy as np


Still should not be there.

NielsRogge · 2022-08-02T09:36:30Z

src/transformers/image_utils.py

+from .utils.constants import (  # noqa: F401
+    IMAGENET_DEFAULT_MEAN,
+    IMAGENET_DEFAULT_STD,
+    IMAGENET_STANDARD_MEAN,
+    IMAGENET_STANDARD_STD,
+)


@sgugger is this ok?

LysandreJik · 2022-08-02T12:04:51Z

There seems to remain an issue with the docs:

Traceback (most recent call last):
  File "/usr/local/bin/doc-builder", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/site-packages/doc_builder/commands/doc_builder_cli.py", line 47, in main
    args.func(args)
  File "/usr/local/lib/python3.8/site-packages/doc_builder/commands/build.py", line 96, in build_command
    build_doc(
  File "/usr/local/lib/python3.8/site-packages/doc_builder/build_doc.py", line 405, in build_doc
    sphinx_refs = check_toc_integrity(doc_folder, output_dir)
  File "/usr/local/lib/python3.8/site-packages/doc_builder/build_doc.py", line 460, in check_toc_integrity
    raise RuntimeError(
RuntimeError: The following files are not present in the table of contents:
- model_doc/videomae
Add them to ../transformers/docs/source/en/_toctree.yml.

NielsRogge · 2022-08-02T13:11:29Z

@LysandreJik yes I was aware of that, should be fixed now.

Don't merge already please, I'm transferring checkpoints and updating the conversion script.

HuggingFaceDocBuilderDev · 2022-08-02T13:31:44Z

The documentation is not available anymore as the PR was closed or merged.

* First draft * Add VideoMAEForVideoClassification * Improve conversion script * Add VideoMAEForPreTraining * Add VideoMAEFeatureExtractor * Improve VideoMAEFeatureExtractor * Improve docs * Add first draft of model tests * Improve VideoMAEForPreTraining * Fix base_model_prefix * Make model take pixel_values of shape (B, T, C, H, W) * Add loss computation of VideoMAEForPreTraining * Improve tests * Improve model testsé * Make all tests pass * Add VideoMAE to main README * Add tests for VideoMAEFeatureExtractor * Add integration test * Improve conversion script * Rename patch embedding class * Remove VideoMAELayer from init * Update design of patch embeddings * Improve comments * Improve conversion script * Improve conversion script * Add conversion of pretrained model * Add loss verification of pretrained model * Add loss verification of unnormalized targets * Add integration test for pretraining model * Apply suggestions from code review * Fix bug to make feature extractor resize only shorter edge * Address more comments * Improve normalization of videos * Add doc examples * Move constants to dedicated script * Remove scripts * Transfer checkpoints, fix docs * Update script * Update image mean and std * Fix doc tests * Set return_tensors to NumPy by default * Revert the previous change Co-authored-by: Niels Rogge <nielsrogge@Nielss-MacBook-Pro.local>

NielsRogge mentioned this pull request Jun 22, 2022

Adding VideoMAE to HuggingFace Transformers MCG-NJU/VideoMAE#23

Closed

NielsRogge force-pushed the add_videomae branch from 36e2185 to b96b9a9 Compare June 22, 2022 12:29

NielsRogge force-pushed the add_videomae branch from 86078d9 to cb7d6f8 Compare July 7, 2022 08:35

NielsRogge requested review from LysandreJik and sgugger July 7, 2022 14:54

sgugger mentioned this pull request Jul 7, 2022

Update localized READMES when template is filled. #18062

Merged

sgugger reviewed Jul 7, 2022

View reviewed changes

NielsRogge force-pushed the add_videomae branch from b1b4557 to 66d949e Compare July 8, 2022 15:44

amyeroberts approved these changes Jul 26, 2022

View reviewed changes

NielsRogge force-pushed the add_videomae branch from bc4d8c3 to 078b6a8 Compare August 1, 2022 15:06

sgugger approved these changes Aug 1, 2022

View reviewed changes

NielsRogge commented Aug 2, 2022

View reviewed changes

NielsRogge and others added 15 commits August 2, 2022 15:15

First draft

418e6bb

Add VideoMAEForVideoClassification

e32d1a5

Improve conversion script

ea8a6e6

Add VideoMAEForPreTraining

e52c3db

Add VideoMAEFeatureExtractor

ff28d74

Improve VideoMAEFeatureExtractor

ef16451

Improve docs

27bfe2b

Add first draft of model tests

22e18ab

Improve VideoMAEForPreTraining

d8a1aa5

Fix base_model_prefix

2caaee9

Make model take pixel_values of shape (B, T, C, H, W)

e321b89

Add loss computation of VideoMAEForPreTraining

7c84302

Improve tests

ecdfe40

Improve model testsé

971bf85

Make all tests pass

29160e9

Niels Rogge added 20 commits August 2, 2022 15:15

Add integration test

63b6e7c

Improve conversion script

2f4de0e

Rename patch embedding class

72c46d4

Remove VideoMAELayer from init

7b4d5d1

Update design of patch embeddings

f34e2bb

Improve comments

b052c1c

Improve conversion script

0e69ac3

Improve conversion script

1a14bea

Add conversion of pretrained model

b42d2ff

Add loss verification of pretrained model

099b3f3

Add loss verification of unnormalized targets

ab40b5f

Add integration test for pretraining model

3a523d8

Apply suggestions from code review

7e22b98

Fix bug to make feature extractor resize only shorter edge

0f0beb8

Address more comments

d899dbe

Improve normalization of videos

e1e658d

Add doc examples

98ee976

Move constants to dedicated script

ae2a3ee

Remove scripts

97229e2

Transfer checkpoints, fix docs

8ef0d69

NielsRogge force-pushed the add_videomae branch from f5a3d58 to 8ef0d69 Compare August 2, 2022 13:17

Niels Rogge added 5 commits August 2, 2022 19:13

Update script

43b90a7

Update image mean and std

cd3aa21

Fix doc tests

c510c3d

Set return_tensors to NumPy by default

256f2c8

Revert the previous change

aa48fee

NielsRogge merged commit f9a0008 into huggingface:main Aug 4, 2022

innat mentioned this pull request Aug 30, 2023

Reproducibility of VideoMAE on Kinetics-400 #25868

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add VideoMAE #17821

Add VideoMAE #17821

NielsRogge commented Jun 22, 2022 •

edited

sgugger left a comment

sgugger Jul 7, 2022

NielsRogge Jul 8, 2022 •

edited

NielsRogge Jul 8, 2022

sgugger Jul 8, 2022

NielsRogge Jul 8, 2022

sgugger Jul 8, 2022

sgugger Jul 7, 2022

sgugger Jul 7, 2022

sgugger Aug 1, 2022

fcakyon commented Jul 24, 2022 •

edited

amyeroberts left a comment

amyeroberts Jul 26, 2022

sgugger left a comment

sgugger Aug 1, 2022

NielsRogge Aug 2, 2022

LysandreJik commented Aug 2, 2022

NielsRogge commented Aug 2, 2022

HuggingFaceDocBuilderDev commented Aug 2, 2022 •

edited

Add VideoMAE #17821

Add VideoMAE #17821

Conversation

NielsRogge commented Jun 22, 2022 • edited

What does this PR do?

sgugger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NielsRogge Jul 8, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fcakyon commented Jul 24, 2022 • edited

amyeroberts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LysandreJik commented Aug 2, 2022

NielsRogge commented Aug 2, 2022

HuggingFaceDocBuilderDev commented Aug 2, 2022 • edited

NielsRogge commented Jun 22, 2022 •

edited

NielsRogge Jul 8, 2022 •

edited

fcakyon commented Jul 24, 2022 •

edited

HuggingFaceDocBuilderDev commented Aug 2, 2022 •

edited