New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add LayoutLMv3 #17060
Add LayoutLMv3 #17060
Conversation
The documentation is not available anymore as the PR was closed or merged. |
Just for the purpose to keep track of the current status. As discussed offline I think the next step to "solve" the tokenization tests is to figure out how |
As seen here, text is tokenized using RobertaTokenizer, where one provides
So this results in [0, 20760, 232, 2]. |
Thanks for the clarification! I've opened a PR on your branch (NielsRogge#38) which proposes several changes including 1) changing the default behaviour so that by default a space prefix is added and including all the changes needed to make it work and 2) some small changes to resolve several of the tests that were failing. I wonder if we shouldn't just remove the option to set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work adding this new model!
There are a lot of commented-out code in the test_tokenization
file. Not sure if it's too fix or cleanup, but it should be removed before merging the PR. LGTM otherwise!
if bidirectional: | ||
num_buckets //= 2 | ||
ret += (relative_position > 0).long() * num_buckets | ||
n = torch.abs(relative_position) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would love a better name for n
src/transformers/models/layoutlmv3/tokenization_layoutlmv3_fast.py
Outdated
Show resolved
Hide resolved
Hi @NielsRogge As the issue #13554 and PR #17092, when How to reproduce from transformers import AutoProcessor, AutoModelForTokenClassification
from datasets import load_dataset
from PIL import Image
processor = AutoProcessor.from_pretrained("microsoft/layoutlmv3-large")
processor.feature_extractor.apply_ocr = False
model = AutoModelForTokenClassification.from_pretrained("microsoft/layoutlmv3-large")
words = ['hello' for i in range(1000)]
boxes = [[0, 1, 2, 3] for i in range(1000)]
encoding = processor(
image,
text=words,
boxes=boxes,
truncation=True,
padding='max_length',
return_overflowing_tokens=True,
return_tensors="pt"
)
print(encoding['input_ids'].shape) # torch.Size([2, 512])
print(encoding['pixel_values'].shape) #torch.Size([1, 3, 224, 224])
overflow_to_sample_mapping = encoding.pop('overflow_to_sample_mapping')
model(**encoding)
# ---> RuntimeError: Sizes of tensors must match except in dimension 1.
# Expected size 4 but got size 1 for tensor number 1 in the list. |
cd3b487
to
68e6b49
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's very impressive! Have the tokenization and test files been copied from others? I see the # Copied from
statements only in the modeling file, it would likely greatly help reviewing if they were also in the other files which have copied parts of the code
7e2ddca
to
faee0a0
Compare
truncated_sequence = information_first_truncated["input_ids"][0] | ||
overflowing_tokens = information_first_truncated["input_ids"][1] | ||
bbox = information_first_truncated["bbox"][0] | ||
overflowing_bbox = information_first_truncated["bbox"][0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A small note to keep in mind the ongoing discussion we had offline, I'm not sure I understand why the element in position 0 is taken and not the one in position 1. 🙂
faee0a0
to
d14072b
Compare
Thank you so much for your fantastic work. I was wondering if you plan to include the object detection task in LayoutLMv3 as well. I noticed that the PubLayNet fine-tuned model weights have already been uploaded to HuggingFace, but I couldn't find any documentation on this capability in this repository. |
d14072b
to
1664f50
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Impressive effort! LGTM!
And thanks for adding the # Copied from
statements, makes the review easier.
@NielsRogge Thanks for this contribution!
output:
This happens beyond maximum seq length as well... where the labels will have a dimension seq_length + ~197 |
Hi @dcyoung, thanks for taking a look. Actually you make a great point; I implemented it as the original implementation (where the authors label all visual tokens with -100 and just add a classifier on top of the entire Thanks a lot! |
And hi @sina-ehsani, unfortunately I'm (for now) not planning to add the object detection part, because the framework being used (Mask R-CNN) is a ridiculous amount of code and it's not straightforward - for now - to add this to the Transformers library (as there's a "one model, one file" philosophy). So I'd advise to use the original repository for that. It may be that in the future we add this framework, but I'm actually much more a fan of simpler frameworks like DETR and YOLOS. It would be great if someone fine-tuned a YOLOS model initialized with the weights of the Document Image Transformer (DiT). I feel like you would get the same performance. |
a39d993
to
f33fdbc
Compare
f33fdbc
to
672af36
Compare
* Make forward pass work * More improvements * Remove unused imports * Remove timm dependency * Improve loss calculation of token classifier * Fix most tests * Add docs * Add model integration test * Make all tests pass * Add LayoutLMv3FeatureExtractor * Improve integration test + make fixup * Add example script * Fix style * Add LayoutLMv3Processor * Fix style * Add option to add visual labels * Make more tokenizer tests pass * Fix more tests * Make more tests pass * Fix bug and improve docs * Fix import of processors * Improve docstrings * Fix toctree and improve docs * Fix auto tokenizer * Move tests to model folder * Move tests to model folder * change default behavior add_prefix_space * add prefix space for fast * add_prefix_spcae set to True for Fast * no space before `unique_no_split` token * add test to hightligh special treatment of added tokens * fix `test_batch_encode_dynamic_overflowing` by building a long enough example * fix `test_full_tokenizer` with add_prefix_token * Fix tokenizer integration test * Make the code more readable * Add tests for LayoutLMv3Processor * Fix style * Add model to README and update init * Apply suggestions from code review * Replace asserts by value errors * Add suggestion by @ducviet00 * Add model to doc tests * Simplify script * Improve README * a step ahead to fix * Update pair_input_test * Make all tokenizer tests pass - phew * Make style * Add LayoutLMv3 to CI job * Fix auto mapping * Fix CI job name * Make all processor tests pass * Make tests of LayoutLMv2 and LayoutXLM consistent * Add copied from statements to fast tokenizer * Add copied from statements to slow tokenizer * Remove add_visual_labels attribute * Fix tests * Add link to notebooks * Improve docs of LayoutLMv3Processor * Fix reference to section Co-authored-by: SaulLu <lucilesaul.com@gmail.com> Co-authored-by: Niels Rogge <nielsrogge@Nielss-MacBook-Pro.local>
Thank you so much for adding the model, I had a question on segment position embeddings. How do you create segment position embeddings during inference when the labels are unknown and are just bounding boxes from an ocr. In this notebook the test set also contains segment level bounding box. I have trained a model on segment level embeddings on my use case and it doesn't perform well on token level 2D embeddings during inference. |
Thanks for the idea. I will have a go at this. My understanding unilm repo uses Detectron2 (Mask-RCNN) for the backbone of Object Detection in LayoutLMv3 for benchmarking compatibility. Would it be possible to swap out the image backbone for a vision transformer in the LayoutLMv3 training. I saw in the paper:
My understanding is that LayoutLMv3 is able to generalise better with the unsupervised pre-training over the MIM+MLM+WPA objectives. It also learns correlations between the text / visual inputs that it benefits with on downstream tasks. YOLOS wouldn't include this key text information in document layout anlaysis. Please correct me if I am wrong... I am learning here. |
This thread has lead me to hacking a model that combines the YolosLoss and YolosObjectDetection head with the LayoutLMv3Model to build a LayoutLMv3ObjectDetection prediction head. Changes to the LayoutLMv3Config and LayoutLMv3FeatureExtractor had to be made to allow for this. This approach avoids the Mask R-CNN discussed. Is this something you would be interested in reviewing and integrating if I open a PR? Or does it deviate too significantly from the research paper? |
What does this PR do?
This PR implements LayoutLMv3. LayoutLMv3 doesn't require a Detectron2 backbone anymore (yay!).
The PR also includes an example script that can be used to reproduce results of the paper.
Fixes #16914
To do:
is_detection
logicPyTesseract
passadd_layoutlmv3_simplify
branch