New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add LayoutLMv2 + LayoutXLM #12604
Add LayoutLMv2 + LayoutXLM #12604
Conversation
4a87464
to
85e16bd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for adding this model!
For LayoutXLM, I don't think we need a new page if we can use the same architecture and tokenizer without changes. Just mention on the doc page the architecture does both.
Don't forget to add the model to the main README!
a8a8a3b
to
d655367
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work implementing this @NielsRogge, and thank you for implementing the integration tests.
The docs are very understandable, great work. If you have some notebooks available, it would be great to put them in the documentation as well.
Just want to point out that LayoutLMv2's tokenizer is subclass of As far as I know, this is the only difference between LayoutLMv2 and LayoutXLM's |
@jasonkit thanks for pointing that out, I will create a separate |
Note that is the tokenizer is the same as a |
Hmm ok, I see that this wasn't done for |
Sure: there is |
Can't wait to test this ;) Thanks for the community effort! |
984a3e2
to
f785031
Compare
@sgugger after internal discussion, I have created a new However, there's a difference between the processors defined for Wav2Vec2/CLIP and the one for LayoutLMv2. The former processors can either be a feature extractor or tokenizer at one particular moment (they are just a wrapper around both). The processor for LayoutLMv2 on the other hand applies both in a sequence, since it first uses the feature extractor to apply OCR on the document images to get words + bounding boxes, which are then provided to the tokenizer, which converts them to token-level Also, an additional feature (which I think people will like), is that one can optionally also provide word-level labels to the processor, and these will then automatically be converted to token-level Happy to get your review :) as you will see, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The design with the feature extractor looks great to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a quick look, the processor/feature extractor/tokenizer approach looks good to me. Let me know when you're happy with the final state and I'll play with it to test out the API deeper.
Impressive test suite!
381b235
to
feab66b
Compare
@NielsRogge from what I can tell, the fast tokenizer is no longer supported in this PR. When using the existing impl of LayoutLMv2Tokenizer in the context of token classification/sequence labeling, I've been following the original repos arguments: padding="max_length",
pad_to_multiple_of=8,
max_length=512,
truncation=True,
return_overflowing_tokens=True,
is_split_into_words=True, as a means of creating multiple sequences from longer input samples. I believe |
Hi @dcyoung, I'm currently working on implementing a fast tokenizer, but the slow tokenizer supports the The API of the tokenizer is a bit more extensive for LayoutLMv2. You can pass a list of words and corresponding (normalized) boxes, and the tokenizer will automatically turn everything into token-level
Can you try it out? It will also return overflowing token boxes if you want it to. |
Yup. That works fine for me. Though, I'm wondering about trying to create batches of sequences from a single "long" input sample which overflows the 512 token limit. This is for SER tasks where I'd like to consider every token on a document, requiring splitting the original sequence into multiple 512 token sequences. Previously, the tokenizer = LayoutLMv2Tokenizer.from_pretrained(
"microsoft/layoutlmv2-base-uncased",
)
n = 2000
words = n * ["hello"]
boxes = n * [[1, 2, 3, 4]]
encoded_inputs = tokenizer(
words,
boxes=boxes,
padding="max_length",
pad_to_multiple_of=8,
max_length=512,
truncation=True,
return_overflowing_tokens=True,
is_split_into_words=True,
return_tensors="pt",
)
print(encoded_inputs.keys())
for k, v in encoded_inputs.items():
print(k, v.size()) dict_keys(['overflowing_tokens', 'overflowing_token_boxes', 'num_truncated_tokens', 'input_ids', 'bbox', 'token_type_ids', 'attention_mask'])
overflowing_tokens torch.Size([1, 1490])
overflowing_token_boxes torch.Size([1, 1490, 4])
num_truncated_tokens torch.Size([1])
input_ids torch.Size([1, 512])
bbox torch.Size([1, 512, 4])
token_type_ids torch.Size([1, 512])
attention_mask torch.Size([1, 512]) I see now from the outputs above, that the tokenizer does return overflow tokens. However, I don't see the Would this require splitting the |
@NielsRogge I took a pass at batching the overflow tokens. In the Processor, i added some logic to modify the class LayoutLMv2Processor:
...
def prepare_overflow(self, encoded_inputs: BatchEncoding) -> List[BatchEncoding]:
num_truncated_tokens = max(
0, int(encoded_inputs.get("num_truncated_tokens", [0])[0])
)
max_source_tokens_per_sample = 510
num_extra_samples = ceil(num_truncated_tokens / max_source_tokens_per_sample)
extra_encoded_inputs = []
for i in range(num_extra_samples):
start_idx = i * max_source_tokens_per_sample
tokens = encoded_inputs["overflowing_tokens"][0][
start_idx : start_idx + max_source_tokens_per_sample
].tolist()
boxes = encoded_inputs["overflowing_token_boxes"][0][
start_idx : start_idx + max_source_tokens_per_sample
].tolist()
labels = encoded_inputs["overflowing_labels"][0][
start_idx : start_idx + max_source_tokens_per_sample
].tolist()
seq_len = len(tokens)
padded = self.tokenizer._pad(
encoded_inputs={
"input_ids": [101] + tokens + [102],
"bbox": [[0, 0, 0, 0]] + boxes + [[1000, 1000, 1000, 1000]],
"token_type_ids": (2 + seq_len) * [0],
"labels": [-100] + labels + [-100],
"attention_mask": (2 + seq_len) * [1],
},
max_length=512,
padding_strategy=PaddingStrategy.MAX_LENGTH,
pad_to_multiple_of=8,
return_attention_mask=True,
)
extra_encoded_inputs.append(
{
"image": torch.clone(encoded_inputs["image"]),
**{k: torch.tensor(v).unsqueeze(0) for k, v in padded.items()},
}
)
return extra_encoded_inputs However, this required adding an additional Using this processor, i am able to generate batches of sequences from a long input sequence. While I haven't had a chance to thoroughly test, I am able to run this batch through the model just fine to produce corresponding logits. Ex: encoded_inputs= processor(
img,
words,
boxes=bboxes,
word_labels=word_label_ids,
return_tensors="pt",
padding="max_length",
pad_to_multiple_of=8,
max_length=512,
truncation=True,
return_overflowing_tokens=True,
is_split_into_words=True,
batch_overflow=True,
)
extra_encoded_inputs = processor.prepare_overflow(encoded_inputs)
for model_inputs in [encoded_inputs] + extra_encoded_inputs:
outputs = model(**model_inputs)
print("Predicted Logits: ", outputs.logits.size()) Does this seem like a reasonable approach, and if so... would it be possible to add the |
The
Yes, but perhaps in a future PR, because it's not clear to me how they use the model at inference time. If you have other questions, can you please post them elsewhere instead of on this thread? Just to keep this PR a bit clean :) perhaps we can set up a Slack channel to discuss this model. If you can give me your email address, I'll set it up. Thanks! |
You're right about redirecting me to a dedicated channel. Here is my email: lacatusu.valeriu@gmail.com. Thank you! |
3fa1768
to
d0cd858
Compare
This reverts commit a9b46ce.
4fcfc77
to
2391ca5
Compare
What does this PR do?
This PR adds Microsoft's LayoutLMv2 and LayoutXLM models, in PyTorch. The latter is a multilingual version of LayoutLMv2. For now, I have not yet added any documentation related to LayoutXLM, I'm not sure whether we need a new model directory + documentation page for that one, since one can load a LayoutXLM model like so:
model = LayoutLMv2Model.from_pretrained("microsoft/layoutxlm-base")
.LayoutLMv2 is an improvement of LayoutLM (improves SOTA across several benchmarks, including new ones), by incorporating visual, text and layout information to understand scanned documents. Detectron2 is used for its visual backbone (which is a ResNeXt-FPN).
The original repo only has
LayoutLMv2Model
andLayoutLMv2ForTokenClassification
. However, in the paper they also use the model to classify document images (on RVL-CDIP), and perform visual question answering (on DocVQA). Therefore, I've addedLayoutLMv2ForSequenceClassification
andLayoutLMv2ForQuestionAnswering
. I've modelled them like they were described in the paper, but there's no official implementation to be found.Fixes #11932 #12194
Who can review?
@LysandreJik @sgugger
To do:
test_initialization
) => Lysandre would be great if you can help me fix that one. It has to do with one of the layers of the backbone. Integration test is also added.ModelOutputs,
as the length of the hidden states and attentions is actuallyseq_length + config.image_feature_pool_shape[0] * config.image_feature_pool_shape[1]
instead of justseq_length
-> update: will add a comment to the "Tips" section in the documentation instead.LayoutLMv2FeatureExtractor
,LayoutLMv2Tokenizer
andLayoutLMv2Processor
Notes:
rel_pos_bias
in the configuration). However, if we update the names, then people will not longer be able to easily convert models from the original repo to HuggingFace and vice versa. The authors did use HuggingFace for their entire codebase (they used Transformers, the Trainer, Datasets,...). The model is already uploaded by the authors on the hub.