New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add LayoutLMv2 + LayoutXLM #12604
Merged
NielsRogge
merged 114 commits into
huggingface:master
from
NielsRogge:modeling_layoutlmv2_v2
Aug 30, 2021
Merged
Add LayoutLMv2 + LayoutXLM #12604
Changes from all commits
Commits
Show all changes
114 commits
Select commit
Hold shift + click to select a range
4744ab8
First commit
NielsRogge 0a3a7e4
Make style
NielsRogge 39da694
Fix dummy objects
NielsRogge 5b18c28
Add Detectron2 config
NielsRogge 5348460
Add LayoutLMv2 pooler
NielsRogge 9be733e
More improvements, add documentation
NielsRogge 3814c54
More improvements
NielsRogge 125ada5
Add model tests
NielsRogge 76a5a0f
Add clarification regarding image input
NielsRogge 480ebe1
Improve integration test
NielsRogge 5b2f585
Fix bug
NielsRogge 060a684
Fix another bug
NielsRogge 5e61df4
Fix another bug
NielsRogge e731f67
Fix another bug
NielsRogge d0ca865
More improvements
NielsRogge 604dd9b
Make more tests pass
NielsRogge b4c172e
Make more tests pass
NielsRogge 7fb70b5
Improve integration test
NielsRogge e6c1318
Remove gradient checkpointing and add head masking
NielsRogge aaef300
Add integration test
NielsRogge b470d03
Add LayoutLMv2ForSequenceClassification to the tests
NielsRogge dfe5ea7
Add LayoutLMv2ForQuestionAnswering
NielsRogge 59c1cf6
More improvements
NielsRogge 33ffd98
More improvements
NielsRogge aa15dbf
Small improvements
NielsRogge 28b576a
Fix _LazyModule
NielsRogge 6229e02
Fix fast tokenizer
NielsRogge d9ff738
Move sync_batch_norm to a separate method
NielsRogge c681657
Replace dummies by requires_backends
NielsRogge fa97538
Move calculation of visual bounding boxes to separate method + update…
NielsRogge ba0bc0e
Add models to main init
NielsRogge cd67bfa
First draft
NielsRogge 287abfa
More improvements
NielsRogge 8c0948f
More improvements
NielsRogge 88be5de
More improvements
NielsRogge 373811f
More improvements
NielsRogge 48c53c0
More improvements
NielsRogge b92db14
Remove is_split_into_words
NielsRogge fcb505a
More improvements
NielsRogge 86bb3ab
Simply tesseract - no use of pandas anymore
NielsRogge 0ae53ff
Add LayoutLMv2Processor
NielsRogge 1adbaf8
Update is_pytesseract_available
NielsRogge d5cf7c2
Fix bugs
NielsRogge 0382104
Improve feature extractor
NielsRogge d06248d
Fix bug
NielsRogge 075590b
Add print statement
NielsRogge b6b277e
Add truncation of bounding boxes
NielsRogge 258060a
Add tests for LayoutLMv2FeatureExtractor and LayoutLMv2Tokenizer
NielsRogge 2a166ca
Improve tokenizer tests
NielsRogge ab3b0ef
Make more tokenizer tests pass
NielsRogge 214b491
Make more tests pass, add integration tests
NielsRogge 0ae6e3b
Finish integration tests
NielsRogge ea84ad6
More improvements
NielsRogge bba6100
More improvements - update API of the tokenizer
NielsRogge ebc2541
More improvements
NielsRogge 93d93b7
Remove support for VQA training
NielsRogge 0b4c97b
Remove some files
NielsRogge 5a24365
Improve feature extractor
NielsRogge f04672c
Improve documentation and one more tokenizer test
NielsRogge 98ca2a2
Make quality and small docs improvements
NielsRogge 7804d69
Add batched tests for LayoutLMv2Processor, remove fast tokenizer
NielsRogge 0ea905b
Add truncation of labels
NielsRogge 8db4e13
Apply suggestions from code review
NielsRogge 0e7d10e
Improve processor tests
NielsRogge 4bccc97
Fix failing tests and add suggestion from code review
NielsRogge fd12133
Fix tokenizer test
NielsRogge 23d0570
Add detectron2 CI job
NielsRogge 40c1b6d
Simplify CI job
NielsRogge 124dd86
Comment out non-detectron2 jobs and specify number of processes
NielsRogge c59bffe
Add pip install torchvision
NielsRogge c299ff0
Add durations to see which tests are slow
NielsRogge f7ea2fe
Fix tokenizer test and make model tests smaller
NielsRogge da85fbc
Frist draft
NielsRogge 0401e4d
Use setattr
NielsRogge 2e43af8
Possible fix
LysandreJik e6d6efc
Proposal with configuration
LysandreJik 546bfb9
First draft of fast tokenizer
NielsRogge 507d724
More improvements
NielsRogge 4101b29
Enable fast tokenizer tests
NielsRogge a582226
Make more tests pass
NielsRogge 67cca2f
Make more tests pass
NielsRogge 2379176
More improvements
NielsRogge c8151e7
Addd padding to fast tokenizer
NielsRogge d6ea661
Mkae more tests pass
NielsRogge b0c7eca
Make more tests pass
NielsRogge 7613c27
Make all tests pass for fast tokenizer
NielsRogge 38934e9
Make fast tokenizer support overflowing boxes and labels
NielsRogge 066a9ec
Add support for overflowing_labels to slow tokenizer
NielsRogge 4446c8a
Add support for fast tokenizer to the processor
NielsRogge 42ebf01
Update processor tests for both slow and fast tokenizers
NielsRogge 5082dbd
Add head models to model mappings
NielsRogge b703ea2
Make style & quality
NielsRogge b22011d
Remove Detectron2 config file
NielsRogge beb6f69
Add configurable option to label all subwords
NielsRogge be1eaa1
Fix test
LysandreJik 659bd94
Skip visual segment embeddings in test
NielsRogge 66b5cbe
Use ResNet-18 backbone in tests instead of ResNet-101
NielsRogge ba5a44f
Proposal
LysandreJik a8e6997
Re-enable all jobs on CI
NielsRogge 3ab8384
Fix installation of tesseract
NielsRogge c417b04
Fix failing test
NielsRogge 84b33b8
Fix index table
NielsRogge 6142f97
Add LayoutXLM doc page, first draft of code examples
NielsRogge 2c5d412
Improve documentation a lot
NielsRogge e86b4cf
Update expected boxes for Tesseract 4.0.0 beta
NielsRogge 9eb372b
Use offsets to create labels instead of checking if they start with ##
NielsRogge e5c5f61
Update expected boxes for Tesseract 4.1.1
NielsRogge 5114199
Fix conflict
NielsRogge d429492
Make variable names cleaner, add docstring, add link to notebooks
NielsRogge 2e0a4f9
Revert "Fix conflict"
NielsRogge 748308f
Revert to make integration test pass
NielsRogge e273e77
Apply suggestions from @LysandreJik's review
NielsRogge a72f080
Address @patrickvonplaten's comments
NielsRogge 2391ca5
Remove fixtures DocVQA in favor of dataset on the hub
NielsRogge File filter
Filter by extension
Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pinging @sgugger for this especially - the LayoutLMv2 tests require
detectron2
to be installed, which takes quite a while - as well aspytesseract
. Therefore, we've opted for a separate job so as to not weigh down the existing tests. Eventually, in the best of worlds, these tests would only trigger if some changes have been detected in the files. It's already somewhat the case, but the whole installation step still happens regardless. This would imply running the test fetcher one step above so that it may decide on which jobs to run.This would help for troublesome models such as TAPAS, where installations are cumbersome.
Let's discuss this in the near future!