Add TAPEX #16473

NielsRogge · 2022-03-29T10:06:02Z

What does this PR do?

Remember TAPAS, the table QA model by Google AI? Microsoft has now released TAPEX, a seq2seq model that outperforms TAPAS and is actually much simpler: table QA is just treated as a seq2seq problem.

As the weights can be directly loaded into a BART model, this PR only implements TapexTokenizer, which can be used to prepare tables and corresponding texts for the model.

This PR also adds 3 scripts that showcase how to fine-tune TAPEX on 3 important benchmarks: WikiSQL and WTQ for table question answering and TabFact for table fact verification.

Kudos to @SivilTaram (the original author) for improving my initial TapexTokenizer implementation, as well as adding the 3 fine-tuning scripts.

HuggingFaceDocBuilderDev · 2022-03-29T10:19:05Z

The documentation is not available anymore as the PR was closed or merged.

patil-suraj

Thanks a lot for adding this super cool model!
LGTM, just left a few nits.

docs/source/model_doc/tapex.mdx

src/transformers/models/tapex/__init__.py

src/transformers/models/tapex/tokenization_tapex.py

tests/tapex/test_tokenization_tapex.py

examples/research_projects/tapex/run_wikisql_with_tapex.py

examples/research_projects/tapex/run_wikitablequestions_with_tapex.py

sgugger

Thanks for the PR!

If the expected architecture should be BART by default, then this model should be added in the relevant auto mapping to work with AutoConfig and AutoModelForXxx. This is jsut a defult value that can be changed in the config if there are checkpoints that rely on a different architecture.

README.md

docs/source/model_doc/tapex.mdx

tests/tapex/test_tokenization_tapex.py

NielsRogge · 2022-03-29T15:26:20Z

make fixup is complaining:

examples/research_projects/tapex/run_wikitablequestions_with_tapex.py:50:1: F401 'wikisql_utils._TYPE_CONVERTER' imported but unused
examples/research_projects/tapex/run_wikitablequestions_with_tapex.py:50:1: F401 'wikisql_utils.retrieve_wikisql_query_answer_tapas' imported but unused

However, these functions are used in the script, so I can't remove these imports.

SivilTaram · 2022-03-30T03:23:59Z

@NielsRogge Thanks for your huge effort! I personally think these two warnings are correct since these two imports are only used in run_wikisql_with_tapex.py instead of run_wikitablequestions_with_tapex.py (the hint message). I think we can remove them.

examples/research_projects/tapex/run_wikitablequestions_with_tapex.py:50:1: F401 'wikisql_utils._TYPE_CONVERTER' imported but unused
examples/research_projects/tapex/run_wikitablequestions_with_tapex.py:50:1: F401 'wikisql_utils.retrieve_wikisql_query_answer_tapas' imported but unused

NielsRogge · 2022-04-04T09:25:40Z

@sgugger I've addressed all comments.

sgugger

The model is still missing in the auto configuration and auto model API to provide the Bart default, which means that it won't work with AutoModelForSeq2Seq for instance. This should be added before merging.

examples/research_projects/tapex/README.md

NielsRogge · 2022-04-04T14:58:08Z

All models on the hub do work with the Auto API, can you elaborate? TAPEX is also added to configuration_auto.py and modeling_auto.py.

…estion answering and table-based fact verification.

…kground. - Remove unused code lines in tabfact script. - Disable the deafult `pad_to_max_length` option which is memory-consuming.

* Fix the do_lower_case behaviour of TapexTokenizer. * Add unit tests for target scenarios and cased/uncased scenarios for both source and target.

…enizer function. * Fix typos in tapex example README.

…zer to control whether do_lower_case * Guarantee the hyper-parameter can be run without out-of-memory on 16GB card and report the new reproduced number on wikisql

* Provide evaluation command.

Co-authored-by: Suraj Patil <surajp815@gmail.com>

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Add TapexTokenizer * Improve docstrings and provide option to provide answer * Remove option for pretokenized inputs * Add TAPEX to README * Fix copies * Remove option for pretokenized inputs * Initial commit: add tapex fine-tuning examples on both table-based question answering and table-based fact verification. * - Draft a README file for running the script and introducing some background. - Remove unused code lines in tabfact script. - Disable the deafult `pad_to_max_length` option which is memory-consuming. * * Support `as_target_tokenizer` function for TapexTokenizer. * Fix the do_lower_case behaviour of TapexTokenizer. * Add unit tests for target scenarios and cased/uncased scenarios for both source and target. * * Replace the label BartTokenizer with TapexTokenizer's as_target_tokenizer function. * Fix typos in tapex example README. * * fix the evaluation script - remove the property `task_name` * * Make the label space more clear for tabfact tasks * * Using a new fine-tuning script for tapex-base on tabfact. * * Remove the lowercase code outside the tokenizer - we use the tokenizer to control whether do_lower_case * Guarantee the hyper-parameter can be run without out-of-memory on 16GB card and report the new reproduced number on wikisql * * Remove the default tokenizer_name option. * Provide evaluation command. * * Support for WikiTableQuestion dataset. * Fix a typo in README. * * Fix the datasets's key name in WikiTableQuestions * Run make fixup and move test to folder * Fix quality * Apply suggestions from code review * Apply suggestions from code review Co-authored-by: Suraj Patil <surajp815@gmail.com> * Apply suggestions from code review * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Apply some more suggestions from code review * Improve docstrings * Overwrite failing test * Improve comment in example scripts * Fix rebase * Add TAPEX to Auto mapping * Add TAPEX to auto config mappings * Put TAPEX higher than BART in auto mapping * Add TAPEX to doc tests Co-authored-by: Niels Rogge <nielsrogge@Nielss-MBP.localdomain> Co-authored-by: SivilTaram <qianlxc@outlook.com> Co-authored-by: Niels Rogge <nielsrogge@nielss-mbp.home> Co-authored-by: Suraj Patil <surajp815@gmail.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Niels Rogge <nielsrogge@Nielss-MacBook-Pro.local>

NielsRogge requested review from sgugger and patil-suraj March 29, 2022 10:11

patil-suraj approved these changes Mar 29, 2022

View reviewed changes

sgugger reviewed Mar 29, 2022

View reviewed changes

sgugger approved these changes Apr 4, 2022

View reviewed changes

examples/research_projects/tapex/README.md Show resolved Hide resolved

NielsRogge force-pushed the add_tapex_bis branch 2 times, most recently from b078744 to 5f86fa5 Compare April 7, 2022 08:36

Niels Rogge and others added 18 commits April 8, 2022 08:33

Add TapexTokenizer

eb9571e

Improve docstrings and provide option to provide answer

0fad12f

Remove option for pretokenized inputs

5bfd512

Add TAPEX to README

22aacce

Fix copies

5f03b5f

Remove option for pretokenized inputs

cd6a092

Initial commit: add tapex fine-tuning examples on both table-based qu…

32cd9ea

…estion answering and table-based fact verification.

- Draft a README file for running the script and introducing some bac…

8eeb61c

…kground. - Remove unused code lines in tabfact script. - Disable the deafult `pad_to_max_length` option which is memory-consuming.

* Support as_target_tokenizer function for TapexTokenizer.

7345d90

* Fix the do_lower_case behaviour of TapexTokenizer. * Add unit tests for target scenarios and cased/uncased scenarios for both source and target.

* Replace the label BartTokenizer with TapexTokenizer's as_target_tok…

ada18cb

…enizer function. * Fix typos in tapex example README.

* fix the evaluation script - remove the property task_name

df04eaa

* Make the label space more clear for tabfact tasks

c4de6c4

* Using a new fine-tuning script for tapex-base on tabfact.

00122ac

* Remove the lowercase code outside the tokenizer - we use the tokeni…

b95675c

…zer to control whether do_lower_case * Guarantee the hyper-parameter can be run without out-of-memory on 16GB card and report the new reproduced number on wikisql

* Remove the default tokenizer_name option.

37c021c

* Provide evaluation command.

* Support for WikiTableQuestion dataset.

be6be26

Fix a typo in README.

b70a662

* Fix the datasets's key name in WikiTableQuestions

2ad4fce

Niels Rogge and others added 14 commits April 8, 2022 08:33

Run make fixup and move test to folder

c5fb28c

Fix quality

2ea26c5

Apply suggestions from code review

a2cc55c

Apply suggestions from code review

55f1025

Co-authored-by: Suraj Patil <surajp815@gmail.com>

Apply suggestions from code review

bf81ceb

Apply suggestions from code review

2b3df5e

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Apply some more suggestions from code review

004c357

Improve docstrings

6b657a6

Overwrite failing test

1fb0bff

Improve comment in example scripts

39a5f29

Fix rebase

68bc8e0

Add TAPEX to Auto mapping

7cadded

Add TAPEX to auto config mappings

050f803

Put TAPEX higher than BART in auto mapping

8d19141

NielsRogge force-pushed the add_tapex_bis branch from 6a2199a to 8d19141 Compare April 8, 2022 06:40

Add TAPEX to doc tests

4640550

NielsRogge merged commit 4ef0abb into huggingface:main Apr 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add TAPEX #16473

Add TAPEX #16473

NielsRogge commented Mar 29, 2022 •

edited

HuggingFaceDocBuilderDev commented Mar 29, 2022 •

edited

patil-suraj left a comment

sgugger left a comment

NielsRogge commented Mar 29, 2022 •

edited

SivilTaram commented Mar 30, 2022

NielsRogge commented Apr 4, 2022

sgugger left a comment

NielsRogge commented Apr 4, 2022 •

edited

Add TAPEX #16473

Add TAPEX #16473

Conversation

NielsRogge commented Mar 29, 2022 • edited

What does this PR do?

HuggingFaceDocBuilderDev commented Mar 29, 2022 • edited

patil-suraj left a comment

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

NielsRogge commented Mar 29, 2022 • edited

SivilTaram commented Mar 30, 2022

NielsRogge commented Apr 4, 2022

sgugger left a comment

Choose a reason for hiding this comment

NielsRogge commented Apr 4, 2022 • edited

NielsRogge commented Mar 29, 2022 •

edited

HuggingFaceDocBuilderDev commented Mar 29, 2022 •

edited

NielsRogge commented Mar 29, 2022 •

edited

NielsRogge commented Apr 4, 2022 •

edited