Optimize the tokenization #143

HillZhang1999 · 2021-12-09T07:16:03Z

First, thanks for your excellent work. Here is my question:

I used your code to reproduce the results in your paper, but found the CPU utilization rate was really high during training process, especially for stage 1. However, the GPU rate was not always 100%, sometimes only 50~60%, and fluctuated.
I debugged and assumed that the reason is the dynamically word-piece tokenization operation in the indexer.
I also made minor changes to adapt your code for Chinese GEC, e.g., 1) upgrade allennlp to latest version; 2) discard the word-piece operations since in Chinese we directly use chars as input units. And in Chinese experiments, i found that after above adaptations, the CPU rate was reduced heavily and training speed was greatly accelerated (e.g., 900M sentences cost ~5d for English, but only 1d for Chinese).
So i wonder if your implementation can be accelerated by upgrading allennlp to the latest version, or preprocessing the data (i.e., do word-piece or bpe segmentation) statically before training.

skurzhanskyi · 2021-12-09T11:35:15Z

That's a good suggestion. Indeed, tokenization may require heavy CPU usage.
I don't see how updating AlllenNLP can help. Maybe you have a suggestion on how we can optimize the code?

HillZhang1999 · 2021-12-09T13:29:41Z

That's a good suggestion. Indeed, tokenization may require heavy CPU usage. I don't see how updating AlllenNLP can help. Maybe you have a suggestion on how we can optimize the code?

Hi, for upgrading AllenNlp, the training speed can be accelerated, since we can easily set the parameter use_amp to true in the GradientDescentTrainer, and thus use Automatic Mixed Precision to train gector. (and, of course, we need to make some extra adaptations to support characteristics in gector like cold_steps).
For handling the problem of tokenization, I guess if we can tokenize the training data before the training process, and load the results during training, to avoid the heavy CPU usage and redundant tokenization?
Thank you for your kind reply!

Jason3900 · 2022-02-12T08:28:38Z

That's a good suggestion. Indeed, tokenization may require heavy CPU usage. I don't see how updating AlllenNLP can help. Maybe you have a suggestion on how we can optimize the code?

Hi, for upgrading AllenNlp, the training speed can be accelerated, since we can easily set the parameter use_amp to true in the GradientDescentTrainer, and thus use Automatic Mixed Precision to train gector. (and, of course, we need to make some extra adaptations to support characteristics in gector like cold_steps). For handling the problem of tokenization, I guess if we can tokenize the training data before the training process, and load the results during training, to avoid the heavy CPU usage and redundant tokenization? Thank you for your kind reply!

Hey! I'm wondering how to modify the code to support cold_steps with amp? I tried but if I freeze the encoder for the first few epochs, it's not possible to unfreeze it with "params.requires_grad = True". The loss and acc will not decrease. Have you figured out a possible solution? I need some help.

HillZhang1999 · 2022-02-12T08:41:39Z

That's a good suggestion. Indeed, tokenization may require heavy CPU usage. I don't see how updating AlllenNLP can help. Maybe you have a suggestion on how we can optimize the code?

Hi, for upgrading AllenNlp, the training speed can be accelerated, since we can easily set the parameter use_amp to true in the GradientDescentTrainer, and thus use Automatic Mixed Precision to train gector. (and, of course, we need to make some extra adaptations to support characteristics in gector like cold_steps). For handling the problem of tokenization, I guess if we can tokenize the training data before the training process, and load the results during training, to avoid the heavy CPU usage and redundant tokenization? Thank you for your kind reply!

Hey! I'm wondering how to modify the code to support cold_steps with amp? I tried but if I freeze the encoder for the first few epochs, it's not possible to unfreeze it with "params.requires_grad = True". The loss and acc will not decrease. Have you figured out a possible solution? I need some help.

One simple solution is that you can save the model parameters to your disk after finishing the cold steps. Then you can start a new training procedure, reload the model parameters and unfreeze the BERT encoders.

Jason3900 · 2022-02-13T12:21:36Z

That's a good suggestion. Indeed, tokenization may require heavy CPU usage. I don't see how updating AlllenNLP can help. Maybe you have a suggestion on how we can optimize the code?

Hi, for upgrading AllenNlp, the training speed can be accelerated, since we can easily set the parameter use_amp to true in the GradientDescentTrainer, and thus use Automatic Mixed Precision to train gector. (and, of course, we need to make some extra adaptations to support characteristics in gector like cold_steps). For handling the problem of tokenization, I guess if we can tokenize the training data before the training process, and load the results during training, to avoid the heavy CPU usage and redundant tokenization? Thank you for your kind reply!

Hey! I'm wondering how to modify the code to support cold_steps with amp? I tried but if I freeze the encoder for the first few epochs, it's not possible to unfreeze it with "params.requires_grad = True". The loss and acc will not decrease. Have you figured out a possible solution? I need some help.

One simple solution is that you can save the model parameters to your disk after finishing the cold steps. Then you can start a new training procedure, reload the model parameters and unfreeze the BERT encoders.

Okay, I found that if the requires_grad option is set inside forward method, it will work. Thank you by the way~

damien2012eng · 2022-08-30T14:15:00Z

Hi @HillZhang1999 Could you please suggest what changes did you made to use latest AllenNLP? Thanks!

HillZhang1999 · 2022-08-30T14:27:22Z

Hi @HillZhang1999 Could you please suggest what changes did you made to use latest AllenNLP? Thanks!

Maybe you can refer to this repo: https://github.com/HillZhang1999/MuCGEC/tree/main/models/seq2edit-based-CGEC

damien2012eng · 2022-08-30T14:51:35Z

Thanks for replying it so quickly!
Looks like that you did not use the tokenization file in your code base? I tried to replace the existing ones with pretrainedIndexer and pretrainedEmbedder directly. However, the predicted results are different.

Jason3900 · 2022-08-31T02:13:18Z

Thanks for replying it so quickly! Looks like that you did not use the tokenization file in your code base? I tried to replace the existing ones with pretrainedIndexer and pretrainedEmbedder directly. However, the predicted results are different.

And if you would like to train seq2edit GEC without AllenNLP bundle but with faster speed, I made a deepspeed + pytorch + transformers implementation, you can refer to this repo:
https://github.com/blcuicall/CCL2022-CLTC/tree/main/baselines/track3/seq2edit

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize the tokenization #143

Optimize the tokenization #143

HillZhang1999 commented Dec 9, 2021

skurzhanskyi commented Dec 9, 2021 •

edited

HillZhang1999 commented Dec 9, 2021

Jason3900 commented Feb 12, 2022

HillZhang1999 commented Feb 12, 2022

Jason3900 commented Feb 13, 2022 •

edited

damien2012eng commented Aug 30, 2022

HillZhang1999 commented Aug 30, 2022

damien2012eng commented Aug 30, 2022

Jason3900 commented Aug 31, 2022 •

edited

Optimize the tokenization #143

Optimize the tokenization #143

Comments

HillZhang1999 commented Dec 9, 2021

skurzhanskyi commented Dec 9, 2021 • edited

HillZhang1999 commented Dec 9, 2021

Jason3900 commented Feb 12, 2022

HillZhang1999 commented Feb 12, 2022

Jason3900 commented Feb 13, 2022 • edited

damien2012eng commented Aug 30, 2022

HillZhang1999 commented Aug 30, 2022

damien2012eng commented Aug 30, 2022

Jason3900 commented Aug 31, 2022 • edited

skurzhanskyi commented Dec 9, 2021 •

edited

Jason3900 commented Feb 13, 2022 •

edited

Jason3900 commented Aug 31, 2022 •

edited