Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training performance drop after augmentation refactoring #3092

Open
DanBmh opened this issue Jun 19, 2020 · 9 comments
Open

Training performance drop after augmentation refactoring #3092

DanBmh opened this issue Jun 19, 2020 · 9 comments
Assignees

Comments

@DanBmh
Copy link
Contributor

DanBmh commented Jun 19, 2020

I have the feeling that the training performance, duration and accuracy, got worse after the augmentation refactoring commits.

Before the training of my dataset took about 2:10h and today it took 3:20h, with about the same number of epochs. At the beginning one epoch takes about 5min, but after epoch 18 they suddenly need 8min on average. I didn't see this behaviour in the trainings two days ago.

Also the accuracy got a bit worse:

Dataset Additional Infos Losses Training epochs of best model Result
Voxforge Test: 32.844025, Validation: 36.912005 14 WER: 0.240091, CER: 0.087971
Voxforge without freq_and_time_masking augmentation Test: 33.698494, Validation: 38.071722 10 WER: 0.244600, CER: 0.094577
Voxforge using new audio augmentation options (AUG_AUDIO code1) Test: 29.280865, Validation: 33.294815 21 WER: 0.220538, CER: 0.079463
Voxforge after refactoring Test: 33.317413, Validation: 38.678969 20 WER: 0.243480, CER: 0.088640

This were the options I set before:

AUG_PITCH_TEMPO="--augmentation_pitch_and_tempo_scaling \
                   --augmentation_pitch_and_tempo_scaling_min_pitch 0.98 \
                   --augmentation_pitch_and_tempo_scaling_max_pitch 1.1 \
                   --augmentation_pitch_and_tempo_scaling_max_tempo 1.2"
AUG_ADD_DROP="--data_aug_features_additive 0.2 \
                --augmentation_spec_dropout_keeprate 0.95"
AUG_FREQ_TIME="--augmentation_freq_and_time_masking True"
AUG_AUDIO="--augment reverb[p=0.1,delay=50.0~30.0,decay=10.0:2.0~1.0] \
      --augment gaps[p=0.05,n=1:3~2,size=10:100] \
      --augment resample[p=0.1,rate=12000:8000~4000] \
      --augment codec[p=0.1,bitrate=48000:16000] \
      --augment volume[p=0.1,dbfs=-10:-40]"

And those I used for my todays run:

    AUG_AUDIO="--augment volume[p=0.1,dbfs=-10:-40] \
      --augment pitch[p=0.1,pitch=1.1~0.95] \
      --augment tempo[p=0.1,factor=1.25~0.75]"
    AUG_ADD_DROP="--augment dropout[p=0.1,rate=0.05] \
      --augment add[p=0.1,domain=signal,stddev=0~0.5]"
    AUG_FREQ_TIME="--augment frequency_mask[p=0.1,n=1:3,size=1:5] \
      --augment time_mask[p=0.1,domain=signal,n=3:10~2,size=50:100~40]"
    AUG_EXTRA="--augment reverb[p=0.1,delay=50.0~30.0,decay=10.0:2.0~1.0] \
      --augment resample[p=0.1,rate=12000:8000~4000] \
      --augment codec[p=0.1,bitrate=48000:16000]"

I also had to reduce the batch size from 30 to 24 because I got the error #3088. About two month ago I could use 36 without any problems.

I know there is a bit randomness in the accuracy and I did change some of the augmentation params slightly but the change in results is bigger than expected.

@tilmankamp do you have an idea about this?

@DanBmh
Copy link
Contributor Author

DanBmh commented Jun 19, 2020

I did run another test with the code directly before the refactoring (#188a6f2c1ee53dc79acf8abceaf729b5f9a05e7a).

This time one epoch takes 4min on average and the whole training took 1:45h.

Dataset Additional Infos Losses Training epochs of best model Result
Voxforge Test: 28.846869, Validation: 32.680268 16 WER: 0.225360, CER: 0.083504

I now used batch size of 24 and did update the params again to better match the params above:

  AUG_AUDIO="--augmentation_pitch_and_tempo_scaling \
                   --augmentation_pitch_and_tempo_scaling_min_pitch 0.95 \
                   --augmentation_pitch_and_tempo_scaling_max_pitch 1.1 \
                   --augmentation_pitch_and_tempo_scaling_max_tempo 1.25"
  AUG_ADD_DROP="--data_aug_features_additive 0.25 \
                --augmentation_spec_dropout_keeprate 0.95"
  AUG_FREQ_TIME="--augmentation_freq_and_time_masking True"
  AUG_EXTRA="--augment reverb[p=0.1,delay=50.0~30.0,decay=10.0:2.0~1.0] \
      --augment gaps[p=0.05,n=1:3~2,size=10:100] \
      --augment resample[p=0.1,rate=12000:8000~4000] \
      --augment codec[p=0.1,bitrate=48000:16000] \
      --augment volume[p=0.1,dbfs=-10:-40]"

@tilmankamp
Copy link
Contributor

tilmankamp commented Jun 22, 2020

@DanBmh Augmentations volume gaps reverb codec resample and overlay are most likely not responsible for this discrepancy as their implementations have not been changed during refactoring. For the others it'd be helpful to compare them one by one with their former implementations to get a better understanding of the problem. I'll do some performance tests here.

@tilmankamp
Copy link
Contributor

My observations so far:

  • Regarding batch size: I got the biggest difference to the old implementation when switching from the former combined --augmentation_pitch_and_tempo_scaling to --augment pitch plus --augment tempo. This additional memory requirement comes from the doubling of certain allocations as the involved ops are not part of one augmentation sub-graph anymore. With the refactored code I had to decrease BS from 38 to 35 to get it working.
  • The new internal clock tensor has some very small overhead that should in most cases not require a BS adjustment.
  • The way how dropout is implemented now seems to require slightly more memory, as there is a tensor of random values allocated that is of the same size as the augmentation target.
  • At least when comparing dropout (more tests needed) there was no difference in regards to runtime.
  • Still to do: Reliable comparison regarding accuracy and dev-loss development per augmentation...

@DanBmh
Copy link
Contributor Author

DanBmh commented Jul 10, 2020

I had some time to run some more tests today (with master about two days ago).

This time an epoch did take about 4:30min on average. I also tried different dropout values:

  • with --augment dropout[p=1,rate=0.05] which I thought should match --augmentation_spec_dropout_keeprate 0.95 (did this change?) the network only learned for two epochs, so it almost trained nothing.
  • also --augment dropout[p=0.5,rate=0.05] did produce really poor results (test loss: 43.882633).

@DanBmh
Copy link
Contributor Author

DanBmh commented Oct 1, 2020

@tilmankamp Any updates on the accuracy problem?

@JRMeyer
Copy link
Contributor

JRMeyer commented Nov 12, 2020

@DanBmh -- did you ever reach a conclusion on this? have you been running augmentation with newer releases?

@DanBmh
Copy link
Contributor Author

DanBmh commented Nov 17, 2020

Were there important changes to the augmentations in between? I didn't check for it.

I didn't run further tests, just the ones above. For my own trainings I still use the old version.

@DanBmh
Copy link
Contributor Author

DanBmh commented Dec 8, 2020

Might have found a reason for the accuracy problem. First I did misunderstand the augmentation flag description and the pitch and tempo flags are not converted correctly. Second, the new start:stop logic could be another reason. I normally use a high training epoch number like 1000, because the training is stopped with early-stopping. But I assume that the :stop is related to the epochs flag and I'm therefore using only the start values for the augmentations instead of the full range.

Will try to run a test in the next time, but I don't believe this will also solve the slower training.

For the second problem maybe a new flag like augment_growth_epochs could be helpful for better combination with early-stopping.

@reuben
Copy link
Contributor

reuben commented Dec 8, 2020

For the second problem maybe a new flag like augment_growth_epochs could be helpful for better combination with early-stopping.

Yeah, that could be useful, usually for hyperparameter schedules there's a separate start/ramp-up/ramp-down/stop range compared to the number of steps/epochs for the whole training run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants