Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nadam optimizer differences #1440

Open
albertz opened this issue Oct 18, 2023 · 0 comments
Open

Nadam optimizer differences #1440

albertz opened this issue Oct 18, 2023 · 0 comments

Comments

@albertz
Copy link
Member

albertz commented Oct 18, 2023

Our TF-layers Nadam optimizer is basically the same as Adam except that we use use_nesterov=True for training_ops.apply_adam. It is based on TF 1.15 tensorflow/contrib/opt/python/training/nadam_optimizer.py. So it also has the same options as normal Adam:

  • learning_rate=0.001
  • beta1=0.9
  • beta2=0.999
  • epsilon=1e-8

I noticed that tf.keras.optimizers.experimental.Nadam has some different options:

  • epsilon=1e-07
  • weight_decay=None
  • clipnorm=None
  • clipvalue=None
  • global_clipnorm=None
  • use_ema=False
  • ema_momentum=0.99
  • ema_overwrite_frequency=None

Ok, I did not further look into this. The clipping and weight decay probably is added here to decouple it. The use_ema is disabled by default, so the ema_... options are not used. So maybe it is mostly the same. Except of a different epsilon default.

See also:
#766 (comment)
keras-team/keras#15710

Now I noticed, in PyTorch, torch.optim.NAdam again has different options:

  • lr=0.002
  • eps=1e-08
  • weight_decay=0
  • momentum_decay=0.004
  • decoupled_weight_decay=False

I specifically wonder about the momentum_decay. What is this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant