Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

agent: restart template runner on retry for unlimited retries #11775

Merged
merged 26 commits into from Jun 21, 2021

Conversation

calvn
Copy link
Member

@calvn calvn commented Jun 4, 2021

This PR takes an alternate approach compared to the one from #11711 on re-introducing unlimited template retry behavior that existed on Vault 1.5.x and earlier. A separate flag controls whether agent restarts the templating server on error, which is how it used to work up until 1.5.x.

This approach maintains backward compatibility on how retry works with Agent 1.6.x and onward while optionally allowing users to revert to the old (1.5.x and earlier) behavior if desired. I've chosen the value to reside at vault.retry.template_unlimited_retries = <bool> since the template stanza is an actual CT config and scoped on a per-template basis, but I'm also open to suggestions where else it could go or what else it could be named.

Update: See #11775 (comment) for the latest update on the config value and format.

Supersedes #11711

TODO:

  • Update/add tests
  • Update docs
  • Include changelog

@calvn calvn added this to the 1.8 milestone Jun 4, 2021
@vercel vercel bot temporarily deployed to Preview – vault-storybook June 4, 2021 21:00 Inactive
@vercel vercel bot temporarily deployed to Preview – vault June 4, 2021 21:00 Inactive
@ncabatoff
Copy link
Contributor

I think I'm okay with the idea of having a new key governing the unlimited-retry behaviour. I'm not sure I like where it is or the fact that the default would be limited-retries - though I get that agent has had limited retries since 1.5 and that going back and forth isn't great. Since traditionally Agent was meant to be a long-running process, exiting due to some errors talking to Vault is unusual/surprising behaviour, which is why I'm advocating for making indefinite retries the default.

The vault section is supposed to be about Agent's interface to Vault, so it doesn't seem ideal to have a key in there that's template-specific. I see the new key being added as being similar to the top-level exit_after_auth key: something that governs the agent process's lifecycle.

Proposal 1: add a template_default section. The new key is similar/related to the template-specific error_on_missing_key at least the way that consul_template uses it, i.e. to exit on a missing key rather than log a message like Agent apparently does? (#11349).

This new template_default section would allow users to reduce duplication by specifying values that should apply to all template sections unless overridden. There would be two new keys in addition to those that are already defined:

  • exit_on_missing_key: this would replace error_on_missing_key, and when set true would make Agent exit immediately if a key is missing from a successful response
  • exit_on_retry_failure: this would be the inverse of the template_unlimited_retries key in this PR

Proposal 2:
An alternate approach would be add a single top-level key, e.g. template_server_lifecycle or a key within a new section, e.g. template_server.lifecycle, which would have values like:

  • long-lived: run indefinitely, log errors
  • one-shot: exit after rendering all templates successfully
  • until-failure: exit if a template isn't rendering successfully after exhausting configured retries

@calvn
Copy link
Member Author

calvn commented Jun 7, 2021

The reason that template_unlimited_retries is within vault.retry is because we intend to push out this change on a patch release of Vault and so it'd be ideal to avoid introducing a new stanza and move things around or doing things like reverting the default behavior if possible. I'd be a proponent of making the changes from proposal 1 as a separate task that's slated for a major Vault release.

I'm open to moving template_unlimited_retries to be a top-level parameter similar to how error_on_missing_key is, but I'm a bit hesitant to flip the default to be unlimited retries. This re-introduces issues such as containers/pods being stuck until its eviction timeout because agent doesn't signal that templates can't be rendered (see VAULT-134).

Thanks for the review and feedback!

@calvn calvn modified the milestones: 1.8, 1.7.3 Jun 7, 2021
@calvn
Copy link
Member Author

calvn commented Jun 7, 2021

@ncabatoff chatted with Product and it seems that we're okay with releasing this for the next major release. I'm open to the idea of having a new template_default stanza in this case.

I'm still unsure if we should change the default back to be unlimited though. Thoughts on this @jasonodonnell and @tvoran, especially around the k8s use case?

@calvn calvn modified the milestones: 1.7.3, 1.8 Jun 7, 2021
@vercel vercel bot temporarily deployed to Preview – vault June 9, 2021 23:27 Inactive
@vercel vercel bot temporarily deployed to Preview – vault-storybook June 9, 2021 23:27 Inactive
@calvn
Copy link
Member Author

calvn commented Jun 10, 2021

Providing a public update on where we stand on the default behavior after a few rounds of internal discussions:

We're going with Nick's proposal 1, with some slight modifications. We will be adding a new stanza template_config for template-related behaviors. For the moment there's going to be just one new parameter introduced - exit_on_retry_failure which controls whether the template runner should exit on failure. In the future, we may use this stanza to tune other template-related behaviors that doesn't quite fit in Agent's existing template stanza, since those are declared on a per-template basis.

template_config {
  exit_on_retry_failure = <bool>
}

As for the omission of exit_on_missing_key, there's already an existing parameter within template.error_on_missing_key which conjunction with exit_on_retry_failure=false should restore the old behavior of not exiting Agent if there are errors including, but not limited to, missing key. Past behavior which used to restart the template runner once its internal retry were exhausted will be restored in this PR. This should resolve #11349.

It's worth noting that the default value for exit_on_retry_failure will be false, which means that we will be re-introducing the behavior of unlimited retries that existed on Vault Agent 1.5.x and earlier.

Our rationale for reverting back to unlimited template retries by default is that there's a valid use case brought to us by our users and customers for Agent to continue running when other subsystems such as caching are defined, regardless of whether the templates rendered successfully or not. We also understand that there are other use cases that want the exact opposite. For instance, in the kubernetes use case having unlimited retries can lead to containers and pods being forever stuck during deployments. However, this time around the unlimited template retries behavior can be turned off since this is not always desirable. Separately, our vault-helm chart will make this behavior configurable.

@vercel vercel bot temporarily deployed to Preview – vault-storybook June 10, 2021 21:35 Inactive
@vercel vercel bot temporarily deployed to Preview – vault June 10, 2021 21:35 Inactive
@vercel vercel bot temporarily deployed to Preview – vault-storybook June 10, 2021 21:59 Inactive
@vercel vercel bot temporarily deployed to Preview – vault June 10, 2021 21:59 Inactive
Copy link
Member

@tvoran tvoran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some small suggestions.

website/content/docs/agent/template-config.mdx Outdated Show resolved Hide resolved
website/content/docs/agent/template-config.mdx Outdated Show resolved Hide resolved
website/content/docs/agent/template-config.mdx Outdated Show resolved Hide resolved
Co-authored-by: Jim Kalafut <jkalafut@hashicorp.com>
Co-authored-by: Theron Voran <tvoran@users.noreply.github.com>
changelog/11775.txt Outdated Show resolved Hide resolved
command/agent_test.go Outdated Show resolved Hide resolved
command/agent_test.go Outdated Show resolved Hide resolved
command/agent_test.go Show resolved Hide resolved
Co-authored-by: Brian Kassouf <briankassouf@users.noreply.github.com>
@vercel vercel bot temporarily deployed to Preview – vault-storybook June 18, 2021 23:29 Inactive
@vercel vercel bot temporarily deployed to Preview – vault June 18, 2021 23:29 Inactive
@vercel vercel bot temporarily deployed to Preview – vault June 18, 2021 23:34 Inactive
@vercel vercel bot temporarily deployed to Preview – vault-storybook June 18, 2021 23:34 Inactive
@vercel vercel bot temporarily deployed to Preview – vault June 19, 2021 00:02 Inactive
@vercel vercel bot temporarily deployed to Preview – vault-storybook June 19, 2021 00:02 Inactive
Co-authored-by: Tom Proctor <tomhjp@users.noreply.github.com>
@vercel vercel bot temporarily deployed to Preview – vault-storybook June 21, 2021 17:49 Inactive
@calvn calvn merged commit 523a9c9 into main Jun 21, 2021
@calvn calvn deleted the agent-template-retry-unlimited branch June 21, 2021 23:10
jartek pushed a commit to jartek/vault that referenced this pull request Sep 11, 2021
…orp#11775)

* agent: restart template runner on retry for unlimited retries

* template: log error message early

* template: delegate retries back to template if param is set to true

* agent: add and use the new template config stanza

* agent: fix panic, fix existing tests

* changelog: add changelog entry

* agent: add tests for exit_on_retry_failure

* agent: properly check on agent exit cases, add separate tests for missing key vs missing secrets

* agent: add note on difference between missing key vs missing secret

* docs: add docs for template_config

* Update website/content/docs/agent/template-config.mdx

Co-authored-by: Jason O'Donnell <2160810+jasonodonnell@users.noreply.github.com>

* Update website/content/docs/agent/template-config.mdx

Co-authored-by: Jason O'Donnell <2160810+jasonodonnell@users.noreply.github.com>

* Update website/content/docs/agent/template-config.mdx

Co-authored-by: Tom Proctor <tomhjp@users.noreply.github.com>

* Update website/content/docs/agent/template-config.mdx

Co-authored-by: Tom Proctor <tomhjp@users.noreply.github.com>

* Update website/content/docs/agent/template-config.mdx

Co-authored-by: Tom Proctor <tomhjp@users.noreply.github.com>

* docs: fix exit_on_retry_failure, fix Functionality section

* docs: update interaction title

* template: add internal note on behavior for persist case

* docs: update agent, template, and template-config docs

* docs: update agent docs on retry stanza

* Apply suggestions from code review

Co-authored-by: Jim Kalafut <jkalafut@hashicorp.com>
Co-authored-by: Theron Voran <tvoran@users.noreply.github.com>

* Update changelog/11775.txt

Co-authored-by: Brian Kassouf <briankassouf@users.noreply.github.com>

* agent/test: rename expectExit to expectExitFromError

* agent/test: add check on early exits on the happy path

* Update website/content/docs/agent/template-config.mdx

Co-authored-by: Tom Proctor <tomhjp@users.noreply.github.com>

Co-authored-by: Jason O'Donnell <2160810+jasonodonnell@users.noreply.github.com>
Co-authored-by: Tom Proctor <tomhjp@users.noreply.github.com>
Co-authored-by: Jim Kalafut <jkalafut@hashicorp.com>
Co-authored-by: Theron Voran <tvoran@users.noreply.github.com>
Co-authored-by: Brian Kassouf <briankassouf@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants