Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot retry to download S3 backup data when Agent-NGT data load timeouts #581

Open
rinx opened this issue Jul 15, 2020 · 1 comment
Open

Comments

@rinx
Copy link
Contributor

rinx commented Jul 15, 2020

related to #503, #556

Describe the issue:

currently, vald-agent-ngt pods have these containers:

  • initContainers
    • agent-sidecar (initcontainer mode: download S3 backup data to volume)
  • containers
    • agent-ngt
    • agent-sidecar (sidecar mode: upload S3 backup data)

agent-sidecar on initContainer mode may fail to complete to download backup data and it returns status code 0 (RST stream from remote host will cause this case). in this case, there may be fragments of backup data in the volume and they cause blocking of NGT startup (#503).
the ideal behavior of the pods on the status like this is retrying to download backup data. however, a failing status of a container doesn't trigger pod restarts.

if there's liveness probe server in the pods, it can trigger pod restarts.
however, agent-NGT has a postStop phase (it is executed after liveness probe killed) to save index. agent-sidecar has a postStop phase to upload index.
so, it is required to improve internal/servers/server to handle these problems.

@rinx rinx added type/bug Something isn't working team/core Core team priority/medium labels Jul 15, 2020
@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the label type/bug to this issue, with a confidence of 0.88. Please mark this comment with 👍 or 👎 to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants