Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Healthcheck Output field delay #9113

Open
ndhanushkodi opened this issue Nov 5, 2020 · 3 comments
Open

API: Healthcheck Output field delay #9113

ndhanushkodi opened this issue Nov 5, 2020 · 3 comments
Labels
theme/performance Performance benchmarking or potential improvement type/enhancement Proposed improvement or new feature

Comments

@ndhanushkodi
Copy link
Contributor

Overview of the Issue

There's a ~7 min delay in showing the service health check output in the UI. When registering a service with a failing healthcheck, when I look at the Consul UI, I see the Output field as blank for 7 min before it gets populated with the failure. When I curl localhost:8500/v1/agent/checks, I see the Output error message without delay, so I'd expect to see it sooner in the UI.

May be related issue: #8225

Reproduction Steps

Steps to reproduce this issue, eg:

  1. Clone this repo and make dev to put binary in GOPATH
  2. consul agent -dev
  3. Open the UI in a browser tab http://localhost:8500/ui/dc1/services
  4. Create payload.json with the service and healthcheck to register. (The healthcheck will fail since nothing is listening on port 8888):
payload.json
{
  "ID": "redis1",
  "Name": "redis",
  "Tags": [
    "primary",
    "v1"
  ],
  "Address": "127.0.0.1",
  "TaggedAddresses": {
    "lan": {
      "address": "127.0.0.1",
      "port": 8000
    }
  },
  "Port": 8000,
  "Check": {
    "CheckID": "service:redis1",
    "Name": "Redis health check",
    "Notes": "TCP health check",
    "ServiceID": "redis1",
    "TCP": "localhost:8888",
    "Interval": "5s",
    "Timeout": "1s",
    "DeregisterCriticalServiceAfter": "600s"
  }
}
  1. Run curl --request PUT --data @payload.json http://localhost:8500/v1/agent/service/register
  2. Look at the UI, and click on the redis1 service
  3. You will notice that the Output field is empty
  4. If you leave everything as is, in about 7 min, the Output field will be populated with dial tcp [::1]:8888: connect: connection refused
  5. If you curl localhost:8500/v1/agent/checks you will see the healthcheck Output field has been set, but it is not showing up in the UI until 7 min later.
  6. One other note, if you re run curl --request PUT --data @payload.json http://localhost:8500/v1/agent/service/register within the 7 min, the Output field gets populated

Consul info for both Client and Server

Only ran one consul agent, using consul agent -dev locally.

Starting Consul agent ``` Version: '1.9.0-dev' Node ID: 'a14f00c2-c789-a951-31f7-bf9c3003940f' Node name: 'dhanushkodi.local' Datacenter: 'dc1' (Segment: '') Server: true (Bootstrap: false) Client Addr: [127.0.0.1] (HTTP: 8500, HTTPS: -1, gRPC: 8502, DNS: 8600) Cluster Addr: 127.0.0.1 (LAN: 8301, WAN: 8302) Encrypt: Gossip: false, TLS-Outgoing: false, TLS-Incoming: false, Auto-Encrypt-TLS: false ```
Server info (`consul info`)
agent:
	check_monitors = 0
	check_ttls = 0
	checks = 1
	services = 1
build:
	prerelease = dev
	revision = 52f3714c
	version = 1.9.0
consul:
	acl = disabled
	bootstrap = false
	known_datacenters = 1
	leader = true
	leader_addr = 127.0.0.1:8300
	server = true
raft:
	applied_index = 27
	commit_index = 27
	fsm_pending = 0
	last_contact = 0
	last_log_index = 27
	last_log_term = 2
	last_snapshot_index = 0
	last_snapshot_term = 0
	latest_configuration = [{Suffrage:Voter ID:a14f00c2-c789-a951-31f7-bf9c3003940f Address:127.0.0.1:8300}]
	latest_configuration_index = 0
	num_peers = 0
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Leader
	term = 2
runtime:
	arch = amd64
	cpu_count = 16
	goroutines = 105
	max_procs = 16
	os = darwin
	version = go1.15.3
serf_lan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 1
	event_time = 2
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 1
	members = 1
	query_queue = 0
	query_time = 1
serf_wan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 1
	members = 1
	query_queue = 0
	query_time = 1

Operating system and Environment details

OS, Architecture, and any other information you can provide about the environment.
Running locally on a macbook pro.

os = darwin
version = go1.15.3

Log Fragments

Include appropriate Client or Server log fragments. If the log is longer than a few dozen lines, please include the URL to the gist of the log instead of posting it in the issue. Use -log-level=TRACE on the client and server to capture the maximum log detail.

The agent logs show:
2020-11-05T12:16:42.412-0800 [WARN] agent: Check socket connection failed: check=service:redis1 error="dial tcp [::1]:8888: connect: connection refused"
far before the UI shows the Output error message.

@jsosulska jsosulska added theme/ui Anything related to the UI theme/performance Performance benchmarking or potential improvement type/enhancement Proposed improvement or new feature labels Nov 24, 2020
@johncowen
Copy link
Contributor

Hi @ndhanushkodi 👋

The UI uses different catalog endpoints to get healthcheck updates, not the agent endpoint:

  1. For Nodes: /v1/internal/ui/node/node-name
  2. For ServiceInstances: /v1/health/service/:service-name

I ran through your walkthrough and there is indeed a difference between the agent endpoint and the health endpoint (which is health according to the catalog), but this is a difference in the API, so I'm guessing its an issue with Consul itself not necessarily the UI, ~7mins does seem like an awfully long time!

Maybe @freddygv or @rboyer can help with some info here? But for now I'll remove the UI label.

Thanks for the issue report!

@johncowen johncowen removed the theme/ui Anything related to the UI label Dec 11, 2020
@johncowen johncowen changed the title UI: Healthcheck Output field delay API: Healthcheck Output field delay Dec 11, 2020
@freddygv
Copy link
Contributor

Hi @ndhanushkodi ,

This is due to an optimization around syncing check updates to servers where only the output has changed. There have been issues in the past where a check's output would churn while the check status remained the same. This would trigger frequent syncs to the servers, and as many writes to RAFT only because of a changing output.

The way we get around this is that whenever there's an output change but no status change, we set a 5 minute timer with some jitter to defer syncing the latest check output.

The reason why you're running into this is that checks are initially registered as critical. So when the check runs and gets the output, the status hasn't changed, so we defer syncing for a few minutes (while storing the output locally).

I think to make that update faster we would need to somehow distinguish the first update that comes from actually running the check. I will leave that decision up to @hashicorp/consul-foundations.

@ndhanushkodi
Copy link
Contributor Author

@johncowen, Thank you for taking a look!

@freddygv Ahh that makes sense thanks for the info on how it works! Yeah it's not super high priority just something me and a few others on the team noticed when running some manual tests for healthchecks on consul-k8s.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/performance Performance benchmarking or potential improvement type/enhancement Proposed improvement or new feature
Projects
None yet
Development

No branches or pull requests

4 participants