Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"failed to release lock" #27

Open
rstupek opened this issue Sep 22, 2022 · 6 comments
Open

"failed to release lock" #27

rstupek opened this issue Sep 22, 2022 · 6 comments

Comments

@rstupek
Copy link

rstupek commented Sep 22, 2022

Is this the expected behavior for thus plugin?

@pteich
Copy link
Owner

pteich commented Sep 25, 2022

Do you have more context what you did and when this error message occurred? How many concurrent Caddy servers are running, what traffic they have to handle how many certificates?

However, it can happen that there is a problem releasing a Consul distributed lock without any further issues.

@rstupek
Copy link
Author

rstupek commented Sep 26, 2022

I have a 3 consul cluster setup and the consul plugin does appear to be storing data however every new certificate that gets requested results in the above failure. Below is an example
`
2022/09/26 16:25:46.080 INFO tls.on_demand obtaining new certificate {"remote_ip": "99.110.94.85", "remote_port": "50979", "server_name": "test.madeupcompany.com"}
2022/09/26 16:25:46.112 INFO caddy.storage.consul deleting key rw_test_82319838895603462 from Consul
2022/09/26 16:25:46.136 INFO tls.obtain acquiring lock {"identifier": "test.madeupcompany.com"}
2022/09/26 16:25:46.136 INFO caddy.storage.consul creating Consul lock for issue_cert_test.madeupcompany.com
2022/09/26 16:25:46.191 INFO tls.obtain lock acquired {"identifier": "test.madeupcompany.com"}
2022/09/26 16:25:46.196 INFO tls.obtain obtaining certificate {"identifier": "test.madeupcompany.com"}
2022/09/26 16:25:46.207 INFO http waiting on internal rate limiter {"identifiers": ["test.madeupcompany.com"], "ca": "https://acme-v02.api.letsencrypt.org/directory", "account": "support@ezot.com"}
2022/09/26 16:25:46.207 INFO http done waiting on internal rate limiter {"identifiers": ["test.madeupcompany.com"], "ca": "https://acme-v02.api.letsencrypt.org/directory", "account": "support@ezot.com"}
2022/09/26 16:25:46.658 INFO http.acme_client trying to solve challenge {"identifier": "test.madeupcompany.com", "challenge_type": "http-01", "ca": "https://acme-v02.api.letsencrypt.org/directory"}
2022/09/26 16:25:46.870 INFO http served key authentication {"identifier": "test.madeupcompany.com", "challenge": "http-01", "remote": "18.118.122.75:56800", "distributed": false}
2022/09/26 16:25:46.919 INFO http served key authentication {"identifier": "test.madeupcompany.com", "challenge": "http-01", "remote": "23.178.112.208:35094", "distributed": false}
2022/09/26 16:25:47.061 INFO http served key authentication {"identifier": "test.madeupcompany.com", "challenge": "http-01", "remote": "34.219.117.149:61766", "distributed": false}
2022/09/26 16:25:47.085 INFO http served key authentication {"identifier": "test.madeupcompany.com", "challenge": "http-01", "remote": "3.120.149.9:43226", "distributed": false}
2022/09/26 16:25:47.346 INFO caddy.storage.consul deleting key acme/acme-v02.api.letsencrypt.org-directory/challenge_tokens/test.madeupcompany.com.json from Consul
2022/09/26 16:25:47.358 INFO http.acme_client authorization finalized {"identifier": "test.madeupcompany.com", "authz_status": "valid"}
2022/09/26 16:25:47.358 INFO http.acme_client validations succeeded; finalizing order {"order": "https://acme-v02.api.letsencrypt.org/acme/order/743838387/129091539597"}
2022/09/26 16:25:48.531 INFO http.acme_client successfully downloaded available certificate chains {"count": 2, "first_url": "https://acme-v02.api.letsencrypt.org/acme/cert/04b35be8294d5d7e3156cbdbbc116bf6fda3"}
2022/09/26 16:25:48.563 INFO tls.obtain certificate obtained successfully {"identifier": "test.madeupcompany.com"}
2022/09/26 16:25:48.563 INFO tls.obtain releasing lock {"identifier": "test.madeupcompany.com"}
2022/09/26 16:25:48.573 ERROR caddy.storage.consul failed to release lock: lock caddytls/issue_cert_test.madeupcompany.com not found
'

It seems that it created the lock without the prefix but attempts to release it with the prefix?

consul kv does appear to have the correct thing setup

    {
            "key": "caddytls/issue_cert_test.madeupcompany.com",
            "flags": 3304740253564472344,
            "value": ""
    },

@kula
Copy link

kula commented Oct 2, 2022

I came across this while working on #28 --- you can see this if you run go test --tags=consul -v. It doesn't have anything to do with the prefix: the message 2022/09/26 16:25:46.136 INFO caddy.storage.consul creating Consul lock for issue_cert_test.madeupcompany.com comes from

cs.logger.Infof("creating Consul lock for %s", key)

which doesn't log the prefix, while the error you are seeing comes from

go func() {
<-lockActive
err := cs.Unlock(ctx, key)
if err != nil {
cs.logger.Errorf("failed to release lock: %s", err)
}
}()

which logs the error returned by Consul, which does happen to include the full path the plugin asked for.

Rather, I think that error is coming from the logic in that goroutine to cleanup in the event a lock is lost. lockActive comes from

lockActive, err := lock.Lock(ctx.Done())

Per the Consul SDK documentation, this is "... a channel that is closed if our lock is lost or an error." (https://pkg.go.dev/github.com/hashicorp/consul/api#Lock). It also seems that channel is closed if the original locker also unlocked the lock.

So I think what is happening is:

  1. cs.Unlock is called, which makes the Consul SDK Unlock call and removes the lock
  2. That triggers the closing of lockActive
  3. Which then calls cs.Unlock again from the goroutine, but of course we've just unlocked the lock so Consul returns an error to us

I think that bit of logic needs to be rethought and a couple things done:

  1. Something like having a second channel and a select in that goroutine --- if we call cs.Unlock close that second channel and have it just exit the goroutine
  2. I don't think the goroutine should call cs.Unlock if we've lost the lock --- we've lost it, so we're going to get an error if we tell Consul to release the lock we no longer have. I think all we really need to do is remove it from the cs.locks map

kula added a commit to kula/caddy-tlsconsul that referenced this issue Oct 2, 2022
Work in progress, I think this addresses pteich#27 but it needs a
sanity check and there's one edge case in there that needs some
more thinking.
@rstupek
Copy link
Author

rstupek commented Oct 4, 2022

Do you know if the issue would cause a problem with renewing the certificate in the 90 day renewal period if consul kv still has the setting?

@kula
Copy link

kula commented Oct 5, 2022

I don't think this will cause a direct problem with renewing a certificate. When Caddy calls Unlock to actually unlock, if there had been a problem the error you would have gotten would have been

return errors.Wrapf(err, "unable to unlock %s", cs.prefixKey(key))

You don't have that, so Caddy successfully asked Consul to release the lock, and Consul did. That, however, also triggers the goroutine

// auto-unlock and clean list of locks in case of lost
go func() {
<-lockActive
err := cs.Unlock(ctx, key)
if err != nil {
cs.logger.Errorf("failed to release lock: %s", err)
}
}()

where it tries to Unlock again. This time it isn't found because the first call to Unlock, after it asked Consul to unlock, removed the lock info from the internal cs.locks list and that's the error we get back:

caddy-tlsconsul/storage.go

Lines 115 to 118 in 3811ba6

lock, exists := cs.GetLock(key)
if !exists {
return errors.Errorf("lock %s not found", cs.prefixKey(key))
}

which when returned to

err := cs.Unlock(ctx, key)
if err != nil {
cs.logger.Errorf("failed to release lock: %s", err)

gives you failed to release lock: lock <lockname> not found, and because <lockname> is logged as cs.prefixKey(key) you get the full K/V path.

So in fact, I don't think we're actually asking Consul to unlock a second time, because when Unlock is called in that goroutine Caddy already thinks we don't have that lock, and that's the error you get above.

As further proof, when you look up that K/V entry you may see it there, but since there's no session field, no current session actually holds a lock on it. Looking at the code here, it doesn't look like we delete the K/V entry, we simply release the lock, and all you'll see on the K/V side of things is that session field disappearing.

@pteich
Copy link
Owner

pteich commented Oct 5, 2022

As @kula points out it is more or less an unnecessary double unlock. But it should and will be no problem with renewing. We use it for several years and many hundreds of domains without any issue regarding renewal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants