Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Consul Enterprise] ACL replication fails when non-default namespace exists and default namespace has policy default #9271

Closed
lkysow opened this issue Nov 24, 2020 · 2 comments

Comments

@lkysow
Copy link
Member

lkysow commented Nov 24, 2020

Overview of the Issue

ACL replication fails when non-default namespace exists and default namespace has policy default.

default:
   Description:
      Builtin Default Namespace
   ACLs:
      Default Policies:
         2a54cc38-538a-b212-0723-9a9700f792f4 / cross-ns-policy
test-ns:

Reproduction Steps

Agent config for dc1:

acl {
  enabled = true
  default_policy = "deny"
  down_policy = "extend-cache"
  enable_token_persistence = true
  tokens {
    master = "286a7cec-bb5e-4373-fbf9-00526c5aa559"
    agent = "286a7cec-bb5e-4373-fbf9-00526c5aa559"
  }
}
connect {
  enabled = true
}
primary_datacenter= "dc1"
ui = true
server = true
data_dir = "dc1-datadir"
bootstrap_expect = 1
ports {
  dns = 8600
  http = 8500
  serf_lan = 8301
  serf_wan = 8302
  server = 8300
}

Agent config for dc2

acl {
  enabled = true
  default_policy = "deny"
  down_policy = "extend-cache"
  enable_token_persistence = true
  enable_token_replication = true
  tokens {
    agent = "286a7cec-bb5e-4373-fbf9-00526c5aa559"
    replication = "286a7cec-bb5e-4373-fbf9-00526c5aa559"
  }
}
connect {
  enabled = true
}
datacenter = "dc2"
primary_datacenter= "dc1"
ui = true
server = true
data_dir = "dc2-datadir"
bootstrap_expect = 1
retry_join_wan = ["127.0.0.1:8302"]
ports {
  dns = 9600
  http = 9500
  serf_lan = 9301
  serf_wan = 9302
  server = 9300
}

Create data dirs

mkdir -p dc1-datadir
mkdir -p dc2-datadir

Start agent 1

consul agent -config-file dc1.hcl

Create policies and tokens

export CONSUL_HTTP_TOKEN=286a7cec-bb5e-4373-fbf9-00526c5aa559
cat <<EOF | consul acl policy create -name "cross-ns-policy" -rules -
namespace_prefix "" {
  service_prefix "" {
    policy = "read"
  }
  node_prefix "" {
    policy = "read"
  }
}
EOF
consul namespace update -name default -default-policy-name=cross-ns-policy
consul namespace create -name test-ns

Start agent 2

consul agent -config-file dc2.hcl

See logs in dc2 forever looping:

2020-11-24T11:44:02.622-0800 [WARN]  agent.server.replication.acl.policy: ACL replication error (will retry if still leader): error="failed to update local ACL policies: Failed to apply policy upserts: Namespace "test-ns" does not exist"
2020-11-24T11:44:11.812-0800 [WARN]  agent.server.replication.namespace: replication error (will retry if still leader): error="failed to apply local Namespace updates: failed validating namespace: No such policy with ID: 2a54cc38-538a-b212-0723-9a9700f792f4"
2020-11-24T11:45:06.640-0800 [WARN]  agent.server.replication.acl.policy: ACL replication error (will retry if still leader): error="failed to update local ACL policies: Failed to apply policy upserts: Namespace "test-ns" does not exist"

That policy ID is the default policy for the default namespace.

Consul Versions

  • 1.8.6-enterprise, 1.9.0-enterprise

Permutations

Bug does not occur if a) there is no test-ns or b) there is no policy default for default ns or c) test-ns is created after the initial federation is complete.

@mkeeler
Copy link
Member

mkeeler commented Nov 30, 2020

There is a conflict between the namespace and acl policy replicators that can cause both to stall indefinitely.

The namespace replicator will fail to replicate a namespace if any namespace in the batch of updates fails validation.

The policy replicator will fail to replicate a policy if any policy in the batch of updates fails validation.

There are several ways to make modifications in the primary DC such that there will be conflicts between the replicators. In the example above the policy replicator is failing to upsert the namespaces namespace-management policy because the ns doesn't yet exist. Within the same batch it would also be attempting to upsert the newly created cross-ns-policy. Because the one policy insertion fails both get rolled back. The namespace replicator is getting stuck when updating the default namespace configuration as it links to a policy which doesn't exist in the secondary dc yet. Because both are erroring due to dependencies on each other they can never progress.

The only mitigation in the example above would be to roll back the default ns changes to add the cross-ns-policy, let replication catch up and then reapply those changes.

The actual code fix would be to relax the validations performed when a secondary dc is inserting a namespace due to replication. We should not validate that the policies/roles already exist. That will allow the ns to get inserted which in turn will allow the policy replications to finish.

Longer term it would be nice to move all replication over to a single ordered event stream so that we have some guarantee about seeing these updates in the order they were performed in the primary thus eliminating any possible validation issues.

@mkeeler
Copy link
Member

mkeeler commented Dec 8, 2020

This will be fixed in 1.9.1, 1.8.8 and 1.7.11

@mkeeler mkeeler closed this as completed Dec 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants