Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: sonic_vxlan flaps all VLAN/L2VNI mappings if vrf_map is specified and state: overridden is used #376

Open
toreanderson opened this issue May 5, 2024 · 2 comments
Assignees

Comments

@toreanderson
Copy link

toreanderson commented May 5, 2024

Bug Description

When the sonic_vxlan module is used with state: overridden and a config dict containing both vlan_map and vrf_map, it will flap (delete and re-create) all VLAN to L2VNI mappings, even though these were correctly configured to begin with. This causes a severe service disruption.

The flapping behaviour goes away if state: replaced or state: merged is used instead, however in these cases the module still falsely reports there are changes required. I suspect that the two issues have identical (or at least related) root causes, so I describe both in the same bug report.

This does not happen if the config dict does not contain vrf_map.

Product Name

SONiC-OS-4.2.0-Enterprise_Base

Component or Module Name

sonic_vxlans

DellEMC Enterprise SONiC Ansible Collection Version

dellemc.enterprise_sonic 2.4.0

SONiC Software Version

4.2.0-Enterprise_Base

Configuration

CONFIG_FILE() = /home/debian/ansible/ansible.cfg
DEFAULT_HASH_BEHAVIOUR(/home/debian/ansible/ansible.cfg) = merge
DEFAULT_HOST_LIST(/home/debian/ansible/ansible.cfg) = ['/home/debian/ansible/hosts.yml']
DEFAULT_JINJA2_EXTENSIONS(/home/debian/ansible/ansible.cfg) = jinja2.ext.do
HOST_KEY_CHECKING(/home/debian/ansible/ansible.cfg) = False
INTERPRETER_PYTHON(/home/debian/ansible/ansible.cfg) = auto_silent
MAX_FILE_SIZE_FOR_DIFF(/home/debian/ansible/ansible.cfg) = 1048576
PERSISTENT_COMMAND_TIMEOUT(/home/debian/ansible/ansible.cfg) = 3000

Steps to Reproduce

  1. Start out with a playbook containing the following test case:
---
- hosts: sonic2
  gather_facts: false
  tasks:
    - name: Create VLANs 10 and 20
      dellemc.enterprise_sonic.sonic_vlans:
        config:
          - vlan_id: 10
          - vlan_id: 20
    - name: Create VRF twenty
      dellemc.enterprise_sonic.sonic_vrfs:
        config:
          - name: Vrf_twenty
            members:
              interfaces:
                - name: Vlan20
    - name: Map VLAN 10 to L2VNI 10
      loop: [0,1,2]
      dellemc.enterprise_sonic.sonic_vxlans:
        state: overridden
        config:
          - name: vtep1
            evpn_nvo: nvo1
            source_ip: 192.0.2.1
            vlan_map:
              - vni: 10
                vlan: 10
    - name: Additionally map Vrf_twenty to L3VNI 2020
      loop: [0,1,2]
      dellemc.enterprise_sonic.sonic_vxlans:
        state: overridden
        config:
          - name: vtep1
            evpn_nvo: nvo1
            source_ip: 192.0.2.1
            vlan_map:
              - vni: 10
                vlan: 10
            vrf_map:
              - vni: 2020
                vrf: Vrf_twenty
  1. Run the playbook against an unconfigured switch.

Expected Behavior

  1. Only the first iteration of the Map VLAN 10 to L2VNI 10 task should return changed:, the subsequent ones should be idempotent and return ok:.
  2. Only the first iteration of the Additionally map Vrf_twenty to L3VNI 2020 task should return changed:, the subsequent ones should be idempotent and return ok:.
  3. No iteration of the Additionally map Vrf_twenty to L3VNI 2020 should cause any change in state to the VLAN 10/L2VNI 10 mapping, as this part of the config: dict is unchanged from the preceding Map VLAN 10 to L2VNI 10 task.

Actual Behavior

  1. Only the first iteration of the Map VLAN 10 to L2VNI 10 task should return changed:, the subsequent ones should be idempotent and return ok: - this is as expected, and shows that the bug is dependent on the presence of vrf_map.
  2. All three iterations of the Additionally map Vrf_twenty to L3VNI 2020 task reports changed:. This is unexpected, as the config: dict used does not change between the iterations. This also happens if the task is changed to use state: replaced or state: merged.
  3. All three iterations of the Additionally map Vrf_twenty to L3VNI 2020 task results in the deletion and re-addition of the VLAN 10 to L2VNI 10 mapping. This is unexpected, as this part of the config dict does not change from the Map VLAN 10 to L2VNI 10 task (or between individual iterations of Additionally map Vrf_twenty to L3VNI 2020 task for that matter). This caused a critical outage in our production network.

For what it is worth, the resulting configuration at the end of the playbook run appears to be correct:

$ sonic-cli -c 'show running-configuration interface vxlan'
!
interface vxlan vtep1
 source-ip 192.0.2.1
 qos-mode pipe dscp 0
 map vni 10 vlan 10
 map vni 2020 vrf Vrf_twenty

Logs

This is the console log from running the playbook:

debian@debian:~/ansible$ ansible-playbook -vD vxlan.yml
Using /home/debian/ansible/ansible.cfg as config file

PLAY [sonic2] ******************************************************************************************************

TASK [Create VLANs 10 and 20] **************************************************************************************
*** before
--- after
***************
*** 1 ****
! []
--- 1,10 ----
! [
!     {
!         'description': null,
!         'vlan_id': 10
!     },
!     {
!         'description': null,
!         'vlan_id': 20
!     }
! ]
changed: [sonic2] => {"after": [{"description": null, "vlan_id": 10}, {"description": null, "vlan_id": 20}], "before": [], "changed": true, "commands": [{"state": "merged", "vlan_id": 10}, {"state": "merged", "vlan_id": 20}]}

TASK [Create VRF twenty] *******************************************************************************************
*** before
--- after
***************
*** 1,4 ****
--- 1,14 ----
  [
+     {
+         'members': {
+             'interfaces': [
+                 {
+                     'name': 'Vlan20'
+                 }
+             ]
+         },
+         'name': 'Vrf_twenty'
+     },
      {
          'members': null,
          'name': 'mgmt'
changed: [sonic2] => {"after": [{"members": {"interfaces": [{"name": "Vlan20"}]}, "name": "Vrf_twenty"}, {"members": null, "name": "mgmt"}], "before": [{"members": null, "name": "mgmt"}], "changed": true, "commands": [{"members": {"interfaces": [{"name": "Vlan20"}]}, "name": "Vrf_twenty", "state": "merged"}]}

TASK [Map VLAN 10 to L2VNI 10] *************************************************************************************
changed: [sonic2] => (item=0) => {"after": [{"evpn_nvo": "nvo1", "name": "vtep1", "primary_ip": null, "source_ip": "192.0.2.1", "vlan_map": [{"vlan": 10, "vni": 10}], "vrf_map": null}], "ansible_loop_var": "item", "before": [], "changed": true, "commands": [{"evpn_nvo": "nvo1", "name": "vtep1", "primary_ip": null, "source_ip": "192.0.2.1", "state": "overridden", "vlan_map": [{"vlan": 10, "vni": 10}], "vrf_map": null}], "item": 0}
ok: [sonic2] => (item=1) => {"ansible_loop_var": "item", "before": [{"evpn_nvo": "nvo1", "name": "vtep1", "primary_ip": null, "source_ip": "192.0.2.1", "vlan_map": [{"vlan": 10, "vni": 10}], "vrf_map": null}], "changed": false, "commands": [], "item": 1}
ok: [sonic2] => (item=2) => {"ansible_loop_var": "item", "before": [{"evpn_nvo": "nvo1", "name": "vtep1", "primary_ip": null, "source_ip": "192.0.2.1", "vlan_map": [{"vlan": 10, "vni": 10}], "vrf_map": null}], "changed": false, "commands": [], "item": 2}

TASK [Additionally map Vrf_twenty to L3VNI 2020] *******************************************************************
changed: [sonic2] => (item=0) => {"after": [{"evpn_nvo": "nvo1", "name": "vtep1", "primary_ip": null, "source_ip": "192.0.2.1", "vlan_map": [{"vlan": 10, "vni": 10}], "vrf_map": null}], "ansible_loop_var": "item", "before": [{"evpn_nvo": "nvo1", "name": "vtep1", "primary_ip": null, "source_ip": "192.0.2.1", "vlan_map": [{"vlan": 10, "vni": 10}], "vrf_map": null}], "changed": true, "commands": [{"evpn_nvo": "nvo1", "name": "vtep1", "primary_ip": null, "source_ip": "192.0.2.1", "state": "overridden", "vlan_map": [{"vlan": 10, "vni": 10}], "vrf_map": [{"vni": 2020, "vrf": "Vrf_twenty"}]}], "item": 0}
changed: [sonic2] => (item=1) => {"after": [{"evpn_nvo": "nvo1", "name": "vtep1", "primary_ip": null, "source_ip": "192.0.2.1", "vlan_map": [{"vlan": 10, "vni": 10}], "vrf_map": null}], "ansible_loop_var": "item", "before": [{"evpn_nvo": "nvo1", "name": "vtep1", "primary_ip": null, "source_ip": "192.0.2.1", "vlan_map": [{"vlan": 10, "vni": 10}], "vrf_map": null}], "changed": true, "commands": [{"evpn_nvo": "nvo1", "name": "vtep1", "primary_ip": null, "source_ip": "192.0.2.1", "state": "overridden", "vlan_map": [{"vlan": 10, "vni": 10}], "vrf_map": [{"vni": 2020, "vrf": "Vrf_twenty"}]}], "item": 1}
changed: [sonic2] => (item=2) => {"after": [{"evpn_nvo": "nvo1", "name": "vtep1", "primary_ip": null, "source_ip": "192.0.2.1", "vlan_map": [{"vlan": 10, "vni": 10}], "vrf_map": null}], "ansible_loop_var": "item", "before": [{"evpn_nvo": "nvo1", "name": "vtep1", "primary_ip": null, "source_ip": "192.0.2.1", "vlan_map": [{"vlan": 10, "vni": 10}], "vrf_map": null}], "changed": true, "commands": [{"evpn_nvo": "nvo1", "name": "vtep1", "primary_ip": null, "source_ip": "192.0.2.1", "state": "overridden", "vlan_map": [{"vlan": 10, "vni": 10}], "vrf_map": [{"vni": 2020, "vrf": "Vrf_twenty"}]}], "item": 2}

PLAY RECAP *********************************************************************************************************
sonic2                     : ok=4    changed=4    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   

A single run of the Additionally map Vrf_twenty to L3VNI 2020 task yields the following relevant output logged to /var/log/ramfs/in-memory-syslog-info.log, of particular interest are the DELETE calls:

INFO mgmt-framework#/usr/sbin/rest_server[34]: [REST-72] User "admin@10.10.10.1:45032" request "GET /restconf/data/sonic-vxlan:sonic-vxlan" status - 200
INFO mgmt-framework#/usr/sbin/rest_server[34]: [REST-73] User "admin@10.10.10.1:45046" request "GET /restconf/data/sonic-vxlan:sonic-vxlan/EVPN_NVO/EVPN_NVO_LIST" status - 200
INFO mgmt-framework#/usr/sbin/rest_server[34]: [REST-74] User "admin@10.10.10.1:45062" request "GET /restconf/data/sonic-vrf:sonic-vrf/VRF/VRF_LIST" status - 200
INFO mgmt-framework#/usr/sbin/rest_server[34]: [REST-75] User "admin@10.10.10.1:45068" request "DELETE /restconf/data/sonic-vxlan:sonic-vxlan/VXLAN_TUNNEL_MAP/VXLAN_TUNNEL_MAP_LIST=vtep1,map_10_Vlan10" status - 204
INFO mgmt-framework#/usr/sbin/rest_server[34]: [REST-76] User "admin@10.10.10.1:45070" request "DELETE /restconf/data/sonic-vxlan:sonic-vxlan/VXLAN_TUNNEL/VXLAN_TUNNEL_LIST=vtep1/src_ip" status - 204
INFO mgmt-framework#/usr/sbin/rest_server[34]: [REST-77] User "admin@10.10.10.1:45076" request "DELETE /restconf/data/sonic-vxlan:sonic-vxlan/EVPN_NVO/EVPN_NVO_LIST=nvo1" status - 204
INFO mgmt-framework#/usr/sbin/rest_server[34]: [REST-78] User "admin@10.10.10.1:38586" request "DELETE /restconf/data/sonic-vxlan:sonic-vxlan/VXLAN_TUNNEL/VXLAN_TUNNEL_LIST=vtep1" status - 204
INFO mgmt-framework#/usr/sbin/rest_server[34]: [REST-79] User "admin@10.10.10.1:38590" request "PATCH /restconf/data/sonic-vxlan:sonic-vxlan/VXLAN_TUNNEL" status - 204
INFO mgmt-framework#/usr/sbin/rest_server[34]: [REST-80] User "admin@10.10.10.1:38600" request "PATCH /restconf/data/sonic-vxlan:sonic-vxlan/EVPN_NVO/EVPN_NVO_LIST" status - 204
INFO mgmt-framework#/usr/sbin/rest_server[34]: [REST-81] User "admin@10.10.10.1:38606" request "PATCH /restconf/data/sonic-vxlan:sonic-vxlan/VXLAN_TUNNEL_MAP" status - 204
INFO mgmt-framework#/usr/sbin/rest_server[34]: [REST-82] User "admin@10.10.10.1:38618" request "PATCH /restconf/data/sonic-vrf:sonic-vrf/VRF/VRF_LIST=Vrf_twenty/vni" status - 204
INFO mgmt-framework#/usr/sbin/rest_server[34]: [REST-83] User "admin@10.10.10.1:38630" request "GET /restconf/data/sonic-vxlan:sonic-vxlan" status - 200
INFO mgmt-framework#/usr/sbin/rest_server[34]: [REST-84] User "admin@10.10.10.1:38646" request "GET /restconf/data/sonic-vxlan:sonic-vxlan/EVPN_NVO/EVPN_NVO_LIST" status - 200
INFO mgmt-framework#/usr/sbin/rest_server[34]: [REST-85] User "admin@10.10.10.1:38654" request "GET /restconf/data/sonic-vrf:sonic-vrf/VRF/VRF_LIST" status - 200

During the above run, the following was logged by a running ip montor link session:

149: vtep1-10: <BROADCAST,MULTICAST> mtu 9100 qdisc noqueue master Bridge state DOWN group default 
    link/ether 0c:eb:33:95:00:49 brd ff:ff:ff:ff:ff:ff
149: vtep1-10: <BROADCAST,MULTICAST> mtu 9100 master Bridge state DOWN 
    link/ether 0c:eb:33:95:00:49
149: vtep1-10: <BROADCAST,MULTICAST> mtu 9100 master Bridge state DOWN 
    link/ether 0c:eb:33:95:00:49
Deleted 149: vtep1-10: <BROADCAST,MULTICAST> mtu 9100 master Bridge state DOWN 
    link/ether 0c:eb:33:95:00:49
76: Bridge: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9216 qdisc noqueue state UP group default event FEATURE CHANGE 
    link/ether 0c:eb:33:95:00:49 brd ff:ff:ff:ff:ff:ff
Deleted 149: vtep1-10: <BROADCAST,MULTICAST> mtu 9100 qdisc noop state DOWN group default 
    link/ether 0c:eb:33:95:00:49 brd ff:ff:ff:ff:ff:ff
150: vtep1-10: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default 
    link/ether 0c:eb:33:95:00:49 brd ff:ff:ff:ff:ff:ff
150: vtep1-10: <BROADCAST,MULTICAST> mtu 1500 qdisc noop master Bridge state DOWN group default 
    link/ether 0c:eb:33:95:00:49 brd ff:ff:ff:ff:ff:ff
150: vtep1-10: <BROADCAST,MULTICAST> mtu 1500 qdisc noop master Bridge state DOWN group default 
    link/ether 0c:eb:33:95:00:49 brd ff:ff:ff:ff:ff:ff
76: Bridge: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9216 qdisc noqueue state UP group default event FEATURE CHANGE 
    link/ether 0c:eb:33:95:00:49 brd ff:ff:ff:ff:ff:ff
150: vtep1-10: <BROADCAST,MULTICAST> mtu 1500 master Bridge state DOWN 
    link/ether 0c:eb:33:95:00:49
150: vtep1-10: <BROADCAST,MULTICAST> mtu 1500 master Bridge state DOWN 
    link/ether 0c:eb:33:95:00:49
150: vtep1-10: <BROADCAST,MULTICAST> mtu 1500 master Bridge state DOWN 
    link/ether 0c:eb:33:95:00:49
150: vtep1-10: <BROADCAST,MULTICAST> mtu 1500 master Bridge state DOWN 
    link/ether 0c:eb:33:95:00:49
150: vtep1-10: <BROADCAST,MULTICAST> mtu 9100 qdisc noop master Bridge state DOWN group default 
    link/ether 0c:eb:33:95:00:49 brd ff:ff:ff:ff:ff:ff
150: vtep1-10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9100 qdisc noqueue master Bridge state UNKNOWN group default 
    link/ether 0c:eb:33:95:00:49 brd ff:ff:ff:ff:ff:ff
150: vtep1-10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9100 master Bridge state UNKNOWN 
    link/ether 0c:eb:33:95:00:49
150: vtep1-10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9100 master Bridge state UNKNOWN 
    link/ether 0c:eb:33:95:00:49
150: vtep1-10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9100 master Bridge state UNKNOWN 
    link/ether 0c:eb:33:95:00:49

Screenshots

No response

Additional Information

Identical behaviour is observed with dellemc.enterprise_sonic 2.2.0

@toreanderson
Copy link
Author

I just noticed another bug which seems part of this overall issue.

If I start out with the configuration created by the test playbook described above, that is:

interface vxlan vtep1
 source-ip 192.0.2.1
 qos-mode pipe dscp 0
 map vni 10 vlan 10
 map vni 2020 vrf Vrf_twenty

And then apply only the Map VLAN 10 to L2VNI 10 task with either state: overridden or state: replaced, the result is:

TASK [Map VLAN 10 to L2VNI 10] *************************************************************************************
ok: [sonic2] => {"before": [{"evpn_nvo": "nvo1", "name": "vtep1", "primary_ip": null, "source_ip": "192.0.2.1", "vlan_map": [{"vlan": 10, "vni": 10}], "vrf_map": null}], "changed": false, "commands": []}

The expected behaviour here is the removal of the map vni 2020 vrf Vrf_twenty (as this mapping does not appear in the config: dict passed to this task), but this does not happen at all - it is left intact. Instead, the task behaves how I would have expected it to behave had state: merged been specified..

@kerry-meyer
Copy link
Collaborator

Thank you for letting us know about this problem.

We are looking into it and will get back to you with the result of our analysis and our plan for addressing the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants