authors	state	discussion
Robert Mustacchi <rm@joyent.com>	draft	https://github.com/joyent/rfd/issues?q="RFD+97"

RFD 97 Project Hookshot - Improved VLAN Handling

VLANs are a predominant and commonly used aspect of networking. However, the illumos networking stack does not often behave as well in the face of VLANs. Many of the behaviors and enhancements in the networking stack that are used do not apply to VLAN tagged traffic. However, we find that most of our customers, and even our own deployments, are almost exclusively filled with networks that leverage VLANs. In part, this is due to the ease of using VLANs in Triton through NAPI and the simplicity of dladm create-vnic -v <vlan>.

VLAN Refresher

Before, we delve too deeply, it's worth going over a brief refresher of what a VLAN is and why it's used. An Ethernet packet has a field called an Ethertype. This Ethertype is a 16-bit identifier which is used to indicate the type of packet that is being used. For example, 0x800 indicates an IPv4 packet, 0x806 indicates an ARP packet, and 0x86dd indicates an IPv6 packet. The following is the breakdown of a typical packet.

+-----------------+-----------+------------+------+
| Ethernet Header | IP Header | TCP Header | Data |
+-----------------+-----------+------------+------+

If we break down the Ethernet header, it then looks like:

+-------------+---------------+-----------+------+
| 6-byte Dest | 6-byte Source | 2-byte    | Data |
| MAC Address | Mac Address   | Ethertype |      |
+-------------+---------------+-----------+------+

VLAN tagged packets are indicated with an Ethertype of 0x8100. When a VLAN tagged packet is encountered, it indicates that the packet uses a modified Ethernet header. Following the Ethertype indicating a VLAN, there are 16-bits used to encode a VLAN ID and priority followed by 16-bits used to encode the actual Ethertype of the actual packet, which can be any normal Ethertype. The VLAN Ethernet header instead looks like:

+-------------+---------------+-----------+--------+-----------+------+
| 6-byte Dest | 6-byte Source | 2-byte    | 2-byte | 2-byte    | Data |
| MAC Address | Mac Address   | Ethertype | VLAN   | Ethertype |      |
|             |               | 0x8100    | id/pri |           |      |
+-------------+---------------+-----------+--------+-----------+------+

Packets that have this VLAN identifier are called tagged packets. This tagging is standardized in IEEE 802.1Q.

Each VLAN identifier represents a unique segment. These can be used to create independent, virtual networks that all share the same physical network. Broadcast traffic on one VLAN does not reach another VLAN. This also means that IP addresses can be reused in different VLANs (though care still needs to be taken to ensure that there isn't overlap if Layer 3 routing is involved).

VLANs can interact with the OS (or switch or other network device) in two different ways:

The host OS can be aware of VLANs and explicitly add VLAN tags to outgoing packets and take them into account when receiving packets.
The host OS can be unaware of VLANs and a switch can add VLAN tags to packets sent by the OS and remove them from packets being deliver to the OS.

Many switches have rules that say only specific VLANs are allowed to go in of and out of specific ports. If the switch encounters a packet with a tag that is not in the list, then it will drop it. A switch may have additional rules that state a packet that it encounters coming in on a given port should be rewritten to include a VLAN tag before sending it out or that when a specific VLAN is encountered, it should remove the tag before delivering it to the port.

In Triton, both modes are often used. The admin network in Triton is traditionally an untagged or access-mode network. Here the OS sends and receives packets without a VLAN tag; however, the switch is transparently adding and removing the tags as packets traverse the network.

Pretty much every other network in Triton is generally a VLAN tagged network. This means that all packets generated by the network must have the correct VLAN tag. The VLAN tag is specified in a NAPI network and is stored in the VM payload. When a VNIC is created, the -v option is used to specify a VLAN tag that indicates what VLAN packets should be tagged with and received on.

KVM

All of our KVM guests today operate in a mode similar to the access mode described above. While the underlying VNIC that a KVM guest can use may be tagged with a VLAN and thus its packets rewritten by the illumos networking stack, the guest cannot see any VLANs. In addition, if it tries to send traffic tagged with a VLAN, it'll end up being dropped.

This facade is maintained by the vnd driver. A KVM zone uses the same VNIC as any other zone, which may be optionally tagged with a VLAN tag.

Missing and Desired Functionality

This section goes over what functionality that exists in the stack that cannot target VLANs and some of the additional follow on effects of this that we would like to implement, but have not.

Hardware Ring Polling on VLANs

The MAC driver (aka GLDv3) allows polling for incoming packets on hardware rings that are fully classified. Polling on a ring is important because of the current design of MAC. MAC has an inherent high watermark for incoming packets. If it exceeds that number of outstanding packets on a per-ring basis, it will transition to polling mode. However, if it cannot transition to polling mode, then it will drop packets.

This high watermark is intended to make sure that the system does not end up running out of memory due to filling it with packets faster than the system can process them. Although TCP tolerates occasional packet loss or reordering, these events often induce retransmits and associated latency, so it's preferable to avoid unnecessary drops or reordering.

This ends up creating an artificial limiter on performance of the system as we end up dealing with dropped packets, which leads to retransmits, etc. While, devices that have VLANs associated with them may have a group assigned to them (granting one or more rings), which helps by providing dedicated resources, MAC does not enable polling.

MAC refuses to enable polling on rings with VLANs associated with them because it cannot guarantee that the ring is fully filtered. However, hardware does have the ability to filter on VLANs and filter them to specific rings. We should explore this functionality and add support for it to a number of drivers in the GLDv3.

Many of the devices supported by the GLDv3 support this functionality. A summary of their functionality will be summarized and provided later in this document in the section Modern Hardware Capabilities.

MAC DLS Bypass and Software Rings

Another useful feature of MAC is called DLS Bypass. DLS bypass basically allows all of the processing that takes place in the DLS kernel module to be skipped and have frames sent directly to ip_input(). This is generally only enabled for IPv4 TCP and UDP rings when VLANs are not on the scene. This just adds another unnecessary cost for using VLANs.

DLS bypass is enabled as part of IP enabling the poll capability on the data link that it has. This poll capability is only noted by the kernel for a device, if has a fully classified ring. When the ip kernel module enables polling, dls bypass is enable through the following stack:

mac`mac_soft_ring_dls_bypass()
mac`mac_client_poll_enable+0x56(ffffff03d380f038)
dld`dld_capab_poll_enable+0x9b(ffffff03d95f4e48, ffffff000f352a00)
dld`dld_capab_poll+0x45(ffffff03d95f4e48, ffffff000f352a00, 1)
dld`dld_capab+0xa0(ffffff03d95f4e48, 2, ffffff000f352a00, 1)
ip`ill_capability_poll_enable+0x75(ffffff03d6613aa8)
ip`ill_capability_dld_enable+0x68(ffffff03d6613aa8)
ip`ill_capability_dld_ack+0xed(ffffff03d6613aa8, ffffff03d0bf0280, 
ffffff03d0c42260)
ip`ill_capability_dispatch+0xb5(ffffff03d6613aa8, ffffff03d0bf0280, 
ffffff03d0c42260)
ip`ill_capability_ack_thr+0xda(ffffff03d0bf0280)
taskq_d_thread+0xb7(ffffff03cff629d8)
thread_start+8()

One complication with this is the fact that we may need to be able to listen on all VLANs on a given MAC. The problem is that someone can bind to the VLAN Ethertype, at which point it needs to be able to manage and handle receiving everything, unlike the filtering.

However, this is a rather uncommon operation. Therefore, it's not unreasonable to try and do something such that if this happens we end up going to a slow path. It may also be possible that we can rig this up with most drivers such that it'll still end up being accepted in the general ring rather than the specific one. So, we'll need to figure that out, but it shouldn't be too bad. One way that we may end up being able to do this is to basically force the underlying data link to be promiscuous so that way we don't have to interfere with the normal data path and treat this as a variant to someone running snoop or other operations.

Software Rings

Software rings in MAC are a means of fanning out the different physical rings into different buckets and queues that can be processed in parallel in the system. Today we bucket things into three categories:

TCP / IPv4
UDP / IPv4
Everything else

In general, we should look, especially in the context of the above logic to figure out if we can always decode the first level of VLAN tag and share the existing TCP/UDP buckets. These buckets are tied directly into what the clients are polling on in the above section. This ring-classification is also a part of the DLS bypass operation.

All in all, both of these should be important and useful. This really ties into GLDv3 polling on rings and how we end up really driving packets through the system.

Explicit VLAN anti-spoofing rules

One thing that might be good to have on the sending side is an explicit set of VLAN anti-spoofing rules and the ability to potentially allow additional VLANs that one can send from. This may want to be phrased as specific VLANs or as MAC, VLAN tuples. While there is some amount of logic for checking this in MAC today, it isn't quite as strong as the other anti-spoofing options that we have today.

This implicitly exists today. Focusing on it may end up making it a bit better for cases where we're trying to virtualize systems and we want to minimize the number of devices we're virtualizing, while maximizing the different traffic they can get on a single device. In essence, this could be similar to the secondary-macs property that we have or look at even combining the two.

Modern Hardware Capabilities

There are a couple of different capabilities that we care about regarding VLANs from hardware. Today most hardware offers:

Filtering traffic to a specific ring
Stripping the VLAN tag out of received packets
Inserting a VLAN tag into transmitted packets

Based on a brief survey, we know that newer Intel NICs support all of these behavior in the 10+ Gb form factors. Only the I350 supports the VLAN filters in this mode.

These filters often operate in one of two different ways to determine whether or not a given packet should be sent to that specific group:

They pass one filter and then another filter
They match a specific tuple

Consider hardware like that supported by the ixgbe driver. It has a separate set of filters for MAC addresses and then a separate set of filters for VLANs. For a packet to enter a given ring, it must match any of the MAC filters and then must match any of the VLAN filters. This could be described roughly in the following pseudocode:

for each packet p:
    for each MAC filter m:
        if m matches p's mac:
            for each VLAN filter v:
                if v matches p's vlan:
                    accept p for group

Basically each different kind of filter is treated as an AND and then it's an OR inside of a filter.

Importantly this is distinct from the tuple-matching kinds of filters. With these, your first filter is basically matching a specific tuple. So say the filter provided a tuple based on (mac, vlan), this would look like:

for each packet p:
    for each filter f:
        if f.mac matches p's mac and f.vlan matches p's vlan:
            accept p for group

This has ramifications for when we design kernel APIs. We'll need to be very clear when we're ORing together and when we're ANDing things together. It's important to be aware of these diverse types as we evaluate hardware.

The hardware strategies that we need to consider

Driver	Ring Filtering	Filter Type	Tagging	Stripping	Notes
Broadcom NetXtreme-C/-E	yes+	Tuple+	yes+	yes+	bnxt driver on Linux and FreeBSD
bnxe	yes	yes*	yes*	yes*	-
cxgbe	yes	Unknown	yes*	yes*	-
ixgbe	yes	Separate filters	yes, per packet	yes, per queue	also has global filters
igb/e1000g	no	N/A	yes, per packet	yes, per port	Filtering support is available on the I350, but not other models
i40e	yes	Both	yes, per packet	yes, per group (VSI)	-
QLogic 45000 series	yes	Both	yes, per packet	yes, per group	qede driver on Linux
Mellanox Connect-X4	yes	Unknown	yes*	yes*	Likely applies to other gen products

A '*' character indicates that we do not know for certain from the documentation what the granularity of support is. A '+' character indicates that we believe there is support based on information from other device drivers; however, we do not have documentation to confirm that.

While we're thinking about this on a purely per-VLAN basis, we also want to think about this in terms of tuples that we may want to apply for a given filter.

New Interfaces

GLDv3 Interface Changes

I'd like to propose a series of new additions to the GLDv3 interfaces for rings and groups. These proposals are variants of what exist today, designed to take into account some of the aspects of what we've talked about already. To facilitate this, manual pages that describe these new interfaces have been written up and existing manual pages have been modified.

In the next subsection we'll go through all of the new and modified manual pages. Afterwards, we'll go through and highlight key differences between what exists in the code base today for rings and groups and what we're proposing here.

The intent of this phase isn't to immediately jump to stabilization, but to get something that makes it easier for us to start working with IHVs and updating drivers ourselves to take care of these new interfaces, with the understanding that as this evolves, things may change.

Note, a number of things will be called out as existing but not documented. This is because they may not provide value at this time or we may want to consider how we we define them more.

Manual Pages

The core of the documentation can be found in the mac_capab_rings(9E) manual page. However, all of the new and modified manual pages are important.

New Manual pages:

mac_capab_rings(9E)
mac_filter(9E) - This defines all of the filter entry points.
mgi_start(9E) and mgi_stop(9E)
mi_enable(9E) and mi_disable(9E)
mr_gget(9E)
mr_rget(9E)
mri_poll(9E)
mri_start(9E)
mri_stat(9E)
mac_group_info(9s)
mac_intr(9s)
mac_ring_info(9s)

Manual pages with new functions added to them:

mri_tx was added to mc_tx(9E)
mac_rx_ring was added to mac_rx(9f)
mac_tx_ring_update was added to mac_tx_update(9f)

Modified existing manual pages:

mac(9E)
mc_unicst(9E)
mac_callbacks(9s)

All manual pages in one PDF are available here.

Changes

This section represents proposed concrete changes to the existing structures.

Structure Extensibility

We'd like to make sure that the structures that we're passing in as capabilities have some amount of extensibility. Rather than using the strict version numbering that is used elsewhere, we'd like to propose an extensions member which is the first member of the structure.

The idea here is that the OS would set bits that it supports and then the driver would and that with the things that it supports. This is a variant of the strategy used by the mc_callbacks member of the mac_callbacks_t structure. By treating it as a feature negotiation, this allows us to even change the structure entirely with additional bits in the future. This is based on some of the work that was done in RFD 89 Project Tiresias.

In addition, we've gone ahead and reserved a uint_t of flags as the second member of each structure to allow us to have a more capabilities like set of flags if we want to indicate that they support various features along the way. One example of this would be having a driver declare that it supports VLAN tagging and stripping.

This impacts the MAC_CAPAB_RINGS capability structure, the mac_group_info_t structure and the mac_ring_info_t structure. These are discussed in mac_capab_rings(9E), mac_group_info(9S), and mac_ring_info(9S) respectively.

MAC and VLAN Filtering

We'd like to make sure that we had some kind of VLAN filtering that we can add to the group API. Today, the current functions for MAC filters look like:

typedef int (*mac_add_mac_addr_t)(void *driver, const uint8_t *mac)
typedef int (*mac_rem_mac_addr_t)(void *driver, const uint8_t *mac)

Earlier we talked about the different ways of specifying filters. Over time, I suspect we're going to want more and more advanced filters; however, it's important to note the tie between MAC and VLAN pairs. To that end, I think the following function signatures should be added

typedef int (*mac_add_mac_addr_t)(mac_group_driver_t driver, const uint8_t *mac,
    uint_t flags)
typedef int (*mac_rem_mac_addr_t)(mac_group_driver_t driver, const uint8_t *mac,
    uint_t flags)
typedef int (*mac_add_vlan_t)(mac_group_driver_t driver, uint16_t vlan, uint_t flags)
typedef int (*mac_rem_vlan_t)(mac_group_driver_t driver, uint16_t vlan, uint_t flags)
typedef int (*mac_add_mv_filter_t)(mac_group_driver_t driver, const uint8_t *mac,
    uint16_t vlan, uint_t flags);
typedef int (*mac_rem_mv_filter_t)(mac_group_driver_t driver, const uint8_t *mac,
    uint16_t vlan, uint_t flags);

The idea with these functions is that they allow us to add one of the following three things:

A MAC-only filter that should be logically ORed with all the other MAC filters.
A VLAN-only filter that should be logically ORed with all other VLAN filters.
A MAC, VLAN match filter. This should be logically ORed with all other MAC, VLAN match filters.

The driver will be able to indicate whether it supports MAC, VLAN, or MAC/VLAN tuple filters. If the driver supports more than one of the filter options, then the documentation will indicate that the driver should logically and between the filters.

A driver should either support the separate filters or the tuple. We'll make it an error when registering the capability if it supports both.

One other thing to point out is that we added a flags argument that will likely be 0 by default. This is really to allow us to extend things in the future if there are arguments we want to add, for example, like tagging and stripping VLAN information.

This can be found in greater detail in the Filters section in mac_capab_rings(9E). In addition, the changes to the mac_group_t structure can be found in mac_group_info(9S).

Ring Polling

Today, ring polling provides a single limit, which is the number of bytes that should be polled by the driver. Currently this is a signed value; however, negative values have no meaning and in fact, some drivers ASSERT that it is not negative. We should likely change this to a size_t and then also consider adding a third argument which indicates the total number of bytes that should be polled.

It is still an open question as to whether or not we should introduce the third argument; however, changing the function signature seems like an important change. While there's no intention at this time of supporting a larger than INT32_MAX value, it might eliminate a class of things that drivers writers may worry about to be truly defensive.

mac_register_t changes

Today, the mac_register_t structure has a member m_v12n which is used to try and determine whether or not it should even ask the driver about caps and shares. This member doesn't seem to provide any value as something that a driver has to specify. I propose that drivers should not have to specify it and it should be ignored. Drivers will simply be asked about both MAC_CAPAB_RINGS and MAC_CAPAB_SHARES which was the other thing that this was intended to support.

Dynamic MAC Groups

The existing MAC framework has a notion of both static and dynamic groups. Static groups have a fixed mapping between rings and groups. Most drivers that support rings use the static mapping. The only exception is the nxge driver. Dynamic groups basically allow a ring to be placed in any group and in fact require that every ring be able to be placed in every group.

mac_group_t mgi_intr member member

The mac_group_t structure has a mac_intr_t member embedded as its mgi_intr member. At this time, this member has not been documented as it's not clear that most drivers have a use for it. From my rough understanding, it allows some group-level interrupt management and synchronization; however, it doesn't really seem to be used by drivers in general.

At this time, I wouldn't remove it, but I would not endorse its use.

Open Questions

This section represents open questions that we still have on the current proposed design. Some of these, like the mac_intr_t we should answer sooner rather than later.

mac_intr_t

The mac_intr_t structure presents an interesting conundrum as it is a fixed size structure that is embedded inside of the mac_ring_info_t and mac_group_info_t structures. There are a few different options here.

Transform the mac_ring_info_t and mac_group_info_t structures into having pointers to the mac_intr_t structures.
Deal with them in the same extensibility format as we described above as part of the other structures.

I believe option one will be better in the long-term. It will cause more churn in existing drivers in the gate; however, now seems the time to pay this cost.

MAC ring driver argument normalization

Today, there exists a type called the mac_ring_driver_t. This value is set based on filling in rings. While some of the function callbacks use and leverage this, not all of them do. Importantly, neither the mri_send or mri_poll entry points do. I propose that while we're here, we normalize this such that they all either use the same void * or use the mac_ring_driver_t. I do not have a strong preference towards one or the other and would welcome feedback as to what approach we should take.

VLAN Tagging and Stripping

It's worth starting to think about tagging and stripping in the context of this and how we would evolve this. I think that we would start by introducing two flags that we could assign on the group information structure:

MAC_RING_STRIP_VLAN
MAC_RING_INSERT_VLAN

MAC_RING_STRIP_VLAN flag indicates that the hardware has the capability to strip the VLAN tags when receiving frames on a given group. We might leverage this by then passing a flag requesting it on the different VLAN filter entry points.

The MAC_RING_INSERT_VLAN flag indicates that the hardware has the capability to insert a VLAN tag when transmitting frames on the given group. It's less clear how this would fit into the broader transmit framework. For example, we may want to treat this like the checksum information and store it on a per-mblk basis or we may want to just say that we'll use this on a fully classified ring.

While neither of these is required at this time, it's useful to think through what features like these may look like.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

RFD 97 Project Hookshot - Improved VLAN Handling

VLAN Refresher

KVM

Missing and Desired Functionality

Hardware Ring Polling on VLANs

MAC DLS Bypass and Software Rings

Software Rings

Explicit VLAN anti-spoofing rules

Modern Hardware Capabilities

New Interfaces

GLDv3 Interface Changes

Manual Pages

Changes

Structure Extensibility

MAC and VLAN Filtering

Ring Polling

mac_register_t changes

Dynamic MAC Groups

mac_group_t mgi_intr member member

Open Questions

mac_intr_t

MAC ring driver argument normalization

VLAN Tagging and Stripping

Files

README.md

Latest commit

History

README.md

File metadata and controls

RFD 97 Project Hookshot - Improved VLAN Handling

VLAN Refresher

KVM

Missing and Desired Functionality

Hardware Ring Polling on VLANs

MAC DLS Bypass and Software Rings

Software Rings

Explicit VLAN anti-spoofing rules

Modern Hardware Capabilities

New Interfaces

GLDv3 Interface Changes

Manual Pages

Changes

Structure Extensibility

MAC and VLAN Filtering

Ring Polling

mac_register_t changes

Dynamic MAC Groups

mac_group_t mgi_intr member member

Open Questions

mac_intr_t

MAC ring driver argument normalization

VLAN Tagging and Stripping