"In het verleden behaalde resultaten bieden geen garanties voor de toekomst"

These are the ramblings of Matthijs Kooijman, concerning the software he hacks on, hobbies he has and occasionally his personal life.

Questions? Praise? Blame? Feel free to contact me.

My old blog (pre-2006) is also still available.

Sun Mon Tue Wed Thu Fri Sat
27

Tag Cloud
&
(With plugins: config, extensionless, hide, tagging, Markdown, macros, breadcrumbs, calendar, directorybrowse, entries_index, feedback, flavourdir, include, interpolate_fancy, listplugins, menu, pagetype, preview, seemore, storynum, storytitle, writeback_recent, moreentries)
Valid XHTML 1.0 Strict & CSS
Bouncing packets: Kernel bridge bug or corner case?

While setting up Tika, I stumbled upon a fairly unlikely corner case in the Linux kernel networking code, that prevented some of my packets from being delivered at the right place. After quite some digging through debug logs and kernel source code, I found the cause of this problem in the way the bridge module handles netfilter and iptables.

Just in case someone else actually finds himself in this situation and actually manages to find this blogpost, I'll detail my setup, the problem and it solution here.

# Tika's network setup

Tika runs Debian wheezy, with a single network interface to the internet (which is not involved in this problem). Furthermore, Tika runs a number of lxc containers, which are isolated systems sharing the same kernel, but running a complete userspace of their own. Using kernel namespaces and cgroups, these containers obtain a fair degree of separation: Each of them has its own root filesystem, a private set of mounted filesystem, separate user ids, separated network stacks, etc.

Each of these containers then connects to the outside world using a virtual ethernet device. This is sort of a named pipe, but then for ethernet. Each veth device has two ends, one inside the container, and one outside, which are connected. On the inside, it just looks like each container has a single ethernet device, which is configured normally. On the outside, all of these veth interfaces are grouped together into a bridge device, br-lxc, which allows the containers to talk amongst themselves (just as if they were connected to the same ethernet switch). The bridge device in the host is configured with an IP address as well, to allow communciation between the host and containers.

Now, I have a few port forwarding rules: when traffic comes in on my public IP address on specific ports, it gets forwarded to a specific container. There is nothing special about this, this is just like forwarding ports to LAN hosts on a NAT router.

A problem with port forwarding like this is that by default, packets coming in from the internal side cannot be properly handled. As an example, one of the containers is running a webserver, which serves a custom Debian repository on the apt.stderr.nl domain. When another container tries to connect to that, DNS resolution will give it the external IP of tika, but connecting to that IP fails.

Usually, the DNAT rule used for portforwarding is configured to only process packets from the external network. But even if it would process internal packets, it would not work. The DNAT rule changes the destination address of these packets to point to my web container so they get sent to the web container. However, the source address is unchanged. Since the containers have a direct connection (through the network bridge) reply packets get sent directly through the original container - the host does not have a chance to "undo" the DNAT on the reply packets. For external connections, this is not a problem because the host is the default gateway for the containers and the replies need to through the host to reach the external ip.

The most common solution to this is split-horizon DNS - make sure that all these domains resolve to the internal address of the web container, so no port forwarding is needed. For various practical reasons, this didn't work for me, so I settled for the other solution: Apply SNAT in addition to DNAT, which causes the source address of the forwarded packets to be changed to the host's address, forcing replies to pass through the host. The Vuurmuur firewall I was using even had a special "bounce" rule for exactly this purpose (setting up a DNAT and SNAT iptables rule).

This setup worked perfectly - when connecting to the web container from other containers. However, when the web container tried to connect to itself (through the public IP address), the packets got lost. I initially thought the packets were droppped - they went through the PREROUTING chain as normal, but never showed up in the FORWARD chain. I also thought the problem was caused by the packet having the same source and destination addresses, since packets coming from other containers worked as normal. Neither of these turned out to be true, as I'll show below.

# Simplifying the setup

Since reproducing the problem on a different and/or simpler setup is always a good approach in debugging, I tried to reproduce the problem on my laptop, using a (single) reguler ethernet device and applying DNAT and SNAT rules. This worked as expected, but when I added a bridge interface, containing just the ethernet interface, it broke again. Adding a second (vlan) interface to the bridge uncovered that the problem was not traffic DNATed back to its source, but rather traffic DNATed back to the same bridge port it originated from - traffic from one bridge port DNATed to the other worked normally.

Digging down into the kernel sources for the bridge module, I uncovered this piece of code, which applies some special handling for exactly DNATed packages on a bridge. It seems this is either a performance optimization, or a way to allow DNATing packets inside a bridge without having to enable full routing, though I find the exact effects of this code rather confusing.

I also found that setting the bridge device to promiscuous mode (e.g. running tcpdump) makes everything work. Setting /proc/sys/net/bridge/bridge-nf-call-iptables to 0 also makes this work. This setting is to prevent bridged packets from passing through iptables, but since this packet wasn't actually a bridged packet before PREROUTING, this actually makes the packet be processed using the normal routing code and progresses through all regular chains normally.

Here's what I think happens:

• The packet comes in br_handle_frame
• The frame gets dumped into the NF_BR_PRE_ROUTING netfilter chain (e.g. the bridge / ebtables version, not the ip / iptables one).
• The ebtables rules get called
• The br_nf_pre_routing hook for NF_BR_PRE_ROUTING gets called. This interrupts (returns NF_STOLEN) the handling of the NF_BR_PRE_ROUTING chain, and calls the NF_INET_PRE_ROUTING chain.
• The br_nf_pre_routing_finish finish handler gets called after completing the NF_INET_PRE_ROUTING chain.
• This handler resumes the handling of the interrupted NF_BR_PRE_ROUTING chain. However, because it detects that DNAT has happened, it sets the finish handler to br_nf_pre_routing_finish_bridge instead of the regular br_handle_frame_finish finish handler.
• br_nf_pre_routing_finish_bridge runs, this skb->dev to the parent bridge and sets the BRNF_BRIDGED_DNAT flag which calls neigh->output(neigh, skb); which presumably resolves to one of the neigh_*output functions, each of which again calls dev_queue_xmit, which should (eventually) call br_dev_xmit.
• br_dev_xmit sees the BRNF_BRIDGED_DNAT flag and calls br_nf_pre_routing_finish_bridge_slow instead of actually delivering the packet.
• br_nf_pre_routing_finish_bridge_slow sets up the destination MAC address, sets skb->dev back to skb->physindev and calls br_handle_frame_finish.
• br_handle_frame_finish calls br_forward. If the bridge device is set to promisicuous mode, this also delivers the packet up through br_pass_frame_up. Since enabling promiscuous mode fixes my problem, it seems likely that the packet manages to get all the way to here.
• br_forward calls should_deliver, which returns false when skb->dev != p->dev (and "hairpin mode" is not enabled) causing the packet to be dropped.

This seems like a bug, or at least an unfortunate side effect. It seems there's currently two ways two work around this problem:

• Setting /proc/sys/net/bridge/bridge-nf-call-iptables to 0, so there is no need for this DNAT + bridge stuff. The side effect of this solution is that bridge packets don't go through iptables, but that's really what I'd have expected in the first place, so this is not a problem for me.
• Setting the bridge port to "hairpin" mode, which allows sending ports back into it. The side effect here is, AFAICS, that broadcast packets are sent back into the bridge port as well, which isn't really needed (but shouldn't really hurt either).

Next up is reporting this to a kernel mailing list to confirm if there is an actual kernel bug, or just a bug in my expectations :-)

Update: Turns out this behaviour was previously spotted, but no concensus about a fix was reached.

Related stories