Bridging LXC containers to host eth0 so they can have a public IP

UPDATE:

I found the solution there: http://www.linuxfoundation.org/collaborate/workgroups/networking/bridge#No_traffic_gets_trough_.28except_ARP_and_STP.29

 # cd /proc/sys/net/bridge
 # ls
 bridge-nf-call-arptables  bridge-nf-call-iptables
 bridge-nf-call-ip6tables  bridge-nf-filter-vlan-tagged
 # for f in bridge-nf-*; do echo 0 > $f; done

But I'd like to have expert opinions on this: is it safe to disable all bridge-nf-*? What are they here for?

END OF UPDATE

I need to bridge LXC containers to the physical interface (eth0) of my host, reading numerous tutorials, documents and blog posts on the subject.

I need the containers to have their own public IP (which I've previously done KVM/libvirt).

After two days of searching and trying, I still can't make it work with LXC containers.

The host runs a freshly installed Ubuntu Server Quantal (12.10) with only libvirt (which I'm not using here) and lxc installed.

I created the containers with :

lxc-create -t ubuntu -n mycontainer

So they also run Ubuntu 12.10.

Content of /var/lib/lxc/mycontainer/config is:


lxc.utsname = mycontainer
lxc.mount = /var/lib/lxc/test/fstab
lxc.rootfs = /var/lib/lxc/test/rootfs


lxc.network.type = veth
lxc.network.flags = up
lxc.network.link = br0
lxc.network.name = eth0
lxc.network.veth.pair = vethmycontainer
lxc.network.ipv4 = 179.43.46.233
lxc.network.hwaddr= 02:00:00:86:5b:11

lxc.devttydir = lxc
lxc.tty = 4
lxc.pts = 1024
lxc.arch = amd64
lxc.cap.drop = sys_module mac_admin mac_override
lxc.pivotdir = lxc_putold

# uncomment the next line to run the container unconfined:
#lxc.aa_profile = unconfined

lxc.cgroup.devices.deny = a
# Allow any mknod (but not using the node)
lxc.cgroup.devices.allow = c *:* m
lxc.cgroup.devices.allow = b *:* m
# /dev/null and zero
lxc.cgroup.devices.allow = c 1:3 rwm
lxc.cgroup.devices.allow = c 1:5 rwm
# consoles
lxc.cgroup.devices.allow = c 5:1 rwm
lxc.cgroup.devices.allow = c 5:0 rwm
#lxc.cgroup.devices.allow = c 4:0 rwm
#lxc.cgroup.devices.allow = c 4:1 rwm
# /dev/{,u}random
lxc.cgroup.devices.allow = c 1:9 rwm
lxc.cgroup.devices.allow = c 1:8 rwm
lxc.cgroup.devices.allow = c 136:* rwm
lxc.cgroup.devices.allow = c 5:2 rwm
# rtc
lxc.cgroup.devices.allow = c 254:0 rwm
#fuse
lxc.cgroup.devices.allow = c 10:229 rwm
#tun
lxc.cgroup.devices.allow = c 10:200 rwm
#full
lxc.cgroup.devices.allow = c 1:7 rwm
#hpet
lxc.cgroup.devices.allow = c 10:228 rwm
#kvm
lxc.cgroup.devices.allow = c 10:232 rwm

Then I changed my host /etc/network/interfaces to:


auto lo
iface lo inet loopback

auto br0
iface br0 inet static
        bridge_ports eth0
        bridge_fd 0
        address 92.281.86.226
        netmask 255.255.255.0
        network 92.281.86.0
        broadcast 92.281.86.255
        gateway 92.281.86.254
        dns-nameservers 213.186.33.99
        dns-search ovh.net

When I try command line configuration ("brctl addif", "ifconfig eth0", etc.) my remote host becomes inaccessible and I have to hard reboot it.

I changed the content of /var/lib/lxc/mycontainer/rootfs/etc/network/interfaces to:


auto lo
iface lo inet loopback

auto eth0
iface eth0 inet static
        address 179.43.46.233
        netmask 255.255.255.255
        broadcast 178.33.40.233
        gateway 92.281.86.254

It takes several minutes for mycontainer to start (lxc-start -n mycontainer).

I tried replacing

        gateway 92.281.86.254
by :
        post-up route add 92.281.86.254 dev eth0
        post-up route add default gw 92.281.86.254
        post-down route del 92.281.86.254 dev eth0
        post-down route del default gw 92.281.86.254

My container then starts instantly.

But whatever configuration I set in /var/lib/lxc/mycontainer/rootfs/etc/network/interfaces, I cannot ping from mycontainer to any IP (including the host's) :


ubuntu@mycontainer:~$ ping 92.281.86.226 
PING 92.281.86.226 (92.281.86.226) 56(84) bytes of data.
^C
--- 92.281.86.226 ping statistics ---
6 packets transmitted, 0 received, 100% packet loss, time 5031ms

And my host cannot ping the container:


root@host:~# ping 179.43.46.233
PING 179.43.46.233 (179.43.46.233) 56(84) bytes of data.
^C
--- 179.43.46.233 ping statistics ---
5 packets transmitted, 0 received, 100% packet loss, time 4000ms

My container's ifconfig:


ubuntu@mycontainer:~$ ifconfig
eth0      Link encap:Ethernet  HWaddr 02:00:00:86:5b:11  
          inet addr:179.43.46.233  Bcast:255.255.255.255  Mask:0.0.0.0
          inet6 addr: fe80::ff:fe79:5a31/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:64 errors:0 dropped:6 overruns:0 frame:0
          TX packets:54 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:4070 (4.0 KB)  TX bytes:4168 (4.1 KB)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:32 errors:0 dropped:0 overruns:0 frame:0
          TX packets:32 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:2496 (2.4 KB)  TX bytes:2496 (2.4 KB)

My host's ifconfig:


root@host:~# ifconfig
br0       Link encap:Ethernet  HWaddr 4c:72:b9:43:65:2b  
          inet addr:92.281.86.226  Bcast:91.121.67.255  Mask:255.255.255.0
          inet6 addr: fe80::4e72:b9ff:fe43:652b/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:1453 errors:0 dropped:18 overruns:0 frame:0
          TX packets:1630 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:145125 (145.1 KB)  TX bytes:299943 (299.9 KB)

eth0      Link encap:Ethernet  HWaddr 4c:72:b9:43:65:2b  
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:3178 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1637 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:298263 (298.2 KB)  TX bytes:309167 (309.1 KB)
          Interrupt:20 Memory:fe500000-fe520000 

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:6 errors:0 dropped:0 overruns:0 frame:0
          TX packets:6 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:300 (300.0 B)  TX bytes:300 (300.0 B)

vethtest  Link encap:Ethernet  HWaddr fe:0d:7f:3e:70:88  
          inet6 addr: fe80::fc0d:7fff:fe3e:7088/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:54 errors:0 dropped:0 overruns:0 frame:0
          TX packets:67 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:4168 (4.1 KB)  TX bytes:4250 (4.2 KB)

virbr0    Link encap:Ethernet  HWaddr de:49:c5:66:cf:84  
          inet addr:192.168.122.1  Bcast:192.168.122.255  Mask:255.255.255.0
          UP BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

I have disabled lxcbr0 (USE_LXC_BRIDGE="false" in /etc/default/lxc).


root@host:~# brctl show
bridge name     bridge id               STP enabled     interfaces                                                                                                 
br0             8000.4c72b943652b       no              eth0                                                                                                       
                                                        vethtest        

I have configured the IP 179.43.46.233 to point to 02:00:00:86:5b:11 in my hosting provider (OVH) config panel.
(The IPs in this post are not the real ones.)

Thanks for reading this long question! :-)

Vianney


A better way to make your change permanent is to use sysctl instead of writing to /proc directly since that is the standard way to configure kernel parameters at runtime so they are set correctly at next boot:

# cat >> /etc/sysctl.d/99-bridge-nf-dont-pass.conf <<EOF
net.bridge.bridge-nf-call-ip6tables = 0
net.bridge.bridge-nf-call-iptables = 0
net.bridge.bridge-nf-call-arptables = 0
net.bridge.bridge-nf-filter-vlan-tagged = 0
EOF
# service procps start

As for the answer to the question in your update...

bridge-netfilter (or bridge-nf) is a very simple bridge for IPv4/IPv6/ARP packets (even in 802.1Q VLAN or PPPoE headers) that provides the functionality for a stateful transparent firewall, but more advanced functionality like transparent IP NAT is provided by passing those packets to arptables/iptables for further processing-- however even if the more advanced features of arptables/iptables is not need, passing packets to those programs is still turned on by default in the kernel module and must be turned off explicitly using sysctl.

What are they here for? These kernel configuration options are here to either pass (1) or don't pass (0) packets to arptables/iptables as described in the bridge-nf FAQ:

As of kernel version 2.6.1, there are three sysctl entries for bridge-nf behavioral control (they can be found under /proc/sys/net/bridge/):
bridge-nf-call-arptables - pass (1) or don't pass (0) bridged ARP traffic to arptables' FORWARD chain.
bridge-nf-call-iptables - pass (1) or don't pass (0) bridged IPv4 traffic to iptables' chains.
bridge-nf-call-ip6tables - pass (1) or don't pass (0) bridged IPv6 traffic to ip6tables' chains.
bridge-nf-filter-vlan-tagged - pass (1) or don't pass (0) bridged vlan-tagged ARP/IP traffic to arptables/iptables.

Is it safe to disable all bridge-nf-*? Yes, it is not only safe to do so, but there is a recommendation for distributions to turn it off by default to help people avoid confusion for the kind of problem you encountered:

In practice, this can lead to serious confusion where someone creates a bridge and finds that some traffic isn't being forwarded across the bridge. Because it's so unexpected that IP firewall rules apply to frames on a bridge, it can take quite some time to figure out what's going on.

and to increase security:

I still think the risk with bridging is higher, especially in the presence of virtualisation. Consider the scenario where you have two VMs on the one host, each with a dedicated bridge with the intention that neither should know anything about the other's traffic.

With conntrack running as part of bridging, the traffic can now cross over which is a serious security hole.

UPDATE: May 2015

If you are running a kernel older than 3.18, then you may be subject to the old behavior of bridge filtering enabled by default; if you newer than 3.18, then you can still be bitten by this if you've loaded the bridge module and haven't disabled the bridge filtering. See:

https://bugzilla.redhat.com/show_bug.cgi?id=634736#c44

After all these years of asking for the default of bridge filtering to be "disabled" and the change being refused by the kernel maintainers, now the filtering has been moved into a separate module that isn't loaded (by default) when the bridge module is loaded, effectively making the default "disabled". Yay!

I think this is in the kernel as of 3.17 (It definitely is in kernel 3.18.7-200.fc21, and appears to be in git prior to the tag "v3.17-rc4")