HP Procurve MAC cache timeouts on long TCP sessions

We have a network of HP 2510G switches connected back to HP2912al for aggregation. We've noticed that long running connections like a MySQL DB dump start flooding out to all network ports once the mac-cache-timeout expires. Doing an "arping" against the destination IP stops the flooding (going back to port to port) until the cache timeout expires again.

I can understand why this would happen for unidirectional UDP traffic, but I'm at a loss as to why it is happening for TCP. I would think the ACKs from the receiving machine would cause the Procurves to refresh the MAC address in their cache. Instead it seems like they only learn from ARPs.

Any ideas?


Solution 1:

The fundamental problem you are dealing with here is that the MAC entry times out and is not renewed in time which causes Unicast flooding. There are a few things to suspect in this situation:

  1. MAC table churn. If you have too many hosts in the collision domain of the Switch you can get MAC table churn, when entries that are in use are timed out. This typically happens when you are trunking a lot of VLANs through a switch to connect two major networks.

  2. STP changes tend to cause flooding. Misconfiguration in STP (switches with identical IDs...) and unstable links can cause clearing of caches and unexpected flooding.

  3. If you run 802.1q and do not have a symmetric setup you can get the switch to learn the destination on the wrong VLAN. This will cause the switch to eventually forget the entry and start flooding. As the replies comes on a different VLAN the switch will keep flooding.

  4. You have an asymmetric routing situation. If your routing is asymmetric and no traffic goes the other way around you can easily time out your entries in the MAC table. For example, in the following picture, traffic from router1 to router 2 goes over Switch1 and traffic from router2 to Router1 goes over Switch2. In this case you risk getting host3 flooded.

       host1
        |
       Router1
      |      |
    Switch1  Switch2 - Host3
      |      |
       Router2
        |
       host2
    
  5. Purely unidirectional traffic. In this case you need to increase the mac table ttl enough so that the gratious arps from the OS (if configured to send any) keeps the table fresh, or even hard configure the forwarding. Note that purely unidirectional traffic is very rare. A MYSQL dump should not be unidirectional. I have only seen this in cases of asymmetric routing.

As a stopgap I recommend deploying the arpd (or similar) to provide gracious arps and stop the flooding. It should have the same effect as ARPPing (which you have found solves the problem temporarily). But you really should debug this.

My first stop would be to verify if the routing is indeed symmetric all the way, as an asymmetric routing problem seems most likely.

Also, have a look at the Cisco documentation on Unicast Flooding in Campus networks which is pretty good.