Backups take so long that the firewall closes the connection

A bit of a mashup of systems here, so bear with me. Essentially, I'm having some trouble using the Backup Exec agent for Oracle, while trying to backup a remote Linux server. The BE agent appears to use RMAN to backup the databases

The backup server is on one VLAN and the target server on another, with a Cisco ASA firewall providing the only link between them. This is by design, as the backup server is to support numerous clients and each client must be on its own VLAN to prevent them from accessing each other. I have added the recommended ports to the firewall to at least allow the agent to talk to the media server.

The backup starts well enough (indeed a smaller Oracle database on the same server completes without issue) but a 200GB database, which would clearly take a few hours to complete, is not able to complete.

I believe the problem to be related to http://www.symantec.com/business/support/index?page=content&id=TECH59632, which says that a CORBA session is established on port 5633 at the start of the backup and used before each RMAN operation but, while data is being transferred, the CORBA session's socket receives no packets. Since the connection timeout on the firewall is 60 mins, the CORBA session is dropped and, when the RMAN agent tries to perform its next action, the whole process bombs. Symantec say this problem was fixed in an earlier version of Backup Exec, but do not detail any additional settings to enforce it.

Setting the connection timeout on the firewall to something high-enough to cover the backup window (e.g. 12 hours) seems like the wrong thing to do, as it is an estate-wide change, which would also affect the connection lifetime of (for example) web requests to another client's web server.

Moving the Linux server into the same LAN as the backup server is out of the question.

I'm not a Linux guru, but I roughly know my way around. So far, I have tried starting using libkeepalive (http://libkeepalive.sourceforge.net/) to force the beremote process' socket creation to be made with a KEEPALIVE TCP flag, but a quick netstat -top indicates that it is not taking. Either I'm using libkeepalive incorrectly, or it doesn't work for the beremote binary

I guess I am looking for an option that fits with the environment I am in. I figure I'm looking for one or more of the following:

  • a way to configure the BE agent to keep the connection alive?
  • a way to inject the keepalive flag to the existing TCP connection (e.g. via a cronjob)?
  • a way to tell the Cisco device to increase the connection timeout for a specific source/target (maybe a policy-map)?

Any/all (other) ideas welcome...

J.


RE: Comment by @Weaver

As requested, class-map, policy-map and service-map entries...

class-map CLS_INSPECTION_TRAFFIC
 match default-inspection-traffic
class-map CLS_ALL_TRAFFIC
 match any
class-map CLS_BACKUPEXEC_CORBA
 description Oracle/DB2 CORBA port for BackupExec traffic
 match port tcp eq 5633
!
!
policy-map type inspect dns PMAP_DNS_INSPECT_SETTINGS
 parameters
  message-length maximum client auto
  message-length maximum 1280
policy-map PMAP_GLOBAL_SERVICE
 class CLS_INSPECTION_TRAFFIC
  inspect dns PMAP_DNS_INSPECT_SETTINGS 
  inspect ftp 
  inspect h323 h225 
  inspect h323 ras 
  inspect rsh 
  inspect rtsp 
  inspect esmtp 
  inspect sqlnet 
  inspect skinny  
  inspect sunrpc 
  inspect xdmcp 
  inspect sip  
  inspect netbios 
  inspect tftp 
  inspect ipsec-pass-thru 
  inspect icmp 
  inspect snmp 
 class CLS_BACKUPEXEC_CORBA
  set connection timeout idle 1:00:00 dcd 
 class CLS_ALL_TRAFFIC
  set connection decrement-ttl
!

Solution 1:

Background on ASA Timeout/Timers:

The global timeout conn is TCP virtual circuit (session) idle timer and defaults to 60 minutes. The global timeout udp is for UDP holes and defaults to 2 minutes. The global timeout xlate is for clearing up translations that linger around after a conn has timed out. The conn (TCP) timeout takes precedence over the xlate timeout. The next paragraph further explains the relationship between conn and xlate timers.

If a conn is successfully torn down via TCP teardown, the conn and xlate go with it (if dynamic xlate, static NAT and static PAT xlate's are never removed). If a conn times out, then the xlate timer is taken into account. If the xlate times out first (you set it real low) it will not take down the connection until the conn times out.

The ASA has several methods for dealing with the varying timeouts. Conn is one where the global setting can be overridden based on class-map -- this should be preferred over increasing the global setting if possible.

The other interesting feature the ASA possesses is dead connection detection -- DCD. DCD allows you to keep your [global] conn timeout at 60 minutes (the default) and when 60 minutes is reached -- the ASA man-in-the-middle spoofs null data ACKs to each endpoint as the other endpoint. Null data works to prevent the sequence numbers from incrementing. If both sides respond the connection's idle timer resets to 0 and begins again. If either side does not respond after a set number of attempts (configurable) in a given period the conn is removed and the xlate timer gains relevance as described above.

I'd recommend configuring a class-map and adding it to your policy that enables DCD. You can use an ACL or a port (others are available as well). Using the port is quick, easy, and will work well if you are certain the TCP/5633 is where the problem sits..

I have used the global_policy below but feel free to adjust as necessary.

class-map BE-CORBA_class
 description Backup Exec CORBA Traffic Class
 match port tcp eq 5633

policy-map global_policy
 class BE-CORBA_class
  -->::Choose one below::<--
  set connection timeout idle 1:00:00 dcd --> for 8.2(2) and up
  set connection timeout tcp 1:00:00 dcd --> for prior to 8.2(2)

service-policy global_policy global

@Comment

According to the reference guide -- "A packet can match only one class map in the policy map for each feature type."

The key phrase is in bold. A packet crossing an interface can match multiple classes inside of a policy-map, but only if those classes use different "features." If you scroll up just a tad in the aforementioned link you will see the various features listed. That whole page is a goldmine for MPF tidbits.

As you mentioned that you have a match any class-map defined and then referenced as a class inside the policy-map -- if you are performing any other TCP and UDP connection limits and timeouts changes in that policy-map class, then subsequent class-maps that match the traffic -- if set in the policy-map -- will not perform TCP and UDP connection limits and timeout changes on that packet.

If you post all the ACL's, class-map's, policy-map's, and service-policy's we can determine for certain.

Solution 2:

As much as I'm not a fan of applications taking their toys and going home (and failing the backup) when one single TCP session gets killed, in this case I'd say just up the ASA's TCP session timeout.

Putting a hard limit on session length at all is really just a product of the ASA's need to track all connections to maintain state (and usually, NAT) - if you're running against your device's connection limit, then it may be an issue, but otherwise, just crank it up to 6 hours or something.

Unless both nodes at the ends of a TCP session go dark, the ASA will bear witness to one end or the other ending the connection when it ends naturally, and tear down the connection then (or trigger the shorter half-closed connection timeout), so you're unlikely to end up with a ton of dead connections clogging things up. The endpoint devices have an interest in tearing down useless connections, too - web servers are a good example, as they'll usually have much shorter connection timeouts than your ASA.