F5 LTM and Tcp Timouts

One of the dangers of being from a pure cisco background is assumption. You treat all devices as if they have the same defaults as ‘normal’ Cisco devices. I think I’m pretty good at avoiding this, but it gets us all sometimes.

As we all know, when you run long lived TCP connections through application aware devices, you need to ensure the connection is used. The classic problem is Oracle SQLnet through firewalls, where oracle sets up pools of TCP connections for later use, so that when it gets a burst of traffic it has the connections set up and doesn’t need to waste precious milliseconds on TCP handshakes.

The problem comes, if they remain unused for longer than the TCP session timeout value of the firewall (typically 60 minutes), where the firewall silently drops the connections as being dead, but the client and server think they’re still up. Next time one side or the other decides to send a little traffic on one, you get a ‘broken socket’ error.

This is normal behaviour, and needs to be taken into consideration whenever you’re building systems/applications which connect through firewalls or load balancers.

Now, lets be clear. The answer to this issue is always setting a TCP keepalive. Send a packet every minute, and you will never have a problem. Ever. Really, do this. In fact, given you work with such a sensible bunch who will immediately implement this, there’s no need to read on.

Back to F5. Interesting fact of the day, is when you use the F5 LTM for load balancing TCP connections, the default timeout is only 5 minutes – i.e. a TCP connection which does not send a packet for 301 seconds gets dropped. That’s not that long, unlike the 60 minutes (3600 seconds) I have in my head from Cisco land. So the whole ‘using a TCP keepalive’ becomes even more important. Really, you are a crazy person if you don’t use a keepalive in this situation.  There’s really no point reading on. You’re not a crazy person, nor is your co-worker who’s setting up the app right?

Still here? OK – there’s two ways to know you have this issue – you’ll see this as RST packets being sent back to the client from the F5 (which do not come from the ‘real’ server) when they send traffic on a timed out connection, and also you can see the current connection timers using the ‘b conn client‘ command :

[dan@ltm01:Active] ~ # b conn client | grep tcp

CONN client 10.1.4.90:1873 server 10.31.8.10:https any protocol tcp age 216 – client: 10.1.4.90:1873
CONN client 10.1.4.90:1876 server 10.31.8.10:https any protocol tcp age 217 – client: 10.1.4.90:1876
CONN client 10.1.4.90:1877 server 10.31.8.10:https any protocol tcp age 238 – client: 10.1.4.90:1877

You can see even more detail using ‘b conn client show all‘ :

[dan@ltm01:Active] ~ #  b conn client 10.1.4.90 show all
VIRTUAL 10.31.8.10:https <-> NODE any6:any   TYPE any
    CLIENTSIDE 10.1.4.90:1927 <-> 10.31.8.10:https
        (pkts,bits) in = (71, 16867)   out = (122, 139083)
    SERVERSIDE any6:any <-> any6:any
        (pkts,bits) in = (0, 0)   out = (0, 0)
    PROTOCOL tcp   UNIT 1   IDLE 83 (300)   LASTHOP 8 00:19:a9:f7:c0:00

So you can now see what the actual timeout value is for the this connection (83 seconds used from a 300 second timer in this case). This is particularly hand as it shows you if your ‘fix’ has actually taken.

If you do have a crazy person setting up your application, here’s how you can be a network hero and ‘fix the F5′. Write an iRule with the following content :

when SERVER_CONNECTED {
IP::idle_timeout 3600
}

and apply to the virtual server. This simply ups the timeout to 1 hr (obviously you can adjust the time to suit your environment). You can actually be quite granular, and set different values for different protocols, check the always useful http://devcentral.f5.com site for more detail, or see this excellent post..

To see the change :

[dan@ltm01:Active] ~ #  b conn client 10.1.4.90 show all
VIRTUAL 10.31.8.10:https <-> NODE any6:any   TYPE any
    CLIENTSIDE 10.1.4.90:1943 <-> 10.31.8.10:https
        (pkts,bits) in = (83, 18157)   out = (165, 188758)
    SERVERSIDE any6:any <-> any6:any
        (pkts,bits) in = (0, 0)   out = (0, 0)
    PROTOCOL tcp   UNIT 1   IDLE 4 (3600)   LASTHOP 8 00:19:a9:f7:c0:00

Last point on this, as with most iRules, simply applying it to the virtual server doesn’t immediately effect current connections. Because the rule starts with ‘when SERVER_CONNECTED’ – it’ll be invoked when a new TCP connection is set up, and the F5 makes the backend connection to the server. You could probably fiddle with this to find other ways to tune when it’s started.

  • http://ertw.com/blog/ Sean

    You can change the TCP timeout in the TCP profile of the VS instead of having to write an iRule (Local Traffic -> Virtual Servers -> Profiles and then pick the appropriate one under “Protocols”). Or crack out a new profile based on TCP if you want to only use it for some VSes.

    Should be a bit more efficient than applying an iRule, and it’s also more in line with the way the configuration is designed.

    Sean

  • Nicollet

    Originally the network was done stateless (dumb network). You can’t loose TCP sessions if you route level 3 datagrams.

    I am sure we could find better answers to this problem.