TCP Is Not Reliable

Been to long between blogs…

TCP Is Not Reliable” – what's THAT mean?

Means: I can cause TCP to reliably fail in under 5 mins, on at least 2 different modern Linux variants and on modern hardware, both in our datacenter (no hypervisor) and on EC2.

What does “fail” mean?  Means the client will open a socket to the server, write a bunch of stuff and close the socket – with no errors of any sort.  All standard blocking calls.  The server will get no information of any sort that a connection was attempted.  Let me repeat that: neither client nor server get ANY errors of any kind, the client gets told he opened/wrote/closed a connection, and the server gets no connection attempt, nor any data, nor any errors.  It's exactly “as if” the client's open/write/close was thrown in the bit-bucket.

We'd been having these rare failures under heavy load where it was looking like a dropped RPC call.  H2O has it's own RPC mechanism, built over the RUDP layer (see all the task-tracking code in the H2ONode class).  Integrating the two layers gives a lot of savings in network traffic, most small-data remote calls (e.g. nearly all the control logic) require exactly 1 UDP packet to start the call, and 1 UDP packet with response.  For large-data calls (i.e., moving a 4Meg “chunk” of data between nodes) we use TCP – mostly for it's flow-control & congestion-control.  Since TCP is also reliable, we bypassed the Reliability part of the RUDP.  If you look in the code, the AutoBuffer class lazily decides between UDP or TCP send styles, based on the amount of data to send.  The TCP stuff used to just open a socket, send the data & close.

So as I was saying, we'd have these rare failures under heavy load that looked like a dropped TCP connection (was hitting the same asserts as dropping a UDP packet, except we had dropped-UDP-packet recovery code in there and working forever).  Finally Kevin, our systems hacker, got a reliable setup (reliably failing?) – it was a H2O parse of a large CSV dataset into a 5-node cluster… then a 4-node cluster, then a 3-node cluster.  I kept adding asserts, and he kept shrinking the test setup, but still nothing seemed obvious – except that obviously during the parse we'd inhale a lot of data, ship it around our 3-node clusters with lots of TCP connections, and then *bang*, an assert would trip about missing some data.

Occam's Razor dictated we look at the layers below the Java code – the JVM, the native, the OS layers – but these are typically very opaque.  The network packets, however, are easily visible with wireshark tools.  So we logged every packet.  It took another few days of hard work, but Kevin triumphantly presented me with a wireshark log bracketing the Java failure… and there it was in the log: a broken TCP connection.  We stared harder.

In all these failures the common theme is that the receiver is very heavily loaded, with many hundreds of short-lived TCP connections being opened/read/closed every second from many other machines.  The sender sends a 'SYN' packet, requesting a connection. The sender (optimistically) sends 1 data packet; optimistic because the receiver has yet to acknowledge the SYN packet.  The receiver, being much overloaded, is very slow.  Eventually the receiver returns a 'SYN-ACK' packet, acknowledging both the open and the data packet.  At this point the receiver's JVM has not been told about the open connection; this work is all opening at the OS layer alone.  The sender, being done, sends a 'FIN' which it does NOT wait for acknowledgement (all data has already been acknowledged).  The receiver, being heavily overloaded, eventually times-out internally (probably waiting for the JVM to accept the open-call, and the JVM being overloaded is too slow to get around to it) – and sends a RST (reset) packet back…. wiping out the connection and the data.  The sender, however, has moved on – it already sent a FIN & closed the socket, so the RST is for a closed connection.  Net result: sender sent, but the receiver reset the connection without informing either the JVM process or the sender.

Kevin crawled the Linux kernel code, looking at places where connections get reset.  There are too many to tell which exact path we triggered, but it is *possible* (not confirmed) that Linux decided it was the subject of a DDOS attack and started closing open-but-not-accepted TCP connections.  There are knobs in Linux you can tweak here, and we did – and could make the problem go away, or be much harder to reproduce.

With the bug root-caused in the OS, we started looking our options for fixing the situation.  Asking our clients to either upgrade their kernels, or use kernel-level network tweaks was not in the cards.  We ended up implementing two fixes: (1) we moved the TCP connection parts into the existing Reliability layer built over UDP.  Basically, we have an application-level timeout and acknowledgement for TCP connections, and will retry TCP connections as needed.  With this in place, the H2O crash goes away (although if the code triggers, we log it and use app-level congestion delay logic).  And (2) we multiplex our TCP connections, so the rate of “open TCPs/sec” has dropped to 1 or 2 – and with this 2nd fix in place we never see the first issue.

At this point H2O's RPC calls are rock-solid, even under extreme loads.

UPDATE:

Found this decent article: http://blog.netherlabs.nl/articles/2009/01/18/the-ultimate-so_linger-page-or-why-is-my-tcp-not-reliable
Basically:

  • It’s a well known problem, in that many people trip over it, and get confused by it
  • The recommended solution is app-level protocol changes (send expected length with data, receiver sends back app-level ACK after reading all expected data). This is frequently not possible (i.e., legacy receiver).
  • Note that setting flags like SO_LINGER are not sufficient
  • There is a Linux-specific workaround (SIOCOUTQ)
  • The “Principle of Least Surprise” is violated: I, at least, am surprised when ‘write / close’ does not block on the ‘close’ until the kernel at the other end promises it can deliver the data to the app.  Probably the remote kernel would need to block the ‘close’ on this side until all the data has been moved into the user-space on that side – which might in turn be blocked by the receiver app’s slow read rate.

Cliff

 

Published by

wpengine

This is the "wpengine" admin user that our staff uses to gain access to your admin area to provide support and troubleshooting. It can only be accessed by a button in our secure log that auto generates a password and dumps that password after the staff member has logged in. We have taken extreme measures to ensure that our own user is not going to be misused to harm any of our clients sites.