TCP Half-close mode and how it affects webserving

Wednesday, May 01, 2002

Over the past few days, Mark and I have been going over partial closures of a TCP connection, since under certain circumstances, you have to do that if you are writing a webserver, such as the one Mark is writing.


   When a client or server wishes to time-out it SHOULD issue a graceful
   close on the transport connection. Clients and servers SHOULD both
   constantly watch for the other side of the transport close, and
   respond to it as appropriate. If a client or server does not detect
   the other side's close promptly it could cause unnecessary resource
   drain on the network.

§ 8.1.4 of RFC-2616

So far so good. But …


/*
 * More machine-dependent networking gooo... on some systems,
 * you've got to be *really* sure that all the packets are acknowledged
 * before closing the connection, since the client will not be able
 * to see the last response if their TCP buffer is flushed by a RST
 * packet from us, which is what the server's TCP stack will send
 * if it receives any request data after closing the connection.
 *
 * In an ideal world, this function would be accomplished by simply
 * setting the socket option SO_LINGER and handling it within the
 * server's TCP stack while the process continues on to the next request.
 * Unfortunately, it seems that most (if not all) operating systems
 * block the server process on close() when SO_LINGER is used.
 * For those that don't, see USE_SO_LINGER below.  For the rest,
 * we have created a home-brew lingering_close.
 *
 * Many operating systems tend to block, puke, or otherwise mishandle
 * calls to shutdown only half of the connection.  You should define
 * NO_LINGCLOSE in ap_config.h if such is the case for your system.
 */

Comment from http_main.c in the Apache source code.

And then …

Some users have observed no FIN_WAIT_2 problems with Apache 1.1.x, but with 1.2b enough connections build up in the FIN_WAIT_2 state to crash their server. The most likely source for additional FIN_WAIT_2 states is a function called lingering_close() which was added between 1.1 and 1.2. This function is necessary for the proper handling of persistent connections and any request which includes content in the message body (e.g., PUTs and POSTs). What it does is read any data sent by the client for a certain time after the server closes the connection. The exact reasons for doing this are somewhat complicated, but involve what happens if the client is making a request at the same time the server sends a response and closes the connection. Without lingering, the client might be forced to reset its TCP input buffer before it has a chance to read the server's response, and thus understand why the connection has closed. See the appendix for more details.

The code in lingering_close() appears to cause problems for a number of factors, including the change in traffic patterns that it causes. The code has been thoroughly reviewed and we are not aware of any bugs in it. It is possible that there is some problem in the BSD TCP stack, aside from the lack of a timeout for the FIN_WAIT_2 state, exposed by the lingering_close code that causes the observed problems.

Connections in FIN_WAIT_2 and Apache

And the whole purpose of lingering_close() is to handle TCP half-closes when you can't use the SO_LINGER option when creating the socket!

So Mark and I go back and forth a few times and I finally send Mark the following:

Okay, looking over Stevens (UNIX Network Programming [1990], TCP/IP Illustrated Volume 1 [1994], TCP/IP Illustrated Volume 2 [1995]) and the Apache source code, here's what is going on.

The TCP/IP stack itself (under UNIX, this happens in the kernel) is responsible for sending out the various packet types of SYN, ACK, FIN, RST, etc. in response to what is done in user code. Ideally, for the server code, you would do (using the Berkeley sockets API since that's all the reference I have right now, and ignoring errors, which would only cloud the issue at hand):
memset(&sin,0,sizeof(sin));
sin.sin_family      = AF_INET;
sin.sin_addr.s_addr = INADDR_ANY;       
sin.sin_port        = htons(port);      /* usually 80 for HTTP */

mastersock = socket(AF_INET,SOCK_STREAM,0);
one        = 1;
setsockopt(mastersock,SOL_SOCKET,SO_REUSEADDR,&one,sizeof(one));
bind(mastersock,(struct sockaddr *)&sin,sizeof(sin))

while(...)
{
  struct linger lingeropt;
  size_t        length;
  int           sock;
  int           opt;

  listen(mastersock,5);
        
  length = sizeof(sin);
  sock   = accept(sock,(struct sockaddr *)&sin,&length);

  opt                = 1;
  lingeropt.l_onoff  = 1;
  lingeropt.l_linger = SOME_TIME_OUT_IN_SECS;

  setsockopt(sock,IPPROTO_TCP,TCP_NODELAY,&opt,sizeof(opt));
  setsockopt(sock,SOL_SOCKET,SO_LINGER,&li,sizeof(struct linger));

  /*---------------------------------------------------
  ; assuming HTTP/1.1, keep handling requests until
  ; a non-200 response it required, or the client
  ; sends a Connection: close or closes its side of the
  ; connection.  When that happens, we can just close
  ; our side and everything is taken care of.
  ;----------------------------------------------------*/
	
  close(sock);
}
There are two problems with this though that Apache attempts to deal with; 1) close() blocks if SO_LINGER is specified (not all TCP/IP stacks do this, just most it seems) and 2) TCP/IP stacks that have no timeout value in the FIN_WAIT_2 state (which means sockets may be consumed if the FIN_WAIT_2 states don't clear).

Apache handles #2 by:
if ( TCP/IP stack has no timeout in FIN_WAIT_2 state) 
     && ( client is a known client that can't handle persistent
                  connections properly)
then
	downgrade to HTTP/1.0.
end
(Apache will also downgrade to HTTP/1.0 for other browsers because they can't handle persistent connections properly anyway, and Apache will prevent them from crashing themselves, but I'm digressing here … )

Now, Apache handles #1 by rolling its own lingering close in userspace by writing any data it needs to the client, calling shutdown(sock,SHUT_WR), setting timeouts (alarm(), timeout struct in select(), etc) and reading any pending data from the client before issuing the close() (and it never calls setsockopt(SO_LINGER) in this case). The reason Apache does this is because it needs to continue processing after the close() and having close() block will affect the response time of Apache—that, and it seems some TCP/IP stacks can't handle SO_LINGER anyway and may crash (or seriously affect the throughput).

So, if you don't mind close() blocking (on a socket with SO_LINGER) and the TCP/IP stack won't puke or mishandle the socket, then the best bet would be to use SO_LINGER. Otherwise, you will have to do what Apache does and do something like:
write(sock,pendingdata,sizeof(pendingdata));
shutdown(sock,SHUT_WR);
alarm(SOME_TIME_OUT_IN_SECS);
        
FD_ZERO(&fdlist);

do
{
  FD_SET(sock,&fdlist);
  tv.tv_sec  = SOME_SMALLER_TIME_OUT_IN_SECS;
  tv.tv_usec = 0;
  rc         = select(FD_SETSIZE,&fdlist,NULL,NULL,&tv);
} while ((rc > 0) && (read(sock,dummybuf,sizeof(dummybuf)) > 0));

close(sock);
alarm(0);
(Apache has SOME_TIME_OUT_IN_SECS equal to 30 and SOME_SMALLER_TIME_OUT_IN_SECS as 2).

And in going over the Apache code more carefully, it does seem that Apache will use its own version of a lingering close for Linux. Heck, I can't see an OS that Apache supports that it actively uses SO_LINGER (and I'm checking the latest version of 1.3).

I'm not sure how you want to handle this, since the shutdown() call can close down either the read half, the write half (which is what the webserver needs to do in the case above) or both halves. The code you have for HttpdSocket::Shutdown() should probably do somethine close to what I have above if you aren't using SO_LINGER, and if you are using SO_LINGER, then all it has to do is call close().

That seems to have cleared up most of the misunderstandings we've been having and now we're down to figuring out some minor details, as the architecture Mark has chosen for his webserver make the possible blocking on close() not that much of an issue and that more modern TCP/IP stacks probably implement SO_LINGER correctly (or at least to the degree that it doesn't puke or mishandle the option).

The Boston Diaries

Wednesday, May 01, 2002

TCP Half-close mode and how it affects webserving

Obligatory Picture

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

Obligatory AI Disclaimer