Wednesday, May 01, 2002
TCP Half-close mode and how it affects webserving
Over the past few days, Mark and I have been going over partial closures of a TCP connection, since under certain circumstances, you have to do that if you are writing a webserver, such as the one Mark is writing.
When a client or server wishes to time-out it SHOULD issue a graceful close on the transport connection. Clients and servers SHOULD both constantly watch for the other side of the transport close, and respond to it as appropriate. If a client or server does not detect the other side's close promptly it could cause unnecessary resource drain on the network.
§ 8.1.4 of RFC-2616
So far so good. But …
/* * More machine-dependent networking gooo... on some systems, * you've got to be *really* sure that all the packets are acknowledged * before closing the connection, since the client will not be able * to see the last response if their TCP buffer is flushed by a RST * packet from us, which is what the server's TCP stack will send * if it receives any request data after closing the connection. * * In an ideal world, this function would be accomplished by simply * setting the socket option SO_LINGER and handling it within the * server's TCP stack while the process continues on to the next request. * Unfortunately, it seems that most (if not all) operating systems * block the server process on close() when SO_LINGER is used. * For those that don't, see USE_SO_LINGER below. For the rest, * we have created a home-brew lingering_close. * * Many operating systems tend to block, puke, or otherwise mishandle * calls to shutdown only half of the connection. You should define * NO_LINGCLOSE in ap_config.h if such is the case for your system. */
Comment from http_main.c
in the Apache source code.
And then …
Some users have observed no
FIN_WAIT_2
problems with Apache 1.1.x, but with 1.2b enough connections build up in theFIN_WAIT_2
state to crash their server. The most likely source for additionalFIN_WAIT_2
states is a function calledlingering_close()
which was added between 1.1 and 1.2. This function is necessary for the proper handling of persistent connections and any request which includes content in the message body (e.g.,PUT
s andPOST
s). What it does is read any data sent by the client for a certain time after the server closes the connection. The exact reasons for doing this are somewhat complicated, but involve what happens if the client is making a request at the same time the server sends a response and closes the connection. Without lingering, the client might be forced to reset its TCP input buffer before it has a chance to read the server's response, and thus understand why the connection has closed. See the appendix for more details.The code in
lingering_close()
appears to cause problems for a number of factors, including the change in traffic patterns that it causes. The code has been thoroughly reviewed and we are not aware of any bugs in it. It is possible that there is some problem in the BSD TCP stack, aside from the lack of a timeout for theFIN_WAIT_2
state, exposed by thelingering_close
code that causes the observed problems.
Connections in
FIN_WAIT_2
and Apache
And the whole purpose of lingering_close()
is to handle TCP half-closes when you
can't use the SO_LINGER
option when creating the socket!
So Mark and I go back and forth a few times and I finally send Mark the following:
Okay, looking over Stevens (UNIX Network Programming [1990], TCP/IP Illustrated Volume 1 [1994], TCP/IP Illustrated Volume 2 [1995]) and the Apache source code, here's what is going on.
The TCP/IP stack itself (under UNIX, this happens in the kernel) is responsible for sending out the various packet types of
SYN
,ACK
,FIN
,RST
, etc. in response to what is done in user code. Ideally, for the server code, you would do (using the Berkeley sockets API since that's all the reference I have right now, and ignoring errors, which would only cloud the issue at hand):memset(&sin,0,sizeof(sin)); sin.sin_family = AF_INET; sin.sin_addr.s_addr = INADDR_ANY; sin.sin_port = htons(port); /* usually 80 for HTTP */ mastersock = socket(AF_INET,SOCK_STREAM,0); one = 1; setsockopt(mastersock,SOL_SOCKET,SO_REUSEADDR,&one,sizeof(one)); bind(mastersock,(struct sockaddr *)&sin,sizeof(sin)) while(...) { struct linger lingeropt; size_t length; int sock; int opt; listen(mastersock,5); length = sizeof(sin); sock = accept(sock,(struct sockaddr *)&sin,&length); opt = 1; lingeropt.l_onoff = 1; lingeropt.l_linger = SOME_TIME_OUT_IN_SECS; setsockopt(sock,IPPROTO_TCP,TCP_NODELAY,&opt,sizeof(opt)); setsockopt(sock,SOL_SOCKET,SO_LINGER,&li,sizeof(struct linger)); /*--------------------------------------------------- ; assuming HTTP/1.1, keep handling requests until ; a non-200 response it required, or the client ; sends a Connection: close or closes its side of the ; connection. When that happens, we can just close ; our side and everything is taken care of. ;----------------------------------------------------*/ close(sock); }There are two problems with this though that Apache attempts to deal with; 1)
close()
blocks ifSO_LINGER
is specified (not all TCP/IP stacks do this, just most it seems) and 2) TCP/IP stacks that have no timeout value in theFIN_WAIT_2
state (which means sockets may be consumed if theFIN_WAIT_2
states don't clear).Apache handles #2 by:
if ( TCP/IP stack has no timeout in FIN_WAIT_2 state) && ( client is a known client that can't handle persistent connections properly) then downgrade to HTTP/1.0. end(Apache will also downgrade to HTTP/1.0 for other browsers because they can't handle persistent connections properly anyway, and Apache will prevent them from crashing themselves, but I'm digressing here … )
Now, Apache handles #1 by rolling its own lingering close in userspace by writing any data it needs to the client, calling
shutdown(sock,SHUT_WR)
, setting timeouts (alarm()
, timeout struct inselect()
, etc) and reading any pending data from the client before issuing theclose()
(and it never callssetsockopt(SO_LINGER)
in this case). The reason Apache does this is because it needs to continue processing after theclose()
and havingclose()
block will affect the response time of Apache—that, and it seems some TCP/IP stacks can't handleSO_LINGER
anyway and may crash (or seriously affect the throughput).So, if you don't mind
close()
blocking (on a socket withSO_LINGER
) and the TCP/IP stack won't puke or mishandle the socket, then the best bet would be to useSO_LINGER
. Otherwise, you will have to do what Apache does and do something like:write(sock,pendingdata,sizeof(pendingdata)); shutdown(sock,SHUT_WR); alarm(SOME_TIME_OUT_IN_SECS); FD_ZERO(&fdlist); do { FD_SET(sock,&fdlist); tv.tv_sec = SOME_SMALLER_TIME_OUT_IN_SECS; tv.tv_usec = 0; rc = select(FD_SETSIZE,&fdlist,NULL,NULL,&tv); } while ((rc > 0) && (read(sock,dummybuf,sizeof(dummybuf)) > 0)); close(sock); alarm(0);(Apache has
SOME_TIME_OUT_IN_SECS
equal to 30 andSOME_SMALLER_TIME_OUT_IN_SECS
as 2).And in going over the Apache code more carefully, it does seem that Apache will use its own version of a lingering close for Linux. Heck, I can't see an OS that Apache supports that it actively uses
SO_LINGER
(and I'm checking the latest version of 1.3).I'm not sure how you want to handle this, since the shutdown() call can close down either the read half, the write half (which is what the webserver needs to do in the case above) or both halves. The code you have for
HttpdSocket::Shutdown()
should probably do somethine close to what I have above if you aren't usingSO_LINGER
, and if you are usingSO_LINGER
, then all it has to do is callclose()
.
That seems to have cleared up most of the misunderstandings we've been
having and now we're down to figuring out some minor details, as the
architecture Mark has chosen for his webserver make the possible blocking on
close()
not that much of an issue and that more modern TCP/IP
stacks probably implement SO_LINGER
correctly (or at least to the
degree that it doesn't puke or mishandle the option).