The Boston Diaries

Sunday, November 25, 2001

Twelve Hours

Twelve hours.

Twelve hours and I still didn't find what was wrong.

I spent a good portion of last night and well, this moring (didn't get to bed until 10:00 am) working on a project for a client. When you freelance … okay, when I freelance, I can loose track of time and that's why I found myself working on a project on a Saturday night/Sunday morning.

The project itself isn't that hard. Data mining. Okay, nothing sexy like hacking a government site in sixty seconds with a gun to your head and getting a blowjob but hey, it's a living. And since it's pulling down pages from a webserver (it's public information by the way) it can't be that hard, right?

Right?

Twelve hours.

First off, the server I'm pulling from is a Microsoft IIS server and well … you have to be delusional if you think Microsoft follows standards to the letter. I already have to work around a few IIS bugs.


14.30 Location

   The Location response-header field is used to redirect the recipient
   to a location other than the Request-URI for completion of the
   request or identification of a new resource. For 201 (Created)
   responses, the Location is that of the new resource which was created
   by the request. For 3xx responses, the location SHOULD indicate the
   server's preferred URI for automatic redirection to the resource. The
   field value consists of a single absolute URI.

       Location       = "Location" ":" absoluteURI

   An example is:

       Location: http://www.w3.org/pub/WWW/People.html

§14.30 of RFC-2616

Right there. Location: contains an absolute URI. But Microsoft? Nah, that would be like … following a standard or something, so when an IIS server sends out a Location: header, it's relative to the base URI the webserver was given. Well, I've worked around that bug long ago, as well as the bug that IIS servers sometimes hand out two sets of headers.

So that's a known quantity. This should be easy enough.

Twelve hours. It's become a mantra.

Now, even though the information is public (mandated by law no less) the owners of the site aren't going to make it easy to actually get to the information. Oh no. The whole site is framed in frames. Hit the wrong URL or neglect to send the correct Referer: header and you get bumped back to a frame.

Annoying, but having to deal with session tracking cookies is even worse. Attempt to avoid using cookies, and “Sorry, the site requires cookies.”

And you can't even get into the site until you click through their licence agreement.

Oh, did I mention this is public information I am pulling out?

I've never dealt with cookies before and well, there's a reason why I never bothered before. Simple in theory but the devil is in the details.

I've been picking through the site using Lynx to pick apart the site and figure out which URLs I need to grab and which URLs I need as refering pages and figuring out the minimum cookie support I need (since my own homegrown library doesn't exactly support cookies) and my code isn't working.

I find out more where Microsoft's IIS is breaking the standard:


   The action performed by the POST method might not result in a
   resource that can be identified by a URI. In this case, either 200
   (OK) or 204 (No Content) is the appropriate response status,
   depending on whether or not the response includes an entity that
   describes the result.

   If a resource has been created on the origin server, the response
   SHOULD be 201 (Created) and contain an entity which describes the
   status of the request and refers to the new resource, and a Location
   header (see section 14.30).

   Responses to this method are not cacheable, unless the response
   includes appropriate Cache-Control or Expires header fields. However,
   the 303 (See Other) response can be used to direct the user agent to
   retrieve a cacheable resource.

§9.5 of RFC-2616

Okay, so I guess Microsoft weasles out with the should clause there because what it does to is sent out a 302 (move temporarily) which I immediately POST to the new location where:


   If the 302 status code is received in response to a request other
   than GET or HEAD, the user agent MUST NOT automatically redirect the
   request unless it can be confirmed by the user, since this might
   change the conditions under which the request was issued.

      Note: RFC 1945 and RFC 2068 specify that the client is not allowed
      to change the method on the redirected request.  However, most
      existing user agent implementations treat 302 as if it were a 303
      response, performing a GET on the Location field-value regardless
      of the original request method. The status codes 303 and 307 have
      been added for servers that wish to make unambiguously clear which
      kind of reaction is expected of the client.

§10.3.3 of RFC-2616

You can't win coming or going. So in this case, not only is Microsoft IIS possibly in the wrong, but nearly every browser is too! Including the aformentioned Lynx. Although in my case, I don't change the method (frankly, it never occured to me to do such a thing).

Twelve hours.

So I'm spending my time trying to figure out why my code isn't working and yet Lynx does. I enable tracing in Lynx. It doesn't tell me anything that I don't already know. I'm adding headers. I'm mimicing headers.

Twelve hours.

At 10:00 am I give up and head to bed.

I get up and decide to record the actual traffic between my workstation and the server in question, to see exactly what is going on. So I record a session with Lynx, and with my software and look at the raw packets and see what is different between the two.

And that's when I want to slap myself up the head with a large and rather heavy blunt object.

Because it's a problem with my code. In fact, it was a feature of my code that I completely forgot about, seeing how I wrote the code in question back in 1997 (and the last server bug workaround code was added in 1999).

You see, when I was setting the headers to be sent with the request, I was including the characters CR and LF at the end (since that's part of the spec—header lines are separated by those characters) when the code I wrote added the same characters to each header line as it was being sent out.

So no wonder it wasn't working.

Twelve hours.

You can smack me now.

Sunday, November 25, 2001

Twelve Hours

Obligatory Picture

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous