Twelve hours and I still didn't find what was wrong.
I spent a good portion of last night and well, this moring (didn't get to bed until 10:00 am) working on a project for a client. When you freelance … okay, when I freelance, I can loose track of time and that's why I found myself working on a project on a Saturday night/Sunday morning.
The project itself isn't that hard. Data mining. Okay, nothing sexy like hacking a government site in sixty seconds with a gun to your head and getting a blowjob but hey, it's a living. And since it's pulling down pages from a webserver (it's public information by the way) it can't be that hard, right?
First off, the server I'm pulling from is a Microsoft IIS server and well … you have to be delusional if you think Microsoft follows standards to the letter. I already have to work around a few IIS bugs.
14.30 Location The Location response-header field is used to redirect the recipient to a location other than the Request-URI for completion of the request or identification of a new resource. For 201 (Created) responses, the Location is that of the new resource which was created by the request. For 3xx responses, the location SHOULD indicate the server's preferred URI for automatic redirection to the resource. The field value consists of a single absolute URI. Location = "Location" ":" absoluteURI An example is: Location: http://www.w3.org/pub/WWW/People.html
§14.30 of RFC-2616
Location: contains an absolute URI. But Microsoft? Nah, that
would be like … following a standard or something, so when an IIS server
sends out a
Location: header, it's relative to the base URI the webserver was given.
Well, I've worked around that bug long ago, as well as the bug that IIS
servers sometimes hand out two sets of headers.
So that's a known quantity. This should be easy enough.
Twelve hours. It's become a mantra.
Now, even though the information is public (mandated by law no
less) the owners of the site aren't going to make it easy to actually
get to the information. Oh no. The whole site is framed in
frames. Hit the wrong URL or neglect to send the correct
header and you get bumped back to a frame.
Annoying, but having to deal with session tracking cookies is even worse. Attempt to avoid using cookies, and “Sorry, the site requires cookies.”
And you can't even get into the site until you click through their licence agreement.
Oh, did I mention this is public information I am pulling out?
I've never dealt with cookies before and well, there's a reason why I never bothered before. Simple in theory but the devil is in the details.
I've been picking through the site using Lynx to pick apart the site and figure out which URLs I need to grab and which URLs I need as refering pages and figuring out the minimum cookie support I need (since my own homegrown library doesn't exactly support cookies) and my code isn't working.
I find out more where Microsoft's IIS is breaking the standard:
The action performed by the POST method might not result in a resource that can be identified by a URI. In this case, either 200 (OK) or 204 (No Content) is the appropriate response status, depending on whether or not the response includes an entity that describes the result. If a resource has been created on the origin server, the response SHOULD be 201 (Created) and contain an entity which describes the status of the request and refers to the new resource, and a Location header (see section 14.30). Responses to this method are not cacheable, unless the response includes appropriate Cache-Control or Expires header fields. However, the 303 (See Other) response can be used to direct the user agent to retrieve a cacheable resource.
§9.5 of RFC-2616
Okay, so I guess Microsoft weasles out with the should clause there because what it does to is sent out a 302 (move temporarily) which I immediately POST to the new location where:
If the 302 status code is received in response to a request other than GET or HEAD, the user agent MUST NOT automatically redirect the request unless it can be confirmed by the user, since this might change the conditions under which the request was issued. Note: RFC 1945 and RFC 2068 specify that the client is not allowed to change the method on the redirected request. However, most existing user agent implementations treat 302 as if it were a 303 response, performing a GET on the Location field-value regardless of the original request method. The status codes 303 and 307 have been added for servers that wish to make unambiguously clear which kind of reaction is expected of the client.
§10.3.3 of RFC-2616
You can't win coming or going. So in this case, not only is Microsoft IIS possibly in the wrong, but nearly every browser is too! Including the aformentioned Lynx. Although in my case, I don't change the method (frankly, it never occured to me to do such a thing).
So I'm spending my time trying to figure out why my code isn't working and yet Lynx does. I enable tracing in Lynx. It doesn't tell me anything that I don't already know. I'm adding headers. I'm mimicing headers.
At 10:00 am I give up and head to bed.
I get up and decide to record the actual traffic between my workstation and the server in question, to see exactly what is going on. So I record a session with Lynx, and with my software and look at the raw packets and see what is different between the two.
And that's when I want to slap myself up the head with a large and rather heavy blunt object.
Because it's a problem with my code. In fact, it was a feature of my code that I completely forgot about, seeing how I wrote the code in question back in 1997 (and the last server bug workaround code was added in 1999).
You see, when I was setting the headers to be sent with the request, I was including the characters CR and LF at the end (since that's part of the spec—header lines are separated by those characters) when the code I wrote added the same characters to each header line as it was being sent out.
So no wonder it wasn't working.
You can smack me now.