“I think Martha's spent too much time hanging with Snoop Dogg.”
“What makes you say that?”
“Look at her! Her dress, her hair, the 50-yard stare into nothing.”
“Maybe you're not used to seeing her at home.”
“Maybe … ”
“Besides, maybe she learned that while in prison.”
“Oh yeah! She did do time in the pokey, didn't she?”
I've fallen into a rabbit hole of URI encoding and decoding, and why not publish my results here so I at least have a place I know where I can look it up again. And who knows? Maybe someone else will find this useful.
Anyway, there are two standards that define URIs:
The first is from the IETF and what most non-browsers that deal with URIs use. The second is from the WHATWG (and while WHATWG stands for “Web Hypertext Application Technology Working Group,” I always read that as ”What Working Group?” which gives away my opinions on this group, truth be told) and is the standard being pushed by the three major browsers left (Chrome, Firefox and Safari).
RFC-3986 is quite clear on when to encode and decode characters:
Under normal circumstances, the only time when octets within a URI are percent-encoded is during the process of producing the URI from its component parts. This is when an implementation determines which of the reserved characters are to be used as subcomponent delimiters and which can be safely used as data. Once produced, a URI is always in its percent-encoded form.
When a URI is dereferenced, the components and subcomponents significant to the scheme-specific dereferencing process (if any) must be parsed and separated before the percent-encoded octets within those components can be safely decoded, as otherwise the data may be mistaken for component delimiters. The only exception is for percent-encoded octets corresponding to characters in the unreserved set, which can be decoded at any time. For example, the octet corresponding to the tilde ("~") character is often encoded as "%7E" by older URI processing implementations; the "%7E" can be replaced by "~" without changing its interpretation.
Because the percent ("%") character serves as the indicator for percent-encoded octets, it must be percent-encoded as "%25" for that octet to be used as data within a URI. Implementations must not percent-encode or decode the same string more than once, as decoding an already decoded string might lead to misinterpreting a percent data octet as the beginning of a percent-encoding, or vice versa in the case of percent-encoding an already percent-encoded string.
RFC-3986, section 2.4: When to Encode or Decode
But you do have to read the ABNF carefully to find the 10 characters not mentioned that must be encoded. The WHATWG standard isn't easy to follow as it describes in all-too-verbose English the algorithm of how to encode and decode a URI, but it does cover what to encode and what not to encode. As I went through both stardards and several other sources (links below), I've created the following table of what characters to encode (current as of this date), with a preference for RFC-3986 (but with notes where WHATWG diverges from RFC-3986):
- WHATWG: “\” is treated as a “/” in path segment
- WHATWG: character not encoded in path
- WHATWG: character not encoded in query
- WHATWG: character not encoded in fragment
|m||only encode when not used for their defined purpose (URI scheme dependent)|
|-||not allowed, even escaped|
|unreserved||characters that never need to be encoded|
|gen-delim||characters defined as general use delimiters|
|sub-delim||characters defined as a potential delimiter for subcomponents in a URI|
|escape||character defined to escape other characters|
|characters not otherwise defined, and thus must be escaped.|
Furthermore, any character not defined in the above table (character codes 0 to 31 and 127 or higher) must also be escaped.
- Uniform Resource Identifier Schemes
- URL Interop
- URL Specification
- A practical guide to URI encoding and URI decoding
- (Please) Stop Using Unsafe Characters in URLs
- Exploiting URL Parsers: The Good, Bad, And Inconsistent
Our customer, The Oligarchic Cell Phone Company, wants us to do a demo of a new feature for a certain class of clients. “Project: Lumbergh” will receive a URL along with the name and reputation of a phone number it gets from elsewhere. “Project: Lumbergh” will then pass this along to “Project: Sippy-Cup.” We already have to deal with URLs from elsewhere. The only change we have to make is allowing URLs to be passed along to the certain class of clients, which formerly did not get URLs. So far, so good.
But then I saw code being added to “Project: Lumbergh” to check the URLs to see if the path portion ended in
I enquired about this,
because to me,
that makes no sense—we're just a conduit for data;
the source of the URL should already know what it can and can't send to the client.
I was told that the certain class of clients only support BMP files while other clients that can receive URLs can't support BMP files,
so we have to ensure that BMPs only go the subset of clients that can support them.
I countered with the fact that we include information about the client to the data source when we query them,
and they should have the logic to handle this on their end—why are we suddenly reponsible for this?
I was told that the LOF for the data source would be too large to handle by the demo deadline, that we had to handle it,
that the code that just looks anywhere in the URL for a literal “.bmp” is Good Enough™,
and to stop with the questions.
Now the URL we're given is “percent-encoded”—we get something like:
Nevermind the fact that that is an invalid URL to begin with
(you aren't supposed to encode characters that are defined as delimiters in URLs if they are,
that's what we get and pass along.
(a few years after we started passing URLs along like this)
the clients can't properly decode them
so of course we have to do that.
I asked why we even had to do that and was told that the LOF for the data source would be too large to handle by the demo deadline, we had to handle it,
and to stop with the questions.
I then complained about the code doing that was doing too much,
as it would decode the so-called “unsafe characters” from RFC-3986
(which aren't defined in the RFC,
but can be derived by a careful reading between the lines),
like the dreaded space character.
There was then much back and forth between me and my manager
(it's not who I thought it was but that's another rant for another time)
about what should and shouldn't be decoded.
I kept saying that if we have to embrace the stupid,
we might as well do it right,
but my manager was arguing against doing that and we should just decode
%2F since that's all that's being asked of us today.
I countered with “What about tomorrow,
when we're asked to decode
%3F (‘?’) and
(which are delimiter characters per RFC-3986)
I was told to stop with the questions.
And then all hell breaks loose when we get
People would come to us with a problem, and we would figure out a solution. We couldn't just search the web because the web was still being written. And you couldn't just punt a hard question to the engineer in the desk next to you. Why? Because you were sitting alone in a utility closet packed with floppy disks and old tape drives.
Ah, this takes me back. I got my first computer back in 1984, and if I wanted to know anything about it I was on my own. Google didn't exist (the public Internet didn't exist at the time). I didn't have anyone I could ask about computer related things. I did have books and magazines. So between experimentation and learning to read between the lines, I picked up programming.
So when it came time to write a metasearch engine, there were no tutorials. There were no open source metasearch engines to download and use. There was only the problem of writing a metasearch engine, in a language I didn't even know (and which itself was less than a year old at the time).
So I always found it odd when people would go online asking for tutorials, especially for writing metasearch engines (and yes, that did happen back then). So when something like testing a negative comes up, and I can't convince the Powers That Be that it's never a good idea to prove a negative, I can't just look up some tutorial on proving negatives—I just have to figure it out on my own.
And, in fact, anyone with any proximity to software development has likely heard rumblings about Agile. For all the promise of the manifesto, one starts to get the sense when talking to people who work in technology that laboring under Agile may not be the liberatory experience it’s billed as. Indeed, software development is in crisis again—but, this time, it’s an Agile crisis. On the web, everyone from regular developers to some of the original manifesto authors is raising concerns about Agile practices. They talk about the “Agile-industrial complex,” the network of consultants, speakers, and coaches who charge large fees to fine-tune Agile processes. And almost everyone complains that Agile has taken a wrong turn: somewhere in the last two decades, Agile has veered from the original manifesto’s vision, becoming something more restrictive, taxing, and stressful than it was meant to be.
Part of the issue is Agile’s flexibility. Jan Wischweh, a freelance developer, calls this the “no true Scotsman” problem. Any Agile practice someone doesn’t like is not Agile at all, it inevitably turns out. The construction of the manifesto makes this almost inescapable: because the manifesto doesn’t prescribe any specific activities, one must gauge the spirit of the methods in place, which all depends on the person experiencing them. Because it insists on its status as a “mindset,” not a methodology, Agile seems destined to take on some of the characteristics of any organization that adopts it. And it is remarkably immune to criticism, since it can’t be reduced to a specific set of methods. “If you do one thing wrong and it’s not working for you, people will assume it’s because you’re doing it wrong,” one product manager told me. “Not because there’s anything wrong with the framework.”
That last line,
“it's not working for you,
people will assume it's because you're doing it wrong,”
rings really true to me.
I no longer work for The Corporation,
I now work for The Enterprise now that the Corporate Overlords have finally taken over.
at The Enterprise,
I've been informing them pretty much all this year that this “Agile” development system they're forcing on us isn't working.
Before they finally took over,
the team I was on was always on time,
(only two bad deployments in ten years)
and no show-stopping bugs found in production.
As I told upper management,
given our prior track record,
why change how we do development?
Why fix what isn't broken?
And while upper management never said this directly,
through their actions they answered: this is our process,
and we're sticking to it,
slipped schedules and disasterous deployments be damned!
As to why I haven't left yet? Because it seems this “Agile” movement has invaded everywhere and things would be “more of the same” elsewhere. At least here, I'm not forced to use Windows.
OK, why do the Simpsons live in a town called Springfield? Isn't that a little generic?
Springfield was named after Springfield, Oregon. The only reason is that when I was a kid, the TV show “Father Knows Best” took place in the town of Springfield, and I was thrilled because I imagined that it was the town next to Portland, my hometown. When I grew up, I realized it was just a fictitious name. I also figured out that Springfield was one of the most common names for a city in the U.S. In anticipation of the success of the show, I thought, “This will be cool; everyone will think it's their Springfield.” And they do.
So I got to wondering, is Springfield the most popular city name in the US? I know, weird question, but I'm curious. So some quick searching lead me to the United States Geological Survey Geographical Names Database. With some massaging of the data, I was able to determine that there are 34 States with a “Springfield,” but it's not alone. There are eight other cities that are also in 34 States: Arlington, Chester, Clinton, Farmington, Florence, Greenville, Milton, and Newport. Okay, maybe not the same 34 states across all those cities, but you get the idea.
But those cities aren't the most popular names. No, all of them are tied for ninth place! The city name that appears in most states is “Riverside” at 46 States (plus Puerto Rico). The States that don't have a “Riverside” are Alaska, Hawaii, Oklahoma, and Louisiana (really? Louisiana? One of the world's largest river run straight through that state, and no one bothered to name a town in Louisiana, “Riverside?”).
And just to satisfy the curious:
It's not Star Wars Day—it's Dave Brubeck Day! (and give yourself 10 cool points if you get the reference) Of course, it's only Dave Brubeck day in the US. Elsewhere in the world, Dave Brubeck Day is April 5th for some odd reason (give yourself a geek point for getting this reference).
[And of course Sean didn't tell you he pulled this meme from FaceMeLinkedInstaMySpaceBookWeInGram. He's not that cool to think of this. —Editor]
Martin Chang replied to my musings on processing malformed Gemini requests, saying that double slashes in URIs are illegal, and pointed out the ABNF grammar from the URI specification to back up his claim:
path = path-absolute ; begins with "/" but not "//" path-absolute = "/" [ segment-nz *( "/" segment ) ] segment-nz = 1*pchar pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
But he didn't quote the
segment = *pchar
which translated says,
“0 or more
So the ABNF he quoted does indeed rule out
It doesn't rule out
since by the time we hit the double slash,
we're in the
*( "/" segment ) part of the
segment can have 0 characters.
But what he quoted only applies to relative links,
what I receive is an abolute link.
If you follow the ABNF from that perspective:
URI-reference = URI / relative-ref URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ] hier-part = "//" authority path-abempty / path-absolute / path-rootless / path-empty path-abempty = *( "/" segment ) ; other rules omitted
not only does this allow
I can understand why this was done—to simplify the grammar as the various
path- rules generally end with
*( "/" segment ) allows one to end a URI with a trailing slash or not.
I don't think the intent was to allow long strings of slashes,
but that's the end result of a lax grammar.
Martin is also correct that multiple slashes are treated as a single slash on POSIX
any Unix system),
that's not the case across all operating systems.
One exception I can think of AmigaOS,
where each slash represents a parent directory.
cd /// on AmigaOS is the same as
cd ‥/‥/‥ on a POSIX system.
And maybe not even relevant these days,
but I thought I should mention it.
I finished my lunch of a sub sandwich when I notice a message printed on the wrapper in not-so-small print:
I have no words.
“You know, you forgot to remind me to make your tea.”
“Oh. I need to remind you make tea.”
“So thank you for reminding me to remind you to make tea.”
“Um, doesn't hitting your head against the wall hurt?”