The Boston Diaries

The ongoing saga of a programmer who doesn't live in Boston, nor does he even like Boston, but yet named his weblog/journal “The Boston Diaries.”

Go figure.

Wednesday, May 25, 2022

URI encoding

I've fallen into a rabbit hole of URI encoding and decoding, and why not publish my results here so I at least have a place I know where I can look it up again. And who knows? Maybe someone else will find this useful.

Anyway, there are two standards that define URIs:

  1. RFC-3986: Uniform Resource Identifier (URI): Generic Syntax
  2. URL: Living Standard

The first is from the IETF and what most non-browsers that deal with URIs use. The second is from the WHATWG (and while WHATWG stands for “Web Hypertext Application Technology Working Group,” I always read that as ”What Working Group?” which gives away my opinions on this group, truth be told) and is the standard being pushed by the three major browsers left (Chrome, Firefox and Safari).

RFC-3986 is quite clear on when to encode and decode characters:

Under normal circumstances, the only time when octets within a URI are percent-encoded is during the process of producing the URI from its component parts. This is when an implementation determines which of the reserved characters are to be used as subcomponent delimiters and which can be safely used as data. Once produced, a URI is always in its percent-encoded form.

When a URI is dereferenced, the components and subcomponents significant to the scheme-specific dereferencing process (if any) must be parsed and separated before the percent-encoded octets within those components can be safely decoded, as otherwise the data may be mistaken for component delimiters. The only exception is for percent-encoded octets corresponding to characters in the unreserved set, which can be decoded at any time. For example, the octet corresponding to the tilde ("~") character is often encoded as "%7E" by older URI processing implementations; the "%7E" can be replaced by "~" without changing its interpretation.

Because the percent ("%") character serves as the indicator for percent-encoded octets, it must be percent-encoded as "%25" for that octet to be used as data within a URI. Implementations must not percent-encode or decode the same string more than once, as decoding an already decoded string might lead to misinterpreting a percent data octet as the beginning of a percent-encoding, or vice versa in the case of percent-encoding an already percent-encoded string.

RFC-3986, section 2.4: When to Encode or Decode

But you do have to read the ABNF carefully to find the 10 characters not mentioned that must be encoded. The WHATWG standard isn't easy to follow as it describes in all-too-verbose English the algorithm of how to encode and decode a URI, but it does cover what to encode and what not to encode. As I went through both stardards and several other sources (links below), I've created the following table of what characters to encode (current as of this date), with a preference for RFC-3986 (but with notes where WHATWG diverges from RFC-3986):

URL percent-encoding chart (per RFC-3986)
scheme auth path query fragment note
scheme auth path query fragment note
SPACE - Y Y Y Y
! sub-delim - m m m m
" - Y Y Y Y
# gen-delim - m m m m 4
$ sub-delim - m m m m
% escape - Y Y Y Y
& sub-delim - m m m m
' sub-delim - m m m m
( sub-delim - m m m m
) sub-delim - m m m m
* sub-delim - m m m m
+ sub-delim N m m m m
, sub-delim - m m m m
- unreserved N N N N N
. unreserved N N N N N
/ gen-delim - m m N N
0 unreserved N N N N N
1 unreserved N N N N N
2 unreserved N N N N N
3 unreserved N N N N N
4 unreserved N N N N N
5 unreserved N N N N N
6 unreserved N N N N N
7 unreserved N N N N N
8 unreserved N N N N N
9 unreserved N N N N N
: gen-delim - m N N N 2
; sub-delim - m m m m
< - Y Y Y Y
= sub-delim - m m m m
> - Y Y Y Y
? gen-delim - m m N N
@ gen-delim - m N N N
A unreserved N N N N N
B unreserved N N N N N
C unreserved N N N N N
D unreserved N N N N N
E unreserved N N N N N
F unreserved N N N N N
G unreserved N N N N N
H unreserved N N N N N
I unreserved N N N N N
J unreserved N N N N N
K unreserved N N N N N
L unreserved N N N N N
M unreserved N N N N N
N unreserved N N N N N
O unreserved N N N N N
P unreserved N N N N N
Q unreserved N N N N N
R unreserved N N N N N
S unreserved N N N N N
T unreserved N N N N N
U unreserved N N N N N
V unreserved N N N N N
W unreserved N N N N N
X unreserved N N N N N
Y unreserved N N N N N
Z unreserved N N N N N
[ gen-delim - m m m m 2,3,4
\ - Y Y Y Y 1
] gen-delim - m m m m 2,3,4
^ - Y Y Y Y 2,3,4
_ unreserved - N N N N
` - Y Y Y Y 3
a unreserved N N N N N
b unreserved N N N N N
c unreserved N N N N N
d unreserved N N N N N
e unreserved N N N N N
f unreserved N N N N N
g unreserved N N N N N
h unreserved N N N N N
i unreserved N N N N N
j unreserved N N N N N
k unreserved N N N N N
l unreserved N N N N N
m unreserved N N N N N
n unreserved N N N N N
o unreserved N N N N N
p unreserved N N N N N
q unreserved N N N N N
r unreserved N N N N N
s unreserved N N N N N
t unreserved N N N N N
u unreserved N N N N N
v unreserved N N N N N
w unreserved N N N N N
x unreserved N N N N N
m unreserved N N N N N
z unreserved N N N N N
{ - Y Y Y Y 3,4
| - Y Y Y Y 2
} - Y Y Y Y 3,4
~ unreserved - N N N N
  1. WHATWG: “\” is treated as a “/” in path segment
  2. WHATWG: character not encoded in path
  3. WHATWG: character not encoded in query
  4. WHATWG: character not encoded in fragment
Encoding Key
Y always encode
N never encode
m only encode when not used for their defined purpose (URI scheme dependent)
- not allowed, even escaped
Character classes as defined by RFC-3986
unreserved characters that never need to be encoded
gen-delim characters defined as general use delimiters
sub-delim characters defined as a potential delimiter for subcomponents in a URI
escape character defined to escape other characters
characters not otherwise defined, and thus must be escaped.

Furthermore, any character not defined in the above table (character codes 0 to 31 and 127 or higher) must also be escaped.

References


Notes on an overheard conversation about The Great American Tag Sale with Martha Stewart

“I think Martha's spent too much time hanging with Snoop Dogg.”

“What makes you say that?”

“Look at her! Her dress, her hair, the 50-yard stare into nothing.”

“Maybe you're not used to seeing her at home.”

“Maybe … ”

“Besides, maybe she learned that while in prison.”

“Oh yeah! She did do time in the pokey, didn't she?”

Obligatory Picture

An abstract representation of where you're coming from]

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

Obligatory AI Disclaimer

No AI was used in the making of this site, unless otherwise noted.

You have my permission to link freely to any entry here. Go ahead, I won't bite. I promise.

The dates are the permanent links to that day's entries (or entry, if there is only one entry). The titles are the permanent links to that entry only. The format for the links are simple: Start with the base link for this site: https://boston.conman.org/, then add the date you are interested in, say 2000/08/01, so that would make the final URL:

https://boston.conman.org/2000/08/01

You can also specify the entire month by leaving off the day portion. You can even select an arbitrary portion of time.

You may also note subtle shading of the links and that's intentional: the “closer” the link is (relative to the page) the “brighter” it appears. It's an experiment in using color shading to denote the distance a link is from here. If you don't notice it, don't worry; it's not all that important.

It is assumed that every brand name, slogan, corporate name, symbol, design element, et cetera mentioned in these pages is a protected and/or trademarked entity, the sole property of its owner(s), and acknowledgement of this status is implied.

Copyright © 1999-2024 by Sean Conner. All Rights Reserved.