Wednesday, May 25, 2022
URI encoding
I've fallen into a rabbit hole of URI encoding and decoding, and why not publish my results here so I at least have a place I know where I can look it up again. And who knows? Maybe someone else will find this useful.
Anyway, there are two standards that define URIs:
The first is from the IETF and what most non-browsers that deal with URIs use. The second is from the WHATWG (and while WHATWG stands for “Web Hypertext Application Technology Working Group,” I always read that as ”What Working Group?” which gives away my opinions on this group, truth be told) and is the standard being pushed by the three major browsers left (Chrome, Firefox and Safari).
RFC-3986 is quite clear on when to encode and decode characters:
Under normal circumstances, the only time when octets within a URI are percent-encoded is during the process of producing the URI from its component parts. This is when an implementation determines which of the reserved characters are to be used as subcomponent delimiters and which can be safely used as data. Once produced, a URI is always in its percent-encoded form.
When a URI is dereferenced, the components and subcomponents significant to the scheme-specific dereferencing process (if any) must be parsed and separated before the percent-encoded octets within those components can be safely decoded, as otherwise the data may be mistaken for component delimiters. The only exception is for percent-encoded octets corresponding to characters in the unreserved set, which can be decoded at any time. For example, the octet corresponding to the tilde ("~") character is often encoded as "%7E" by older URI processing implementations; the "%7E" can be replaced by "~" without changing its interpretation.
Because the percent ("%") character serves as the indicator for percent-encoded octets, it must be percent-encoded as "%25" for that octet to be used as data within a URI. Implementations must not percent-encode or decode the same string more than once, as decoding an already decoded string might lead to misinterpreting a percent data octet as the beginning of a percent-encoding, or vice versa in the case of percent-encoding an already percent-encoded string.
RFC-3986, section 2.4: When to Encode or Decode
But you do have to read the ABNF carefully to find the 10 characters not mentioned that must be encoded. The WHATWG standard isn't easy to follow as it describes in all-too-verbose English the algorithm of how to encode and decode a URI, but it does cover what to encode and what not to encode. As I went through both stardards and several other sources (links below), I've created the following table of what characters to encode (current as of this date), with a preference for RFC-3986 (but with notes where WHATWG diverges from RFC-3986):
scheme | auth | path | query | fragment | note | ||
---|---|---|---|---|---|---|---|
scheme | auth | path | query | fragment | note | ||
SPACE | - | Y | Y | Y | Y | ||
! | sub-delim | - | m | m | m | m | |
" | - | Y | Y | Y | Y | ||
# | gen-delim | - | m | m | m | m | 4 |
$ | sub-delim | - | m | m | m | m | |
% | escape | - | Y | Y | Y | Y | |
& | sub-delim | - | m | m | m | m | |
' | sub-delim | - | m | m | m | m | |
( | sub-delim | - | m | m | m | m | |
) | sub-delim | - | m | m | m | m | |
* | sub-delim | - | m | m | m | m | |
+ | sub-delim | N | m | m | m | m | |
, | sub-delim | - | m | m | m | m | |
- | unreserved | N | N | N | N | N | |
. | unreserved | N | N | N | N | N | |
/ | gen-delim | - | m | m | N | N | |
0 | unreserved | N | N | N | N | N | |
1 | unreserved | N | N | N | N | N | |
2 | unreserved | N | N | N | N | N | |
3 | unreserved | N | N | N | N | N | |
4 | unreserved | N | N | N | N | N | |
5 | unreserved | N | N | N | N | N | |
6 | unreserved | N | N | N | N | N | |
7 | unreserved | N | N | N | N | N | |
8 | unreserved | N | N | N | N | N | |
9 | unreserved | N | N | N | N | N | |
: | gen-delim | - | m | N | N | N | 2 |
; | sub-delim | - | m | m | m | m | |
< | - | Y | Y | Y | Y | ||
= | sub-delim | - | m | m | m | m | |
> | - | Y | Y | Y | Y | ||
? | gen-delim | - | m | m | N | N | |
@ | gen-delim | - | m | N | N | N | |
A | unreserved | N | N | N | N | N | |
B | unreserved | N | N | N | N | N | |
C | unreserved | N | N | N | N | N | |
D | unreserved | N | N | N | N | N | |
E | unreserved | N | N | N | N | N | |
F | unreserved | N | N | N | N | N | |
G | unreserved | N | N | N | N | N | |
H | unreserved | N | N | N | N | N | |
I | unreserved | N | N | N | N | N | |
J | unreserved | N | N | N | N | N | |
K | unreserved | N | N | N | N | N | |
L | unreserved | N | N | N | N | N | |
M | unreserved | N | N | N | N | N | |
N | unreserved | N | N | N | N | N | |
O | unreserved | N | N | N | N | N | |
P | unreserved | N | N | N | N | N | |
Q | unreserved | N | N | N | N | N | |
R | unreserved | N | N | N | N | N | |
S | unreserved | N | N | N | N | N | |
T | unreserved | N | N | N | N | N | |
U | unreserved | N | N | N | N | N | |
V | unreserved | N | N | N | N | N | |
W | unreserved | N | N | N | N | N | |
X | unreserved | N | N | N | N | N | |
Y | unreserved | N | N | N | N | N | |
Z | unreserved | N | N | N | N | N | |
[ | gen-delim | - | m | m | m | m | 2,3,4 |
\ | - | Y | Y | Y | Y | 1 | |
] | gen-delim | - | m | m | m | m | 2,3,4 |
^ | - | Y | Y | Y | Y | 2,3,4 | |
_ | unreserved | - | N | N | N | N | |
` | - | Y | Y | Y | Y | 3 | |
a | unreserved | N | N | N | N | N | |
b | unreserved | N | N | N | N | N | |
c | unreserved | N | N | N | N | N | |
d | unreserved | N | N | N | N | N | |
e | unreserved | N | N | N | N | N | |
f | unreserved | N | N | N | N | N | |
g | unreserved | N | N | N | N | N | |
h | unreserved | N | N | N | N | N | |
i | unreserved | N | N | N | N | N | |
j | unreserved | N | N | N | N | N | |
k | unreserved | N | N | N | N | N | |
l | unreserved | N | N | N | N | N | |
m | unreserved | N | N | N | N | N | |
n | unreserved | N | N | N | N | N | |
o | unreserved | N | N | N | N | N | |
p | unreserved | N | N | N | N | N | |
q | unreserved | N | N | N | N | N | |
r | unreserved | N | N | N | N | N | |
s | unreserved | N | N | N | N | N | |
t | unreserved | N | N | N | N | N | |
u | unreserved | N | N | N | N | N | |
v | unreserved | N | N | N | N | N | |
w | unreserved | N | N | N | N | N | |
x | unreserved | N | N | N | N | N | |
m | unreserved | N | N | N | N | N | |
z | unreserved | N | N | N | N | N | |
{ | - | Y | Y | Y | Y | 3,4 | |
| | - | Y | Y | Y | Y | 2 | |
} | - | Y | Y | Y | Y | 3,4 | |
~ | unreserved | - | N | N | N | N |
- WHATWG: “\” is treated as a “/” in path segment
- WHATWG: character not encoded in path
- WHATWG: character not encoded in query
- WHATWG: character not encoded in fragment
Y | always encode |
N | never encode |
m | only encode when not used for their defined purpose (URI scheme dependent) |
- | not allowed, even escaped |
unreserved | characters that never need to be encoded |
gen-delim | characters defined as general use delimiters |
sub-delim | characters defined as a potential delimiter for subcomponents in a URI |
escape | character defined to escape other characters |
characters not otherwise defined, and thus must be escaped. |
Furthermore, any character not defined in the above table (character codes 0 to 31 and 127 or higher) must also be escaped.
References
- Uniform Resource Identifier Schemes
- URL Interop
- URL Specification
- A practical guide to URI encoding and URI decoding
- (Please) Stop Using Unsafe Characters in URLs
- Exploiting URL Parsers: The Good, Bad, And Inconsistent
Notes on an overheard conversation about The Great American Tag Sale with Martha Stewart
“I think Martha's spent too much time hanging with Snoop Dogg.”
“What makes you say that?”
“Look at her! Her dress, her hair, the 50-yard stare into nothing.”
“Maybe you're not used to seeing her at home.”
“Maybe … ”
“Besides, maybe she learned that while in prison.”
“Oh yeah! She did do time in the pokey, didn't she?”