URI encoding

Wednesday, May 25, 2022

I've fallen into a rabbit hole of URI encoding and decoding, and why not publish my results here so I at least have a place I know where I can look it up again. And who knows? Maybe someone else will find this useful.

Anyway, there are two standards that define URIs:

The first is from the IETF and what most non-browsers that deal with URIs use. The second is from the WHATWG (and while WHATWG stands for “Web Hypertext Application Technology Working Group,” I always read that as ”What Working Group?” which gives away my opinions on this group, truth be told) and is the standard being pushed by the three major browsers left (Chrome, Firefox and Safari).

RFC-3986 is quite clear on when to encode and decode characters:

Under normal circumstances, the only time when octets within a URI are percent-encoded is during the process of producing the URI from its component parts. This is when an implementation determines which of the reserved characters are to be used as subcomponent delimiters and which can be safely used as data. Once produced, a URI is always in its percent-encoded form.

When a URI is dereferenced, the components and subcomponents significant to the scheme-specific dereferencing process (if any) must be parsed and separated before the percent-encoded octets within those components can be safely decoded, as otherwise the data may be mistaken for component delimiters. The only exception is for percent-encoded octets corresponding to characters in the unreserved set, which can be decoded at any time. For example, the octet corresponding to the tilde ("~") character is often encoded as "%7E" by older URI processing implementations; the "%7E" can be replaced by "~" without changing its interpretation.

Because the percent ("%") character serves as the indicator for percent-encoded octets, it must be percent-encoded as "%25" for that octet to be used as data within a URI. Implementations must not percent-encode or decode the same string more than once, as decoding an already decoded string might lead to misinterpreting a percent data octet as the beginning of a percent-encoding, or vice versa in the case of percent-encoding an already percent-encoded string.

RFC-3986, section 2.4: When to Encode or Decode

But you do have to read the ABNF carefully to find the 10 characters not mentioned that must be encoded. The WHATWG standard isn't easy to follow as it describes in all-too-verbose English the algorithm of how to encode and decode a URI, but it does cover what to encode and what not to encode. As I went through both stardards and several other sources (links below), I've created the following table of what characters to encode (current as of this date), with a preference for RFC-3986 (but with notes where WHATWG diverges from RFC-3986):

URL percent-encoding chart (per RFC-3986)
		scheme	auth	path	query	fragment	note
		scheme	auth	path	query	fragment	note
SPACE		-	Y	Y	Y	Y
!	sub-delim	-	m	m	m	m
"		-	Y	Y	Y	Y
#	gen-delim	-	m	m	m	m	4
$	sub-delim	-	m	m	m	m
%	escape	-	Y	Y	Y	Y
&	sub-delim	-	m	m	m	m
'	sub-delim	-	m	m	m	m
(	sub-delim	-	m	m	m	m
)	sub-delim	-	m	m	m	m
*	sub-delim	-	m	m	m	m
+	sub-delim	N	m	m	m	m
,	sub-delim	-	m	m	m	m
-	unreserved	N	N	N	N	N
.	unreserved	N	N	N	N	N
/	gen-delim	-	m	m	N	N
0	unreserved	N	N	N	N	N
1	unreserved	N	N	N	N	N
2	unreserved	N	N	N	N	N
3	unreserved	N	N	N	N	N
4	unreserved	N	N	N	N	N
5	unreserved	N	N	N	N	N
6	unreserved	N	N	N	N	N
7	unreserved	N	N	N	N	N
8	unreserved	N	N	N	N	N
9	unreserved	N	N	N	N	N
:	gen-delim	-	m	N	N	N	2
;	sub-delim	-	m	m	m	m
<		-	Y	Y	Y	Y
=	sub-delim	-	m	m	m	m
>		-	Y	Y	Y	Y
?	gen-delim	-	m	m	N	N
@	gen-delim	-	m	N	N	N
A	unreserved	N	N	N	N	N
B	unreserved	N	N	N	N	N
C	unreserved	N	N	N	N	N
D	unreserved	N	N	N	N	N
E	unreserved	N	N	N	N	N
F	unreserved	N	N	N	N	N
G	unreserved	N	N	N	N	N
H	unreserved	N	N	N	N	N
I	unreserved	N	N	N	N	N
J	unreserved	N	N	N	N	N
K	unreserved	N	N	N	N	N
L	unreserved	N	N	N	N	N
M	unreserved	N	N	N	N	N
N	unreserved	N	N	N	N	N
O	unreserved	N	N	N	N	N
P	unreserved	N	N	N	N	N
Q	unreserved	N	N	N	N	N
R	unreserved	N	N	N	N	N
S	unreserved	N	N	N	N	N
T	unreserved	N	N	N	N	N
U	unreserved	N	N	N	N	N
V	unreserved	N	N	N	N	N
W	unreserved	N	N	N	N	N
X	unreserved	N	N	N	N	N
Y	unreserved	N	N	N	N	N
Z	unreserved	N	N	N	N	N
[	gen-delim	-	m	m	m	m	2,3,4
\		-	Y	Y	Y	Y	1
]	gen-delim	-	m	m	m	m	2,3,4
^		-	Y	Y	Y	Y	2,3,4
_	unreserved	-	N	N	N	N
`		-	Y	Y	Y	Y	3
a	unreserved	N	N	N	N	N
b	unreserved	N	N	N	N	N
c	unreserved	N	N	N	N	N
d	unreserved	N	N	N	N	N
e	unreserved	N	N	N	N	N
f	unreserved	N	N	N	N	N
g	unreserved	N	N	N	N	N
h	unreserved	N	N	N	N	N
i	unreserved	N	N	N	N	N
j	unreserved	N	N	N	N	N
k	unreserved	N	N	N	N	N
l	unreserved	N	N	N	N	N
m	unreserved	N	N	N	N	N
n	unreserved	N	N	N	N	N
o	unreserved	N	N	N	N	N
p	unreserved	N	N	N	N	N
q	unreserved	N	N	N	N	N
r	unreserved	N	N	N	N	N
s	unreserved	N	N	N	N	N
t	unreserved	N	N	N	N	N
u	unreserved	N	N	N	N	N
v	unreserved	N	N	N	N	N
w	unreserved	N	N	N	N	N
x	unreserved	N	N	N	N	N
m	unreserved	N	N	N	N	N
z	unreserved	N	N	N	N	N
{		-	Y	Y	Y	Y	3,4
\|		-	Y	Y	Y	Y	2
}		-	Y	Y	Y	Y	3,4
~	unreserved	-	N	N	N	N

WHATWG: “\” is treated as a “/” in path segment
WHATWG: character not encoded in path
WHATWG: character not encoded in query
WHATWG: character not encoded in fragment

Encoding Key
Y	always encode
N	never encode
m	only encode when not used for their defined purpose (URI scheme dependent)
-	not allowed, even escaped

Character classes as defined by RFC-3986
unreserved	characters that never need to be encoded
gen-delim	characters defined as general use delimiters
sub-delim	characters defined as a potential delimiter for subcomponents in a URI
escape	character defined to escape other characters
	characters not otherwise defined, and thus must be escaped.

Furthermore, any character not defined in the above table (character codes 0 to 31 and 127 or higher) must also be escaped.

The Boston Diaries

Wednesday, May 25, 2022

URI encoding

References

Obligatory Picture

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

Obligatory AI Disclaimer