The Boston Diaries

Sunday, May 01, 2022

A zombie site from May Days past

Given that today is May Day I was curious as to what I wrote on past May Days. And lo' sixteen years ago I wrote about OsiXs.org and their attempt to “change the world!” Amazingly, the website is still around, although with even less than there was sixteen years ago. I guess I was right when I wrote back then, “I personally don't see this going anywhere fast.”

It was a simple bug, but …

I was right about the double slash bug—it was a simple bug after all. The authors of two Gemini crawlers wrote in about the double slash bug, and from them, I was able to get the root cause of the problem—my blog on Gemini. Good thing I hedged my statement about not being the cause yesterday. Sigh.

Back in Debtember, I added support for displaying multiple posts. It's not an easy feature to describe, but basically, it allows one to (by hacking the URL, but who hacks URLs these days?) specify posts via a range of dates. And it's on these pages that the double slashed URLs appear. Why that happens is easy—I was generating the links directly from strings:

local function geminilink(entry)
  return string.format("gemini://%s%s/%s%04d/%02d/%02d.%d",
            config.url.host,
            port, -- generated elsewhere
            config.url.path,
            entry.when.year,
            entry.when.month,
            entry.when.day,
            entry.when.part
    )
end

instead of from a URL type. I think when I wrote the above code, I wasn't thinking in terms of a URL type, but of constructing a URL from data I already had. The bug itself is due to config.url.path ending in a slash, so the third slash in the string literal wasn't needed. The correct way isn't that hard:

local function geminilink(entry)
  return uurl.toa(uurl.merge(config.url,
		{
		  path = string.format("%04d/%02d/%02d.%d",
				entry.when.year,
				entry.when.month,
				entry.when.day,
				entry.when.part)
		}))
end

and it wouldn't have exhibited the issue.

With this fix in place, I think I will continue to reject requests with the double slash, as it is catching bugs, which is a Good Thing™.

Monday, May 02, 2022

Notes on an overheard conversation about tea

“You know, you forgot to remind me to make your tea.”

“Oh. I need to remind you make tea.”

“Sigh.”

“So thank you for reminding me to remind you to make tea.”

“…”

“Um, doesn't hitting your head against the wall hurt?”

Tuesday, May 03, 2022

I'm hoping this is a joke, because if it's not, I'm not sure what that says about our society

I finished my lunch of a sub sandwich when I notice a message printed on the wrapper in not-so-small print:

[A sub sandwich wrapper with “DO NOT EAT THIS WRAPPER” printed on it.] I'll admit, the sub was good, but not so good as to keep eating everything in sight.

I have no words.

The legality of double slashes in URIs

Martin Chang replied to my musings on processing malformed Gemini requests, saying that double slashes in URIs are illegal, and pointed out the ABNF grammar from the URI specification to back up his claim:

path          = path-absolute   ; begins with "/" but not "//"
path-absolute = "/" [ segment-nz *( "/" segment ) ]
segment-nz    = 1*pchar
pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"

But he didn't quote the segment rule:

segment       = *pchar

which translated says, “0 or more pchar rules.”

So the ABNF he quoted does indeed rule out //boston/2018/07/04.2. It doesn't rule out /boston//2018/07/04.2, since by the time we hit the double slash, we're in the *( "/" segment ) part of the path-absolute rule, and segment can have 0 characters. But what he quoted only applies to relative links, what I receive is an abolute link. If you follow the ABNF from that perspective:

URI-reference = URI / relative-ref
URI           = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
hier-part     = "//" authority path-abempty
                 / path-absolute
                 / path-rootless
                 / path-empty

path-abempty  = *( "/" segment )

; other rules omitted

not only does this allow gemini://gemini.conman.org//boston/2018/07/04.2 but gemini://gemini.conman.org///////////boston/2018/07/04.2.

I can understand why this was done—to simplify the grammar as the various path- rules generally end with *( "/" segment ) allows one to end a URI with a trailing slash or not. I don't think the intent was to allow long strings of slashes, but that's the end result of a lax grammar. Martin is also correct that multiple slashes are treated as a single slash on POSIX (basically, any Unix system), that's not the case across all operating systems. One exception I can think of AmigaOS, where each slash represents a parent directory. This command, cd /// on AmigaOS is the same as cd ‥/‥/‥ on a POSIX system. Crazy, I know. And maybe not even relevant these days, but I thought I should mention it.

Wednesday, May 04, 2022

Star Wars Day?

It's not Star Wars Day—it's Dave Brubeck Day! (and give yourself 10 cool points if you get the reference) Of course, it's only Dave Brubeck day in the US. Elsewhere in the world, Dave Brubeck Day is April 5^th for some odd reason (give yourself a geek point for getting this reference).

[And of course Sean didn't tell you he pulled this meme from FaceMeLinkedInstaMySpaceBookWeInGram. He's not that cool to think of this. —Editor]

Tuesday, May 10, 2022

Springfield isn't the most popular city name in the US

OK, why do the Simpsons live in a town called Springfield? Isn't that a little generic?

Springfield was named after Springfield, Oregon. The only reason is that when I was a kid, the TV show “Father Knows Best” took place in the town of Springfield, and I was thrilled because I imagined that it was the town next to Portland, my hometown. When I grew up, I realized it was just a fictitious name. I also figured out that Springfield was one of the most common names for a city in the U.S. In anticipation of the success of the show, I thought, “This will be cool; everyone will think it's their Springfield.” And they do.

Matt Groening Reveals the Location of the Real Springfield | Arts & Culture| Smithsonian Magazine

So I got to wondering, is Springfield the most popular city name in the US? I know, weird question, but I'm curious. So some quick searching lead me to the United States Geological Survey Geographical Names Database. With some massaging of the data, I was able to determine that there are 34 States with a “Springfield,” but it's not alone. There are eight other cities that are also in 34 States: Arlington, Chester, Clinton, Farmington, Florence, Greenville, Milton, and Newport. Okay, maybe not the same 34 states across all those cities, but you get the idea.

But those cities aren't the most popular names. No, all of them are tied for ninth place! The city name that appears in most states is “Riverside” at 46 States (plus Puerto Rico). The States that don't have a “Riverside” are Alaska, Hawaii, Oklahoma, and Louisiana (really? Louisiana? One of the world's largest river run straight through that state, and no one bothered to name a town in Louisiana, “Riverside?”).

And just to satisfy the curious:

Top 10 city names in the US (including territories)
Place	Name	# States
1	Riverside	47
2	Centerville	43
3	Fairview	41
4	Franklin	40
5	Midway	39
6	Georgetown	37
	Glendale	37
	Greenwood	37
7	Lincoln	36
	Marion	36
	Oakland	36
	Pleasant Valley	36
	Salem	36
	Union	36
8	Fairfield	35
	Lakeview	35
	Liberty	35
9	Arlington	34
	Chester	34
	Clinton	34
	Farmington	34
	Florence	34
	Greenville	34
	Milton	34
	Newport	34
	Springfield	34
10	Bethel	33
	Clifton	33
	Eden	33
	Glenwood	33
	Hamilton	33
	Kingston	33
	Lakeside	33
	Mount Pleasant	33
	Summit	33

Thursday, May 12, 2022

“This is how we do things around here.”

And, in fact, anyone with any proximity to software development has likely heard rumblings about Agile. For all the promise of the manifesto, one starts to get the sense when talking to people who work in technology that laboring under Agile may not be the liberatory experience it’s billed as. Indeed, software development is in crisis again—but, this time, it’s an Agile crisis. On the web, everyone from regular developers to some of the original manifesto authors is raising concerns about Agile practices. They talk about the “Agile-industrial complex,” the network of consultants, speakers, and coaches who charge large fees to fine-tune Agile processes. And almost everyone complains that Agile has taken a wrong turn: somewhere in the last two decades, Agile has veered from the original manifesto’s vision, becoming something more restrictive, taxing, and stressful than it was meant to be.

Part of the issue is Agile’s flexibility. Jan Wischweh, a freelance developer, calls this the “no true Scotsman” problem. Any Agile practice someone doesn’t like is not Agile at all, it inevitably turns out. The construction of the manifesto makes this almost inescapable: because the manifesto doesn’t prescribe any specific activities, one must gauge the spirit of the methods in place, which all depends on the person experiencing them. Because it insists on its status as a “mindset,” not a methodology, Agile seems destined to take on some of the characteristics of any organization that adopts it. And it is remarkably immune to criticism, since it can’t be reduced to a specific set of methods. “If you do one thing wrong and it’s not working for you, people will assume it’s because you’re doing it wrong,” one product manager told me. “Not because there’s anything wrong with the framework.”

Via Hacker News, Agile and the Long Crisis of Software

That last line, “it's not working for you, people will assume it's because you're doing it wrong,” rings really true to me. At ~~The Corporation~~—no, I no longer work for The Corporation, I now work for The Enterprise now that the Corporate Overlords have finally taken over. So, at The Enterprise, I've been informing them pretty much all this year that this “Agile” development system they're forcing on us isn't working. Before they finally took over, the team I was on was always on time, on budget, smooth deployments (only two bad deployments in ten years) and no show-stopping bugs found in production. As I told upper management, given our prior track record, why change how we do development? Why fix what isn't broken? And while upper management never said this directly, through their actions they answered: this is our process, and we're sticking to it, slipped schedules and disasterous deployments be damned!

As to why I haven't left yet? Because it seems this “Agile” movement has invaded everywhere and things would be “more of the same” elsewhere. At least here, I'm not forced to use Windows.

Programming, up hill, both ways

People would come to us with a problem, and we would figure out a solution. We couldn't just search the web because the web was still being written. And you couldn't just punt a hard question to the engineer in the desk next to you. Why? Because you were sitting alone in a utility closet packed with floppy disks and old tape drives.

I'm a XXXXX XX webmaster

Ah, this takes me back. I got my first computer back in 1984, and if I wanted to know anything about it I was on my own. Google didn't exist (the public Internet didn't exist at the time). I didn't have anyone I could ask about computer related things. I did have books and magazines. So between experimentation and learning to read between the lines, I picked up programming.

So when it came time to write a metasearch engine, there were no tutorials. There were no open source metasearch engines to download and use. There was only the problem of writing a metasearch engine, in a language I didn't even know (and which itself was less than a year old at the time).

Fun times.

So I always found it odd when people would go online asking for tutorials, especially for writing metasearch engines (and yes, that did happen back then). So when something like testing a negative comes up, and I can't convince the Powers That Be that it's never a good idea to prove a negative, I can't just look up some tutorial on proving negatives—I just have to figure it out on my own.

Friday, May 20, 2022

If you have to embrace the stupid, you might as well do it well

Our customer, The Oligarchic Cell Phone Company, wants us to do a demo of a new feature for a certain class of clients. “Project: Lumbergh” will receive a URL along with the name and reputation of a phone number it gets from elsewhere. “Project: Lumbergh” will then pass this along to “Project: Sippy-Cup.” We already have to deal with URLs from elsewhere. The only change we have to make is allowing URLs to be passed along to the certain class of clients, which formerly did not get URLs. So far, so good.

But then I saw code being added to “Project: Lumbergh” to check the URLs to see if the path portion ended in .bmp. I enquired about this, because to me, that makes no sense—we're just a conduit for data; the source of the URL should already know what it can and can't send to the client. I was told that the certain class of clients only support BMP files while other clients that can receive URLs can't support BMP files, so we have to ensure that BMPs only go the subset of clients that can support them. I countered with the fact that we include information about the client to the data source when we query them, and they should have the logic to handle this on their end—why are we suddenly reponsible for this? I was told that the LOF for the data source would be too large to handle by the demo deadline, that we had to handle it, that the code that just looks anywhere in the URL for a literal “.bmp” is Good Enough™, and to stop with the questions.

Now the URL we're given is “percent-encoded”—we get something like: https%3A%2F%2Fexample.com%2Fpicture.bmp. Nevermind the fact that that is an invalid URL to begin with (you aren't supposed to encode characters that are defined as delimiters in URLs if they are, in fact, delimiting fields), that's what we get and pass along. Only now (a few years after we started passing URLs along like this) the clients can't properly decode them (surprise!), so of course we have to do that. I asked why we even had to do that and was told that the LOF for the data source would be too large to handle by the demo deadline, we had to handle it, and to stop with the questions. I then complained about the code doing that was doing too much, as it would decode the so-called “unsafe characters” from RFC-3986 (which aren't defined in the RFC, but can be derived by a careful reading between the lines), like the dreaded space character.

There was then much back and forth between me and my manager (it's not who I thought it was but that's another rant for another time) about what should and shouldn't be decoded. I kept saying that if we have to embrace the stupid, we might as well do it right, but my manager was arguing against doing that and we should just decode %3A and %2F since that's all that's being asked of us today. I countered with “What about tomorrow, when we're asked to decode %3F (‘?’) and %40 (‘@’)?” (which are delimiter characters per RFC-3986)

I was told to stop with the questions.

And then all hell breaks loose when we get https%3A%2F%2Fexample.com%2FThings%2520Go%2520Boom%2521.

Sigh.

Wednesday, May 25, 2022

URI encoding

I've fallen into a rabbit hole of URI encoding and decoding, and why not publish my results here so I at least have a place I know where I can look it up again. And who knows? Maybe someone else will find this useful.

Anyway, there are two standards that define URIs:

The first is from the IETF and what most non-browsers that deal with URIs use. The second is from the WHATWG (and while WHATWG stands for “Web Hypertext Application Technology Working Group,” I always read that as ”What Working Group?” which gives away my opinions on this group, truth be told) and is the standard being pushed by the three major browsers left (Chrome, Firefox and Safari).

RFC-3986 is quite clear on when to encode and decode characters:

Under normal circumstances, the only time when octets within a URI are percent-encoded is during the process of producing the URI from its component parts. This is when an implementation determines which of the reserved characters are to be used as subcomponent delimiters and which can be safely used as data. Once produced, a URI is always in its percent-encoded form.

When a URI is dereferenced, the components and subcomponents significant to the scheme-specific dereferencing process (if any) must be parsed and separated before the percent-encoded octets within those components can be safely decoded, as otherwise the data may be mistaken for component delimiters. The only exception is for percent-encoded octets corresponding to characters in the unreserved set, which can be decoded at any time. For example, the octet corresponding to the tilde ("~") character is often encoded as "%7E" by older URI processing implementations; the "%7E" can be replaced by "~" without changing its interpretation.

Because the percent ("%") character serves as the indicator for percent-encoded octets, it must be percent-encoded as "%25" for that octet to be used as data within a URI. Implementations must not percent-encode or decode the same string more than once, as decoding an already decoded string might lead to misinterpreting a percent data octet as the beginning of a percent-encoding, or vice versa in the case of percent-encoding an already percent-encoded string.

RFC-3986, section 2.4: When to Encode or Decode

But you do have to read the ABNF carefully to find the 10 characters not mentioned that must be encoded. The WHATWG standard isn't easy to follow as it describes in all-too-verbose English the algorithm of how to encode and decode a URI, but it does cover what to encode and what not to encode. As I went through both stardards and several other sources (links below), I've created the following table of what characters to encode (current as of this date), with a preference for RFC-3986 (but with notes where WHATWG diverges from RFC-3986):

URL percent-encoding chart (per RFC-3986)
		scheme	auth	path	query	fragment	note
		scheme	auth	path	query	fragment	note
SPACE		-	Y	Y	Y	Y
!	sub-delim	-	m	m	m	m
"		-	Y	Y	Y	Y
#	gen-delim	-	m	m	m	m	4
$	sub-delim	-	m	m	m	m
%	escape	-	Y	Y	Y	Y
&	sub-delim	-	m	m	m	m
'	sub-delim	-	m	m	m	m
(	sub-delim	-	m	m	m	m
)	sub-delim	-	m	m	m	m
*	sub-delim	-	m	m	m	m
+	sub-delim	N	m	m	m	m
,	sub-delim	-	m	m	m	m
-	unreserved	N	N	N	N	N
.	unreserved	N	N	N	N	N
/	gen-delim	-	m	m	N	N
0	unreserved	N	N	N	N	N
1	unreserved	N	N	N	N	N
2	unreserved	N	N	N	N	N
3	unreserved	N	N	N	N	N
4	unreserved	N	N	N	N	N
5	unreserved	N	N	N	N	N
6	unreserved	N	N	N	N	N
7	unreserved	N	N	N	N	N
8	unreserved	N	N	N	N	N
9	unreserved	N	N	N	N	N
:	gen-delim	-	m	N	N	N	2
;	sub-delim	-	m	m	m	m
<		-	Y	Y	Y	Y
=	sub-delim	-	m	m	m	m
>		-	Y	Y	Y	Y
?	gen-delim	-	m	m	N	N
@	gen-delim	-	m	N	N	N
A	unreserved	N	N	N	N	N
B	unreserved	N	N	N	N	N
C	unreserved	N	N	N	N	N
D	unreserved	N	N	N	N	N
E	unreserved	N	N	N	N	N
F	unreserved	N	N	N	N	N
G	unreserved	N	N	N	N	N
H	unreserved	N	N	N	N	N
I	unreserved	N	N	N	N	N
J	unreserved	N	N	N	N	N
K	unreserved	N	N	N	N	N
L	unreserved	N	N	N	N	N
M	unreserved	N	N	N	N	N
N	unreserved	N	N	N	N	N
O	unreserved	N	N	N	N	N
P	unreserved	N	N	N	N	N
Q	unreserved	N	N	N	N	N
R	unreserved	N	N	N	N	N
S	unreserved	N	N	N	N	N
T	unreserved	N	N	N	N	N
U	unreserved	N	N	N	N	N
V	unreserved	N	N	N	N	N
W	unreserved	N	N	N	N	N
X	unreserved	N	N	N	N	N
Y	unreserved	N	N	N	N	N
Z	unreserved	N	N	N	N	N
[	gen-delim	-	m	m	m	m	2,3,4
\		-	Y	Y	Y	Y	1
]	gen-delim	-	m	m	m	m	2,3,4
^		-	Y	Y	Y	Y	2,3,4
_	unreserved	-	N	N	N	N
`		-	Y	Y	Y	Y	3
a	unreserved	N	N	N	N	N
b	unreserved	N	N	N	N	N
c	unreserved	N	N	N	N	N
d	unreserved	N	N	N	N	N
e	unreserved	N	N	N	N	N
f	unreserved	N	N	N	N	N
g	unreserved	N	N	N	N	N
h	unreserved	N	N	N	N	N
i	unreserved	N	N	N	N	N
j	unreserved	N	N	N	N	N
k	unreserved	N	N	N	N	N
l	unreserved	N	N	N	N	N
m	unreserved	N	N	N	N	N
n	unreserved	N	N	N	N	N
o	unreserved	N	N	N	N	N
p	unreserved	N	N	N	N	N
q	unreserved	N	N	N	N	N
r	unreserved	N	N	N	N	N
s	unreserved	N	N	N	N	N
t	unreserved	N	N	N	N	N
u	unreserved	N	N	N	N	N
v	unreserved	N	N	N	N	N
w	unreserved	N	N	N	N	N
x	unreserved	N	N	N	N	N
m	unreserved	N	N	N	N	N
z	unreserved	N	N	N	N	N
{		-	Y	Y	Y	Y	3,4
\|		-	Y	Y	Y	Y	2
}		-	Y	Y	Y	Y	3,4
~	unreserved	-	N	N	N	N

WHATWG: “\” is treated as a “/” in path segment
WHATWG: character not encoded in path
WHATWG: character not encoded in query
WHATWG: character not encoded in fragment

Encoding Key
Y	always encode
N	never encode
m	only encode when not used for their defined purpose (URI scheme dependent)
-	not allowed, even escaped

Character classes as defined by RFC-3986
unreserved	characters that never need to be encoded
gen-delim	characters defined as general use delimiters
sub-delim	characters defined as a potential delimiter for subcomponents in a URI
escape	character defined to escape other characters
	characters not otherwise defined, and thus must be escaped.

Furthermore, any character not defined in the above table (character codes 0 to 31 and 127 or higher) must also be escaped.

References

Notes on an overheard conversation about The Great American Tag Sale with Martha Stewart

“I think Martha's spent too much time hanging with Snoop Dogg.”

“What makes you say that?”

“Look at her! Her dress, her hair, the 50-yard stare into nothing.”

“Maybe you're not used to seeing her at home.”

“Maybe … ”

“Besides, maybe she learned that while in prison.”

“Oh yeah! She did do time in the pokey, didn't she?”