The Boston Diaries

The ongoing saga of a programmer who doesn't live in Boston, nor does he even like Boston, but yet named his weblog/journal “The Boston Diaries.”

Go figure.

Sunday, May 01, 2022

A zombie site from May Days past

Given that today is May Day I was curious as to what I wrote on past May Days. And lo' sixteen years ago I wrote about OsiXs.org and their attempt to “change the world!” Amazingly, the website is still around, although with even less than there was sixteen years ago. I guess I was right when I wrote back then, “I personally don't see this going anywhere fast.”


It was a simple bug, but …

I was right about the double slash bugit was a simple bug after all. The authors of two Gemini crawlers wrote in about the double slash bug, and from them, I was able to get the root cause of the problem—my blog on Gemini. Good thing I hedged my statement about not being the cause yesterday. Sigh.

Back in Debtember, I added support for displaying multiple posts. It's not an easy feature to describe, but basically, it allows one to (by hacking the URL, but who hacks URLs these days?) specify posts via a range of dates. And it's on these pages that the double slashed URLs appear. Why that happens is easy—I was generating the links directly from strings:

local function geminilink(entry)
  return string.format("gemini://%s%s/%s%04d/%02d/%02d.%d",
            config.url.host,
            port, -- generated elsewhere
            config.url.path,
            entry.when.year,
            entry.when.month,
            entry.when.day,
            entry.when.part
    )
end

instead of from a URL type. I think when I wrote the above code, I wasn't thinking in terms of a URL type, but of constructing a URL from data I already had. The bug itself is due to config.url.path ending in a slash, so the third slash in the string literal wasn't needed. The correct way isn't that hard:

local function geminilink(entry)
  return uurl.toa(uurl.merge(config.url,
		{
		  path = string.format("%04d/%02d/%02d.%d",
				entry.when.year,
				entry.when.month,
				entry.when.day,
				entry.when.part)
		}))
end

and it wouldn't have exhibited the issue.

With this fix in place, I think I will continue to reject requests with the double slash, as it is catching bugs, which is a Good Thing™.

Monday, May 02, 2022

Notes on an overheard conversation about tea

“You know, you forgot to remind me to make your tea.”

“Oh. I need to remind you make tea.”

“Sigh.”

“So thank you for reminding me to remind you to make tea.”

“…”

“Um, doesn't hitting your head against the wall hurt?”

Tuesday, May 03, 2022

I'm hoping this is a joke, because if it's not, I'm not sure what that says about our society

I finished my lunch of a sub sandwich when I notice a message printed on the wrapper in not-so-small print:

[A sub sandwich wrapper with “DO NOT EAT THIS WRAPPER” printed on it.] I'll admit, the sub was good, but not so good as to keep eating everything in sight.

I have no words.


The legality of double slashes in URIs

Martin Chang replied to my musings on processing malformed Gemini requests, saying that double slashes in URIs are illegal, and pointed out the ABNF grammar from the URI specification to back up his claim:

path          = path-absolute   ; begins with "/" but not "//"
path-absolute = "/" [ segment-nz *( "/" segment ) ]
segment-nz    = 1*pchar
pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"

But he didn't quote the segment rule:

segment       = *pchar

which translated says, “0 or more pchar rules.”

So the ABNF he quoted does indeed rule out //­boston/­2018/­07/­04.2. It doesn't rule out /­boston//­2018/­07/­04.2, since by the time we hit the double slash, we're in the *( "/" segment ) part of the path-absolute rule, and segment can have 0 characters. But what he quoted only applies to relative links, what I receive is an abolute link. If you follow the ABNF from that perspective:

URI-reference = URI / relative-ref
URI           = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
hier-part     = "//" authority path-abempty
                 / path-absolute
                 / path-rootless
                 / path-empty

path-abempty  = *( "/" segment )

; other rules omitted

not only does this allow gemini://­gemini.conman.org/­/­boston/­2018/­07/­04.2 but gemini://­gemini.conman.org/­/­/­/­/­/­/­/­/­/­/­boston/­2018/­07/­04.2.

I can understand why this was done—to simplify the grammar as the various path- rules generally end with *( "/" segment ) allows one to end a URI with a trailing slash or not. I don't think the intent was to allow long strings of slashes, but that's the end result of a lax grammar. Martin is also correct that multiple slashes are treated as a single slash on POSIX (basically, any Unix system), that's not the case across all operating systems. One exception I can think of AmigaOS, where each slash represents a parent directory. This command, cd /// on AmigaOS is the same as cd ‥/‥/‥ on a POSIX system. Crazy, I know. And maybe not even relevant these days, but I thought I should mention it.

Wednesday, May 04, 2022

Star Wars Day?

It's not Star Wars Day—it's Dave Brubeck Day! (and give yourself 10 cool points if you get the reference) Of course, it's only Dave Brubeck day in the US. Elsewhere in the world, Dave Brubeck Day is April 5th for some odd reason (give yourself a geek point for getting this reference).

[And of course Sean didn't tell you he pulled this meme from Face­Me­Linked­Insta­My­Space­Book­We­In­Gram. He's not that cool to think of this. —Editor]

Tuesday, May 10, 2022

Springfield isn't the most popular city name in the US

OK, why do the Simpsons live in a town called Springfield? Isn't that a little generic?

Springfield was named after Springfield, Oregon. The only reason is that when I was a kid, the TV show “Father Knows Best” took place in the town of Springfield, and I was thrilled because I imagined that it was the town next to Portland, my hometown. When I grew up, I realized it was just a fictitious name. I also figured out that Springfield was one of the most common names for a city in the U.S. In anticipation of the success of the show, I thought, “This will be cool; everyone will think it's their Springfield.” And they do.

Matt Groening Reveals the Location of the Real Springfield | Arts & Culture| Smithsonian Magazine

So I got to wondering, is Springfield the most popular city name in the US? I know, weird question, but I'm curious. So some quick searching lead me to the United States Geological Survey Geographical Names Database. With some massaging of the data, I was able to determine that there are 34 States with a “Springfield,” but it's not alone. There are eight other cities that are also in 34 States: Arlington, Chester, Clinton, Farmington, Florence, Greenville, Milton, and Newport. Okay, maybe not the same 34 states across all those cities, but you get the idea.

But those cities aren't the most popular names. No, all of them are tied for ninth place! The city name that appears in most states is “Riverside” at 46 States (plus Puerto Rico). The States that don't have a “Riverside” are Alaska, Hawaii, Oklahoma, and Louisiana (really? Louisiana? One of the world's largest river run straight through that state, and no one bothered to name a town in Louisiana, “Riverside?”).

And just to satisfy the curious:

Top 10 city names in the US (including territories)
Place Name # States
1 Riverside 47
2 Centerville 43
3 Fairview 41
4 Franklin 40
5 Midway 39
6 Georgetown 37
Glendale 37
Greenwood 37
7 Lincoln 36
Marion 36
Oakland 36
Pleasant Valley 36
Salem 36
Union 36
8 Fairfield 35
Lakeview 35
Liberty 35
9 Arlington 34
Chester 34
Clinton 34
Farmington 34
Florence 34
Greenville 34
Milton 34
Newport 34
Springfield 34
10 Bethel 33
Clifton 33
Eden 33
Glenwood 33
Hamilton 33
Kingston 33
Lakeside 33
Mount Pleasant 33
Summit 33

Thursday, May 12, 2022

“This is how we do things around here.”

And, in fact, anyone with any proximity to software development has likely heard rumblings about Agile. For all the promise of the manifesto, one starts to get the sense when talking to people who work in technology that laboring under Agile may not be the liberatory experience it’s billed as. Indeed, software development is in crisis again—but, this time, it’s an Agile crisis. On the web, everyone from regular developers to some of the original manifesto authors is raising concerns about Agile practices. They talk about the “Agile-industrial complex,” the network of consultants, speakers, and coaches who charge large fees to fine-tune Agile processes. And almost everyone complains that Agile has taken a wrong turn: somewhere in the last two decades, Agile has veered from the original manifesto’s vision, becoming something more restrictive, taxing, and stressful than it was meant to be.

Part of the issue is Agile’s flexibility. Jan Wischweh, a freelance developer, calls this the “no true Scotsman” problem. Any Agile practice someone doesn’t like is not Agile at all, it inevitably turns out. The construction of the manifesto makes this almost inescapable: because the manifesto doesn’t prescribe any specific activities, one must gauge the spirit of the methods in place, which all depends on the person experiencing them. Because it insists on its status as a “mindset,” not a methodology, Agile seems destined to take on some of the characteristics of any organization that adopts it. And it is remarkably immune to criticism, since it can’t be reduced to a specific set of methods. “If you do one thing wrong and it’s not working for you, people will assume it’s because you’re doing it wrong,” one product manager told me. “Not because there’s anything wrong with the framework.”

Via Hacker News, Agile and the Long Crisis of Software

That last line, “it's not working for you, people will assume it's because you're doing it wrong,” rings really true to me. At The Corporation—no, I no longer work for The Corporation, I now work for The Enterprise now that the Corporate Overlords have finally taken over. So, at The Enterprise, I've been informing them pretty much all this year that this “Agile” development system they're forcing on us isn't working. Before they finally took over, the team I was on was always on time, on budget, smooth deployments (only two bad deployments in ten years) and no show-stopping bugs found in production. As I told upper management, given our prior track record, why change how we do development? Why fix what isn't broken? And while upper management never said this directly, through their actions they answered: this is our process, and we're sticking to it, slipped schedules and disasterous deployments be damned!

As to why I haven't left yet? Because it seems this “Agile” movement has invaded everywhere and things would be “more of the same” elsewhere. At least here, I'm not forced to use Windows.


Programming, up hill, both ways

People would come to us with a problem, and we would figure out a solution. We couldn't just search the web because the web was still being written. And you couldn't just punt a hard question to the engineer in the desk next to you. Why? Because you were sitting alone in a utility closet packed with floppy disks and old tape drives.

I'm a XXXXX­ XX webmaster

Ah, this takes me back. I got my first computer back in 1984, and if I wanted to know anything about it I was on my own. Google didn't exist (the public Internet didn't exist at the time). I didn't have anyone I could ask about computer related things. I did have books and magazines. So between experimentation and learning to read between the lines, I picked up programming.

So when it came time to write a metasearch engine, there were no tutorials. There were no open source metasearch engines to download and use. There was only the problem of writing a metasearch engine, in a language I didn't even know (and which itself was less than a year old at the time).

Fun times.

So I always found it odd when people would go online asking for tutorials, especially for writing metasearch engines (and yes, that did happen back then). So when something like testing a negative comes up, and I can't convince the Powers That Be that it's never a good idea to prove a negative, I can't just look up some tutorial on proving negatives—I just have to figure it out on my own.

Friday, May 20, 2022

If you have to embrace the stupid, you might as well do it well

Our customer, The Oligarchic Cell Phone Company, wants us to do a demo of a new feature for a certain class of clients. “Project: Lumbergh” will receive a URL along with the name and reputation of a phone number it gets from elsewhere. “Project: Lumbergh” will then pass this along to “Project: Sippy-Cup.” We already have to deal with URLs from elsewhere. The only change we have to make is allowing URLs to be passed along to the certain class of clients, which formerly did not get URLs. So far, so good.

But then I saw code being added to “Project: Lumbergh” to check the URLs to see if the path portion ended in .bmp. I enquired about this, because to me, that makes no sense—we're just a conduit for data; the source of the URL should already know what it can and can't send to the client. I was told that the certain class of clients only support BMP files while other clients that can receive URLs can't support BMP files, so we have to ensure that BMPs only go the subset of clients that can support them. I countered with the fact that we include information about the client to the data source when we query them, and they should have the logic to handle this on their end—why are we suddenly reponsible for this? I was told that the LOF for the data source would be too large to handle by the demo deadline, that we had to handle it, that the code that just looks anywhere in the URL for a literal “.bmp” is Good Enough™, and to stop with the questions.

Now the URL we're given is “percent-encoded”—we get something like: https%3A­%2F%2F­example.com%2F­picture.bmp. Nevermind the fact that that is an invalid URL to begin with (you aren't supposed to encode characters that are defined as delimiters in URLs if they are, in fact, delimiting fields), that's what we get and pass along. Only now (a few years after we started passing URLs along like this) the clients can't properly decode them (surprise!), so of course we have to do that. I asked why we even had to do that and was told that the LOF for the data source would be too large to handle by the demo deadline, we had to handle it, and to stop with the questions. I then complained about the code doing that was doing too much, as it would decode the so-called “unsafe characters” from RFC-3986 (which aren't defined in the RFC, but can be derived by a careful reading between the lines), like the dreaded space character.

There was then much back and forth between me and my manager (it's not who I thought it was but that's another rant for another time) about what should and shouldn't be decoded. I kept saying that if we have to embrace the stupid, we might as well do it right, but my manager was arguing against doing that and we should just decode %3A and %2F since that's all that's being asked of us today. I countered with “What about tomorrow, when we're asked to decode %3F (‘?’) and %40 (‘@’)?” (which are delimiter characters per RFC-3986)

I was told to stop with the questions.

And then all hell breaks loose when we get https%3A­%2F%2F­example.com%2F­Things%2520Go%2520Boom%2521.

Sigh.

Wednesday, May 25, 2022

URI encoding

I've fallen into a rabbit hole of URI encoding and decoding, and why not publish my results here so I at least have a place I know where I can look it up again. And who knows? Maybe someone else will find this useful.

Anyway, there are two standards that define URIs:

  1. RFC-3986: Uniform Resource Identifier (URI): Generic Syntax
  2. URL: Living Standard

The first is from the IETF and what most non-browsers that deal with URIs use. The second is from the WHATWG (and while WHATWG stands for “Web Hypertext Application Technology Working Group,” I always read that as ”What Working Group?” which gives away my opinions on this group, truth be told) and is the standard being pushed by the three major browsers left (Chrome, Firefox and Safari).

RFC-3986 is quite clear on when to encode and decode characters:

Under normal circumstances, the only time when octets within a URI are percent-encoded is during the process of producing the URI from its component parts. This is when an implementation determines which of the reserved characters are to be used as subcomponent delimiters and which can be safely used as data. Once produced, a URI is always in its percent-encoded form.

When a URI is dereferenced, the components and subcomponents significant to the scheme-specific dereferencing process (if any) must be parsed and separated before the percent-encoded octets within those components can be safely decoded, as otherwise the data may be mistaken for component delimiters. The only exception is for percent-encoded octets corresponding to characters in the unreserved set, which can be decoded at any time. For example, the octet corresponding to the tilde ("~") character is often encoded as "%7E" by older URI processing implementations; the "%7E" can be replaced by "~" without changing its interpretation.

Because the percent ("%") character serves as the indicator for percent-encoded octets, it must be percent-encoded as "%25" for that octet to be used as data within a URI. Implementations must not percent-encode or decode the same string more than once, as decoding an already decoded string might lead to misinterpreting a percent data octet as the beginning of a percent-encoding, or vice versa in the case of percent-encoding an already percent-encoded string.

RFC-3986, section 2.4: When to Encode or Decode

But you do have to read the ABNF carefully to find the 10 characters not mentioned that must be encoded. The WHATWG standard isn't easy to follow as it describes in all-too-verbose English the algorithm of how to encode and decode a URI, but it does cover what to encode and what not to encode. As I went through both stardards and several other sources (links below), I've created the following table of what characters to encode (current as of this date), with a preference for RFC-3986 (but with notes where WHATWG diverges from RFC-3986):

URL percent-encoding chart (per RFC-3986)
scheme auth path query fragment note
scheme auth path query fragment note
SPACE - Y Y Y Y
! sub-delim - m m m m
" - Y Y Y Y
# gen-delim - m m m m 4
$ sub-delim - m m m m
% escape - Y Y Y Y
& sub-delim - m m m m
' sub-delim - m m m m
( sub-delim - m m m m
) sub-delim - m m m m
* sub-delim - m m m m
+ sub-delim N m m m m
, sub-delim - m m m m
- unreserved N N N N N
. unreserved N N N N N
/ gen-delim - m m N N
0 unreserved N N N N N
1 unreserved N N N N N
2 unreserved N N N N N
3 unreserved N N N N N
4 unreserved N N N N N
5 unreserved N N N N N
6 unreserved N N N N N
7 unreserved N N N N N
8 unreserved N N N N N
9 unreserved N N N N N
: gen-delim - m N N N 2
; sub-delim - m m m m
< - Y Y Y Y
= sub-delim - m m m m
> - Y Y Y Y
? gen-delim - m m N N
@ gen-delim - m N N N
A unreserved N N N N N
B unreserved N N N N N
C unreserved N N N N N
D unreserved N N N N N
E unreserved N N N N N
F unreserved N N N N N
G unreserved N N N N N
H unreserved N N N N N
I unreserved N N N N N
J unreserved N N N N N
K unreserved N N N N N
L unreserved N N N N N
M unreserved N N N N N
N unreserved N N N N N
O unreserved N N N N N
P unreserved N N N N N
Q unreserved N N N N N
R unreserved N N N N N
S unreserved N N N N N
T unreserved N N N N N
U unreserved N N N N N
V unreserved N N N N N
W unreserved N N N N N
X unreserved N N N N N
Y unreserved N N N N N
Z unreserved N N N N N
[ gen-delim - m m m m 2,3,4
\ - Y Y Y Y 1
] gen-delim - m m m m 2,3,4
^ - Y Y Y Y 2,3,4
_ unreserved - N N N N
` - Y Y Y Y 3
a unreserved N N N N N
b unreserved N N N N N
c unreserved N N N N N
d unreserved N N N N N
e unreserved N N N N N
f unreserved N N N N N
g unreserved N N N N N
h unreserved N N N N N
i unreserved N N N N N
j unreserved N N N N N
k unreserved N N N N N
l unreserved N N N N N
m unreserved N N N N N
n unreserved N N N N N
o unreserved N N N N N
p unreserved N N N N N
q unreserved N N N N N
r unreserved N N N N N
s unreserved N N N N N
t unreserved N N N N N
u unreserved N N N N N
v unreserved N N N N N
w unreserved N N N N N
x unreserved N N N N N
m unreserved N N N N N
z unreserved N N N N N
{ - Y Y Y Y 3,4
| - Y Y Y Y 2
} - Y Y Y Y 3,4
~ unreserved - N N N N
  1. WHATWG: “\” is treated as a “/” in path segment
  2. WHATWG: character not encoded in path
  3. WHATWG: character not encoded in query
  4. WHATWG: character not encoded in fragment
Encoding Key
Y always encode
N never encode
m only encode when not used for their defined purpose (URI scheme dependent)
- not allowed, even escaped
Character classes as defined by RFC-3986
unreserved characters that never need to be encoded
gen-delim characters defined as general use delimiters
sub-delim characters defined as a potential delimiter for subcomponents in a URI
escape character defined to escape other characters
characters not otherwise defined, and thus must be escaped.

Furthermore, any character not defined in the above table (character codes 0 to 31 and 127 or higher) must also be escaped.

References


Notes on an overheard conversation about The Great American Tag Sale with Martha Stewart

“I think Martha's spent too much time hanging with Snoop Dogg.”

“What makes you say that?”

“Look at her! Her dress, her hair, the 50-yard stare into nothing.”

“Maybe you're not used to seeing her at home.”

“Maybe … ”

“Besides, maybe she learned that while in prison.”

“Oh yeah! She did do time in the pokey, didn't she?”

Obligatory Picture

[The future's so bright, I gotta wear shades]

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

You have my permission to link freely to any entry here. Go ahead, I won't bite. I promise.

The dates are the permanent links to that day's entries (or entry, if there is only one entry). The titles are the permanent links to that entry only. The format for the links are simple: Start with the base link for this site: https://boston.conman.org/, then add the date you are interested in, say 2000/08/01, so that would make the final URL:

https://boston.conman.org/2000/08/01

You can also specify the entire month by leaving off the day portion. You can even select an arbitrary portion of time.

You may also note subtle shading of the links and that's intentional: the “closer” the link is (relative to the page) the “brighter” it appears. It's an experiment in using color shading to denote the distance a link is from here. If you don't notice it, don't worry; it's not all that important.

It is assumed that every brand name, slogan, corporate name, symbol, design element, et cetera mentioned in these pages is a protected and/or trademarked entity, the sole property of its owner(s), and acknowledgement of this status is implied.

Copyright © 1999-2024 by Sean Conner. All Rights Reserved.