The Boston Diaries

The ongoing saga of a programmer who doesn't live in Boston, nor does he even like Boston, but yet named his weblog/journal “The Boston Diaries.”

Go figure.

Monday, August 02, 2021

I never said testing was bad, just that it's tedious to automate

I might come across as someone who hates testing, if the past two months have been anything to go by, but I've really been complaining about automating the tests, including some rather difficult ones to automate. But honesty compels me state this: the new regression test has found another potential bug.

Because I added code to delay (or entirely block) responses from the various database sources, a few test cases were added to a problematic feature to ensure it's been fixed. The Happy Path™ has been fixed, but there is a Sad Path™ that's been missed. We query two sources, A and B. In the scenario we are testing, the data we want is from B—any data from A is ignored (but we have to query anyway due to “reasons”). So the case of A has no data, B has data is fine. But it's when A doesn't return (or times out), the reponse from B is ignored when it probably shouldn't be (since that data does get back to us). And it would not surprise me if there aren't more cases like this.

Normally, I wouldn't expect this to happen all that much [It doesn't. We have a KPI for that, and I don't think it's worth worring about–the largest spike I've seen over the past month is easly three orders of magnitude lower than our volume; the rest barely show up on the graph. —Sean], and the re-engineering required to handle these casees might be significant since it would require adding more states to the processing state machine. But that's not my call to make.

Tuesday, August 03, 2021

Yak shaving the garbage disposal

The garbage disposal went kaput last night. Bunny just got home with the new unit. How hard can it be to replace it?

[A shot of a kitchen ceiling, which includes a pan rack, and some decorative copper molds flanking a ceiling fan.] Everything is looking up.  Of course, that's easy to say when one is flat on their back on the kitchen floor.

Next time I'm flat on my back on the kitchen floor, I'll make sure not to ask that question.

Removing the old unit was easy enough—pop the circuit breaker to it (just in case), disconnect the drain hose and the hose to the dishwashing machine, twist hard and let it fall. Then remove a small panel on the bottom of the unit, and unwire the power. The only hard bit was pulling the wires out as they were this very thick solid copper wiring.

Installing the new unit? Well, wiring in the power was a bit tough, what with having to manhandle the thick ground wire (and honestly, I thought that was going to be the hardest bit of the job). Then it was just a simple matter of remounting the disposal to the bottom of the sink. The mounting bracket is standard, so I didn't have to install that. It was a matter of lifting the unit, catching some tabs onto a rail and twist!

Half an hour later and I'm flat on my back, garbage disposal still not mounted, and out of ideas. It's that the unit is just heavy enough to be unwieldy, and it was proving to be very difficult to line the tabs up and ensure they engaged with the rails.

Bunny suggested we set the unit on something to help hold it up. What finally worked was a small step stool, several planks of wood and a few shims. Took just a few seconds.

[Picture of a garbage disposal, sitting on some wedges, several pieces of wood, and a small foot stool I couldn't fit into the shot.  Also, there's a drain pipe hanging off the left side of the unit.] See that black thing hanging off the left side?  Yes, that's the drain pipe.  That's going to prove to be problematic in a few moments.

Then the next problem cropped up—the drain pipe out of the garbage disposal was too long, and in the wrong location, to hook back up to the normal plumbing. Inspecting the old unit showed the drain pipe had been cut, and given the new unit was pretty much the same as the old one, I figured I might as well use the drain pipe from the old unit. Only issue here was getting the drain pipe off the new unit.

It turned out I had to unmount the unit.

Sigh.

A few moments to switch out the drain pipe, and then another twenty minutes trying to remount the unit back to the underside of the sink. I think I got very lucky on the first attempt, because the second attempt was not going well.

But eventually I got it.

Then all that was left was to reattach the dishwasher hose, then the drain pipe to the rest of the plumbing and … oh look! That part came off in my hand.

Sigh.

So tomorrow it's off to the hardware store for some new plumbing hose, and we should be all good to go.

“Oh no!”

I was afraid to ask, but did anyway. “What's up?”

“The toilet … ”

Friday, August 06, 2021

I'm not sure what I hate more, control panels or unit tests über alles

I am experiencing culture shock with the new testing regime.

So I had this exchange with the new manager, AG, over email. AG: “In which case or scenario does ‘Project: Sippy-Cup’ send a 480 response code? Please see the attached packet capture.”

I answered, “‘Project: Sippy-Cup’ does not send a 480 response. I checked the packet capture, and the IP address does not appear to be ours.”

AG replied, “Thanks for the information. Can you also confirm that there is no code in ‘Project Sippy-Cup’ to send a 480 response code?”

When I read that, I admit I got a bit upset—I did check the code the first time around! Does he not trust me? But upon reflection, I could see how AG might have thought I didn't check the code to “Project: Sippy-Cup” at all because the packet capture had the incorrect IP addresses. I see I may have to be more explicit in my future responses.

Then during the regularly scheduled meeting AG asked if we had any tests for 480 responses. … What? I replied, “No, because ‘Project: Sippy-Cup’ does NOT send a 480 response.”

AG then asked if there was a list of responses “Project: Sippy-Cup” replies with. Again, the answer is no, not explicitly written down. I then read through the code saying out loud each response code it does return. Then AG asked if there were any tests for any of those responses, like the “version not supported” response?

I'm having a hard time wrapping my brain around a test for

if message.request.version ~= "2.0" then
  info.socket:send(remote,pack_error(message,response.VERSION_NOT_SUPPORTED))
  return
end

Really? Testing the convoluted business logic I can see, but this? What? I got the logic wrong? There's a bug in two lines of code?

So now I have to find a way to inject invalid SIP messages into the regression test.

I swear, mocking system calls is coming soon …

Wednesday, August 11, 2021

A most persistent spam, part II

Shortly after I wrote about the Russian spam from Aleksandr the nature of the emails changed. They are now four attached Microsoft Word files. I'm not sure if they're infected or not, but it matters to me not, because I can't read the darned things since I am Microsoft free. Except for a small handful of emails where they appear to be missing the four Microsoft Word files and it seems they were never included in the email. It was weird enough to do a bit more investigation.

I have a stock pile of these emails now, and I've notcied an interesting thing—all of them are addressed to one of just two addresses. The first one is info@conman.org, a catch-all address that is mentioned in RFC-2142, and the second one is rparker@numbersstation.info, which I've only mentioned once on this blog and one would have to actively search for to even find it elsewhere.

So I have three options before me:

  1. nuke both the info@conman.org and rparker@numbersstation.info addresses as they're not really used;
  2. setup a custom email filter rule that will tell my greylist daemon to reject emails from the IP address (which I did manually, but it changes quite often);
  3. setup a custom email filter rule that will tell the firewall to block the IP address from even connecting.

The first is easy, but I wonder how long until Aleksandr finds another address to spam. The other two are a bit more involved. I think I'll try the first one and see how long that lasts, and only if the spam returns will I mull over the other two options.

Sunday, August 15, 2021

The Dungeons and Dragons 5th Edition Challenge Rating system is deranged

I've been running a D&D5 game for two years and I've yet to come to grips with the “Challenge Rating” system for monsters. There have been times when what I think is a balanced encounter is one sided with the players going “Okay, that must have been the minion—where's the actual boss monster?” and other times when I throw something that should end with a total party kill, but instead is just a “challenging” encounter the party survives.

I think the theory is that a group of four players of level N should be equally matched with one monster of a challenge rating of N. And I read “equally matched” as “a 50/50 chance for either side to be killed.” Whether that's a good reading of the term “equally matched” is debatable, but that's how I'm initially reading it. It makes sense to me—if you have two combatents (and let's face it—D&D involves a lot of combat) of equal skills and equipment, then it should be a 50/50 toss up for the winner. And in plenty of spots in the various D&D tomes I read that four level 1 characters should be an equal match to one CR1 monster.

So I wrote a simulation to test that out. I have four level 1 characters that are pitted against one CR1 monster for 10,000 simulated fights. For the four players, I used the four archetypal types—a cleric, fighter, thief and wizard. For the sake of simplicity, all four will always “run up” to the monster and fight—no ranged attacks are done. I know this isn't completely realistic, but for a “proof-of-concept” it should be good enough.

Anyway, our poor unfortunate victims:

The Cleric
Has an an AC of 14, 1d8+2 HP, no initiative bonus, +1 to attacks, and will do 1d8+1 worth of damage if the attack hits. The Cleric will not attempt to heal any damaged comrades as I don't expect the enounters to last long enough for that to matter.
The Fighter
Has an AC of 16, 1d10+2 HP, +2 to intiative, +2 to attacks, and will do 1d8+2 worth of damage.
The Thief
Has an AC of 13, 1d8+1 HP, +2 to intiative, +2 to attacks, and because of the strategy of the other players (“run up and hit”) the Thief can use their “Sneak Attack” feature to do an addtional 1d6 worth of damage, on top of the 1d4+2 of damage normally done (here, the Thief will always use a dagger).
The Wizard
Has an AC of 10, 1d6 of HP, +2 to intiative and no bonuses to attacks. Normally, wizards would be in the back of the party doing ranged attacks, but because I didn't want the complication of ranged attacks, the Wizard is right up front and close. But I do allow the Wizard the use of two Magic Missiles (the maximum allowed for that spell, which always hit) for the first two rounds (each doing 3d4+3 damage), then Fire Bolt (which does 1d10, but doesn't always hit).

These are pretty bog-standard 1ST level characters. I then went through picking out a few CR1 monsters to throw against the party. The unfortunate enemy victims include:

Animated Armor
Has an AC of 18, 6d8+6 HP, no initiative bonus, +4 to attacks, and does do two attacks of 1d6+2 of damage.
Bugbear
Has an AC of 16, 5d8+5 HP, +2 to initiative, +4 to hit, doing 2d8+2 worth of damage.
Goblin Boss
Has an AC of 17, 6d6 HP, +2 to initiative, +4 to attack, and gets two attacks, the second of which is at disadvantage both of which do 1d6+2 worth of damage.
Half Orge
Has an AC of 12, 4d10+8 HP, no intiative bonus, +5 to attack, which does 2d10+3 damage.
Lion
Has an AC of 12, +2 to intiative, +5 to attack, and does 1d8+3 worth of damage.

And because it wouldn't be D&D without the dragons, there's one dragon at CR1—the Brass Dragon Wyrmling (a baby brass dragon):

Brass Dragon Wyrmling
Has an AC of 16, 3d8+3 HP, no initiative bonus, a +4 to attacks which does 1d10+2 damage, or a breath weapon which can do 4d6 worth of damage if the victim can't jump out of the way (but only has a 1 in 3 chance of regaining per round of combat).

For the Brass Dragon Wyrmling, I decided to have it lead with its breath weapon always, and prefer the use of its breath weapon when it recharges, but if not, do a normal attack.

It's not every CR1 monster, but I think it's a decent list to start with, and it should give me a feeling for how this “Challenge Rating” system is supposed to work.

On with the simulation!

10,000 battles between a 1ST level party and some CR1 monsters
monster TPK% #rounds #lived
Animated Armor 28.23 4 2
Brass Dragon Wyrmling 0.36 3 3
Bugbear 4.81 3 3
Gobin Boss 1.18 3 3
Half Orge 2.56 3 3
Lion 0.72 3 3

Well … [Deep subject. —Editor] I'm not sure what to make of the results. Either the simulation is crap, or the whole “Challenge Rating” system is crap. I've heard that it tends to break down at the higher levels, but even here, it doesn't seem to work. A baby brass dragon is less dangerous than a lion? An empty shell of armor is the most dangerous monster? What?

Okay, about those columns—the TPK column is the percentage time the entire party was wiped out by the encounter. The last column is if the party killed the monster, on average, now many party members were left standing? (rounded to the nearest whole value)

So then I decided to throw some other monsters into the mix. And in keeping with the “dragon” portion of D&D, I found a dragon of CR2, CR3 and CR4 (and used the same strategy I used for the Brass Dragon Wyrmling—leading with the breath weapon). And because I'm just mean, I decided to throw in one of the weakest creatures of D&D into the mix—a single kobold. Behold:

10,000 battles between a 1ST level party and some non-CR1 monsters
monster TPK% #rounds #lived CR
Kobold 0.00 2 4
Black Dragon Wyrmling 27.83 4 2 2
Blue Dragon Wyrmling 71.86 6 1 3
Red Dragon Wyrmling 98.36 6 0 4

To be fair, there is a chance for one party member to be killed by the kobold, as the actual average for the number of party members surviving is 3.93—a single kobold is not a complete cakewalk. And there's a slightly better than 1% chance of killing a baby red dragon as a 1ST level party (better than I expected). But somehow, it still feels wrong that some animated armor is still a bit more dangerous than a black baby dragon.

I guess I'll just have to throw my group (a mix of 5TH to 7TH level characters) in front of an ancient red dragon and see what happens … oh wait, did I mention that out loud?

Wednesday, August 18, 2021

A serious lack of power

I walked into the Computer Room at Chez Boca to a squealing UPS, but unlike last time this happened the computers were still powered up. I shutdown both computers and decided to deal with them after lunch. There wasn't much else I could do at that point, seeing how the power was out for our entire neighborhood.

After lunch, the power was restored, so I went to power up the Linux box and … it didn't come on.

XXXXX­XXXXX­XXXXX­XXXXX­XXXXX­XXX!

I checked the internal cables to make sure none have come loose, evicted multiple dust bunnies, and even changed the CMOS battery just to be sure. But nothing I did made the Linux box power up.

I called my friend Tom Singkornrat (not only is he wise in the way of hardware, but he lives less than a mile away) and he said that it could be one of two possible reasons—it could be a bad power supply, or it could be a bad mother board. He had a spare power supply I could try. Why not?

The only difficult part of the whole procedure was unscrewing the old power supply from the case. Other than that, the new power supply fit right in and it powered right up. Great! But for some reason the keyboard wasn't working …

Update later today …

I found the issue with the keyboard.


The Keyboard situation

Okay, about that keyboard

The setup I have is … not straightforward. I have two computers, a tower running Linux and a Mac mini. I have a single keyboard and mouse plugged into a KVM switch (an industrial 8-port unit I aquired mumblely-mumble years ago), but each computer has its own monitor. I also run Synergy on both computers to allow sharing of the keyboard and mouse across both computers. It will also allow you to move the mouse from one computer to the other (and supports “cut-n-paste” operations between the two computers). It's a neat program and really, the KVM is there when I need to shutdown in the event of a power outtage (and when I boot the systems back up, before I get Synergy running).

The KVM expects PS/2 style connectors, but both computers need a USB converter. This all just works and I rarely have issues.

Until I do.

This time, it turned out to be the cheap USB converter on the Linux system that just … I don't know, gave up the ghost or something (thank God! I was fearful the USB ports might have been blown out). I can work around it with Synergy, but it's a bit of a pain initially (boot one computer, switch keyboard/mouse to other computer, boot that one, switch keyboard/mouse back to get Synergy running). So now that I'm back up and running, I just need to get a new USB converter and maybe this time it'll last longer than a year.


The Lord of the Rings as a D&D game

Ever wonder what The Lord of the Rings would be like as a D&D game? Well, neither did I until I saw this video series:

It starts out with a large group, then due to interpersonal conflicts, the game splits into the role players and the roll players, then they all get back together for the final boss fight.

And Frodo's player never did learn the D&D combat system.

Thursday, August 19, 2021

The case of the regression test regression

So the SVN-git conversion is back on the table. When I worked on this last year, I stopped when it became clear that a critical build server didn't support git. I duly reported the issue, and it's only been fixed in the past few months. Last year, it was delegated solely to me. This year, it's more of a team effort.

For our team, we currently have two main SVN repositories with some duplicate third party packages checked in. We went through the lists to remove the redundencies (a Good Thing™) and I felt that it might be a good time to ensure they're up to date. I spent a week doing so and ran into an odd issue—the old regression test now runs slower! Not thirteen (or nineteen) hours slow but slow enough to be seriously annoying. I was able to isolate the slowdown to some Lua modules (that I wrote cough cough—I didn't think the changes were that bad).

We do development primarily on Macs. My first thought was to either confirm or deny that the Mac was the issue. I ran tests on a Linux laptop I have for work and nope—the slowdown exists on that. I then tried to profile the code (to no conclusive result). I compared the code between the two versions (only five lines changed in the regression test, and all of those were small API changes like fsys.redirect(stdin,fsys.STDIN) becoming fsys.redirect(stdin,io.stdin)). I just could not figure out why the new code was slower.

So then I decided to just time things. I added code to time how long it took to run each test (and to limit the number of tests run—enough to show the issue, but not the entire 15,852 tests) and plot the results.

[Graph of the time to run each test from the old regression test and the new regression test.  The old one is faster.]

The X-axis is the test case number, the Y-axis is time in microseconds (and plotted on a logarithmic scale, otherwise, you wouldn't be able to make out any detail along the lower portion). The new regression test is in green, the old one in red. It's clear that there are a few dozen tests that take a full second to run in the new program, whereas the average (of both, aside from the outliers) seems to be around 5ms. So good, the problem clearly shows up.

Then I timed each section of running a test to see which section is sucking up the time, and eventually, I get this graph:

[Graph of the small segment of the test that is taking all the time.]

[The sharper eyed of you might have noticed the X-axis change—each test is run twice (for reasons). This means the second graph has twice as many entries. Yes, that means the full 15,852 tests are each run twice, for 31,704 requests. —Sean]

This represents the timing of this bit of code:

-- --------------------------------
-- Receive and process NOTIFY
-- --------------------------------

local remote,data = sipsock:recv(10)
if not data then
  return false,"no NOTIFY"
end

The only program that's different is the regression test, not the programs being tested. And it's not like I updated the version of Lua either—that didn't change. The third party code I'm using, yes, to some degree. But the underlying code to sipsock:recv() did not change. Yet it's this bit of code that is consistently slow for a consistent set of tests.

To say I'm a bit perplexed is a bit of an understatement.

Monday, August 23, 2021

I'm a computer illiterate dinosaur. Thanks, Microsoft!

All I needed to do was transfer a file to a fellow cow-orker. That's it. We were in a meeting, via Microsoft Teams. I was using the dreaded Corporate Overlords' mandated managed Microsoft Windows laptop. The file in question was on my work related Linux laptop (it was the first laptop I received when I started working at The Corporation 11 years earlier). So on the Windows laptop, I open up CMD.EXE:

Microsoft Windows [Version 10.0.18363.1621]
(c) 2019 Microsoft Corporation. All rights reserved.

C:\Users\sconner>cd Desktop

C:\Users\sconner\Desktop>scp spc@linux-laptop:work/file-to-trans .
spc@linux-laptop's password:
file-to-trans                        100% 6246KB  10.2MB/s   00:00

C:\Users\sconner\Desktop>dir
 Volume in drive C is Windows
 Volume Serial Number is 267D-3086

 Directory of C:\Users\sconner\Desktop

08/23/2021  06:55 PM    <DIR>          .
08/23/2021  06:55 PM    <DIR>          ..
08/23/2021  07:18 PM         6,395,862 file-to-trans
               1 File(s)      6,395,862 bytes
               2 Dir(s)  436,294,770,688 bytes free

C:\Users\sconner\Desktop>

I then open up File Explorer (sigh … that took entirely too long to find the name of that program), select “Desktop” and … the file is not there.

What?

Is the “Desktop” in the File Explorer not the same as “C:\Users\sconner\Desktop”? When did that happen?

My fellow cow-orker had to talk me through finding the file in question in the File Explorer, because “Desktop” there isn't the same one as “Desktop” in CMD.EXE.

What?

My brain broke.

Anyway, the file found, I then had to be talked through how to initiate a file upload through Microsoft Teams, and on this point, said fellow cow-orker and another fellow cow-orker had to work through how to send the file. Yes, apparently it takes three software developers to initiate a file upload through Microsoft Teams.

And then I had to describe to fellow cow-orkers that yes, this will take some time because I'm behind an honest-to-God DSL connection and NOT a multi-gigabit ethernet connection, because Chez Boca exists outside the service area for anything faster. It took several minutes of convincing them of this.

And the upload failed, probably because I'm behind an honest-to-God DSL connection and NOT a multi-gigabite ethernet connection.

Seriously, how hard is it to transfer SOME BITS?

So, not only did I have trouble with sending a fellow cow-orker a file, I somehow destroyed CMD.EXE's ability to display text as I was writing this entry. WTF? All I was trying to do was find a way to cut-n-paste the text from CMD.EXE and somehow, I manged to XXXX CMD.EXE so bad, it wouldn't show text any more. Several minutes later, I got the text back up (how? XXXX if I know) but now the cursor is missing.

XXXX it! I can deal with that.

<slow-clap>

Way to go, Microsoft!

But when did I become so computer illiterate?

Tuesday, August 24, 2021

All your CPUs belong to us

As if writing software without exploits is hard enough, now we have the most popular computer architecture, the Intel x86 line of CPUs, with a potential hole large enough to drive the NSA through. In a DEF CON talk, Christopher Domas shows how he found an exploit on a particular version of the x86 CPU that allowed him to gain total control over the computer without the operating system even knowing about it. All it involved is one undocumented instruction that enables access to a hidden CPU inside the x86 CPU (or rather, perhaps allow direct access to the underlying core that is simply interpreting the x86 ISA) followed by multiple copies of an x86 instruction that actually feeds instructions directly to this inner CPU that bypass all system checks because this inner CPU has access to everything (from user mode, and if you understand that statement, you know how bad it is).

As mentioned, this is only for a particular x86 implementation, but who knows what evils lurk in the heart of CPUs?

Probably the NSA.


The case of the regression test regression, part II

When you have eliminated the impossible, whatever remains, however improbable, must be the truth.

Sherlock Holmes

When last I left off I identified the slow code in the regression test and it left me puzzled—it was a single function call that did not change between versions. Now, a bit of background: eight years ago [Eight years⁈ Where did the time go? —Sean] [A world wide pandemic. —Editor] [Gee, thanks. —Sean] I wrote a custom Lua interpreter that contains all possible Lua modules we could possibly use at work in order to avoid having a bunch of code to install, which I call kslua (which stands for “Kitchen Sink Lua”). And so far, that's what I've been using to run the regression test.

Faced with the fact that the sipsock:recv() call was taking upwards of a second, I decided update just that module to the latest in the fast version of the regression test as a sanity check. Well, it failed as a sanity check, because the latest version of that module that contains that function ran fast, so my sanity wasn't saved one bit. The only conclusion I can come to is that something else has changed!

Fortunately, somethine else has changed. A bit more background: the regression test is used to test “Project: Sippy-Cup,” “Project: Lumbergh” and “Project: Cleese.” And to run those programs, I need a few more programs that those programs communicate with, and oh hey! There's a program that the regression program runs that also runs via kslua! And through a tedious process of elimination, I finally found a module that causes the slowdown—the network event driver module I wrote. I then went through a tedious process of elminiation to find the exact change that causes the slow down. The “fast” version of the function in question, which is written in C, is:

static int polllua_insert(lua_State *L)
{
  pollset__t *set = luaL_checkudata(L,1,TYPE_POLL);
  int         fh  = luaL_checkinteger(L,2);
  
  lua_settop(L,4);
  
  if (set->idx == set->max)
    /* ... */

and the slow version, which is the next literal version of the code:

static int polllua_insert(lua_State *L)
{
  pollset__t *set = luaL_checkudata(L,1,TYPE_POLL);
  int         fh;

  lua_settop(L,4);
  
  if (!luaL_callmeta(L,2,"_tofd"))
  {
    lua_pushinteger(L,EINVAL);
    return 1;
  }
   
  fh = luaL_checkinteger(L,-1);

  if (set->idx == set->max)
    /* ... */

I got tired of having to write (in Lua):

SOCKETS:insert(sock:_tofd(),'r',handler)

so I changed the code to call _tofd() directly:

SOCKETS:insert(sock,'r',handler) -- the system will know to call _tofd()

The only thing is—the program that calls this only calls this once in the program.

At startup.

Desk, meet head.

So I'm again failing to see how this causes the slowdown. I use the “fast” version and the regression runs fast. I click the version of that module one step forward and it's slow.

It's maddening!

Wednesday, August 25, 2021

The case of the regression test regression, part III

When I last left off I identified the slow code in the regression test and it left me puzzled—it was a single function call that was called during startup and not when the tests were running. Faced with the fact that I had identified (for a second time) that code that did not change was causing a second of delay, I decided to change my approach. Instead of starting with the working, fast version of the code and going forward, I would start with the current, slow version of the code and work backwards. The first point of business—see if I could reproduce the fast code/slow code change with the code I identified as a sanity check. And of course that sanity check failed as both code versions were now slow.

What was that Sherlock Holmes quote? Ah yes, “When you have eliminated the impossible, whatever remains, must be the truth.” Must be the computer hates me.

Okay, so now that the slowdown was not the regression test script, but one of the mocked services. So I did to that script what I did to the regression test script—record how long it takes to handle a query. And of course it always responded in a timely fasion. No second long delay found at all!

So the delay is not in the regression test. The delay is not in the mocked service. I am slowly running out of ideas.

Okay, time to take stock: what do I know? Some queries are taking up to a second to run. They appear to be the same tests from run to run. Okay, let's run with that idea.

I modify the regression test to kick out which test cases are taking over 900,000 microseconds. I then just run those tests and yes, it's consistent. I remove those slow tests, the regression test just flows. If I just run the “slow” tests, each one takes about a second to run.

So now I know it's consistent. What's consistent about them?

Now I start digging into the logs and … yeah … something odd is going on with those tests. It's a path that “Project: Lumbergh” can take, and it's hitting a mocked service that I keep forgetting about. I mean, the regression test starts it up, but otherwise, it's pretty silent (and pretty easy to forget about). I add some logging to this particular mocked service, run the regression test and see that it responds once, then nothing ever again.

I add logging to the “fast” version and yes, it gets hit, and keeps responding.

I check the differences between the two versions:

[spc]saltmine-2:~>diff path/to/slow/version path/to/fast/version
96c93
< nfl.SOCKETS:insert(sock,'r',function()
---
> nfl.SOCKETS:insert(sock:fd(),'r',function()
106c103
< nfl.server_eventloop()
---
> nfl.eventloop()
[spc]saltmine-2:~>

Yeah … just some API updates, like yesterday. But if I run the new regression test while running the older version of this one mock, the new regression test runs as I expected. The delay is actually coming from “Project: Lumbergh” because the new version of this mocked service doesn't respond properly, causing “Project: Lumbergh” to time out, thus the delay I'm seeing.

I think I'm finally getting somewhere, but why is it so hot? And what am I doing in this basket?

Thursday, August 26, 2021

The case of the regression test regression, part IV

Like all bugs, it was a simple fix once identified—it was just identifying the bug that was difficult. The bug was due to a semantic change in an API. Lua has a file interface that includes a read() function, and the parameter it's given dictates how much to read. One of the parameters is “a” (or “*a” if you are using Lua 5.1) that indicates you want to read all possible data until the end-of-file is reached.

I ended up writing my own read() function to deal with network activity. I couldn't just open up a normal Lua file object with a network connection because that doesn't work nicely with event driven programs. An earlier version of my read() function had a different semantic meaning with “a”—it just returned any buffered data. It was in a later change where I aligned the semantics to match Lua's semantics, but forgot to change that bit of code in the one component. And those semantic changes explain the behavior I was seeing yesterday. It only took what? A week? To find the culprit.

Now the regression test with the updated code runs as fast as the previous version.

Saturday, August 28, 2021

A most persistent spam, part III

Earlier this month I nuked two email addresses being spammed by Alekandr and wanted to wait to see how things go. Two weeks later and Aleksandr (all “his” emails came from the same domain but from different subdomains) are still being sent, still being delayed by my greylist daemon only to be rejected because the destination email address no longer exists. I'm surprised that they are still attempting to deliver the emails. Is it that Aleksandr doesn't care if the emails make it or not? I noticed that the IP addresses are changing more often now—is it a test to see if the IP address being used is blocked? Is it a smear campaign against Aleksandr? What is the end game here?


The script kiddies have come to Gemini

Logging of just about anything on a Gemini server is, within the Gemini community, a contentious issue. Most people using Gemini dislike the web and the intensive logging of everything done with in, so of course they go overboard in the other direction. I've never agreed with that viewpoint and I do log information, even potentially personally identifiable information like IP addresses, because of crap like 5.252.227.126 is doing—repeated (and quite maliciously) trying to crawl my Gemini server for exploits.

And it would be one thing if it were well written, did a single scan, not find anything and move on. But it's not well written (hell, even commercial bots aren't well written and well behaved) and it repeately requests the same page over and over again. Until I blocked it, it had requested /index.php over 700 times. A sample:

Requests from a badly written Gemini bot that started just today
count request
722 gemini://gemini.conman.org/index.php
721 gemini://gemini.conman.org/test/torture/index.php
86 gemini://gemini.conman.org/login.php
86 gemini://gemini.conman.org/test/torture/login.php

Among other requests.

And even now, several hours after I blocked it, it's still trying to make requests even though it's now getting the “no such port” error from the server. I just have to wonder if it's just too cheap to run these bots that it doesn't matter if they don't work all that well. Just enough to keep going and finding exploits. And hey, just becuase now we can't get that page doesn't mean it won't exist in 20 minutes, so keep making those requests.

Sigh. This is why we can't have nice things on the Internet.

More broadly though, I'm not sure what this heralds for Gemini. On the one hand, it's popular enough to attract the script kiddies to check for exploits. On the other hand, there's a large number of different servers so exploits for one server won't necessarily imply a globally workable exploit. And on the gripping hand, the fact that it's easy to write a server means the likelihood of an exploitable server is high.

But hey! Gemini hit a milestone! Script kiddies have hit the scene and now we have to contend with their crap! Woot!

Obligatory Picture

[It's the most wonderful time of the year!]

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

You have my permission to link freely to any entry here. Go ahead, I won't bite. I promise.

The dates are the permanent links to that day's entries (or entry, if there is only one entry). The titles are the permanent links to that entry only. The format for the links are simple: Start with the base link for this site: http://boston.conman.org/, then add the date you are interested in, say 2000/08/01, so that would make the final URL:

http://boston.conman.org/2000/08/01

You can also specify the entire month by leaving off the day portion. You can even select an arbitrary portion of time.

You may also note subtle shading of the links and that's intentional: the “closer” the link is (relative to the page) the “brighter” it appears. It's an experiment in using color shading to denote the distance a link is from here. If you don't notice it, don't worry; it's not all that important.

It is assumed that every brand name, slogan, corporate name, symbol, design element, et cetera mentioned in these pages is a protected and/or trademarked entity, the sole property of its owner(s), and acknowledgement of this status is implied.

Copyright © 1999-2021 by Sean Conner. All Rights Reserved.