Wednesday, August 25, 2021
The case of the regression test regression, part III
When I last left off I identified the slow code in the regression test and it left me puzzled—it was a single function call that was called during startup and not when the tests were running. Faced with the fact that I had identified (for a second time) that code that did not change was causing a second of delay, I decided to change my approach. Instead of starting with the working, fast version of the code and going forward, I would start with the current, slow version of the code and work backwards. The first point of business—see if I could reproduce the fast code/slow code change with the code I identified as a sanity check. And of course that sanity check failed as both code versions were now slow.
What was that Sherlock Holmes quote? Ah yes, “When you have eliminated the impossible, whatever remains, must be the truth.” Must be the computer hates me.
Okay, so now that the slowdown was not the regression test script, but one of the mocked services. So I did to that script what I did to the regression test script—record how long it takes to handle a query. And of course it always responded in a timely fasion. No second long delay found at all!
So the delay is not in the regression test. The delay is not in the mocked service. I am slowly running out of ideas.
Okay, time to take stock: what do I know? Some queries are taking up to a second to run. They appear to be the same tests from run to run. Okay, let's run with that idea.
I modify the regression test to kick out which test cases are taking over 900,000 microseconds. I then just run those tests and yes, it's consistent. I remove those slow tests, the regression test just flows. If I just run the “slow” tests, each one takes about a second to run.
So now I know it's consistent. What's consistent about them?
Now I start digging into the logs and … yeah … something odd is going on with those tests. It's a path that “Project: Lumbergh” can take, and it's hitting a mocked service that I keep forgetting about. I mean, the regression test starts it up, but otherwise, it's pretty silent (and pretty easy to forget about). I add some logging to this particular mocked service, run the regression test and see that it responds once, then nothing ever again.
I add logging to the “fast” version and yes, it gets hit, and keeps responding.
I check the differences between the two versions:
[spc]saltmine-2:~>diff path/to/slow/version path/to/fast/version 96c93 < nfl.SOCKETS:insert(sock,'r',function() --- > nfl.SOCKETS:insert(sock:fd(),'r',function() 106c103 < nfl.server_eventloop() --- > nfl.eventloop() [spc]saltmine-2:~>
Yeah … just some API updates, like yesterday. But if I run the new regression test while running the older version of this one mock, the new regression test runs as I expected. The delay is actually coming from “Project: Lumbergh” because the new version of this mocked service doesn't respond properly, causing “Project: Lumbergh” to time out, thus the delay I'm seeing.
I think I'm finally getting somewhere, but why is it so hot? And what am I doing in this basket?