Thursday, August 19, 2021
The case of the regression test regression
So the SVN-git
conversion is back on the table.
When I worked on this last year,
I stopped when it became clear that a critical build server didn't support git
.
I duly reported the issue,
and it's only been fixed in the past few months.
Last year,
it was delegated solely to me.
This year,
it's more of a team effort.
For our team, we currently have two main SVN repositories with some duplicate third party packages checked in. We went through the lists to remove the redundencies (a Good Thing™) and I felt that it might be a good time to ensure they're up to date. I spent a week doing so and ran into an odd issue—the old regression test now runs slower! Not thirteen (or nineteen) hours slow but slow enough to be seriously annoying. I was able to isolate the slowdown to some Lua modules (that I wrote cough cough—I didn't think the changes were that bad).
We do development primarily on Macs.
My first thought was to either confirm or deny that the Mac was the issue.
I ran tests on a Linux laptop I have for work and nope—the slowdown exists on that.
I then tried to profile the code
(to no conclusive result).
I compared the code between the two versions
(only five lines changed in the regression test,
and all of those were small API changes like fsys.redirect(stdin,fsys.STDIN)
becoming fsys.redirect(stdin,io.stdin)
).
I just could not figure out why the new code was slower.
So then I decided to just time things. I added code to time how long it took to run each test (and to limit the number of tests run—enough to show the issue, but not the entire 15,852 tests) and plot the results.
The X-axis is the test case number, the Y-axis is time in microseconds (and plotted on a logarithmic scale, otherwise, you wouldn't be able to make out any detail along the lower portion). The new regression test is in green, the old one in red. It's clear that there are a few dozen tests that take a full second to run in the new program, whereas the average (of both, aside from the outliers) seems to be around 5ms. So good, the problem clearly shows up.
Then I timed each section of running a test to see which section is sucking up the time, and eventually, I get this graph:
[The sharper eyed of you might have noticed the X-axis change—each test is run twice (for reasons). This means the second graph has twice as many entries. Yes, that means the full 15,852 tests are each run twice, for 31,704 requests. —Sean]
This represents the timing of this bit of code:
-- -------------------------------- -- Receive and process NOTIFY -- -------------------------------- local remote,data = sipsock:recv(10) if not data then return false,"no NOTIFY" end
The only program that's different is the regression test,
not the programs being tested.
And it's not like I updated the version of Lua either—that didn't change.
The third party code I'm using,
yes,
to some degree.
But the underlying code to sipsock:recv()
did not change.
Yet it's this bit of code that is consistently slow for a consistent set of tests.
To say I'm a bit perplexed is a bit of an understatement.