Wednesday, June 09, 2021
You know, walking into Mordor might be simple after all
Yesterday, I said the full regression test might take over 13 hours. In light of the results of running just a partial test, it turns out the full regression test will take over 19 hours! The jokes on me though—when I said it would "be fun reporting at the next meeting" I wasn't expecting my new manager to double down on the regression test. Seriously, he asked “Can you run it in parallel?”
…
Words fail.
“The job of a programmer is to produce test cases.”
So last night I ran a subset of the regression test in 4½ hours and got a few errors where something that shouldn't happen, happened (and it's this “checking for not an event happening” that takes the time). Well, it wasn't a bug in the code being tested, but a bug in the regression test (Surprise! Surprise! Surprise! Only not really). I think that says more about our business logic than it does about CZ or me; both of us attempted to validate this part of the business logic in the regression test, and we both got it wrong.
And about parallelizing the regression test—yes, it's possible. But doing so on the spot isn't. The easy solution is to run the regression test on multiple machines—nice if you have them. The other option is to parallelize the run on a single machine and the code just isn't set up to do that. I'm not saying it's impossible, but it will take engineering effort, and more importantly, testing! Funny how testing your test cases isn't talked about that much.
The slowdown of the regression test is due to “proving a negative”—that is, checking for something that's not supposed to happen did not happen. And in a distributed system like ours, that's not easy to test—a check could happen before the event due to any number of reasons, and how long do you wait to ensure that what shouldn't happen didn't happen?
The other issue to why it will take so long to run is just the sheer number of tests that are run. My “retiring any day now” manager has never been happy with the “shotgun” approach I took to generating the tests—I basically generate thousands of combinations of conditions, most of which “should” never appear in production. But one of those “should never happen” things did happen about seven years ago and well, the less said about that the better. So at least my “shotgun” approach does have the effect of testing for a lot of “I don't know” conditions (most of which are misconfigurations of data from provisioning). And each test we add could potentially double the number of tests cases. I'm sure there's a way to reduce the number of test cases, but to the TDD acolytes out there (and the new management team does appear to follow TDD tenents), “one does not simply reduce the number of test cases.”
Sigh.
And the regression test rolls on …