Thursday, July 29, 2021
“Would love to hear about your prior development method. Did adopting the new practices have any upsides?”
[The following is a comment I made on Lobsters when asked about our development methods. I think it's good enough to save, and what better place to save it than this here blog. So here it is.]
First off, our stuff is a collection of components that work together. There are two front-end pieces (one for SS7 traffic, one for SIP traffic) that then talk to the back-end (that implements the business logic). The back-end makes parallel DNS queries [1] to get the required information, muck with the data according to the business logic, then return data to the front-ends to ultimately return the information back to the Oligarchic Cell Phone Companies. Since this process happens as a call is being placed we are on the Oligarchic Cell Phone Companies network, and we have some pretty short time constraints. And due to this, not only do we have some pretty severe SLAs, but any updates have to be approved 10 business days before deployment by said Oligarchic Cell Phone Companies. As a result, we might get four deployments per year [2].
And the components are written in a combination of C89, C++98 [3], C99, and Lua [4].
So, now that you have some background, our development process. We do trunk based development (all work done on one branch, for the most part). We do NOT have continuous deployment (as noted above). When working, we developers (which never numbered more than three) would do local testing, either with the regression test, or another tool that allows us to target a particular data configuration (based off the regression test, which starts eight programs, five of which are just needed for the components being tested). Why not test just the business logic? Said logic is spread throughout the back-end process, intermixed with all the I/O it does (it needs data from multiple sources, queried at the same time).
Anyway, code is written, committed (main line), tested, fixed, committed (main line), repeat, until we feel it's good. And the “tested” part not only includes us developers, but also QA at the same time. Once it's deemed working (using both regression testing and manual testing), we then officially pass it over to QA, who walks it down the line from the QA servers, staging servers and finally (once we get permission from the Oligarchic Cell Phone Companies) into production, where not only devops is involved, but QA and the developer who's code is being installed (at 2:00 am Eastern, Tuesday, Wednesday or Thursday, never Monday or Friday).
Due to the nature of what we are dealing with, testing at all is damn near impossible (or rather, hideously expensive, because getting actual cell phone traffic through the lab environment involves, well, being a phone company (which we aren't), very expensive and hard to get equipment, and a very expensive and hard to get laboratory setup (that will meet FCC regulations, blah blah yada yada)) so we do the best we can. We can inject messages as if they were coming from cell phones, but it's still not a real cell phone, so there is testing done during deployment into production.
It's been a 10 year process, and it has gotten better until this past December.
Now it's all Agile, scrum, stories, milestones, sprints, and unit testing
über alles! As I told my new manager, why bother with a two week sprint
when the Oligarchic Cell Phone Companies have a two year sprint? It's not
like we ever did continuous deployment. Could more testing be done
automatically? I'm sure, but there are aspects that are very difficult to
test automatically [5]. Also, more branch development. I wouldn't mind so
much this, except we're using SVN (for reasons that are mostly historical at
this point) and branching is … um … not as easy as in git
. [6] And
the new developer sent me diffs to ensure his work passes the tests. When I
asked him why didn't he check the new code in, he said he was told by the
new manager not to, as it could “break the build.” But we've broken the
build before this—all we do is just fix code and check it in [8]. But no,
no “breaking the build,” even though we don't do continuous integration, nor
continuous deployment, and what deployment process we do have locks the
build number from Jenkins of what does get pushed (or considered “gold”).
Is there any upside to the new regime? Well, I have rewritten the regression test (for the third time now) to include such features as “delay this response” and “did we not send a notification to this process.” I should note that is is code for us, not for our customer, which, need I remind people, is the Oligarchic Cell Phone Companies. If anyone is interested, I have spent June and July blogging about this (among other things).
- Looking up NAPTR records to convert phone numbers to names, and another set to return the “reputation” of the phone number.
- It took us five years to get one SIP header changed slightly by the Oligarchic Cell Phone Companies to add a bit more context to the call. Five years. Continuous deployment? What's that?
- The original development happened in 2010, and the only developer at the time was a) very conservative, b) didn't believe in unit tests. The code is not written in a way to make it easy to unit test, at least, as how I understand unit testing.
- A prototype I wrote to get my head around parsing SIP messages that got deployed to production without my knowing it by a previous manager who was convinced the company would go out of business if it wasn't. This was six years ago. We're still in business, and I don't think we're going out of business any time soon.
- As I mentioned, we have multiple outstanding requests to various data sources, and other components that are notified on a “fire and forget” mechanism (UDP, but it's all on the same segment) that the new regime want to ensure gets notified correctly. Think about that for a second, how do you prove a negative? That is, something that wasn't supposed to happen (like a component not getting notified) didn't happen?
- I think we're the only department left using SVN—the rest of the
company has switched to
git
. Why are we still on SVN? 1) Because the Solaris [7] build servers aren't configured to pull fromgit
yet and 2) the only redeeming feature of SVN is the ability to checkout a subdirectory, which given the layout of our repository, and how devops want the build servers configured, is used extensively. I did look into usinggit
submodules, but man, what a mess. It totally doesn't work for us. - Oh, did I neglect to mention we're still using Solaris because of SLAs? Because we are.
- Usually, it's Jenkins that breaks the build, not the code we checked in. Sometimes, the Jenkins checkout fails. Devops has to fix the build server [7] and try the call again.