Testing

Testing is almost universal in software development, yet it’s common to treat it like a chore, or an after thought. Sure, test-driven development (TDD) is a means to influence what the public interface should look like by having the developer pretend the class already exists, and then implementing it to make the tests pass… but how often is TDD applied? And how often is it applied consistently? What’s the magic test-coverage percentage that you (or your lead) are satisfied with, and why that number, in particular? What the heck is “shifting left”, and how religiously shoud you stick to the test pyramid? Can automated testing happen in production?

Testing is a well studied subject, so in this post I’m going to take a two-pronged approach to discussing these subjects: I’ll use my crossword builder as a case-study to explain these concepts, and I’ll sprinkle in some of the things I’ve learned to be true after a few decades in the industry. This is all of course just my opinion, but is generally something I’ve found to be true. I’ll assume that the reader is familiar with testing and software development, in general.

Testing is core to software development, and with good reason - tests help validate that the underlying system does what it ought to, and that the intended behavior is defended by tests, as things change. Refactoring a well-tested codebase is much less daunting than one that has few or no tests, because you can trust your tests to catch issues where your refactoring was flawed. Tests can serve as documentation, influence design, catch integration issues, and so on… but tests be a chore to write.

Let’s blast through the kinds of tests commonly found in software:

  • Unit tests are the bread and butter of testing. Unit tests validate the behavior of a single unit of code, usually a single class or package. They are the fastest to run, and easiest to write. Most or all dependencies are mocked or faked out.
  • Integration tests are the next step up, and span multiple units. These kinds of tests are meant to test interactions/integrations between those units, hence the name. Some dependencies may be still mocked out (e.g., RPC calls to external systems), but the goal is to get a bit closer to testing the real thing, while typically still sticking within the bounds of the binary that the units in question belong to. These tests are still generally fast to run, but may take a bit more effort to create.
  • E2E tests can take many forms, but the general notion is that now you are testing across systems / binaries, not just units. These are the most realistic, and while a few interactions can still be faked out, most are close to what happens in production. These are the slowest to run, as you may end up having to spin up a sandbox to run all these binaries in, wire them up, and so on. This type of setup can get fairly complicated, making these kinds of tests the most difficult to create.

The idea behind TDD is that by writing tests first, rather than the code, the developer puts themselves in the shoes of the user of their code. The developer is thus forced to think about what the interface looks like. Such tests can also become a form of documentation of the underlying code. The newly created tests should fail, and the developer then goes and writes the code to make the already existing tests pass. And that’s TTD in a nutshell.

Looking at the three categories above, the test pyramid will tells us that we should have very many unit tests, some integration tests, and few E2E tests. The justification generally is along the lines of run time of these tests - unit tests are almost always lightning fast, and so are run frequently, integration tests are a bit more expensive, and might be run only occasionally during development, and E2E tests are often viewed as slow and painful, and so are run either before submitting changes, or less ideally, at some later point in the continuous integration pipeline.

The general idea is that the earlier a bug is caught, the better, because catching a bug in production is worse than catching it in staging/CI, catching a bug before submitting it to the codebase is better than in a CI, and catching a bug before even sending your code for review is even better. And so, the idea is to design systems so that more and more code is tested in the earlier phases (so, Unit > Integration > E2E test).

As with many things with code, one should view TDD as one of the tools in their toolbox, and apply it accordingly. For example, if you’re starting on a fresh code base and aren’t sure what the interface should look like, TDD is a good approach to get the interface and tests bootstrapped, and possibly to drive the majority of development.

Personally, I find that by the time I get to coding, I have a pretty good idea of what it is that I want to build, and to be perfectly honest, usually add tests after the fact, which is of course a big “no” for TDD. The cases where I do find TDD bring huge value is for fixing bugs. Suppose you find a bug in your code, and you’ve narrowed it down to a subset of the system. Now is a good opportunity to apply a form of TDD (I suppose it’s more like “test driven debugging”):

  • Replicate the failing behavior by writing a test. The test should describe the preconditions for the bug and the expected result. The test may be a unit test, but could be an integration test, or even an E2E test. The latter can become a necessesity as the bugs get more and more subtle, because you don’t know where exactly the bug may be.
  • Run the test. If you got the preconditions right, your test should reproduce the bug and fail, because the test expects the correct result, which your system under test isn’t producing (yet).
  • Fix it. Root cause the issue, fix it, run the test, and get it to pass. Oh yeah, and don’t forget to submit that fix.

Tests added to help fix bugs following the pattern above serve multiple purposes: they help confirm that the bug is fixed (and possibly help root cause by being easily repeatable during this phase), and they serve as a lasting protection against the same bug resurfacing in the future. They also serve as a form of documentation for a potentially subtle use-case you hadn’t considered before. I found this pattern particularly useful for patching up subtle issues with the underlying CSP solver that powers the crossword builder and the sudoku solver.

Even if you find TDD isn’t something you use every day to design your code, you should consider it as a useful tool under other circumstances, like fixing and preventing recurrence of bugs.

Often times teams will set high code coverage metrics, and these are generally a good rule of thumb to ensure that tests exist, but a single number, like 85% coverage, only covers the basic fact that tests exist and exercise at least that fraction of the lines of code, by definition. Coverage won’t tell you if your tests are of high quality, or whether there are areas where a higher degree of coverage may be beneficial (and conversely, can be a not-so-useful nag for trivial code, like a slew of getters and setters).

So when do you know that a certain piece of code deserves to be held to a higher bar? My general rule of thumb is a combination of impact that a flaw could have, and complexity of logic in question. For example, if the logic is mission-critical, you may want to be extra thorough in the kinds of scenarios you want to cover with tests. The same approach applies to subtle logic, like a complex algorithm - even 100% coverage may not truly catch all the quirks and edge-cases, and you’ll want to run many different kinds of inputs against such code.

In the case of my crossword builder, the core solver logic has near 100% code coverage and is exercised by N-queens, sudoku and a few crossword puzzles, to ensure that refactorings or changes in general don’t introduce regressions. I even have some rudimenary load tests to see if things slow down too much as the code evolves.

As I alluded above, there may be cases where thorough code coverage is not worth the cost of introducing it. In my case much of the HTTP server is largely boilerplate that changes rarely, and is easy to fully exercise by just bringing up the website. Of course if this was a critical website I would introduce coverage to avoid outages, but being my hobby site with low traffic this felt overkill.

I find that the general pattern applies to all kinds of tests, where you weigh the benefit of having the test vs the cost of writing it. I’d go even further to state that this is often the reason for the test pyramid to be unit-test heavy, and E2E test light - not just because the former are blazing fast and the latter can be very slow to run, but because the cost/benefit trade-off doesn’t justify investing that much time into the latter. And, of course, this all depends on how critical and subtle the E2E behavior is. Sometimes you may still wish to spend a significant amount of effort to put together an E2E test, if the functionality in question is that critical to validate.

While you may want to cover subtle or critical bits of code with tests proactively, to prevent subtle or critical regressions, there are other cases where lack of tests can become a significant speed-bump or blocker to the overall development process. In this section I’ll focus more on maintainability and extensibility of the underlying code that is protected by tests.

The core solver is my 4th rewrite, but this time around I really invested the time to cover the logic with tests, and doing so helped me ensure correctness as I’ve introduced many new features, like conflict-directed backjumping, which is fairly complex and required significantly rewiring things. While this may not be my final rewrite, it’s certainly the easiest to maintain and extend so far, and test coverage is one of the main reasons why this is the case.

Another example where tests became necessary is the crossword builder frontend - its logic was originally in a monolithic JS file, without tests. While this approach worked fine for the sudoku solvers (both the classic and NxN), the complexity of allowing the user to build the grid, encode/decode all kinds of state, fill the grid and supporting all the other bells and whistles blew up quickly. As a result, I’ve ported JS to TypeScript, where I was able to cleanly factor the code into classes, with proper unit and integration tests. And while the features themselves are largely simpler than the core solver, the same argument applies here - I am less worried about making larger structural changes as I know that my code coverage will catch obvious flaws. And similarly to the HTTP boilerplate of the backend, I didn’t invest time to add tests validating rendering or automating interactions. This would definitely add some value by preventing the need for manual tests, but for now, to me, the cost wasn’t worth the benefit, at least for now.

Coming back to shifting left - I could say that the TypeScript rewrite essentially accomplished that, removing the need for most of the manual tests and shifting tests left, towards unit and integration tests. A similar argument can be said about incrementally adding tests to capture bugs (as part of fixing them) - now these bugs are part of the unit/integration test suite, which is basically free to run, and previously reproducing such bugs would have required significant manual effort.

Tests take time to write, and generally the more E2E they are, the longer they take to write. I’ve found the impact/complexity rule of thumb to be a solid guideline to know when light coverage is good enough, or you want to go the extra mile and save yourself the debugging headache (or the inevitable production outage) down the line. In the same vein of thought, you can use tests as a means to make your development easier, by shifting left (and thus making your future testing more efficient), covering regressions, and so on.