Tests ROI: Before or After Implementation

Are we missing something about TDD?

Jun 18, 2023

You’ll hate me for this. I’m the TDD camp. I’m very passionate about TDD because it has produced a spike in productivity no other singular tool could offer. It is a universal hammer. Hey, fitting for our logo!

When I code— which I have to admit is left often that I’d like— I don’t always use TDD. Some simple things I consciously don’t test cover. But when I aim to iteratively grow and evolve a piece of software I start with TDD. Even on Katas or offline work.

I’ve highlighted this opinion in the article below which you may know very well. It’s the most popular newsletter issue to date:

Test-first or don't test at all

Denis Čahuk

May 16, 2023

“I’ll do it later” I’ll do it later is one of the most common anti-thesis to improving tech engineering culture. Testing becomes exponentially more expensive the longer you wait. This has to do with how much re-work and re-understanding needs to be done after-the-fact if you don’t start with a

Read full story

Treat your Testing Code Seriously

Feedback I received to making the above suggestion is generally "What difference does it make?". And that’s the nice stuff.

Since I expected flaming and trolls, I went all-in and re-posted on reddit. Oh boy! I’d show you but r/programming is on the protest blackout at the moment. To summarise, there were a lot of heated disagreements about this kind of advice as I’m sure I’ll receive again for writing this issue.

The responses highlight a key symptom to me that I face a lot during coaching teams: They don’t take their test code seriously.

I am expanding my company-focused engineering leadership coaching to the MentorCruise platform. I am a Top Mentor on MentorCruise which puts me in the top 2% to rub shoulders with engineering leaders from the largest tech corporations whose names appear on the index funds you own.

Unlock your potential

What do I mean by seriously?

Let’s see how you may reason about a payment processor:

Does it process payments?
Does it emit logs so you can debug it?
Does it fail without disrupting operations?
Can you recover from failure?
Are the data mutations correct and reliable?

Now let’s look at its test suite:^{numbering them to refer to later}

Does it cover payment processing behavior?
Are the failure messages clear so we can debug easily?
Does the test fail without breaking the suite?
Does the test fail when it should?
Are data side effects intentional or coincidental?

If I gave each line above 2 points as a score, then most test suites I see in the wild for real production services get a score of 1/10. For covering some part of behavior.

To get 10/10 you need to make sure your tests are working correctly. Not that the behavior works— rather that the test is complementing the behavior with added in quality.

It's not important to write the test before the implementation so it fails.

The quality the test provides is to see it in a failing state. That's what builds the quality.

Are data side effects intentional or coincidental?

If you’ve inherited a badly tested codebase, you’ve seen this:

Large database fixtures
So much detail in the data setup you can’t tell behavior from implementation
The smallest change in the data or schema breaks an entire range of test suites
Everything is slowly becoming an integration test with IO, including your non-optional mechanical parts like your database and cache layers
The test suites requires an entire Kubernetes cluster to run… and breaks at the smallest schema change somewhere upstream

Point 1: The local test code only contains data inputs that affect the wanted behavior. Other derivative inputs are hidden.

To make data setup and output intentional apply encapsulation to the testing suite. Hide dependencies and proxies inside factories or helpers to make sure the test communicates only what is relevant to the behavior being tested.

Often this can be achieved by drilling the test suite into a tree-structure and deepening certain composites of setup dependencies. But be wary— make sure the testing tree does not clone the abstraction inheritance tree of your implementation. That backfires quickly towards the pain point list above.

Point 2: The output is tested against a behavior-facing interface. If your database is not public, then don’t test output against it.

Does the test fail when it should?

When you write the test after the implementation, then you also need to undo it to see it fail -- this is what mutation testing reveals. Then you have to re-do it back in case your assumption was correct. This is a ton of rework.

Point 1: The test should communicate what is wrong when the implementation fails to achieve the expected behavior.

And most notably, the test should focus on testing behavior primarily, not implementation. This is difficult for many as it requires you to think about your code outside-in, from a product perspective or from a user’s perspective.

This can be a challenge when the design is still young and your team cannot agree on what outside is.

Hint: Looking for I/O is a good start.

Point 2: The test should not fail when behavior is achieved but something else got borked (implementation details).

Does the test fail without breaking the suite?

Tests can also cause what one would consider integration problems. Noticeable when combining many languages, having a backend and frontend test suite that a full-stack engineer wants to leverage.

Point 1: Write tests first to see how many of your APIs and UIs need to be inside the suite. Failures should communicate in the language used by that part.

Rich API + dummy UI type services often suffer from this where a failing test in the API causes downstream problems in the UI. Thus it is important to not rely on local interfaces or DTO wrappers on inputs and outputs to ensure that boundaries are testable in the form the behavior-facing UI sees it.

Most of my audience works on the Web or Mobile. HTTP and cloud-based messaging is a strong abstraction layer, as are JSON and binary/ascii data table dumps. Focus on the schema, not the language-constructs.

Point 2: Write output testing using plain structures, rather than implementation-specific language wrappers or DTOs. Failure to synchronise these will cause tests to break.

Are the failure messages clear so we can debug easily?

When the test fails you naturally make two valid assumptions:

Is the test suite broken?
Is the implementation broken?

Point 1: The test should tell you what behavior broke in domain terms. Not just “failed to assert true is false.”

The second is a bit more subtle. It has to do with the overall quality and health of the test suite. Sometimes tests fail because the orchestrated setup is not deterministic or due to coupling to system infrastructure that is outside the suite.

This may be the case when onboarding someone new to the team and you haven’t witnessed the suite being used on an Arm laptop or containerised server without CPU core virtualisation.

Point 2: The test suite should provide a reasonable best effort to determine the health of the suite.

Does it cover behavior?

And of course, the icing on the cake.

We’ve seen many iterations of this. London vs. Chicago. Inside-out, outside-in. SUT vs. Integration. I/O vs Pure.

The terminology started to break down and become nuanced over the decades. This is a good thing. However, it doesn’t make writing a good test any easier. Especially if you or your team are just starting out.

It’s especially difficult when your only reference point in your software career is badly written tests.

I decided to make this point the last one because it will require further expansion. I don’t think it’s possible to cover all nuances so you’ll have to forgive me for oversimplifying.

Point 1: The test should only check and communicate the root cause problem of the behavior being tested.

Here’s a short video on the 5 Why’s.

Level of introspection is important. Levels closer to the top are about Implementation Details and UI noise. The root cause at the bottom of the list contains the core behavior.

We want to be able to explain what the test is covering using the deepest level of understanding. This can be extremely difficult.

In a way— this is what the BDD movement set out to do initially. Sadly, I often see BDD test suites use natural language to explain UI elements and implementation details. This is a symptom of not going to the root cause of the behavior.

Point 2: The Test name should state *why* the test was added, the body should describe *how* the behavior is defined using business terminology.

Naming things is hard. If you’ve been doing DDD or Event Modeling for a long time this won’t be a surprise to you. Having business understanding in an engineer’s mind is the highest predictor of productivity. It’s how the 10x dev teams are shaped.

Get as much business language, jargon and terminology into the test as you can. Test-first approaches will drive your design, including how you name things. This is where it all starts and a lot of care should be given to naming the test suite, the subject under test and the behaviors.

You don’t have to weasel this out of the compiler. Ask your business.

Domain-Driven Design is about Language, not Code