How much flakiness do you tolerate in end to end tests?

@kersplort · edit-2 10 months ago

How much flakiness do you tolerate in end to end tests?

@[email protected] · 10 months ago

I think the reality is that there are lots of different levels of tests, we just don’t have names for all of them.

Even unit tests have levels. You have unit tests for a single function or method in isolation, then you have unit tests for a whole class that might set up quite a bit more mocks and test the class’s contract with the rest of the system.

Then there are tests for a whole module, that might test multiple classes working together, while mocking out the rest of the system.

A step up from that might be unit tests that use fakes instead of mocks. You might have a fake in-memory database, for example. That enables you to test a class or module at a higher level and ensure it can solve more complex problems and leave the database in the state you expect it in the end.

A step up from that might be integration tests between modules, but all things you control.

Up from that might be integration tests or end-to-end tests that include third-party components like databases, libraries, etc. or tests that bring up a real GUI on the desktop - but where you still try to eliminate variables that are out of your control like sending requests to the external network, testing top-level window focus, etc.

Then at the opposite extreme you have end-to-end tests that really do interact with components you don’t have 100% control over. That might mean calling a third-party API, so the test fails if the third-party has downtime. It might mean opening a GUI on the desktop and automating it with the mouse, which might fail if the desktop OS pops up a dialog over top of your app. Those last types of tests can still be very important and useful, but they’re never going to be 100% reliable.

I think the solution is to have a smaller number of those tests with external dependencies, don’t block the build on them, and look at statistics. Sound an alarm when a test fails multiple times in a row, but not for every failure.

Most of the other types of tests can be written in a way to drive flakiness down to almost zero. It’s not easy, but it can be doable. It requires a heavy investment in test infrastructure.