Engineering Fitness

A machine learning solution for detecting and mitigating flaky tests

If you are a developer you’ve probably experienced flaky tests or flakes and the first thing that comes to your mind is either frustration or annoyance. In this post we are going to discuss how we are dealing with flaky tests here at Fitbit and what our plans are for solving this problem.

Before we go into any details, let’s level set on what flaky tests are, why they are bad and also go through some real life examples of flaky tests, just to have the context well set up. Let’s dig in! 

What is a Flaky Test? 

A test which passes or fails in a nondeterministic way is referred to as flaky. Such behavior could be harmful to developers because test failures do not always indicate bugs in the code. Broadly there are two main types of flaky tests. Those that are flaky due to some external conditions, such as network issues, machine crashes, power outages etc. Let’s say you had a test that ran on a Jenkins executor that crashed with a kernel panic, forcing you to rebuild. On the second run everything was fine and thus the test was green. So although the test passed and failed on the same Git revision, it’s not really its fault, and there is not that much we can do about a crashed executor in this context. Also network issues in integration tests cause a lot of flakiness that we cannot control. The second type of flakiness is due to defects in the test case’s code or in the CUT(code under test), such as asynchronous waits, concurrency issues such as race conditions, priority inversion or incorrect assumptions about timezones or database ordering. This type of flakiness is not that trivial to cope with because we are dealing with subtle “bugs” in the test cases themselves. In general flaky tests due to test programming errors are much harder to catch or identify than to find bugs in production code. We’ll see why in a second. 

Show me a real life example

According to an analysis published by the University of Illinois in this paper, almost 45% of flakes are due to asynchronous waits. This number turns out to hold for our monorepo as well. Most of the flakes that we see are due to this kind of condition.

Usually this goes like this. Before the test runs, an embedded Cassandra or HTTP server is created. In the test we do an async call to the server, and then we wait in a Thread.sleep() call for a while, hoping the server had enough time to process our request. After that we do an assert of some kind. This generally works, except when communication lag or other network issues cause the Thread.sleep(period) to be too short. Then what? Well if there is an assert after that sleep call, then it will fail. Let’s look at a real life flaky test to get a feel of what we are up against.

   final KafkaProducer<String> kafkaProducer = … // Initialize and configure a Kafka producer
   final KafkaConsumerNew<String> kafkaConsumer = … // Initialize and configure a Kafka consumer
   int batchSize = 0;
   kafkaConsumer.setMessageProcessor(new KafkaMessageBatchProcessor(...)) {	
       //This callback will process messages from the specific topic
       protected void process(Collection messages, KafkaOffsetsCommitter kafkaOffsetsCommitter) {
           batchSize = messages.size();

   for (int i = 0; i < NUM_OF_MESSAGES; i++) {
       kafkaProducer.postMessage("test");           <-----  CAUSE OF FLAKINESS.    

   assertEquals(NUM_OF_MESSAGES, batchSize);

In this test we create a kafka producer that posts 5 messages to a topic. We also create a consumer that should consume those messages. The catch here is the postMessage() method is asynchronous and the kafka consumer waits for 10 seconds for the messages. In most cases this code and the final assert will work. But what if something went wrong, and 10 seconds are not enough? The assert will fail because the number of messages produced and consumed (batchSize) are not equal. This is a clear example of a test that could be flaky because of asynchronous calls to external services. 

What other types of flakes do we see? 

Incorrect assumptions about database table ordering

Let’s suppose you have a MySQL integration test where you create mocking data: 

CREATE TABLE table_test(field INT)
INSERT table_test VALUES(1)
INSERT table_test VALUES(2)

Now a simple JDBC code like this: 

String query = "SELECT FIELD FROM table_test LIMIT 1";
Statement st = conn.createStatement();
ResultSet rs = st.executeQuery(query);
 if ({
   int field = rs.getInt("field");
   assertEquals(field, 1)	← POSSIBLE CAUSE OF FLAKINESS

can make your test flaky. You have incorrectly assumed selecting the first row is guaranteed to yield 1.

The output of the SQL above can either 1 or 2 depending on a bunch of factors that are internal to the database and not the code under test. 

Resource leaks

Another frustrating, flake causing issue are resource leaks.They can happen when an integration test suite starts an embedded HTTP server, runs a bunch of tests but leaves the server open on port 8080. If another test suite tries to use port 8080 on the same machine, it will fail because the port was already allocated.

Of course, the list could go on and on. For a more comprehensive list of types of flaky tests checkout  Martin Fowler’s excellent article.

As you can see, finding flaky tests in a huge repository with more than 100 projects is not an easy task. And the most damaging problem related to flaky tests is not that you have to rebuild when they fail, although wasted time is certainly an issue. More important is lost confidence in tests that fail.

Our current infrastructure

We have a large monorepo with more than 100 projects in it, with a total of 28572 test cases and there are more than a dozen teams managing all these projects. Given this setting, just looking with your naked eye at the whole code and trying to catch flakes is certainly not a very smart strategy. It was obvious that we needed a tool to track all of these flakes, sort them by teams and make the teams aware of those tests in the hopes that they will get fixed. 

One obvious way to track these is to record all the builds and all the test executions in a large database. We record everything from the git revision, to the duration of each test to the build id on which that test ran, etc. We then developed a service that looks into this database and tries to detect which test was flaky and on which build. We’ve also developed a nice dashboard that displays all of that : 

And in order to drive these flakes out in the open, so developers are aware of them and start fixing them, we send a weekly update with whatever flakes we’ve discovered.

We’ve quickly realized that the simple formula where we classify a test as flaky or not based on if it’s failing and passing on the same revision is not enough. We need a deeper analysis of a test’s execution history. The reason for this is that we’ve seen both false positives, tests wrongly classified as flakes, and false negatives, flakes not caught by flaky-test detector. We’ve developed a machine learning proof of concept that could track flaky tests better and “look” at them before execution in Jenkins and tell if they are flaky or real test failures. If they are flaky we can chose not to run them so they won’t block anyone from merging their code changes.

A machine learning based solution

There has been some research on how to distinguish between flaky tests and actual test failures. By far the most comprehensive one was done at Google. After some research of our own,we have developed some intuitions about flaky tests and how their execution history looks compared to that of failing tests or working tests. We know that a test is flaky if it passes and fails on the same git revision. But we also looked for tests that are passing and failing across multiple git revisions over a certain period of time. Flaky-Detector is looking for tests that are passing and failing on the same revision. That is just fine, since it follows the actual definition of a flaky tests. And we do get pretty good results. But apparently that seems not to be enough. The reason behind that is that tests that are truly flaky have certain properties attached to them.

Below are the properties we have found common across flaky tests.

Flaky tests pass and fail on successive git revisions over a long period of time

This property catches the flakiness better than just looking for individual revisions(as we do now).

In order to catch this aspect of a test passing and failing on several successive revisions, we’ve come up with the concept of edge. An edge for a certain test case is a transition from pass to fail or fail to pass on two different git revisions. Say you push some code change and the execution of test_case()  passes, someone else then pushes some other changes that trigger that same test to run if it fails on the second run. That would be an example of an edge. The test test_case() was passing on a revision but then failed on the next revision. 

Looking on the image above, you could see the execution history of two hypothetical test cases over a period of 12 hours in which several Jenkins builds ran with different revisions. The first test had only two transitions. The first one from pass to fail and then the second from fail to pass and then it stayed green for the rest of the time. We cannot say the same thing about the second test which exhibits a more unstable behaviour pattern with many more transitions.  From this image alone we could say that there is greater likelihood that test case 2 is flaky whereas test case 1 is not.  

Failed test cases are highly likely to be flaky if they are “far” in the monorepo from the actual code change

We are using a monorepo where we have all our microservices and core libraries. We’ve also defined dependencies between projects. A project might depend on other projects. And those projects in turn might depend on others and so on. Say we’ve changed a java file in a core library called A. Projects B and C depend on that library A. And in turn project D depends on project B and project E on project C. Thus that change on project A, will trigger not only the unit/integration tests on project A, but also the tests on all the transitive dependencies to it, including those for B,C,D and E. See image below for the dependency tree in this example: 

In this contrived example, tests from project A are at distance 0 from the actual code change. But tests in projects B and C are at distance 1, whereas tests in the projects D and E are at distance 2 from the actual code change in project A.  So the “farther” a test that fails is from an actual code change on that revision, the higher the probability that it’s flaky. 

Test cases that fail due to changes in config files are very likely to be flaky

Although it is quite possible to change an xml file or any other config file, a test failure to be a real failure. Maybe a url was changed that made that test to fail. That does not mean that the test is flaky. But in general, a test that failed on git revision where only config file were changed, is highly likely to be flaky.

A test case that has failed on a git revision that changed a file that was previously changed a lot is highly likely to be a real failure and not a flake

Based on the idea that changing a file that was previously changed many times before could introduce bugs, we also believe that tests that fail in revisions that changes heavily modified files are more likely to be real test failures rather than flakes

A test case that has failed on a git revision that changed a file which was previously changed by more than two authors recently is highly likely to be a real failure 

Just like the condition before, a file that was modified by more than one person is generally prone to introduce new bugs in the system. So a test that has failed on a git revision that has changed a file that was previously changed by more than two developers, is highly likely to be a real test failure rather than a flake

A test case that has failed on a git revision where many source code files were changed, is highly likely to be a real failure

This third condition is very much related to the previous two. Creating pull requests that change many files, across many projects in the monorepo has the potential to introduce bugs. So therefore test that tend to fail on a revision like that are highly likely to be real test failure rather than flakes

Now what? 

Given the fact that we have all this info in our test execution database, now we have to gather all this info and use it as training data. For each test failure recorded in the database, we will look at that revision and get the list of files, and for each file get all git related info such as how many commits were done on that file, by how many authors etc. And also get the edge history(passed/failure) since that revision. All this info about each test failure, will be added to a vector that will represent a feature vector. These feature vectors will then be used to distinguish between real test failure and flaky candidates using a simple unsupervised machine learning model called autoencoder

Since we do not have an annotated training set, the first step is to see how the data looks like. We will first train a separate autoencoder with some of the data that we’ve gathered from the database just to get a glimpse of how the data is spatially distributed. 

Here we could use any unsupervised clustering method. Technically a simple K-means algorithm would work as well. The feature vectors that we’ve talked about earlier are multidimensional. We cannot view them in a large dimensional space. So we will do a PCA for dimensionality reduction, to map them from their initial higher dimension into two dimensions. We can see that the data is smoothly scattered into two main clusters: 

Looking into some individual data points from both clusters we’ve realized that the red one corresponds to flaky tests whereas the blue one to real test failures. Just as we’ve discussed earlier, flakes in general have higher number of pass/fail edges in their execution history and they tend to correspond to git revisions where mostly config files were changed and the “dependency distance” between the changed code and the test is higher on average. These conditions  are enough to cluster the flakes in a different cluster from the actual test failures that have on average a lower number of pass/fail edges and they happen on git revisions where either multiple files were changed or files that were changed many times by many authors, and those files are source code files on average.  

That’s it? 

Well not really. Getting a feel of how that data looks like is not enough to classify a test as either flake or failure. What we want in the end is a method that works directly on the Jenkins machines and could “look” at a test case and tell you before you run it if it is a flake or a real test failure. In case we’re dealing with a flake, we’ll ignore it if it fails so it does not become a nuisance to developers. What we need is to have the ability to classify between those two classes. What we’ll do, is we will train another multilayer autoencoder just with the failures that we managed to infer from the clustering. After that we will have a trained autoencoder that could reconstruct failure feature vectors.  

Autoencoders are a specific type of feedforward neural networks where the input is the same as the output. They compress the input into a lower-dimensional code and then reconstruct the output from this representation. The code is a compact “summary” or “compression” of the input, also called the latent-space representation.

An autoencoder consists of 3 components: encoder, code and decoder. The encoder compresses the input and produces the code, the decoder then reconstructs the input only using this code.

A simple auto-encoder that is trained to reconstruct simple digit images. 

We’ve trained an autoencoder to learn the internal representations of real test failures. Therefore failures in this model will have a small reconstruction cost, whereas flakes will have a large reconstruction cost. For more info on neural networks and their cost functions check out this awesome mini book by Michael Nielsen

In order to predict whether or not a new/unseen test case failure is real or flaky, we’ll calculate the reconstruction error from the test failure data itself. If the error is larger than a predefined threshold, we’ll mark it as a flake (since our model should have a low error on real test failure).

Let’s now plot each failure’s cost to see how they look like: 

It’s clear from this picture that the maximum reconstruction cost for failure is somewhere around 40. Let’s pick this value as the threshold.

Now let’s feed-forward the feature vectors for the flaky samples and plot them along with the real failures and see how they differ

The orange points represent flakes and the blue ones are the failures. The red line is our chosen threshold. We can see a clear separation between the two sets!

Why not a supervised binary classifier like SVM instead? 

Because in order to use a support vector machine or a feed forward neural net we would need an annotated training set. So as you might’ve guessed already, that would take a huge time effort and a lot of manual labor. And that’s something we cannot afford. So we wanted to have a method that is unsupervised, or to be more precise, semi-supervised, where we first get a geometric understanding of the data and sample a few hundred samples from the failure data and train an autoencoder in order to learn their internal representation. Then use anomaly detection where flakes are treated as anomalies in this model. 


In the simple example from above, samples that have a higher reconstruction error than 40 are very likely to be flakes. To check that even further let’s look at the confusion matrix of this model :

So out of 802 flaky tests in the test set, all of them have been detected as flaky. We could say that we have precision 100% and zero false positives. 

Out of all the 19271 of real failures in the test set, all of them have been classified as failures by our model, and none of them as a flake. We could say that our model performed flawlessly. This is the type of model that is used in malware detection or credit card fraud detection. In this environment, you should not fail to predict a normal transaction as fraudulent or even worse a fraudulent one as normal. That would be very very bad!!! That’s why we’ve chosen an anomaly detection algorithm using an autoencoder and not a classification algorithm such as support vector machine or logistic regression. 

In Conclusion

We hope this has been an informative look into how handle flaky tests how we’ll go about detecting them

If you love finding bugs and flaky tests as much as I do, and love writing great software that inspire a healthier, more active lifestyle, come join us at!

About the Author

Liviu Serban – Senior Software Engineer

Liviu has been a software engineer at Fitbit for almost 2 years, and has worked in the devprod team where we try to make the lives easier for the rest of the developers, by making sure that core services such as Jenkins are in good shape. When he is not investigating failures in Jenkins builds, Liviu likes to learn about machine learning, and enjoys going out with friends.

0 Comments   Join the Conversation

If you have questions about a Fitbit tracker, product availability, or the status of your order, contact our Support Team or search the Fitbit Community for answers.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.