Engineering Fitness

The Tower of Terror: A Bug Mystery

Here at Fitbit, we are no strangers to shipping hardware products. Making a new product is not an easy task, especially as we approach its ship date. During this time, our employees (both inside and outside of engineering) are a key part of testing the product and getting as many testing hours as possible. We encourage our employees to wear the device during their everyday lives, stress test them, and put them through any scenarios they can think of, no matter how weird.

For those of us working on the firmware (our general term for software that runs on the device), this time is filled with fixing bugs and tweaking the system to make it as polished and stable as possible. Like any deadline, a lot of hard work goes in near the end.

Pair programming gone wrong: Two people on one keyboard usually ends poorly (source: CBS)

This can certainly make for a stressful time, but sometimes there are little bugs that catch your attention and make you laugh. One such bug came across my desk during the final push before we launched the Fitbit Ionic. We thought this would make for a great example of how we test devices internally before they launch, and how our firmware team investigates crashes and bugs on our devices. Buckle up and strap in, it’s going to be a fun roller coaster ride.

Bugs and Bug Reports

Starship Troopers is even relevant to programming (Source: Tristar Pictures)

Bugs within the firmware team can take many forms including, but not limited to, the device crashing and rebooting. Sometimes these occur during normal usage and most users don’t notice. Other times are more obvious, like when the user is actively interacting with the device. Our devices report all crash information back to our servers so that we can know our top crashes regardless of whether users report it or not. However, crash reports can’t always give us all the information we need to fix a bug. Oftentimes, there are multiple paths to get to a certain line of code, and sometimes there can be multiple triggers for a certain code path. Without data about what was happening at the time, it can be difficult to reproduce the bug and be sure that we have fixed it.

This is where bug reports from our testers are indispensable, as they tell us what they were doing with the device at a certain time of day, (for example, exercising, sleeping, using an on-device app, hiking in a certain location, etc.). Like any good mystery, every clue is helpful in hunting for bugs. Bug reports usually take the following format:

  • Summary of the issue
  • Steps to Reproduce
  • Reproducibility Rate
  • Firmware Version
  • Smartphone App Version
  • Any other info the tester can provide (date/time of occurrence, etc.)

Any/all of these fields could be filled for any given bug, but this represents a good set of information for a firmware developer to start bug hunting.

Enter Mission: Breakout!

If you’ve ever been to Disneyland, you’ve probably heard of the Guardians of the Galaxy- Mission: Breakout! ride or its predecessor, The Twilight Zone Tower of Terror. If you need clarification, here’s a YouTube video that gives an idea of the ride. The experience boils down to lots of very fast ups and downs.

Grand, bright, and terrifying (Source: Disney)

When our internal testers report a bug, the bug finds its way through our Field Testing team, and ends up on the (figurative) desk of engineers like myself. The following report showed up on my desk one day:

  • Summary of the issue
    Went on Disneyland Tower of Terror (Guardians of the Galaxy).  During the course of the ride, my Ionic rebooted 3 times. I went on it twice and it was repeatable each time.
  • Steps to Reproduce
    Ride the Tower of Terror
  • Reproducibility Rate
    100%
  • Any other info
    Date/Times of Occurrence:
    Last Saturday between 5:00 and 5:30 PM, and between 6:30 and 6:40 PM

Another tester saw this report, and added on a comment:

I had this happen a few times. I was at Six Flags Magic Mountain and rode a ton of roller coasters.

Needless to say, my team and I got a good laugh out of this one. After the laughter died down, I buckled down to hunt for the bug and fix it. But before I dive into that process, let’s take a small detour into how Fitbit devices work.

Accelerometers

Accelerometers are in virtually every device that tracks motion these days, and they do exactly what you’d expect based on their name: they measure acceleration. These are in your smartphones, drones, and of course, the Fitbit devices you know and love. We use 3-axis accelerometers to get the data we need for many of our stats. The directions of the 3 axes (X, Y, and Z) can vary across products, but here’s an example for the Fitbit Blaze:

Accelerometer axes on a Fitbit Blaze (Source: Fitbit R&D)

Our accelerometer is active and taking measurements as long as the device is powered on, and constantly delivering 3-axis data to our algorithms like step-counting, automatic exercise detection, screen wake, and many others. While you’re walking, for example, the accelerometer data looks something like this:

Acceleration vs. time, Axes separated (Source: Fitbit R&D)

This graph may look weird, because two of the axes are close to each other, but the third is far below the others. Why is that? If you guessed “gravity,” good job! By virtue of being on the planet Earth, our accelerometer will always report 1g of acceleration, as that’s what it takes to resist the force of gravity. If you’d like more information as to why, this article can explain the concepts much better than I can.

In the case of this graph, you can fairly easily guess where the steps are happening by looking at the peaks on each line. Let’s make it even easier though. In 3 dimensions, the X, Y, and Z components create a 3D vector that can give us the overall magnitude and direction of our acceleration for each point in time. If we took this same data and graphed the magnitude of the vector through time, it looks something like this:

Acceleration magnitude vs time (Source: Fitbit R&D)

Even easier to find the steps now, right? Also as expected, our magnitude is centered around the 1g of force that we expect, and the peaks/troughs on the graph can tell us where the steps are. Our step-counting algorithm is far more complex than this, but this is just a simple example of how one could use an accelerometer to count steps.

This entire digression about accelerometers is meant to show that they give us a lot of data, and even a little bit of processing on the data can help us understand what the user might be doing. Remember the concept of magnitude, it’ll come in handy later.

Following the Breadcrumbs

Okay, so back to figuring out what happened on the roller coaster. The first place we look when a crash is reported is the device crash records. If you’ve ever seen the Blue Screen of Death on Windows, you’ve probably seen a whole bunch of data that makes little obvious sense. The same would be true for our crash logs. While these logs can look intimidating, data like this can tell us a lot of what we need to find a bug. Here’s an example:

Fault Log:
Link Register 0x10018f31, Program Counter 0x10019064
Fault Register 0x00002000
Stack Trace: 0x10018f31   0x10019064   0x10018f31   0x1001bc17
Stack Trace: 0x10036401   0x00000000   0x00000000   0x00000000  
Stack Trace: 0x00000000   0x00000000   0x00000000   0x00000000  
Stack Trace: 0x00000000   0x00000000   0x00000000   0x00000000

PANIC!!!!!

After translating this log against our code base, I found the data I needed to start tracking this bug down:

Crash Cause: System tried to divide by zero
Location of offending code: AddNewSamples function at algorithm_x.c, line 500

Yay! We now know why our device crashed, and where in our codebase it crashed. As expected, our system panics and crashes when we try to divide by zero in our algorithm code. Our algorithms perform many mathematical operations, and any abnormalities can have unexpected effects on your stats, so we need to catch these errors and ensure we fix them.

The next step after translating the crash data is to look at the code where the divide-by-zero occurred, so let’s go find line 500 of the algorithm_x.c file. Here’s a snippet very much like that line of code.

500: scaled_accel = (SCALE_ACCEL * raw_accel_mag) / filtered_accel_mag;

Admittedly, this code is a little opaque without more context, but it boils down to scaling the ratio between a raw accelerometer magnitude and a filtered accelerometer magnitude. This is one mathematical step in the larger context of this algorithm. To our advantage, there is only one division on this line, so we know exactly which value must have been zero to cause this crash: filtered_accel_mag. The question now becomes, how did that value become zero for us to divide by it?

The answer is, of course, physics. As you might have gleaned from the video above, the Mission: Breakout! ride drops you into what is effectively free-fall for several seconds, during which your accelerometer sees a zero-g environment.  If you’re curious about why, the article I linked above has a great explanation.

Once I realized this was caused by a zero-g environment, I tried to reproduce it myself by throwing the device in the air. This would be akin to creating a moment of zero-g, and I expected it to crash the device. To my surprise, the device did not crash. After more investigation, the key to understanding why this occurred was in the name of the value: “filtered.” In this context, “filtered” means we are smoothing the data and making it less susceptible to change in a short amount of time. This meant that my experiment would not read zero-g for long enough to make the filtered value reach zero. On a roller coaster, however, the accelerometer reports zero-g for long enough that the value reaches zero and causes the crash.

So now we know why riding Mission: Breakout! causes the device to crash and reboot: 

Being in free-fall for enough time causes the filtered accelerometer magnitude value to approach zero, which makes this line of code divide by zero, thereby crashing the system.

Fixing the Bug and Releasing a New Firmware

Now that we know why the bug happens, it’s important to find a fix for it. In the case of this particular algorithm, the accelerometer signal was only used to check for movement. Device movement during free-fall is not a big issue for us, especially since this is a fairly rare use case. This meant that we could check for the value being zero, and if so, hard-code a non-zero value.

if (filtered_accel_mag > 0) {
    scaled_accel = (SCALE_ACCEL * raw_accel_mag) / filtered_accel_mag;
} else {
    scaled_accel = DEFAULT_VALUE;
}

Once we had implemented the fix, the only thing remaining to do was make sure our internal testers gave it a test run before releasing it to the public. Our internal testers (like many of our customers) faithfully update their firmware as soon as it becomes available, and it turns out there are many Fitbit employees willing to volunteer to test a bugfix on a roller coaster. Within short order, we had our confirmation of a fix. One such comment:

I experienced no reboots on California Screamin’ yesterday with two rides.

Unfortunately, being so close to launch, we could not release the fix for this bug along with Ionic’s launch. We ultimately released the fix for this bug along with a larger firmware release to support our (open-source!) SDK, which enabled third-party developers to make apps for our smartwatches. With the release, the bug was officially dead and closed. Victory!

In Conclusion

We hope this has been an informative look into how we handle device crashes as a team, and how we go about tracking down and fixing a bug. This is by no means the weirdest bug we’ve ever had. A few examples of bugs we’ve faced over the years:

  • Tight pants causing excessive “floors climbed” to be measured in someone’s pocket
  • Connected GPS reporting less accurate distances in the Southern Hemisphere
  • Hardware light sensitivity throwing off power measurements.

Bug fixes like this roller coaster case get carried forward to all of our future products, and as a result, our products get better and better over time. No product launch is without its issues, but we put a lot of love and care into making these devices as great as possible. Our new Versa 2 smartwatch is the perfect example of just how far our smartwatches have come.

If you love embedded software and firmware as much as I do, and love making devices that inspire a healthier, more active lifestyle, come join us at fitbit.com/careers!

About the Author

Shiva Rajagopal – Staff Firmware Engineer

Shiva has been a firmware engineer at Fitbit for 3 years, and has primarily worked on Fitbit’s smartwatches (Ionic, Versa, Versa Lite Edition, Versa 2). He served as the Firmware Development Lead for Versa Lite Edition, and loves hunting down bugs (if you couldn’t tell by this blog post). When not hunting bugs, Shiva enjoys biking around San Francisco, volunteering at the SF Exploratorium, and sampling different beers around the Bay Area.

Acknowledgements

Thanks to Alexandros Pantelopoulos for the data figures. Thanks to all who reviewed this post, and thanks to the Ionic team for a fun time!

1 Comment   Join the Conversation

1 CommentLeave a comment

If you have questions about a Fitbit tracker, product availability, or the status of your order, contact our Support Team or search the Fitbit Community for answers.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.