Ensuring that Fitbit’s site is always available is a top priority for our engineering organization. This is especially true during the annual holiday season that spans Black Friday through the end of the year. Christmas day brings extra pressure and visibility on our site because it is the busiest day of the year for user account creation and device setups. More importantly, Christmas day for many of our customers is the first time they use one of our products. We want to make sure their first impression with Fitbit is the best experience possible. Sometimes circumstances arise that pose unexpected challenges to keeping the site operating smoothly. Fitbit encountered one of these situations on Christmas day in 2014 which caused the site to be unavailable for part of that day. Members from our engineering, operations, and customer support teams stepped up to address the problem that day, limiting the potential customer impact. However, many customers were still affected and it was not the kind of experience we strive to provide. We wanted to ensure that we avoided the downtime situation we encountered in 2014 as we headed into future holiday seasons.
The overarching goal for approaching the 2015 holiday season was to improve our risk management through better practices, increased visibility of the holiday preparation work, and clearer decisions on trade-offs. In the spring of 2015, the Fitbit leadership team assembled a group of senior engineers and managers to identify and prioritize the risk areas that could cause problems for the 2015 holiday season. Items were sorted into priority buckets that generally corresponded to levels of risk, or more accurately, risk of not doing them. There were a number of known “must do” software and infrastructure improvements. For example, for Fitbit to handle the planned capacity needed for Christmas 2015 and beyond, we knew we would need to expand to an additional data center. There were also things identified as important for reducing risk, such as addressing areas in our site that needed major optimization or would likely trigger downtime if certain load thresholds were exceeded. In addition, we identified items that had some potential risk but that we felt reasonably confident would not cause problems during the holidays. One of the challenges in 2015 was to keep the focus on the holiday preparation efforts in addition to rolling out new software features and delivering new devices. The leadership team helped to address this by forming a new program management team that summer, which could focus on improving the planning processes around the development work. In the fall of 2015 we adopted a new process that gave the executives more visibility into what work was needed for completing the preparations. This new process was an adjustment for the software teams, and it resulted in some teams having to re-prioritize their planned work in Q4 in order to make sure they had enough time to complete their holiday preparation tasks.
While this was happening, Fitbit was growing its Site Reliability Engineering team. One of the SRE team’s priorities was to work with other teams to make sure we had proper system “runbooks” and “game-day scenarios” documented, and appropriate monitoring and alerting in place, so that we would know what to look for and how to respond if things went wrong on Christmas day. While some of our 2015 holiday preparations were a little rough around the edges, we achieved the executive goal of having no downtime over both the Black Friday holiday weekend and Christmas day.
Because of all the effort spent to scale and harden Fitbit’s site in 2015, we were able to take a more holistic and proactive approach to addressing these challenges in 2016. We wanted to identify potential risk areas and plan the resulting work earlier so that it didn’t all fall on teams in Q4. In order to do this, the leadership team created an official “Scaling and Resilience” program within the company to drive this effort more formally. In addition, we added more members to our Performance and Capacity team, which is responsible for investigating performance bottlenecks and creating more fine-grained capacity planning models. This information would feed into the program’s decision making on how to prioritize our holiday preparation work.
The program kicked off in the spring of 2016 and its first major focus was engaging with other teams to survey how Fitbit’s systems and services contributed to our site’s ability to scale and withstand unplanned situations. Based on the survey results and other investigations, we identified 40 combined Priority 1 (P1) and Priority 2 (P2) development and operational tasks for teams to complete before the holidays. The main themes of the work were around system optimization, improving and expanding monitoring and alerting, infrastructure provisioning, and expanding our documentation. Our “must do” P1 list was much smaller in 2016 because of the foundation we had laid in 2015, and instead we could focus on further hardening our systems.
The program engaged teams within their existing planning processes with the goal of completing all the P1 and P2 tasks by the end of Q3. As seen in the program’s holiday preparation task burn-up chart here, we didn’t quite achieve that, but far more work was completed during Q3 in 2016 compared to the previous year. In addition, most of the issues we needed to address were identified by the end of Q3, with only a few other tasks added in October. Most of the work itself was completed by November, and the three remaining tasks completed in December. (Note that in the burn-up chart, the green “resolved” line crosses the red “created” line because three of the total issues we were tracking had been created earlier than the 240-day window that the graph represents.) We were able to head into Christmas knowing that we had completed everything on our list, and December 2016 was much less stressful for everyone compared to December 2015.
Once again, our preparations paid off. Black Friday weekend and Christmas day had no disruptions and the best part was that this was accomplished without any fire drills. Teams were able to continue to work on their holiday preparation tasks at the same time as their feature development and other responsibilities.
As we look towards the 2017 holidays and beyond, it’s a good opportunity to reflect on several lessons we’ve learned over the last two years:
· Have centralized ownership and clear priorities around holiday preparation. While the resulting fire drill in Q4 2015 was not ideal, the clear priority from the leadership team around holiday preparation focused teams on delivering their work on time. Creating a dedicated program in 2016 to oversee the preparation efforts was a further improvement since it could drive the priorities, communication, and planning needed for the holidays while integrating into the organization’s existing processes.
· Plan early and avoid fire drills. The holidays are not moveable, so everyone wins when we plan holiday preparation work early and spread the execution over the course of the year.
· Make addressing scaling and resilience concerns a regular part of the design and development process. Using surveys to get teams to think about their existing systems and services was valuable. Half of our combined P1 and P2 holiday preparation task list directly resulted from the surveys. In addition, going forward we will cover scaling and resilience issues as part of our regular architectural reviews for new services at Fitbit.
· Remember Benjamin Franklin’s words of wisdom: “An ounce of prevention is worth a pound of cure.” Even though we didn’t have any downtime on Christmas in 2015 or 2016, some incidents came up on and around those days that the SRE team needed to investigate. The runbooks and game-day scenario plans were there for team members to reference if they needed them.
Given the complexity of running a modern production site like Fitbit’s, there is always a chance that something unexpected will crop up. No plan is perfect. Also, downtime doesn’t necessarily mean mistakes were made, and lack of downtime doesn’t necessarily mean we did everything perfectly. However, the last two years of holiday preparation show how focused planning, strong organizational support, and learning from previous experience can substantially improve the odds of successfully managing peak loads on the site. We delivered on our goal of making sure our customers had a great experience using our products on Christmas day, even if it happened to be on the busiest day of the year.
About the Author
Jeff Beall is a member of the Interactive Program Management team at Fitbit, and has been with the company since 2015. Jeff is the program manager for the Scaling and Resilience program and helps with other site engineering initiatives as well. Prior to Fitbit he has focused on development, architecture, engineering management, and program management in the digital media and content creation fields.