Between December 18th and 19th, 2022, Order Desk sustained a partial outage in which new orders could not be imported into the system. The order insertion downtime lasted from Sunday afternoon to Monday afternoon and lasted a total of 26 hours. Over the course of that downtime, inventory syncing, shipment notices, and order submission continued to process normally.
Order Desk keeps an internal ID number for each order in a field in our database. This field type was “integer” and one of the main downsides to using an integer field is that this type is limited to 32 bytes, a max number of 2,147,483,647 (2.1 billion). We knew that we were going to have to upgrade to a “biginteger” field which supports up to 9,223,372,036,854,775,808 (9.2 quintillion) over the next several months, but we had roughly eight months to prepare for this and were planning to complete this database maintenance after the new year.
We had been seeing significantly elevated order import counts, but this wasn’t translating into higher order fees across our platform, which clued us in that something was amiss. We did have some internal tools in place to monitor this, but the numbers began skyrocketing much quicker than expected, resulting in us hitting our max number more than half a year early on what would have been a normally quiet Sunday afternoon.
After the outage was resolved, we were able to locate a series of stores that had accidentally been importing and then immediately deleting orders because they weren’t applicable. This had been happening in an instant loop, resulting in many millions of extra “orders” over the week leading up to the outage.
The order import failures began at 4:20pm (Eastern Time) on Sunday afternoon. This didn’t manifest as a big error, however, since the database was showing that any orders attempting to be imported had already been imported, an odd side effect of the way the database handles out of range numbers. Because of the timing and lack of error messages, we weren’t alerted to any issues and a few hours passed before we received any sort of notifications from customers.
At 8pm on Sunday, the engineering team began assessing the problem and working on solutions. Updating a database column type on a large table with lots of data isn’t a simple process; it involves rebuilding the relevant data and indexes, which can be quite time-consuming. To add to the complexity, the order ID fields are also referenced in other large tables that needed to be rebuilt as well.
With fixes in hand, the team began the first update to the orders table at 11:45pm. The projected run time for the migration was just under six hours. Two hours into the migration, it failed due to the process using up too many threads and the team was forced to start over.
At 2:30am on Monday, the team began a second migration attempt with some new parameters in place to pause the process if the database got too busy. At the same time, the team was migrating a secondary table that needed to be updated as well. The secondary migration completed after only a half hour, which was encouraging.
At 3:10am, the migration began on the final secondary table, which completed successfully four hours later.
At this time, the team began setting up contingency plans in case the main migration failed again. The contingency plans centered around migrating data manually to a new database with a new orders tables, starting with the newer orders and eventually over time migrating all old orders to the new database.
This would, however, require all of Order Desk to be inaccessible for several hours to ensure data integrity, something we were hoping to avoid. Nevertheless, the team began preparing in case this was needed.
Around this time, order submission was turned back on, which enabled Order Desk to submit orders that had already been imported and to process inventory syncing and shipment notification. We did not turn on any scheduled appointments, however, as those were primarily involved with downloading orders.
As morning dawned and the app began to get busier, the migration process slowed to a crawl. By 10am, we began to see a lot of database deadlocks, which made us nervous about the migration process that had been running for over seven hours already. The migration began to get slower and slower, but it seemed we were finally getting close to the end by noon. At 12:45pm, the database usage numbers began to spike. We’d seen this happen at the end of a migration before so we made the decision to put the app in maintenance mode, which makes Order Desk temporarily inaccessible to the public. At 1:02pm, we saw a huge spike of database processing (80x what the database was rated for) and the second migration failed at 99%.
We decided that we would upgrade the main database instance to a much larger server that was 6x our current size, then we would make one final migration attempt before trying the option that would involve shutting down the app completely. This began at 1:30pm and, because of the bigger database server, it was initially scheduled to finish in just 3 hours. We’d generally been seeing that the migration needed an extra hour, so our final projection for completion was for 5:30pm. At 5:45pm, the migration told us it was at 99% with just 22 seconds left.
The third migration attempt eventually completed successfully at 6:08pm and, after running some confirmation tests, we began turning the system back on and re-enabling imports. The total downtime for order imports came to almost exactly 26 hours.
What We Learned
- We need better alerts tied to our order import counters going to zero. This would have alerted our engineering team to the issue sooner, likely before customers noticed.
- We should have stayed on top of the elevated order counts to find out was really happening.
- We learned about some new, critically important load management parameters on our database migration tools.
- We discovered that aggressively upsizing our database before migration will radically improve migration time.
We know that you rely on Order Desk for your business, and we know we let you down. We’ve already put significant resources in place to ensure that there will be better safeguards to protect against further partial outages like this. We’re resolved that this won’t happen again.
Our consistent uptime has always been something that we’ve been proud of, and we’ve (thankfully) never needed to mitigate an outage of this size in our 8 years in business.
We also know that the timing of this partial outage couldn’t have come at a worse time of year: we shared in your frustration knowing that this was happening on the final shipping day for Christmas. Holiday fulfillment times were top of mind during the downtime, and our entire engineering team was fully engaged for every minute of the outage, trying everything possible to get Order Desk back to 100% functionality.
Lastly, we’re deeply grateful for the outpouring of support we received from so many of you. The kind words and cat GIFs that you sent to us in the middle of a highly stressful time was more helpful than you can imagine.
Thank you for your trust in Order Desk. We are committed to taking continuing action to be worthy of it.