Update on the recent OnCheckin service interruption
On the evening of November 18th Pacific Standard Time we experience an outage that interrupted access to our website as well as the operation of our source control and build nodes from 5pm PST until 11pm PST. This was caused by our main hosting provider Microsoft Azure themselves having a much larger system outage that affected any of their services that relied on storage (basically all services hosted on Azure). We are extremely sorry to our customers who were attempting to deploy their websites using our service during this time.
You can read more about the Azure outage and it’s root causes on the Azure website here.
But doesn’t the cloud mean you never go down?
OnCheckin uses Microsoft Azure for the majority of our hosted services. We run a redundant multi-region setup for our databases and website and our build nodes are positioned across multiple data centres in East/West coast US and Europe. This means that most outages suffered by OnCheckin usually only degrade our services, with the main risk from an outage being that a customer’s build and deploy may be interrupted mid-build, with the following build continuing unhindered in a new region.
The service interruption suffered by our hosting partner Microsoft was multi-region and very broad, affecting all of our services.
Over the past year we have moved from a hybrid cloud model, hosted in a combination of three providers: SoftSys Hosting, Amazon AWS and Microsoft Azure. As OnCheckin is a largely automated service, we saw large cost savings from standardizing our automation stack by placing it on a single provider. As we run a largely Microsoft based technology stack, and as we are a Microsoft BizSpark sponsored startup this made Microsoft Azure an easy choice for us. In hindsight this decision was a mistake, as it exposed our service and our customers to more risk of service interruption. If we had retained our hybrid model we may have be able to reduce the severity of the outage experienced on Tuesday night.
Ensuring we grow and learn from the experience
2014 has been a big year for us. We have seen some exciting customer growth over this time, forcing us to review the importance we place on uptime. Our customers are global, and we are building and deploying customer’s websites around the clock. With this growth in round-the-clock usage comes a need to ensure that we’re able to stick to our goal - making sure a worst-case last-minute Friday night deployment by a customer always runs smoothly – because we know first hand as developers ourselves that this is when continuous deployment automation matters most.
We can and will work to avoid service interruptions like this in the future. Over the following months we will be reviewing our hosting options for 2015 and making changes to move back to a multi-provider hosting model to ensure that we do everything we can to avoid incidents like the one that occurred on Tuesday night.