On March 9th, 2022, over the course of two hours and thirty-six minutes between 12:45 and 15:21 UTC, some customers experienced issues with requests to our platform and experienced higher build times or API failures. Collectively, the impact to our service lasted intermittently for two hours during the incident. During this time, customers could have experienced various issues with their site serving correctly. If cached content were unavailable, our service would need to reach back to our databases to update the cache and serve the new content. Due to the root issue of the incident being database latency, in some cases, we were unable to refresh the cache in a timely manner. For content that was not cached - such as dynamically generated content or content from new deploys, and password-protected content - these errors/latency were observed as failed requests affecting your visitors.
Summary of impact:
Periods of failed or slow web requests, increased API request latency and errors, password protected sites unavailable:
We’re genuinely sorry for the impact on our customers and everyone who relies on them. We want to provide the best service possible and we take any service disruption seriously. Our vision is to build a better web, and we strive to provide world-class service at every tier. Below we will provide more insight into the specifics of what occurred, as well as the measures we’re already taking to mitigate the risk of future incidents. We understand the serious nature of this event and we are committed to sharing any new information we uncover as we learn more.
Impact and Resolution Steps
On March 9th beginning at 12:45 UTC we encountered an issue with our production databases following planned system maintenance with no expected impact on performance. However, after the maintenance concluded we encountered connectivity issues to our databases which impacted requests served by the API as well as standard customer builds.
[2022-03-09 09:00 - 01:00 UTC]: Planned maintenance conducted
[2022-03-09 12:45 UTC]: Latency alerts first occur indicating high latency on the API and database. The team begins investigating
[2022-03-09 12:51 UTC]: Latencies resolved
[2022-03-09 13:02 UTC]: Scheduled maintenance status updated to complete
[2022-03-09 13:10 UTC]: Latencies on database connections increased, again impacting API and build systems and serving uncached content
[2022-03-09 15:21 UTC]: Latencies resolved
[2022-03-09 15:43 UTC]: The source of the latency was identified and we began developing a mitigation
[2022-03-09 16:07 UTC]: Impact is mitigated following code change to resolve database connection issues
[2022-03-09 16:47 UTC]: Additional improvements are made to our database to improve performance
[2022-03-09 16:54 UTC]: Status updated to monitoring following mitigations and monitoring indicating requests and builds are operational
[2022-03-09 17:34 UTC]: Incident declared resolved