Resolved
Updated

Post-Incident Review

At 14:57 UTC on April 3rd some customers began receiving TLS certificate warnings and "403 Forbidden" error messages when accessing their Status Pages and our Management UI.

Within a few minutes, the Sorry™ team had been paged, and were on scene to begin diagnosis.

Our initial investigations led us to believe this was a TLS provisioning issue on our CDN, however, after a deeper dive with help from our edge provider Fastly it was later traced to one of our DNS hosts, who was intermittently serving an incorrect and very old record for our domain.

These incorrect DNS records led to traffic no longer being routed through our CDN and WAF, but instead straight to our origin servers, which for security purposes are not designed to receive direct traffic like this.

At 16:27 UTC the offending DNS provider was identified and removed from our domain, we then began to see traffic return to normal, and this trend continued over the following hour as the name server change propagated.

At 17:57 UTC we received final confirmation from customers to say they were no longer seeing any errors.

Whilst this only impacted a small number of customers, it was a substantial impact, with the application being intermittently unavailable to them for several hours.

For this, we're incredibly sorry.

Improvements Made

During our post-impact assessment, we found the biggest contributing factor to the incident duration was the time it took us to identify the root cause as a DNS failure.

This was partly because DNS is generally very stable, and not often the "likely cause" of an incident, so didn't garner our immediate attention, however, there are certainly lessons we can learn.

With that in mind, we took several steps to minimize the risk of it happening in future, and help us respond quicker should the worst occur.

Better DNS Monitoring

We have added OhDear as additional monitoring on all our critical endpoints. Their suite of tools includes DNS-specific checks, looking for changes in name servers and record types.

We also have DNSChecker added to our suite of diagnostics tools.

These new tools should make spotting and identifying DNS issues much faster and help narrow down which provider the failure stems from.

New Primary / Primary DNS Redundancy

We have also replaced our old Primary/Secondary DNS configuration with a more resilient Primary/Primary setup using DNSimple and AWS Route53.

Running two separate and disconnected DNS providers means that DNS is much less likely to be a single point of failure.

We are also in a much better position to drop the offending provider from our traffic flow should we need to.

Avatar for Robert Rawlins
Robert Rawlins
Resolved

All the signs show this issue has been resolved, and we no longer see intermittent privacy errors. We are still working on the post-incident report, which may take some back-and-forth with the network team before we post again.

We must work out the root cause and prevent a repeat of today. Just because there were a relatively small number of sessions in the dark doesn't change how sorry we are to everyone who was affected.

Avatar for Robin Geall
Robin Geall
Recovering

We have seen an increase in successful requests, and monitoring shows improvements. We will mark this notice as recovering while continuing to monitor and coordinate with our edge provider on further identifying the root cause.

Once again, thank you for your continued patience. If you or any subscribers experience further issues, please do not hesitate to contact support.

Avatar for Nic Coates
Nic Coates
Updated

We continue to investigate with our edge provider and have identified a potential DNS routing issue with specific DNS resolvers. This issue is resulting in a small number of requests intermittently bypassing the cache. However, a large percentage of requests are working and being routed via the edge correctly.

Once again, we thank you for your continued patience, and we will endeavour to update this notice when we have new information.

Avatar for Nic Coates
Nic Coates
Updated

Very sorry to say that we are not quite out of the woods with this. We are sucessfully serving 100's of request a second however a small number of requests are routing incorrectly. We are all hands on deck together with our Edge provider. More updates to follow.

Avatar for Robin Geall
Robin Geall
Identified

We have identified a possible fix for whats causing the intermittent errors and it's being deployed now. Thanks for waiting while we work on this one.

Avatar for Robin Geall
Robin Geall
Updated

We continue to investigate, but monitoring has shown no further intermittent errors for the past 20 minutes. Once again, we apologise for any inconvenience caused and appreciate your patience as we work to resolve this issue promptly. Stay tuned for further updates.

Avatar for Nic Coates
Nic Coates
Investigating

We're currently experiencing intermittent SSL privacy errors on our status pages over the past few minutes. Our team is actively investigating the issue and has engaged our edge provider for assistance.

We apologise for any inconvenience caused and appreciate your patience as we work to resolve this issue promptly. Stay tuned for further updates.

Avatar for Nic Coates
Nic Coates
Began at:

Affected components
  • Status Pages
  • Management UI