Resolved
Updated

Post Incident Review

At 19:12 UTC on April 12th, our monitoring spotted several failures when loading our Management UI. The on-duty team member was alerted and on-scene within a few minutes.

It was noted that the Management UI was intermittently unavailable, but Status Pages and the API remained unimpacted.

After the initial assessment, the issue was identified as timeouts when connecting to our Postgres database, and the level-2 on-call team member was paged for assistance.

By 19:32 UTC the cavalry had arrived, however, the application was once again accessible, the initial issue having resolved itself.

Understanding What Happened

While watching that the application remained stable the team began their initial investigations and reached out to our database provider for additional help in diagnosing what had happened.

The response from our provider suggested that our Postgres instance appeared to be experiencing an abnormal load, which was eating up all the CPU resources, causing connections to hang.

To properly understand what caused this, more detailed monitoring would be required on the database.

This, and responding to any insights it offered would be the path forward.

Improvements Made

Better Postgres Monitoring

We have added two layers of additional monitoring to our Postgres database servers, the first helps us track and CPU and Query load being placed on the server, while the second helps us identify specific queries and database configuration settings that may be negatively impacting performance.

This new monitoring immediately led to us finding some particular queries related to sending notifications that were placing undue load on the server, and impacting performance across the entire application.

Improving Expensive Queries

The queries identified by the monitoring have been rigorously improved. In some cases removed altogether or consolidated, and in other cases rewritten to be more efficient.

We also improved the indexing on some of the key database tables, to further optimize performance.

As a result of these changes, we are no longer seeing any spikes in load on the database servers, and we have also seen improved application response times across the board.

Avatar for Robert Rawlins
Robert Rawlins
Resolved

We've monitored the management UI and are confident this is now stable. We are conducting a root cause analysis as part of our post-incident process. Once again, we appreciate your understanding during this incident.

Avatar for Nic Coates
Nic Coates
Recovering

Between 19:10 UTC and 19:23 UTC, our Sorry™ management UI encountered a temporary outage, returning a 503 error due to a database connection issue. Our on-call engineer promptly addressed the issue by scaling down and back up our nodes, restoring full functionality.

We apologise for any inconvenience caused and appreciate your patience. Rest assured, we're closely monitoring the situation to ensure stability.

Avatar for Nic Coates
Nic Coates
Began at:

Affected components
  • Management UI