AWS outage in us-east-1

Incident Report for gaiia

Postmortem

POST MORTEM:

  • Incident: The gaiia API was down due to a major AWS downtime in the us-east-1 region. Most of the AWS services were affected, notably Lambda, API Gateway and Cloudwatch.
  • Scope: This incapacitated all web applications relying on the gaiia API: gaiia users were not able to use the gaiia web application, and end customers were not able to place orders or log into the client portals.
  • Potential mitigation: No direct mitigation of the AWS issue was possible, but a partial disaster recovery process was initiated in case this issue would have lasted longer.
  • Resolution: AWS fixed their services.
  • Timeline:
    2023-06-13 14h55 EST: Issue was first discovered
    2023-06-13 16h44 EST: Issue was partially resolved
    2023-06-13 18h37 EST: Issue was fully resolved
  • Time to discovery: ~5 minutes according to AWS timeline
  • Time to full resolution: 3h48 mins
Posted Jun 14, 2023 - 19:09 UTC

Resolved

Gaiia has fully recovered, and all accumulated events that were pending during the incident have been processed.
Posted Jun 13, 2023 - 21:57 UTC

Monitoring

We have been able to log back into gaiia and are continuing to monitor the resolution.

Update from AWS:
"We are beginning to see an improvement in the Lambda function error rates. We are continuing to work towards full recovery."
Posted Jun 13, 2023 - 20:44 UTC

Update

Update from AWS:
"We are continuing to work to resolve the error rates invoking Lambda functions. We're also observing elevated errors obtaining temporary credentials from the AWS Security Token Service, and are working in parallel to resolve these errors."
Posted Jun 13, 2023 - 20:15 UTC

Update

Update from AWS:
"We are continuing to experience increased error rates and latencies for multiple AWS Services in the US-EAST-1 Region. We have identified the root cause as an issue with AWS Lambda, and are actively working toward resolution. For customers attempting to access the AWS Management Console, we recommend using a region-specific endpoint (such as: https://us-west-2.console.aws.amazon.com). We are actively working on full mitigation and will continue to provide regular updates."
Posted Jun 13, 2023 - 19:38 UTC

Update

The us-east-1 region is completely down. AWS has identified the root cause of the problem, but we are still preparing to launch the disaster recovery process in another region if need be.
Posted Jun 13, 2023 - 19:29 UTC

Identified

Both the AWS Control Plane and Data Plane are down.
Posted Jun 13, 2023 - 19:04 UTC

Update

We are continuing to investigate this issue.
Posted Jun 13, 2023 - 19:03 UTC

Investigating

We are currently investigating this issue.
Posted Jun 13, 2023 - 19:02 UTC
This incident affected: Public GraphQL API.