Incident: The gaiia API was down due to a major AWS downtime in the us-east-1 region. Most of the AWS services were affected, notably Lambda, API Gateway and Cloudwatch.
Scope: This incapacitated all web applications relying on the gaiia API: gaiia users were not able to use the gaiia web application, and end customers were not able to place orders or log into the client portals.
Potential mitigation: No direct mitigation of the AWS issue was possible, but a partial disaster recovery process was initiated in case this issue would have lasted longer.
Resolution: AWS fixed their services.
Timeline:
2023-06-13 14h55 EST: Issue was first discovered
2023-06-13 16h44 EST: Issue was partially resolved
2023-06-13 18h37 EST: Issue was fully resolved
Time to discovery: ~5 minutes according to AWS timeline
Time to full resolution: 3h48 mins
Posted Jun 14, 2023 - 19:09 UTC
Resolved
Gaiia has fully recovered, and all accumulated events that were pending during the incident have been processed.
Posted Jun 13, 2023 - 21:57 UTC
Monitoring
We have been able to log back into gaiia and are continuing to monitor the resolution.
Update from AWS: "We are beginning to see an improvement in the Lambda function error rates. We are continuing to work towards full recovery."
Posted Jun 13, 2023 - 20:44 UTC
Update
Update from AWS: "We are continuing to work to resolve the error rates invoking Lambda functions. We're also observing elevated errors obtaining temporary credentials from the AWS Security Token Service, and are working in parallel to resolve these errors."
Posted Jun 13, 2023 - 20:15 UTC
Update
Update from AWS: "We are continuing to experience increased error rates and latencies for multiple AWS Services in the US-EAST-1 Region. We have identified the root cause as an issue with AWS Lambda, and are actively working toward resolution. For customers attempting to access the AWS Management Console, we recommend using a region-specific endpoint (such as: https://us-west-2.console.aws.amazon.com). We are actively working on full mitigation and will continue to provide regular updates."
Posted Jun 13, 2023 - 19:38 UTC
Update
The us-east-1 region is completely down. AWS has identified the root cause of the problem, but we are still preparing to launch the disaster recovery process in another region if need be.
Posted Jun 13, 2023 - 19:29 UTC
Identified
Both the AWS Control Plane and Data Plane are down.