LEXI Disruption

Incident Report for Ai-Media

Postmortem

LEXI Disruption Postmortem: September 29, 2025, 21:56:04 - 22:00:23 UTC (4m 19s)

Summary: A 4-minute and 19-second service disruption affected LEXI and Translate jobs due to an automatic update of a critical backend service. New jobs failed to start, while running jobs experienced significant caption delays that self-resolved after the disruption ended.

Impact

  • New Jobs: Unable to connect to backend service during startup, resulting in startup failures. Jobs self-resolved when the disruption ended.
  • Running Jobs: Experienced significant caption delays during the disruption window. Delayed captions were output in batch after service restoration, with read-speed functionality automatically compensating before returning to normal operation.

Root Cause: An automatic update of a critical backend service during the maintenance window caused the temporary unavailability of a dependency required for LEXI/Translate job initialization and real-time caption processing.

Timeline

  • 21:56:04 UTC - Disruption begins with backend service update
  • 21:56:04 - 22:00:23 UTC - New jobs fail to start; running jobs experience delays
  • 22:00:23 UTC - Backend service update completes; service functionality restored
  • Post-incident - Investigation and remediation planning initiated

Resolution Service was automatically restored once the backend service update was completed. No manual intervention was required.

Action Items Completed

  • Modified service update window to minimize user impact
  • Improved fallback handling mechanisms for backend service dependencies

Action Items In Progress

  • Implementing additional redundancy measures to strengthen service resilience during dependency updates
Posted Oct 07, 2025 - 23:23 UTC

Resolved

We investigated a brief service disruption that affected LEXI and Translate jobs on September 29th (21:56-22:00 UTC). The disruption lasted approximately 4 minutes and 19 seconds and was related to an automatic update of an AWS service performed during that time.
Users experienced different impacts depending on when their jobs were initiated:
New LEXI/Translate jobs started during this window were unable to connect to the AWS service during startup, resulting in startup failures. These jobs self-resolved once the disruption ended.
LEXI/Translate jobs already running before the disruption experienced significant caption delays. Captions produced during the disruption were output in a large batch after service restoration, with read-speed functionality automatically increasing output speed to compensate before returning to normal real-time operation.
Posted Sep 29, 2025 - 22:00 UTC