LEXI Disruption Postmortem: September 29, 2025, 21:56:04 - 22:00:23 UTC (4m 19s)
Summary: A 4-minute and 19-second service disruption affected LEXI and Translate jobs due to an automatic update of a critical backend service. New jobs failed to start, while running jobs experienced significant caption delays that self-resolved after the disruption ended.
Impact
- New Jobs: Unable to connect to backend service during startup, resulting in startup failures. Jobs self-resolved when the disruption ended.
- Running Jobs: Experienced significant caption delays during the disruption window. Delayed captions were output in batch after service restoration, with read-speed functionality automatically compensating before returning to normal operation.
Root Cause: An automatic update of a critical backend service during the maintenance window caused the temporary unavailability of a dependency required for LEXI/Translate job initialization and real-time caption processing.
Timeline
- 21:56:04 UTC - Disruption begins with backend service update
- 21:56:04 - 22:00:23 UTC - New jobs fail to start; running jobs experience delays
- 22:00:23 UTC - Backend service update completes; service functionality restored
- Post-incident - Investigation and remediation planning initiated
Resolution Service was automatically restored once the backend service update was completed. No manual intervention was required.
Action Items Completed
- Modified service update window to minimize user impact
- Improved fallback handling mechanisms for backend service dependencies
Action Items In Progress
- Implementing additional redundancy measures to strengthen service resilience during dependency updates