Date/Time: September 18th, 2025, 15:30-16:00 UTC (11:30-12:00 EDT)
Duration: ~30 minutes
Impact: Fewer than 10 LEXI jobs experienced dropped or delayed output (1-5 minutes)
Root Cause: Missing CPU resource limits on infrastructure components caused node instability
Timeline
- 15:30 UTC: Service disruption begins affecting LEXI jobs
- 15:30-16:00 UTC: Users experience intermittent dropped/delayed output
- 15:45 UTC: Issue detected via monitoring alerts showing CPU spikes on compute nodes
- 16:00 UTC: Service automatically recovered
- 16:00-22:00 UTC: Investigation identified resource limit configuration issue on [specific components]
- 22:00 UTC (6:00 PM EST): CPU resource limits applied to affected infrastructure components
Root Cause Analysis
Infrastructure components running without proper CPU resource limits consumed excessive resources, causing:
- CPU spikes on compute nodes
- Node instability
- Disruption to LEXI job processing
This same root cause likely explains the similar incident on September 8th.