LEXI Disruption

Incident Report for Ai-Media

Postmortem

Date/Time: September 18th, 2025, 15:30-16:00 UTC (11:30-12:00 EDT)
Duration: ~30 minutes
Impact: Fewer than 10 LEXI jobs experienced dropped or delayed output (1-5 minutes)
Root Cause: Missing CPU resource limits on infrastructure components caused node instability

Timeline

  • 15:30 UTC: Service disruption begins affecting LEXI jobs
  • 15:30-16:00 UTC: Users experience intermittent dropped/delayed output
  • 15:45 UTC: Issue detected via monitoring alerts showing CPU spikes on compute nodes
  • 16:00 UTC: Service automatically recovered
  • 16:00-22:00 UTC: Investigation identified resource limit configuration issue on [specific components]
  • 22:00 UTC (6:00 PM EST): CPU resource limits applied to affected infrastructure components

Root Cause Analysis

Infrastructure components running without proper CPU resource limits consumed excessive resources, causing:

  1. CPU spikes on compute nodes
  2. Node instability
  3. Disruption to LEXI job processing

This same root cause likely explains the similar incident on September 8th.

Posted Oct 06, 2025 - 21:01 UTC

Resolved

We investigated a brief service disruption that affected 10 or fewer LEXI jobs on September 18th (~15:30-16:00 UTC). Users experienced dropped or delayed output of varying lengths between ~1 minute and ~5 minutes. Drops in output resolved without user intervention. We identified a potential root cause that may also explain the similar incident on September 8th. No further instances have been detected since this timeframe.
Posted Sep 18, 2025 - 17:30 UTC