On May 6, beginning around 17:01 UTC/13:01 EST and lasting approximately 2 hours, the Lexi service was unable to start any new sessions. This affected all users of the service, whether managing Lexi from an encoder, or in EEGCloud via the Lexi UI or scheduler. Any Lexi sessions that were already running when this incident began were unaffected, and continued to run normally.
Our engineering team was able to trace the cause of this incident to the Lexi production message broker, which apparently exceeded a memory threshold that caused it to go into an alarm state and block all clients attempting to send messages, indefinitely. EEGCloud was thus unable to communicate with the Lexi (and Translate) backends; all new jobs were stuck in a state of either “CREATED” or “TERMINATING”.
Once engineering was able to identify the cause, they restarted the message broker service, which resolved the memory issue. This message broker had been running for almost a full year prior to this restart with no previous issues reported.
We are exploring several steps in the immediate & longer-term to prevent an incident like this from disrupting Lexi in the future, including:
Ai-Media understands the importance of a reliable cloud service for closed captioning and we apologize for the disruption this incident caused to EEGCloud users. If there are any follow-up questions on this incident, please submit a ticket to eeg.support@ai-media.tv with subject line “May 6 Lexi Outage”.