Errors connecting to Lexi - May 6 2024
Incident Report for Ai-Media
Postmortem

What happened?

On May 6, beginning around 17:01 UTC/13:01 EST and lasting approximately 2 hours, the Lexi service was unable to start any new sessions. This affected all users of the service, whether managing Lexi from an encoder, or in EEGCloud via the Lexi UI or scheduler. Any Lexi sessions that were already running when this incident began were unaffected, and continued to run normally.

Our engineering team was able to trace the cause of this incident to the Lexi production message broker, which apparently exceeded a memory threshold that caused it to go into an alarm state and block all clients attempting to send messages, indefinitely. EEGCloud was thus unable to communicate with the Lexi (and Translate) backends; all new jobs were stuck in a state of either “CREATED” or “TERMINATING”.

Once engineering was able to identify the cause, they restarted the message broker service, which resolved the memory issue. This message broker had been running for almost a full year prior to this restart with no previous issues reported.

What are we doing about it?

We are exploring several steps in the immediate & longer-term to prevent an incident like this from disrupting Lexi in the future, including:

  • Better alarming on the message broker service
  • Generating an alarm if the number of Lexi jobs started successfully over a given period drops below a particular threshold
  • Updating the message broker software version & underlying OS
  • Periodic memory usage checks on the message broker
  • Implementing a backup to prevent a single point of failure

Ai-Media understands the importance of a reliable cloud service for closed captioning and we apologize for the disruption this incident caused to EEGCloud users. If there are any follow-up questions on this incident, please submit a ticket to eeg.support@ai-media.tv with subject line “May 6 Lexi Outage”.

Posted May 07, 2024 - 15:23 UTC

Resolved
Lexi operations have returned to normal and all metrics are looking healthy. We believe this to be resolved.

Any users experiencing further issues with Lexi should contact our Support team at eeg.support@ai-media.tv.
Posted May 06, 2024 - 20:09 UTC
Monitoring
We traced this incident to a memory issue with the Lexi message broker and have instituted a fix; Lexi sessions are starting normally again. We will continue to monitor to ensure this is fully resolved.
Posted May 06, 2024 - 19:25 UTC
Update
We are continuing to investigate this issue.
Posted May 06, 2024 - 18:57 UTC
Update
Update: our engineering team is actively engaged and are still investigating this issue. Nothing new to report as yet, but we will continue to provide regular updates here as we work towards a resolution.
Posted May 06, 2024 - 18:49 UTC
Investigating
We have received scattered reports of encoders being unable to connect to the Lexi service. We are investigating and will update here once more information is available.
Posted May 06, 2024 - 18:12 UTC
This incident affected: Lexi.