Degraded Performance of Phrase TMS (EU) Term base and CAT web editor Components between 10:02 AM and 05:14 PM CEST
Incident Report for Phrase
Postmortem

Introduction

We would like to share more details about the events that occurred with Phrase between 10:02 on April 19, 2024, and 05:14 PM CEST on April 19, 2024, which led to degraded performance of the CAT Web Editor and Termbase services and what Phrase engineers are doing to prevent these issues from reoccurring.

Timeline

  • 10:02 AM: Engineers start investigating the performance check failures of the Termbase component.
  • 11:16 AM: Engineers identify an increased load on the datastore of the Termbase component and decide to scale it up by a factor of two.
  • 12:41 PM: Scaling of the datastore of the terminology component is in progress.
  • 1:41 PM: The datastore is unable to serve requests as a result of an upgrade in progress. The key part of it is low on memory.
  • 1:54 PM: Engineers get together on a call and decide to stop the datastore and perform a full restart with upscaled components.
  • 2:50 PM: The datastore is fully restarted and is starting to service requests from customers. . The tasks that had not been processed during downtime were now processed in bulk.
  • 3:50 PM: The datastore is again fully operational.
  • 4:05 PM: An inconsistency is noticed on part of the nodes forming the datastore , so it is decided that they should be restarted with updated settings.
  • 4:34 PM: The restarted node has issues, causing a smaller version of the previously witnessed issue It is decided to keep the state as is, allowing the datastore to settle down.
  • 5:14 PM: The datastore is fully operational again at this point.

Root Cause

The datastore of the Termbase component was hit by an increased load from a newly introduced check that is intended to alert engineers about suboptimal runtime characteristics. At the same time, the key components of the datastore did not have enough memory to start properly given the size of the environment.

 Actions to Prevent Recurrence

The abovementioned check was rewritten to be much more lightweight, while at the same time the nodes running the datastore were provided with twice as much memory than before.

Conclusion

Firstly, we want to apologize. We know how critical our services are to your business. Phrase as a whole will do everything to learn from this incident and use it to drive improvements across our services. As with any significant operational issue, Phrase engineers will be working tirelessly over the next coming days and weeks to improve their understanding of the incident and determine what changes to make that improve our services and processes.

Posted Apr 29, 2024 - 07:35 CEST

Resolved
This incident has been resolved.
Posted Apr 19, 2024 - 17:29 CEST
Monitoring
Our engineers were able to find the root cause and apply a fix. We're monitoring the results and things are getting back to normal.
Posted Apr 19, 2024 - 17:19 CEST
Update
Our engineers were able to find the root cause and apply a fix. We're monitoring the results and things are getting back to normal.
Posted Apr 19, 2024 - 17:18 CEST
Investigating
We are currently investigating an issue with degraded performance of Term base component.
Posted Apr 19, 2024 - 16:43 CEST
Monitoring
A fix has been implemented, we're monitoring the results and things are getting back to normal.
Posted Apr 19, 2024 - 15:32 CEST
Update
The engineering team implemented a partial fix and the situation should be better. We are still working towards implementing a full fix.
Posted Apr 19, 2024 - 14:30 CEST
Update
We are continuing to work on a fix for this issue.
Posted Apr 19, 2024 - 14:06 CEST
Identified
The root cause of the issue has been identified and we are currently working on a fix.
Posted Apr 19, 2024 - 13:32 CEST
Investigating
Some of our users reported intermittent issues with QA checks. Our engineering team trying to identify the root cause.
Posted Apr 19, 2024 - 13:24 CEST
This incident affected: Phrase TMS (EU) (CAT web editor, Term base).