Failure with Item service

Résolu 2024-01-11 09:46 EST

This issue is resolved.

A postmortem will be published and shared in the next several days.

Sous surveillance 2024-01-08 10:15 EST

Update : Error rates are down to 0 now and Orders will get processed ahead with delayed event due to huge backlog of workflows.

We believe this issue has been contained and would be observing the situation.

Still there is high CPU usage being observed, but this would not impact the workflow executions, we would be monitoring this closely.

Problème identifié 2024-01-08 09:24 EST

We are restarting Item Service. To handle the requests, a small number of healthy instances will be scaled up. This will cause an approximately 40 minute window of increased down time.

Order injection from the websites to the MCP Order Workflow will remain unaffected. There will be some delays with orders reaching manufacturing.

Order processing errors experienced by fulfillment systems will be retried. All customers of MCP Shipping are experience errors shipping orders.

Problème identifié 2024-01-08 08:49 EST

We are currently working with AWS to get the instances healthy (by terminating dead instances manually), we believe that once we restore its health things should improve.
But still there is no update in ETA yet.
Impacted services:- No such major impact on new orders, services dependent on Item-service will function with degraded service.
Printdeal has few orders for which they are getting 500 error, while Venlo is facing issue with shipment.

Problème identifié 2024-01-08 07:10 EST

Update : Team is attempting to downgrade the degraded instances and add new one to cluster.
There is not update on ETA yet.
Will share the update in next 30 mins or there is some update whichever is sooner.

Problème identifié 2024-01-08 06:24 EST

Update : There is no new update on the incident. Pipeline deployment is still failing, working with AWS support team to fix this.

Problème identifié 2024-01-08 06:22 EST

Update : There is no new update on the incident. Pipeline deployment is still failing, working with AWS support team to fix this.

Problème identifié 2024-01-08 05:25 EST

Update : The current manual deployment failed due to degraded instance, hence we have escalated to AWS to assist in recovering the degraded instances which shall help in re-deploying the pipeline.

Problème identifié 2024-01-08 04:57 EST

Update : Pipeline running on production went into error, hence team will be attempting to manually push the fix to the pipeline

En voie de résolution 2024-01-08 04:37 EST

Update: PRD pipeline start in about 5 mins time. PRD pipeline will be slow

Problème identifié 2024-01-08 04:11 EST

The team has observed around 2% call failure with Item service API, which might also impact the subsidiary call to this service.
Impacted URL:- https://item-service.commerce.cimpress.io