Failure with Item service

Löst Jan 11, 2024 9:46 AM EST

This issue is resolved.

A postmortem will be published and shared in the next several days.

Övervakar Jan 8, 2024 10:15 AM EST

Update : Error rates are down to 0 now and Orders will get processed ahead with delayed event due to huge backlog of workflows.

We believe this issue has been contained and would be observing the situation.

Still there is high CPU usage being observed, but this would not impact the workflow executions, we would be monitoring this closely.

Problem identifierat Jan 8, 2024 9:24 AM EST

We are restarting Item Service. To handle the requests, a small number of healthy instances will be scaled up. This will cause an approximately 40 minute window of increased down time.

Order injection from the websites to the MCP Order Workflow will remain unaffected. There will be some delays with orders reaching manufacturing.

Order processing errors experienced by fulfillment systems will be retried. All customers of MCP Shipping are experience errors shipping orders.

Problem identifierat Jan 8, 2024 8:49 AM EST

We are currently working with AWS to get the instances healthy (by terminating dead instances manually), we believe that once we restore its health things should improve.
But still there is no update in ETA yet.
Impacted services:- No such major impact on new orders, services dependent on Item-service will function with degraded service.
Printdeal has few orders for which they are getting 500 error, while Venlo is facing issue with shipment.

Problem identifierat Jan 8, 2024 7:10 AM EST

Update : Team is attempting to downgrade the degraded instances and add new one to cluster.
There is not update on ETA yet.
Will share the update in next 30 mins or there is some update whichever is sooner.

Problem identifierat Jan 8, 2024 6:24 AM EST

Update : There is no new update on the incident. Pipeline deployment is still failing, working with AWS support team to fix this.

Problem identifierat Jan 8, 2024 6:22 AM EST

Update : There is no new update on the incident. Pipeline deployment is still failing, working with AWS support team to fix this.

Problem identifierat Jan 8, 2024 5:25 AM EST

Update : The current manual deployment failed due to degraded instance, hence we have escalated to AWS to assist in recovering the degraded instances which shall help in re-deploying the pipeline.

Problem identifierat Jan 8, 2024 4:57 AM EST

Update : Pipeline running on production went into error, hence team will be attempting to manually push the fix to the pipeline

Utreder Jan 8, 2024 4:37 AM EST

Update: PRD pipeline start in about 5 mins time. PRD pipeline will be slow

Problem identifierat Jan 8, 2024 4:11 AM EST

The team has observed around 2% call failure with Item service API, which might also impact the subsidiary call to this service.
Impacted URL:- https://item-service.commerce.cimpress.io