Vista

Orders stuck in Routing step

Løst

Incident Resolved: CTIM-1117
All stuck orders have been processed successfully with only 32 remaining in the auto retry process. Product Operations will follow up to ensure the remaining orders pass through. As everything is stable we are now containing this incident.
Please contact CTIM or see ticket for more details.

Opdateret

UPDATE:

The stuck order count has now reduced to 37 and these are currently in the auto-retry process. Next update in 30 minutes.

Opdateret

UPDATE:

Dispatch is rerunning the stuck items and it is expected that the count will be down to 0 in a few minutes. The next update will be shared in 30 minutes or sooner once all orders have passed through successfully.

Opdateret

UPDATE:

Order flow has stabilized with 1,166 stuck orders. Dispatch is preparing to rerun routing on them. IDP failures are still ongoing and both the Dispatch and Firecamp teams are addressing this with high priority. The next update will be provided in 30 minutes or sooner.

Opdateret

UPDATE:

Dispatch shared that the pod logs show ongoing issues and highlighted that a New Relic alert which monitors ship date errors has been consistently triggered over the past 3 hours. The alert is caused by IDP pods being unable to make outbound requests affecting the ship date process. Normally rerunning the most recent deployment and restarting the pods resolves the issue but this time it hasn’t worked and new pods are encountering the same problem. Both Dispatch and Firecamp are investigating the issue. Next update will be provided in 30 minutes or sooner.

Opdateret

UPDATE:
Escalation sent to: PCD: Dispatch Squad,PCD: Firecamp Squad

The Dispatch squad has requested the Firecamp team as the IDP pods are unable to connect to the network and rerunning deployments is no longer resolving the issue as it used to. The Firecamp team is now online and investigating the problem.

Opdateret

UPDATE:

Dispatch discovered that the underlying API is experiencing issues and an IDP problem is causing the orders to get stuck in routing which they believed was resolved two hours ago. The Dispatch team is actively investigating the issue which may potentially affect all products.

Opdateret

UPDATE:

Dispatch discovered that the underlying API is experiencing issues and an IDP problem is causing the orders to get stuck in routing which they believed was resolved two hours ago. The team is actively investigating the issue.

Opdateret

UPDATE:

The Dispatch squad is investigating the stuck orders as some orders are not visible to the Product Operations team for retrying. Next update will be shared when we hear from them.

Opdateret

UPDATE:

There are still around 900 orders suspended and 800 stuck in processing within routing. The Product Operations team is actively working on clearing them by retrying.

Opdateret

UPDATE:

Following the fix deployment Viper order flow has resumed. The Access team will continue monitoring performance and the incident will be considered contained once the backlog is fully cleared.

Opdateret

UPDATE:

The backlog of orders stuck in routing has reduced to 2.4K from 2.6K and the team is continuing to monitor the situation. Improvements in the numbers are expected.

Opdateret

UPDATE:

Access team has deployed a fix across all impacted regions and are currently monitoring the performance. They will provide another update shortly.

Opdateret

UPDATE:

Order flow has resumed in MCP. Product Operations retried some orders which successfully passed routing. However there are still over 2.6K orders in routing. We are currently awaiting updates from the Access Domain regarding the deployment of the fix.

Opdateret

UPDATE:

Access domain has identified the root cause and are in the process of deploying a fix region by region. We’ll provide updates here once the fix is in place.

Opdateret

UPDATE:

Access domain joined the slack and informed The OCI (oauth.cimpress.io) service is currently down in the eu-west-1 and eu-central-regions. The Access team is investigating the root cause and working to implement a fix as soon as possible. We’ll provide an update shortly.

Opdateret

UPDATE:

Currently Product ops has joined the slack and looking into the issue, awaiting response access domain. next update in 30mins or less

Undersøgelse

New Incident: CTIM-1117
Priority: Critical
Escalation sent to: PCD: Product Operations,Access Domain for review.
Currently orders are stuck in Routing step and number is increasing.