Staged Update Run TSG
This guide provides troubleshooting steps for common issues related to both cluster-scoped and namespace-scoped Staged Update Runs.
Note: To get more information about failures, you can check the updateRun controller logs.
Cluster-Scoped Troubleshooting
CRP status without Staged Update Run
When a ClusterResourcePlacement is created with spec.strategy.type set to External, the rollout does not start immediately.
A sample status of such ClusterResourcePlacement is as follows:
$ kubectl describe crp example-placement
...
Status:
Conditions:
Last Transition Time: 2025-03-12T23:01:32Z
Message: found all cluster needed as specified by the scheduling policy, found 2 cluster(s)
Observed Generation: 1
Reason: SchedulingPolicyFulfilled
Status: True
Type: ClusterResourcePlacementScheduled
Last Transition Time: 2025-03-12T23:01:32Z
Message: There are still 2 cluster(s) in the process of deciding whether to roll out the latest resources or not
Observed Generation: 1
Reason: RolloutStartedUnknown
Status: Unknown
Type: ClusterResourcePlacementRolloutStarted
Observed Resource Index: 0
Placement Statuses:
Cluster Name: member1
Conditions:
Last Transition Time: 2025-03-12T23:01:32Z
Message: Successfully scheduled resources for placement in "member1" (affinity score: 0, topology spread score: 0): picked by scheduling policy
Observed Generation: 1
Reason: Scheduled
Status: True
Type: Scheduled
Last Transition Time: 2025-03-12T23:01:32Z
Message: In the process of deciding whether to roll out the latest resources or not
Observed Generation: 1
Reason: RolloutStartedUnknown
Status: Unknown
Type: RolloutStarted
Cluster Name: member2
Conditions:
Last Transition Time: 2025-03-12T23:01:32Z
Message: Successfully scheduled resources for placement in "member2" (affinity score: 0, topology spread score: 0): picked by scheduling policy
Observed Generation: 1
Reason: Scheduled
Status: True
Type: Scheduled
Last Transition Time: 2025-03-12T23:01:32Z
Message: In the process of deciding whether to roll out the latest resources or not
Observed Generation: 1
Reason: RolloutStartedUnknown
Status: Unknown
Type: RolloutStarted
Selected Resources:
...
Events: <none>
SchedulingPolicyFulfilled condition indicates the CRP has been fully scheduled, while RolloutStartedUnknown condition shows that the rollout has not started.
In the Placement Statuses section, it displays the detailed status of each cluster. Both selected clusters are in the Scheduled state, but the RolloutStarted condition is still Unknown because the rollout has not kicked off yet.
Investigate ClusterStagedUpdateRun initialization failure
An updateRun initialization failure can be easily detected by getting the resource:
$ kubectl get csur example-run
NAME PLACEMENT RESOURCE-SNAPSHOT-INDEX POLICY-SNAPSHOT-INDEX INITIALIZED SUCCEEDED AGE
example-run example-placement 1 0 False 2s
The INITIALIZED field is False, indicating the initialization failed.
Describe the updateRun to get more details:
$ kubectl describe csur example-run
...
Status:
Conditions:
Last Transition Time: 2025-03-13T07:28:29Z
Message: cannot continue the ClusterStagedUpdateRun: failed to initialize the clusterStagedUpdateRun: failed to process the request due to a client error: no clusterResourceSnapshots with index `1` found for clusterResourcePlacement `example-placement`
Observed Generation: 1
Reason: UpdateRunInitializedFailed
Status: False
Type: Initialized
Deletion Stage Status:
Clusters:
Stage Name: kubernetes-fleet.io/deleteStage
Policy Observed Cluster Count: 2
Policy Snapshot Index Used: 0
...
The condition clearly indicates the initialization failed. The condition message gives more details about the failure.
In this case, a non-existing resource snapshot index 1 was used for the updateRun.
Investigate ClusterStagedUpdateRun execution failure
An updateRun execution failure can be easily detected by getting the resource:
$ kubectl get csur example-run
NAME PLACEMENT RESOURCE-SNAPSHOT-INDEX POLICY-SNAPSHOT-INDEX INITIALIZED SUCCEEDED AGE
example-run example-placement 0 0 True False 24m
The SUCCEEDED field is False, indicating the execution failure.
An updateRun execution failure can be caused by mainly 2 scenarios:
- When the updateRun controller is triggered to reconcile an in-progress updateRun, it starts by doing a bunch of validations
including retrieving the CRP and checking its rollout strategy, gathering all the bindings and regenerating the execution plan.
If any failure happens during validation, the updateRun execution fails with the corresponding validation error.
In above case, the CRP referenced by the updateRun is deleted during the execution. The updateRun controller detects and aborts the release.status: conditions: - lastTransitionTime: "2025-05-13T21:11:06Z" message: ClusterStagedUpdateRun initialized successfully observedGeneration: 1 reason: UpdateRunInitializedSuccessfully status: "True" type: Initialized - lastTransitionTime: "2025-05-13T21:11:21Z" message: The stages are aborted due to a non-recoverable error observedGeneration: 1 reason: UpdateRunFailed status: "False" type: Progressing - lastTransitionTime: "2025-05-13T22:15:23Z" message: 'cannot continue the ClusterStagedUpdateRun: failed to execute the clusterStagedUpdateRun: failed to process the request due to a client error: parent clusterResourcePlacement not found' observedGeneration: 1 reason: UpdateRunFailed status: "False" type: Succeeded - The updateRun controller triggers update to a member cluster by updating the corresponding binding spec and setting its
status to
RolloutStarted. It then waits for default 15 seconds and check whether the resources have been successfully applied by checking the binding again. In case that there are multiple concurrent updateRuns, and during the 15-second wait, some other updateRun preempts and updates the binding with new configuration, current updateRun detects and fails with clear error message.
Thestatus: conditions: - lastTransitionTime: "2025-05-13T21:10:58Z" message: ClusterStagedUpdateRun initialized successfully observedGeneration: 1 reason: UpdateRunInitializedSuccessfully status: "True" type: Initialized - lastTransitionTime: "2025-05-13T21:11:13Z" message: The stages are aborted due to a non-recoverable error observedGeneration: 1 reason: UpdateRunFailed status: "False" type: Progressing - lastTransitionTime: "2025-05-13T21:11:13Z" message: 'cannot continue the ClusterStagedUpdateRun: unexpected behavior which cannot be handled by the controller: the clusterResourceBinding of the updating cluster `member1` in the stage `staging` does not have expected status: binding spec diff: binding has different resourceSnapshotName, want: example-placement-0-snapshot, got: example-placement-1-snapshot; binding state (want Bound): Bound; binding RolloutStarted (want true): true, please check if there is concurrent clusterStagedUpdateRun' observedGeneration: 1 reason: UpdateRunFailed status: "False" type: SucceededSucceededcondition is set toFalsewith reasonUpdateRunFailed. In themessage, we showmember1cluster instagingstage gets preempted, and theresourceSnapshotNamefield is changed fromexample-placement-0-snapshottoexample-placement-1-snapshotwhich means probably some other updateRun is rolling out a newer resource version. The message also prints current binding state and ifRolloutStartedcondition is set to true. The message gives a hint about whether these is a concurrent clusterStagedUpdateRun running. Upon such failure, the user can list updateRuns or check the binding state:The binding is named askubectl get clusterresourcebindings NAME WORKSYNCHRONIZED RESOURCESAPPLIED AGE example-placement-member1-2afc7d7f True True 51m example-placement-member2-fc081413 51m<crp-name>-<cluster-name>-<suffix>. Since the error message saysmember1cluster fails the updateRun, we can check its binding:As the bindingkubectl get clusterresourcebindings example-placement-member1-2afc7d7f -o yaml ... spec: ... resourceSnapshotName: example-placement-1-snapshot schedulingPolicySnapshotName: example-placement-0 state: Bound targetCluster: member1 status: conditions: - lastTransitionTime: "2025-05-13T21:11:06Z" message: 'Detected the new changes on the resources and started the rollout process, resourceSnapshotIndex: 1, clusterStagedUpdateRun: example-run-1' observedGeneration: 3 reason: RolloutStarted status: "True" type: RolloutStarted ...RolloutStartedcondition shows, it’s updated by another updateRunexample-run-1.
The updateRun abortion due to execution failures is not recoverable at the moment. If failure happens due to validation error, one can fix the issue and create a new updateRun. If preemption happens, in most cases the user is releasing a new resource version, and they can just let the new updateRun run to complete.
Investigate ClusterStagedUpdateRun rollout stuck
A ClusterStagedUpdateRun can get stuck when resource placement fails on some clusters. Getting the updateRun will show the cluster name and stage that is in stuck state:
$ kubectl get csur example-run -o yaml
...
status:
conditions:
- lastTransitionTime: "2025-05-13T23:15:35Z"
message: ClusterStagedUpdateRun initialized successfully
observedGeneration: 1
reason: UpdateRunInitializedSuccessfully
status: "True"
type: Initialized
- lastTransitionTime: "2025-05-13T23:21:18Z"
message: The updateRun is stuck waiting for cluster member1 in stage staging to
finish updating, please check crp status for potential errors
observedGeneration: 1
reason: UpdateRunStuck
status: "False"
type: Progressing
...
The message shows that the updateRun is stuck waiting for the cluster member1 in stage staging to finish releasing.
The updateRun controller rolls resources to a member cluster by updating its corresponding binding. It then checks periodically
whether the update has completed or not. If the binding is still not available after current default 5 minutes, updateRun
controller decides the rollout has stuck and reports the condition.
This usually indicates something wrong happened on the cluster or the resources have some issue. To further investigate, you can check the ClusterResourcePlacement status:
$ kubectl describe crp example-placement
...
Placement Statuses:
Cluster Name: member1
Conditions:
Last Transition Time: 2025-05-13T23:11:14Z
Message: Successfully scheduled resources for placement in "member1" (affinity score: 0, topology spread score: 0): picked by scheduling policy
Observed Generation: 1
Reason: Scheduled
Status: True
Type: Scheduled
Last Transition Time: 2025-05-13T23:15:35Z
Message: Detected the new changes on the resources and started the rollout process, resourceSnapshotIndex: 0, clusterStagedUpdateRun: example-run
Observed Generation: 1
Reason: RolloutStarted
Status: True
Type: RolloutStarted
Last Transition Time: 2025-05-13T23:15:35Z
Message: No override rules are configured for the selected resources
Observed Generation: 1
Reason: NoOverrideSpecified
Status: True
Type: Overridden
Last Transition Time: 2025-05-13T23:15:35Z
Message: All of the works are synchronized to the latest
Observed Generation: 1
Reason: AllWorkSynced
Status: True
Type: WorkSynchronized
Last Transition Time: 2025-05-13T23:15:35Z
Message: All corresponding work objects are applied
Observed Generation: 1
Reason: AllWorkHaveBeenApplied
Status: True
Type: Applied
Last Transition Time: 2025-05-13T23:15:35Z
Message: Work object example-placement-work-configmap-c5971133-2779-4f6f-8681-3e05c4458c82 is not yet available
Observed Generation: 1
Reason: NotAllWorkAreAvailable
Status: False
Type: Available
Failed Placements:
Condition:
Last Transition Time: 2025-05-13T23:15:35Z
Message: Manifest is trackable but not available yet
Observed Generation: 1
Reason: ManifestNotAvailableYet
Status: False
Type: Available
Envelope:
Name: envelope-nginx-deploy
Namespace: test-namespace
Type: ConfigMap
Group: apps
Kind: Deployment
Name: nginx
Namespace: test-namespace
Version: v1
...
The Applied condition is False and says not all work have been applied. And in the “failed placements” section, it shows
the nginx deployment wrapped by envelope-nginx-deploy configMap is not ready. Check from member1 cluster and we can see
there’s image pull failure:
kubectl config use-context member1
kubectl get deploy -n test-namespace
NAME READY UP-TO-DATE AVAILABLE AGE
nginx 0/1 1 0 16m
kubectl get pods -n test-namespace
NAME READY STATUS RESTARTS AGE
nginx-69b9cb5485-sw24b 0/1 ErrImagePull 0 16m
For more debugging instructions, you can refer to ClusterResourcePlacement TSG.
After resolving the issue, you can create always create a new updateRun to restart the rollout. Stuck updateRuns can be deleted.
Namespace-Scoped Troubleshooting
Namespace-scoped StagedUpdateRun troubleshooting is a mirror image of cluster-scoped ClusterStagedUpdateRun troubleshooting. The concepts, failure patterns, and diagnostic approaches are exactly the same - only the resource names, scopes, and kubectl commands differ.
Both follow identical troubleshooting patterns:
- Initialization failures: Missing resource snapshots, invalid configurations
- Execution failures: Validation errors, concurrent updateRun conflicts
- Rollout stuck scenarios: Resource placement failures, cluster connectivity issues
- Status investigation: Using
kubectl get,kubectl describe, and checking placement status - Recovery approaches: Creating new updateRuns, cleaning up stuck resources
The key differences are:
- Resource scope: Namespace-scoped vs cluster-scoped resources
- Commands: Use
sur(StagedUpdateRun) instead ofcsur(ClusterStagedUpdateRun) - Target resources:
ResourcePlacementinstead ofClusterResourcePlacement - Bindings:
resourcebindingsinstead ofclusterresourcebindings - Approvals:
approvalrequestsinstead ofclusterapprovalrequests
ResourcePlacement status without Staged Update Run
When a namespace-scoped ResourcePlacement is created with spec.strategy.type set to External, the rollout does not start immediately.
A sample status of such ResourcePlacement is as follows:
$ kubectl describe rp web-app-placement -n my-app-namespace
...
Status:
Conditions:
Last Transition Time: 2025-03-12T23:01:32Z
Message: found all cluster needed as specified by the scheduling policy, found 2 cluster(s)
Observed Generation: 1
Reason: SchedulingPolicyFulfilled
Status: True
Type: ResourcePlacementScheduled
Last Transition Time: 2025-03-12T23:01:32Z
Message: There are still 2 cluster(s) in the process of deciding whether to roll out the latest resources or not
Observed Generation: 1
Reason: RolloutStartedUnknown
Status: Unknown
Type: ResourcePlacementRolloutStarted
Observed Resource Index: 0
Placement Statuses:
Cluster Name: member1
Conditions:
Last Transition Time: 2025-03-12T23:01:32Z
Message: Successfully scheduled resources for placement in "member1" (affinity score: 0, topology spread score: 0): picked by scheduling policy
Observed Generation: 1
Reason: Scheduled
Status: True
Type: Scheduled
Last Transition Time: 2025-03-12T23:01:32Z
Message: In the process of deciding whether to roll out the latest resources or not
Observed Generation: 1
Reason: RolloutStartedUnknown
Status: Unknown
Type: RolloutStarted
...
Events: <none>
SchedulingPolicyFulfilled condition indicates the ResourcePlacement has been fully scheduled, while RolloutStartedUnknown condition shows that the rollout has not started.
Investigate StagedUpdateRun initialization failure
A namespace-scoped updateRun initialization failure can be easily detected by getting the resource:
$ kubectl get sur web-app-rollout -n my-app-namespace
NAME PLACEMENT RESOURCE-SNAPSHOT-INDEX POLICY-SNAPSHOT-INDEX INITIALIZED SUCCEEDED AGE
web-app-rollout web-app-placement 1 0 False 2s
The INITIALIZED field is False, indicating the initialization failed.
Describe the updateRun to get more details:
$ kubectl describe sur web-app-rollout -n my-app-namespace
...
Status:
Conditions:
Last Transition Time: 2025-03-13T07:28:29Z
Message: cannot continue the StagedUpdateRun: failed to initialize the stagedUpdateRun: failed to process the request due to a client error: no resourceSnapshots with index `1` found for resourcePlacement `web-app-placement` in namespace `my-app-namespace`
Observed Generation: 1
Reason: UpdateRunInitializedFailed
Status: False
Type: Initialized
Deletion Stage Status:
Clusters:
Stage Name: kubernetes-fleet.io/deleteStage
Policy Observed Cluster Count: 2
Policy Snapshot Index Used: 0
...
The condition clearly indicates the initialization failed. The condition message gives more details about the failure.
In this case, a non-existing resource snapshot index 1 was used for the updateRun.
Investigate StagedUpdateRun execution failure
A namespace-scoped updateRun execution failure can be easily detected by getting the resource:
$ kubectl get sur web-app-rollout -n my-app-namespace
NAME PLACEMENT RESOURCE-SNAPSHOT-INDEX POLICY-SNAPSHOT-INDEX INITIALIZED SUCCEEDED AGE
web-app-rollout web-app-placement 0 0 True False 24m
The SUCCEEDED field is False, indicating the execution failure.
The execution failure scenarios are similar to cluster-scoped updateRuns:
Validation errors during reconciliation: The updateRun controller validates the ResourcePlacement, gathers bindings, and regenerates the execution plan. If any failure occurs, the updateRun execution fails:
status: conditions: - lastTransitionTime: "2025-05-13T21:11:06Z" message: StagedUpdateRun initialized successfully observedGeneration: 1 reason: UpdateRunInitializedSuccessfully status: "True" type: Initialized - lastTransitionTime: "2025-05-13T21:11:21Z" message: The stages are aborted due to a non-recoverable error observedGeneration: 1 reason: UpdateRunFailed status: "False" type: Progressing - lastTransitionTime: "2025-05-13T22:15:23Z" message: 'cannot continue the StagedUpdateRun: failed to initialize the stagedUpdateRun: failed to process the request due to a client error: parent resourcePlacement not found' observedGeneration: 1 reason: UpdateRunFailed status: "False" type: SucceededConcurrent updateRun preemption: When multiple updateRuns target the same ResourcePlacement, they may conflict:
status: conditions: - lastTransitionTime: "2025-05-13T21:11:13Z" message: 'cannot continue the StagedUpdateRun: unexpected behavior which cannot be handled by the controller: the resourceBinding of the updating cluster `member1` in the stage `staging` does not have expected status: binding spec diff: binding has different resourceSnapshotName, want: web-app-placement-0-snapshot, got: web-app-placement-1-snapshot; please check if there is concurrent stagedUpdateRun' observedGeneration: 1 reason: UpdateRunFailed status: "False" type: SucceededTo investigate concurrent updateRuns, check the namespace-scoped resource bindings:
$ kubectl get resourcebindings -n my-app-namespace NAME WORKSYNCHRONIZED RESOURCESAPPLIED AGE web-app-placement-member1-2afc7d7f True True 51m web-app-placement-member2-fc081413 51m
Investigate StagedUpdateRun rollout stuck
A StagedUpdateRun can get stuck when resource placement fails on some clusters:
$ kubectl get sur web-app-rollout -n my-app-namespace -o yaml
...
status:
conditions:
- lastTransitionTime: "2025-05-13T23:15:35Z"
message: StagedUpdateRun initialized successfully
observedGeneration: 1
reason: UpdateRunInitializedSuccessfully
status: "True"
type: Initialized
- lastTransitionTime: "2025-05-13T23:21:18Z"
message: The updateRun is stuck waiting for cluster member1 in stage staging to
finish updating, please check resourceplacement status for potential errors
observedGeneration: 1
reason: UpdateRunStuck
status: "False"
type: Progressing
...
To investigate further, check the ResourcePlacement status:
$ kubectl describe rp web-app-placement -n my-app-namespace
...
Placement Statuses:
Cluster Name: member1
Conditions:
Last Transition Time: 2025-05-13T23:11:14Z
Message: Successfully scheduled resources for placement in "member1" (affinity score: 0, topology spread score: 0): picked by scheduling policy
Observed Generation: 1
Reason: Scheduled
Status: True
Type: Scheduled
Last Transition Time: 2025-05-13T23:15:35Z
Message: Detected the new changes on the resources and started the rollout process, resourceSnapshotIndex: 0, stagedUpdateRun: web-app-rollout
Observed Generation: 1
Reason: RolloutStarted
Status: True
Type: RolloutStarted
Last Transition Time: 2025-05-13T23:15:35Z
Message: Work object web-app-placement-work-deployment-1234 is not yet available
Observed Generation: 1
Reason: NotAllWorkAreAvailable
Status: False
Type: Available
Failed Placements:
Condition:
Last Transition Time: 2025-05-13T23:15:35Z
Message: Manifest is trackable but not available yet
Observed Generation: 1
Reason: ManifestNotAvailableYet
Status: False
Type: Available
Envelope:
Name: envelope-web-app-deploy
Namespace: my-app-namespace
Type: ConfigMap
Group: apps
Kind: Deployment
Name: web-app
Namespace: my-app-namespace
Version: v1
...
Check the target cluster to diagnose the specific issue:
kubectl config use-context member1
kubectl get deploy web-app -n my-app-namespace
NAME READY UP-TO-DATE AVAILABLE AGE
web-app 0/1 1 0 16m
kubectl get pods -n my-app-namespace
NAME READY STATUS RESTARTS AGE
web-app-69b9cb5485-sw24b 0/1 ErrImagePull 0 16m
Namespace-Scoped Approval Troubleshooting
For namespace-scoped staged updates with approval gates, check for ApprovalRequest objects:
# List approval requests in the namespace
$ kubectl get approvalrequests -n my-app-namespace
NAME UPDATE-RUN STAGE APPROVED APPROVALACCEPTED AGE
web-app-rollout-prod-clusters web-app-rollout prod False 2m
# Approve the request
$ kubectl patch approvalrequests web-app-rollout-prod-clusters -n my-app-namespace --type='merge' \
-p '{"status":{"conditions":[{"type":"Approved","status":"True","reason":"approved","message":"approved"}]}}' \
--subresource=status
After resolving issues, create a new updateRun to restart the rollout. Stuck updateRuns can be deleted safely.
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can further improve.
Sorry to hear that. Please tell us how we can fix the experience for you.