ClusterStagedUpdateRun TSG
This guide provides troubleshooting steps for common issues related to Staged Update Run.
Note: To get more information about why the scheduling fails, you can check the updateRun controller logs.
CRP status without Staged Update Run
When a ClusterResourcePlacement
is created with spec.strategy.type
set to External
, the rollout does not start immediately.
A sample status of such ClusterResourcePlacement
is as follows:
$ kubectl describe crp example-placement
...
Status:
Conditions:
Last Transition Time: 2025-03-12T23:01:32Z
Message: found all cluster needed as specified by the scheduling policy, found 2 cluster(s)
Observed Generation: 1
Reason: SchedulingPolicyFulfilled
Status: True
Type: ClusterResourcePlacementScheduled
Last Transition Time: 2025-03-12T23:01:32Z
Message: There are still 2 cluster(s) in the process of deciding whether to roll out the latest resources or not
Observed Generation: 1
Reason: RolloutStartedUnknown
Status: Unknown
Type: ClusterResourcePlacementRolloutStarted
Observed Resource Index: 0
Placement Statuses:
Cluster Name: member1
Conditions:
Last Transition Time: 2025-03-12T23:01:32Z
Message: Successfully scheduled resources for placement in "member1" (affinity score: 0, topology spread score: 0): picked by scheduling policy
Observed Generation: 1
Reason: Scheduled
Status: True
Type: Scheduled
Last Transition Time: 2025-03-12T23:01:32Z
Message: In the process of deciding whether to roll out the latest resources or not
Observed Generation: 1
Reason: RolloutStartedUnknown
Status: Unknown
Type: RolloutStarted
Cluster Name: member2
Conditions:
Last Transition Time: 2025-03-12T23:01:32Z
Message: Successfully scheduled resources for placement in "member2" (affinity score: 0, topology spread score: 0): picked by scheduling policy
Observed Generation: 1
Reason: Scheduled
Status: True
Type: Scheduled
Last Transition Time: 2025-03-12T23:01:32Z
Message: In the process of deciding whether to roll out the latest resources or not
Observed Generation: 1
Reason: RolloutStartedUnknown
Status: Unknown
Type: RolloutStarted
Selected Resources:
...
Events: <none>
SchedulingPolicyFulfilled
condition indicates the CRP has been fully scheduled, while RolloutStartedUnknown
condition shows that the rollout has not started.
In the Placement Statuses
section, it displays the detailed status of each cluster. Both selected clusters are in the Scheduled
state, but the RolloutStarted
condition is still Unknown
because the rollout has not kicked off yet.
Investigate ClusterStagedUpdateRun initialization failure
An updateRun initialization failure can be easily detected by getting the resource:
$ kubectl get csur example-run
NAME PLACEMENT RESOURCE-SNAPSHOT-INDEX POLICY-SNAPSHOT-INDEX INITIALIZED SUCCEEDED AGE
example-run example-placement 1 0 False 2s
The INITIALIZED
field is False
, indicating the initialization failed.
Describe the updateRun to get more details:
$ kubectl describe csur example-run
...
Status:
Conditions:
Last Transition Time: 2025-03-13T07:28:29Z
Message: cannot continue the ClusterStagedUpdateRun: failed to initialize the clusterStagedUpdateRun: failed to process the request due to a client error: no clusterResourceSnapshots with index `1` found for clusterResourcePlacement `example-placement`
Observed Generation: 1
Reason: UpdateRunInitializedFailed
Status: False
Type: Initialized
Deletion Stage Status:
Clusters:
Stage Name: kubernetes-fleet.io/deleteStage
Policy Observed Cluster Count: 2
Policy Snapshot Index Used: 0
...
The condition clearly indicates the initialization failed. And the condition message gives more details about the failure.
In this case, I used a not-existing resource snapshot index 1
for the updateRun.
Investigate ClusterStagedUpdateRun execution failure
An updateRun execution failure can be easily detected by getting the resource:
$ kubectl get csur example-run
NAME PLACEMENT RESOURCE-SNAPSHOT-INDEX POLICY-SNAPSHOT-INDEX INITIALIZED SUCCEEDED AGE
example-run example-placement 0 0 True False 24m
The SUCCEEDED
field is False
, indicating the execution failure.
An updateRun execution failure can be caused by mainly 2 scenarios:
- When the updateRun controller is triggered to reconcile an in-progress updateRun, it starts by doing a bunch of validations
including retrieving the CRP and checking its rollout strategy, gathering all the bindings and regenerating the execution plan.
If any failure happens during validation, the updateRun execution fails with the corresponding validation error.
In above case, the CRP referenced by the updateRun is deleted during the execution. The updateRun controller detects and aborts the release.status: conditions: - lastTransitionTime: "2025-05-13T21:11:06Z" message: ClusterStagedUpdateRun initialized successfully observedGeneration: 1 reason: UpdateRunInitializedSuccessfully status: "True" type: Initialized - lastTransitionTime: "2025-05-13T21:11:21Z" message: The stages are aborted due to a non-recoverable error observedGeneration: 1 reason: UpdateRunFailed status: "False" type: Progressing - lastTransitionTime: "2025-05-13T22:15:23Z" message: 'cannot continue the ClusterStagedUpdateRun: failed to initialize the clusterStagedUpdateRun: failed to process the request due to a client error: parent clusterResourcePlacement not found' observedGeneration: 1 reason: UpdateRunFailed status: "False" type: Succeeded
- The updateRun controller triggers update to a member cluster by updating the corresponding binding spec and setting its
status to
RolloutStarted
. It then waits for default 15 seconds and check whether the resources have been successfully applied by checking the binding again. In case that there are multiple concurrent updateRuns, and during the 15-second wait, some other updateRun preempts and updates the binding with new configuration, current updateRun detects and fails with clear error message.
Thestatus: conditions: - lastTransitionTime: "2025-05-13T21:10:58Z" message: ClusterStagedUpdateRun initialized successfully observedGeneration: 1 reason: UpdateRunInitializedSuccessfully status: "True" type: Initialized - lastTransitionTime: "2025-05-13T21:11:13Z" message: The stages are aborted due to a non-recoverable error observedGeneration: 1 reason: UpdateRunFailed status: "False" type: Progressing - lastTransitionTime: "2025-05-13T21:11:13Z" message: 'cannot continue the ClusterStagedUpdateRun: unexpected behavior which cannot be handled by the controller: the clusterResourceBinding of the updating cluster `member1` in the stage `staging` does not have expected status: binding spec diff: binding has different resourceSnapshotName, want: example-placement-0-snapshot, got: example-placement-1-snapshot; binding state (want Bound): Bound; binding RolloutStarted (want true): true, please check if there is concurrent clusterStagedUpdateRun' observedGeneration: 1 reason: UpdateRunFailed status: "False" type: Succeeded
Succeeded
condition is set toFalse
with reasonUpdateRunFailed
. In themessage
, we showmember1
cluster instaging
stage gets preempted, and theresourceSnapshotName
field is changed fromexample-placement-0-snapshot
toexample-placement-1-snapshot
which means probably some other updateRun is rolling out a newer resource version. The message also prints current binding state and ifRolloutStarted
condition is set to true. The message gives a hint about whether these is a concurrent clusterStagedUpdateRun running. Upon such failure, the user can list updateRuns or check the binding state:The binding is named askubectl get clusterresourcebindings NAME WORKSYNCHRONIZED RESOURCESAPPLIED AGE example-placement-member1-2afc7d7f True True 51m example-placement-member2-fc081413 51m
<crp-name>-<cluster-name>-<suffix>
. Since the error message saysmember1
cluster fails the updateRun, we can check its binding:As the bindingkubectl get clusterresourcebindings example-placement-member1-2afc7d7f -o yaml ... spec: ... resourceSnapshotName: example-placement-1-snapshot schedulingPolicySnapshotName: example-placement-0 state: Bound targetCluster: member1 status: conditions: - lastTransitionTime: "2025-05-13T21:11:06Z" message: 'Detected the new changes on the resources and started the rollout process, resourceSnapshotIndex: 1, clusterStagedUpdateRun: example-run-1' observedGeneration: 3 reason: RolloutStarted status: "True" type: RolloutStarted ...
RolloutStarted
condition shows, it’s updated by another updateRunexample-run-1
.
The updateRun abortion due to execution failures is not recoverable at the moment. If failure happens due to validation error, one can fix the issue and create a new updateRun. If preemption happens, in most cases the user is releasing a new resource version, and they can just let the new updateRun run to complete.
Investigate ClusterStagedUpdateRun rollout stuck
A ClusterStagedUpdateRun
can get stuck when resource placement fails on some clusters. Getting the updateRun will show the cluster name and stage that is in stuck state:
$ kubectl get csur example-run -o yaml
...
status:
conditions:
- lastTransitionTime: "2025-05-13T23:15:35Z"
message: ClusterStagedUpdateRun initialized successfully
observedGeneration: 1
reason: UpdateRunInitializedSuccessfully
status: "True"
type: Initialized
- lastTransitionTime: "2025-05-13T23:21:18Z"
message: The updateRun is stuck waiting for cluster member1 in stage staging to
finish updating, please check crp status for potential errors
observedGeneration: 1
reason: UpdateRunStuck
status: "False"
type: Progressing
...
The message shows that the updateRun is stuck waiting for the cluster member1
in stage staging
to finish releasing.
The updateRun controller rolls resources to a member cluster by updating its corresponding binding. It then checks periodically
whether the update has completed or not. If the binding is still not available after current default 5 minutes, updateRun
controller decides the rollout has stuck and reports the condition.
This usually indicates something wrong happened on the cluster or the resources have some issue. To further investigate, you can check the ClusterResourcePlacement
status:
$ kubectl describe crp example-placement
...
Placement Statuses:
Cluster Name: member1
Conditions:
Last Transition Time: 2025-05-13T23:11:14Z
Message: Successfully scheduled resources for placement in "member1" (affinity score: 0, topology spread score: 0): picked by scheduling policy
Observed Generation: 1
Reason: Scheduled
Status: True
Type: Scheduled
Last Transition Time: 2025-05-13T23:15:35Z
Message: Detected the new changes on the resources and started the rollout process, resourceSnapshotIndex: 0, clusterStagedUpdateRun: example-run
Observed Generation: 1
Reason: RolloutStarted
Status: True
Type: RolloutStarted
Last Transition Time: 2025-05-13T23:15:35Z
Message: No override rules are configured for the selected resources
Observed Generation: 1
Reason: NoOverrideSpecified
Status: True
Type: Overridden
Last Transition Time: 2025-05-13T23:15:35Z
Message: All of the works are synchronized to the latest
Observed Generation: 1
Reason: AllWorkSynced
Status: True
Type: WorkSynchronized
Last Transition Time: 2025-05-13T23:15:35Z
Message: All corresponding work objects are applied
Observed Generation: 1
Reason: AllWorkHaveBeenApplied
Status: True
Type: Applied
Last Transition Time: 2025-05-13T23:15:35Z
Message: Work object example-placement-work-configmap-c5971133-2779-4f6f-8681-3e05c4458c82 is not yet available
Observed Generation: 1
Reason: NotAllWorkAreAvailable
Status: False
Type: Available
Failed Placements:
Condition:
Last Transition Time: 2025-05-13T23:15:35Z
Message: Manifest is trackable but not available yet
Observed Generation: 1
Reason: ManifestNotAvailableYet
Status: False
Type: Available
Envelope:
Name: envelope-nginx-deploy
Namespace: test-namespace
Type: ConfigMap
Group: apps
Kind: Deployment
Name: nginx
Namespace: test-namespace
Version: v1
...
The Applied
condition is False
and says not all work have been applied. And in the “failed placements” section, it shows
the nginx
deployment wrapped by envelope-nginx-deploy
configMap is not ready. Check from member1
cluster and we can see
there’s image pull failure:
kubectl config use-context member1
kubectl get deploy -n test-namespace
NAME READY UP-TO-DATE AVAILABLE AGE
nginx 0/1 1 0 16m
kubectl get pods -n test-namespace
NAME READY STATUS RESTARTS AGE
nginx-69b9cb5485-sw24b 0/1 ErrImagePull 0 16m
For more debugging instructions, you can refer to ClusterResourcePlacement TSG.
After resolving the issue, you can create always create a new updateRun to restart the rollout. Stuck updateRuns can be deleted.
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can further improve.
Sorry to hear that. Please tell us how we can fix the experience for you.