KubeFleet documentation features a number of troubleshooting guides to help you identify and fix KubeFleet issues you encounter. Pick one below to proceed.
This is the multi-page printable view of this section. Click here to print.
Troubleshooting Guides
- 1: ClusterResourcePlacement TSG
- 2: ResourcePlacement TSG
- 3: Scheduling Failure TSG
- 4: Rollout Failure TSG
- 5: Override Failure TSG
- 6: Work Synchronization Failure TSG
- 7: Work-Application Failure TSG
- 8: Availability Failure TSG
- 9: CRP Drift Detection and Configuration Difference Check Unexpected Result TSG
- 10: Diff Reporting Failure TSG
- 11: Staged Update Run TSG
- 12: ClusterResourcePlacementEviction TSG
1 - ClusterResourcePlacement TSG
This TSG is meant to help you troubleshoot issues with the ClusterResourcePlacement API in Fleet.
Cluster Resource Placement
Internal Objects to keep in mind when troubleshooting CRP related errors on the hub cluster:
ClusterResourceSnapshotClusterSchedulingPolicySnapshotClusterResourceBindingWork
Please read the Fleet API reference for more details about each object.
Complete Progress of the ClusterResourcePlacement
Understanding the progression and the status of the ClusterResourcePlacement custom resource is crucial for diagnosing and identifying failures.
You can view the status of the ClusterResourcePlacement custom resource by using the following command:
kubectl describe clusterresourceplacement <name>
The complete progression of ClusterResourcePlacement is as follows:
ClusterResourcePlacementScheduled: Indicates a resource has been scheduled for placement.- If this condition is false, refer to Scheduling Failure TSG.
ClusterResourcePlacementRolloutStarted: Indicates the rollout process has begun.- If this condition is false refer to Rollout Failure TSG
- If you are triggering a rollout with a staged update run, refer to ClusterStagedUpdateRun TSG.
ClusterResourcePlacementOverridden: Indicates the resource has been overridden.- If this condition is false, refer to Override Failure TSG
ClusterResourcePlacementWorkSynchronized: Indicates the work objects have been synchronized.- If this condition is false, refer to Work Synchronization Failure TSG
ClusterResourcePlacementApplied: Indicates the resource has been applied. This condition will only be populated if the apply strategy in use is of the typeClientSideApply(default) orServerSideApply.- If this condition is false, refer to Work-Application Failure TSG
ClusterResourcePlacementAvailable: Indicates the resource is available. This condition will only be populated if the apply strategy in use is of the typeClientSideApply(default) orServerSideApply.- If this condition is false, refer to Availability Failure TSG
ClusterResourcePlacementDiffreported: Indicates whether diff reporting has completed on all resources. This condition will only be populated if the apply strategy in use is of the typeReportDiff.- If this condition is false, refer to the Diff Reporting Failure TSG for more information.
How can I debug if some clusters are not selected as expected?
Check the status of the ClusterSchedulingPolicySnapshot to determine which clusters were selected along with the reason.
How can I debug if a selected cluster does not have the expected resources on it or if CRP doesn’t pick up the latest changes?
Please check the following cases,
- Check whether the
ClusterResourcePlacementRolloutStartedcondition inClusterResourcePlacementstatus is set to true or false. - If
false, see Scheduling Failure TSG. - If
true,- Check to see if
ClusterResourcePlacementAppliedcondition is set to unknown, false or true. - If
unknown, wait for the process to finish, as the resources are still being applied to the member cluster. If the state remains unknown for a while, create a issue, as this is an unusual behavior. - If
false, refer to Work-Application Failure TSG. - If
true, verify that the resource exists on the hub cluster.
- Check to see if
We can also take a look at the placementStatuses section in ClusterResourcePlacement status for that particular cluster. In placementStatuses we would find failedPlacements section which should have the reasons as to why resources failed to apply.
How can I debug if the drift detection result or the configuration difference check result are different from my expectations?
See the Drift Detection and Configuration Difference Check Unexpected Result TSG for more information.
How can I find and verify the latest ClusterSchedulingPolicySnapshot for a ClusterResourcePlacement?
To find the latest ClusterSchedulingPolicySnapshot for a ClusterResourcePlacement resource, run the following command:
kubectl get clusterschedulingpolicysnapshot -l kubernetes-fleet.io/is-latest-snapshot=true,kubernetes-fleet.io/parent-CRP={CRPName}
NOTE: In this command, replace
{CRPName}with yourClusterResourcePlacementname.
Then, compare the ClusterSchedulingPolicySnapshot with the ClusterResourcePlacement policy to make sure that they match, excluding the numberOfClusters field from the ClusterResourcePlacement spec.
If the placement type is PickN, check whether the number of clusters that’s requested in the ClusterResourcePlacement policy matches the value of the number-of-clusters label.
How can I find the latest ClusterResourceBinding resource?
The following command lists all ClusterResourceBindings instances that are associated with ClusterResourcePlacement:
kubectl get clusterresourcebinding -l kubernetes-fleet.io/parent-CRP={CRPName}
NOTE: In this command, replace
{CRPName}with yourClusterResourcePlacementname.
Example
In this case we have ClusterResourcePlacement called test-crp.
- List the
ClusterResourcePlacementto get the name of the CRP,
kubectl get crp test-crp
NAME GEN SCHEDULED SCHEDULEDGEN APPLIED APPLIEDGEN AGE
test-crp 1 True 1 True 1 15s
- The following command is run to view the status of the
ClusterResourcePlacementdeployment.
kubectl describe clusterresourceplacement test-crp
- Here’s an example output. From the
placementStatusessection of thetest-crpstatus, notice that it has distributed resources to two member clusters and, therefore, has twoClusterResourceBindingsinstances:
status:
conditions:
- lastTransitionTime: "2023-11-23T00:49:29Z"
...
placementStatuses:
- clusterName: kind-cluster-1
conditions:
...
type: ResourceApplied
- clusterName: kind-cluster-2
conditions:
...
reason: ApplySucceeded
status: "True"
type: ResourceApplied
- To get the
ClusterResourceBindingsvalue, run the following command:
kubectl get clusterresourcebinding -l kubernetes-fleet.io/parent-CRP=test-crp
- The output lists all
ClusterResourceBindingsinstances that are associated withtest-crp.
kubectl get clusterresourcebinding -l kubernetes-fleet.io/parent-CRP=test-crp
NAME WORKCREATED RESOURCESAPPLIED AGE
test-crp-kind-cluster-1-be990c3e True True 33s
test-crp-kind-cluster-2-ec4d953c True True 33s
The ClusterResourceBinding resource name uses the following format: {CRPName}-{clusterName}-{suffix}.
Find the ClusterResourceBinding for the target cluster you are looking for based on the clusterName.
How can I find the latest ClusterResourceSnapshot resource?
To find the latest ClusterResourceSnapshot resource, run the following command:
kubectl get clusterresourcesnapshot -l kubernetes-fleet.io/is-latest-snapshot=true,kubernetes-fleet.io/parent-CRP={CRPName}
NOTE: In this command, replace
{CRPName}with yourClusterResourcePlacementname.
How can I find the correct work resource that’s associated with ClusterResourcePlacement?
To find the correct work resource, follow these steps:
- Identify the member cluster namespace and the
ClusterResourcePlacementname. The format for the namespace isfleet-member-{clusterName}. - To get the work resource, run the following command:
kubectl get work -n fleet-member-{clusterName} -l kubernetes-fleet.io/parent-CRP={CRPName}
NOTE: In this command, replace
{clusterName}and{CRPName}with the names that you identified in the first step.
2 - ResourcePlacement TSG
This TSG is meant to help you troubleshoot issues with the ResourcePlacement API in Fleet.
Resource Placement
Internal Objects to keep in mind when troubleshooting RP related errors on the hub cluster:
ResourceSnapshotSchedulingPolicySnapshotResourceBindingWork
Please read the Fleet API reference for more details about each object.
Important Considerations for ResourcePlacement
Namespace Prerequisites
Important: ResourcePlacement can only place namespace-scoped resources to clusters that already have the target namespace. Before creating a ResourcePlacement:
Ensure the target namespace exists on the member clusters, either:
- Created by a
ClusterResourcePlacement(CRP) using namespace-only mode - Pre-existing on the member clusters
- Created by a
If the namespace doesn’t exist on a member cluster, the
ResourcePlacementwill fail to apply resources to that cluster.
Coordination with ClusterResourcePlacement
When using both ResourcePlacement (RP) and ClusterResourcePlacement (CRP) together:
- CRP in namespace-only mode: Use CRP to create and manage the namespace itself across clusters
- RP for resources: Use RP to manage specific resources within that namespace
- Avoid conflicts: Ensure that CRP and RP don’t select the same resources to prevent conflicts
Resource Scope Limitations
ResourcePlacement can only select and manage namespace-scoped resources within the same namespace where the RP object resides:
- ✅ Supported: ConfigMaps, Secrets, Services, Deployments, StatefulSets, Jobs, etc. within the RP’s namespace
- ❌ Not Supported: Cluster-scoped resources (use ClusterResourcePlacement instead)
- ❌ Not Supported: Resources in other namespaces
Complete Progress of the ResourcePlacement
Understanding the progression and the status of the ResourcePlacement custom resource is crucial for diagnosing and identifying failures.
You can view the status of the ResourcePlacement custom resource by using the following command:
kubectl describe resourceplacement <name> -n <namespace>
The complete progression of ResourcePlacement is as follows:
ResourcePlacementScheduled: Indicates a resource has been scheduled for placement.- If this condition is false, refer to Scheduling Failure TSG.
ResourcePlacementRolloutStarted: Indicates the rollout process has begun.- If this condition is false, refer to Rollout Failure TSG.
- If you are triggering a rollout with a staged update run, refer to ClusterStagedUpdateRun TSG.
ResourcePlacementOverridden: Indicates the resource has been overridden.- If this condition is false, refer to Override Failure TSG.
ResourcePlacementWorkSynchronized: Indicates the work objects have been synchronized.- If this condition is false, refer to Work Synchronization Failure TSG.
ResourcePlacementApplied: Indicates the resource has been applied. This condition will only be populated if the apply strategy in use is of the typeClientSideApply(default) orServerSideApply.- If this condition is false, refer to Work-Application Failure TSG.
ResourcePlacementAvailable: Indicates the resource is available. This condition will only be populated if the apply strategy in use is of the typeClientSideApply(default) orServerSideApply.- If this condition is false, refer to Availability Failure TSG.
ResourcePlacementDiffreported: Indicates whether diff reporting has completed on all resources. This condition will only be populated if the apply strategy in use is of the typeReportDiff.- If this condition is false, refer to the Diff Reporting Failure TSG for more information.
Note: ResourcePlacement and ClusterResourcePlacement share the same underlying architecture with a 1-to-1 mapping of condition types. The condition types follow a naming convention where RP conditions use the
ResourcePlacementprefix while CRP conditions use theClusterResourcePlacementprefix. For example:
ResourcePlacementScheduled↔ClusterResourcePlacementScheduledResourcePlacementApplied↔ClusterResourcePlacementAppliedResourcePlacementAvailable↔ClusterResourcePlacementAvailableThe troubleshooting approaches documented in the CRP TSG files are applicable to ResourcePlacement as well. The main difference is that ResourcePlacement is namespace-scoped and works with namespace-scoped resources, while ClusterResourcePlacement is cluster-scoped. When following CRP TSG guidance, substitute the appropriate RP condition names and commands (e.g., use
kubectl get resourceplacement -n <namespace>instead ofkubectl get clusterresourceplacement).
How can I debug if some clusters are not selected as expected?
Check the status of the SchedulingPolicySnapshot to determine which clusters were selected along with the reason.
To find the latest SchedulingPolicySnapshot for a ResourcePlacement resource, run the following command:
kubectl get schedulingpolicysnapshot -n <namespace> -l kubernetes-fleet.io/is-latest-snapshot=true,kubernetes-fleet.io/parent-CRP={RPName}
NOTE: In this command, replace
{RPName}with yourResourcePlacementname and<namespace>with the namespace where the ResourcePlacement exists.
Then, compare the SchedulingPolicySnapshot with the ResourcePlacement policy to make sure that they match.
How can I debug if a selected cluster does not have the expected resources on it or if RP doesn’t pick up the latest changes?
Please check the following cases,
- Check whether the
ResourcePlacementRolloutStartedcondition inResourcePlacementstatus is set to true or false. - If
false, the resource placement has not started rolling out yet. - If
true,- Check to see if the overall
ResourcePlacementAppliedcondition is set to unknown, false or true. - If
unknown, wait for the process to finish, as the resources are still being applied to the member clusters. If the state remains unknown for a while, create an issue, as this is an unusual behavior. - If
false, the resources failed to apply on one or more clusters. Check thePlacement Statusessection in the status for cluster-specific details. - If
true, verify that the resource exists on the hub cluster in the same namespace as the ResourcePlacement.
- Check to see if the overall
To pinpoint issues on specific clusters, examine the Placement Statuses section in ResourcePlacement status. For each cluster, you can find:
- The cluster name
- Conditions specific to that cluster (e.g.,
Applied,Available) Failed Placementssection which lists the resources that failed to apply along with the reasons
How can I debug if the drift detection result or the configuration difference check result are different from my expectations?
See the Drift Detection and Configuration Difference Check Unexpected Result TSG for more information.
How can I find the latest ResourceBinding resource?
The following command lists all ResourceBindings instances that are associated with ResourcePlacement:
kubectl get resourcebinding -n <namespace> -l kubernetes-fleet.io/parent-CRP={RPName}
NOTE: In this command, replace
{RPName}with yourResourcePlacementname and<namespace>with the namespace where the ResourcePlacement exists.
Example
In this case we have ResourcePlacement called test-rp in namespace test-ns.
- Get the
ResourcePlacementto get a basic overview of the RP,
kubectl get rp test-rp -n test-ns
NAME GEN SCHEDULED SCHEDULEDGEN AVAILABLE AVAILABLE-GEN AGE
test-rp 1 True 1 True 1 15s
- The following command is run to view the status of the
ResourcePlacement.
kubectl describe resourceplacement test-rp -n test-ns
Here’s an example output:
Status:
Conditions:
Last Transition Time: 2025-11-13T22:25:45Z
Message: found all cluster needed as specified by the scheduling policy, found 2 cluster(s)
Observed Generation: 2
Reason: SchedulingPolicyFulfilled
Status: True
Type: ResourcePlacementScheduled
Last Transition Time: 2025-11-13T22:25:45Z
Message: All 2 cluster(s) start rolling out the latest resource
Observed Generation: 2
Reason: RolloutStarted
Status: True
Type: ResourcePlacementRolloutStarted
Last Transition Time: 2025-11-13T22:25:45Z
Message: No override rules are configured for the selected resources
Observed Generation: 2
Reason: NoOverrideSpecified
Status: True
Type: ResourcePlacementOverridden
Last Transition Time: 2025-11-13T22:25:45Z
Message: Works(s) are succcesfully created or updated in 2 target cluster(s)' namespaces
Observed Generation: 2
Reason: WorkSynchronized
Status: True
Type: ResourcePlacementWorkSynchronized
Last Transition Time: 2025-11-13T22:25:45Z
Message: The selected resources are successfully applied to 2 cluster(s)
Observed Generation: 2
Reason: ApplySucceeded
Status: True
Type: ResourcePlacementApplied
Last Transition Time: 2025-11-13T22:25:45Z
Message: The selected resources in 2 cluster(s) are available now
Observed Generation: 2
Reason: ResourceAvailable
Status: True
Type: ResourcePlacementAvailable
Observed Resource Index: 0
Placement Statuses:
Cluster Name: kind-cluster-1
Conditions:
Last Transition Time: 2025-11-13T22:25:45Z
Message: Successfully scheduled resources for placement in "kind-cluster-1": picked by scheduling policy
Observed Generation: 2
Reason: Scheduled
Status: True
Type: Scheduled
Last Transition Time: 2025-11-13T22:25:45Z
Message: Detected the new changes on the resources and started the rollout process
Observed Generation: 2
Reason: RolloutStarted
Status: True
Type: RolloutStarted
Last Transition Time: 2025-11-13T22:25:45Z
Message: No override rules are configured for the selected resources
Observed Generation: 2
Reason: NoOverrideSpecified
Status: True
Type: Overridden
Last Transition Time: 2025-11-13T22:25:45Z
Message: All of the works are synchronized to the latest
Observed Generation: 2
Reason: AllWorkSynced
Status: True
Type: WorkSynchronized
Last Transition Time: 2025-11-13T22:25:45Z
Message: All corresponding work objects are applied
Observed Generation: 2
Reason: AllWorkHaveBeenApplied
Status: True
Type: Applied
Last Transition Time: 2025-11-13T22:25:45Z
Message: All corresponding work objects are available
Observed Generation: 2
Reason: AllWorkAreAvailable
Status: True
Type: Available
Observed Resource Index: 0
Cluster Name: kind-cluster-2
Conditions:
Last Transition Time: 2025-11-13T22:25:45Z
Message: Successfully scheduled resources for placement in "kind-cluster-2": picked by scheduling policy
Observed Generation: 2
Reason: Scheduled
Status: True
Type: Scheduled
Last Transition Time: 2025-11-13T22:25:45Z
Message: Detected the new changes on the resources and started the rollout process
Observed Generation: 2
Reason: RolloutStarted
Status: True
Type: RolloutStarted
Last Transition Time: 2025-11-13T22:25:45Z
Message: No override rules are configured for the selected resources
Observed Generation: 2
Reason: NoOverrideSpecified
Status: True
Type: Overridden
Last Transition Time: 2025-11-13T22:25:45Z
Message: All of the works are synchronized to the latest
Observed Generation: 2
Reason: AllWorkSynced
Status: True
Type: WorkSynchronized
Last Transition Time: 2025-11-13T22:25:45Z
Message: All corresponding work objects are applied
Observed Generation: 2
Reason: AllWorkHaveBeenApplied
Status: True
Type: Applied
Last Transition Time: 2025-11-13T22:25:45Z
Message: All corresponding work objects are available
Observed Generation: 2
Reason: AllWorkAreAvailable
Status: True
Type: Available
Observed Resource Index: 0
Selected Resources:
Kind: ConfigMap
Name: app-config
Namespace: my-app
Version: v1
Kind: ConfigMap
Name: feature-flags
Namespace: my-app
Version: v1
From the status output, you can see:
- Overall Conditions: Show the aggregated state across all clusters (e.g.,
ResourcePlacementApplied,ResourcePlacementAvailable) - Placement Statuses: Contains per-cluster details for
kind-cluster-1andkind-cluster-2, each with their own conditions (Scheduled,Applied,Available, etc.) - Selected Resources: Lists the ConfigMaps (
app-configandfeature-flags) that were selected for placement
- To get the
ResourceBindings, run the following command:
kubectl get resourcebinding -n test-ns -l kubernetes-fleet.io/parent-CRP=test-rp
This lists all ResourceBindings instances that are associated with test-rp.
kubectl get resourcebinding -n test-ns -l kubernetes-fleet.io/parent-CRP=test-rp
NAME WORKSYNCHRONIZED RESOURCESAPPLIED AGE
test-rp-kind-cluster-1-be990c3e True True 33s
test-rp-kind-cluster-2-ec4d953c True True 33s
The ResourceBinding resource name uses the following format: {RPName}-{clusterName}-{suffix}.
Find the ResourceBinding for the target cluster you are looking for based on the clusterName.
How can I find the latest ResourceSnapshot resource?
To find the latest ResourceSnapshot resource, run the following command:
kubectl get resourcesnapshot -n <namespace> -l kubernetes-fleet.io/is-latest-snapshot=true,kubernetes-fleet.io/parent-CRP={RPName}
NOTE: In this command, replace
{RPName}with yourResourcePlacementname and<namespace>with the namespace where the ResourcePlacement exists.
How can I find the correct work resource that’s associated with ResourcePlacement?
To find the correct work resource, follow these steps:
- Identify the member cluster namespace and the
ResourcePlacementname. The format for the namespace isfleet-member-{clusterName}. - To get the work resource, run the following command:
kubectl get work -n fleet-member-{clusterName} -l kubernetes-fleet.io/parent-CRP={RPName}
NOTE: In this command, replace
{clusterName}and{RPName}with the names that you identified in the first step.
3 - Scheduling Failure TSG
The ClusterResourcePlacementScheduled (for ClusterResourcePlacement) or ResourcePlacementScheduled (for ResourcePlacement) condition is set to false when the scheduler cannot find all the clusters needed as specified by the scheduling policy.
Note: To get more information about why the scheduling fails, you can check the scheduler logs.
Common scenarios
Instances where this condition may arise:
- When the placement policy is set to
PickFixed, but the specified cluster names do not match any joined member cluster name in the fleet, or the specified cluster is no longer connected to the fleet. - When the placement policy is set to
PickN, and N clusters are specified, but there are fewer than N clusters that have joined the fleet or satisfy the placement policy. - When the placement resource selector selects a reserved namespace.
Note: When the placement policy is set to
PickAll, theClusterResourcePlacementScheduledorResourcePlacementScheduledcondition is always set toTrue.
Case Study
In the following example, a ClusterResourcePlacement with a PickN placement policy is trying to propagate resources to two clusters labeled env:prod. (The same scheduling logic applies to ResourcePlacement.)
The two clusters, named kind-cluster-1 and kind-cluster-2, have joined the fleet. However, only one member cluster, kind-cluster-1, has the label env:prod.
CRP spec:
spec:
policy:
affinity:
clusterAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
clusterSelectorTerms:
- labelSelector:
matchLabels:
env: prod
numberOfClusters: 2
placementType: PickN
resourceSelectors:
...
revisionHistoryLimit: 10
strategy:
type: RollingUpdate
ClusterResourcePlacement status
status:
conditions:
- lastTransitionTime: "2024-05-07T22:36:33Z"
message: could not find all the clusters needed as specified by the scheduling
policy
observedGeneration: 1
reason: SchedulingPolicyUnfulfilled
status: "False"
type: ClusterResourcePlacementScheduled
- lastTransitionTime: "2024-05-07T22:36:33Z"
message: All 1 cluster(s) start rolling out the latest resource
observedGeneration: 1
reason: RolloutStarted
status: "True"
type: ClusterResourcePlacementRolloutStarted
- lastTransitionTime: "2024-05-07T22:36:33Z"
message: No override rules are configured for the selected resources
observedGeneration: 1
reason: NoOverrideSpecified
status: "True"
type: ClusterResourcePlacementOverridden
- lastTransitionTime: "2024-05-07T22:36:33Z"
message: Works(s) are succcesfully created or updated in the 1 target clusters'
namespaces
observedGeneration: 1
reason: WorkSynchronized
status: "True"
type: ClusterResourcePlacementWorkSynchronized
- lastTransitionTime: "2024-05-07T22:36:33Z"
message: The selected resources are successfully applied to 1 clusters
observedGeneration: 1
reason: ApplySucceeded
status: "True"
type: ClusterResourcePlacementApplied
- lastTransitionTime: "2024-05-07T22:36:33Z"
message: The selected resources in 1 cluster are available now
observedGeneration: 1
reason: ResourceAvailable
status: "True"
type: ClusterResourcePlacementAvailable
observedResourceIndex: "0"
placementStatuses:
- clusterName: kind-cluster-1
conditions:
- lastTransitionTime: "2024-05-07T22:36:33Z"
message: 'Successfully scheduled resources for placement in kind-cluster-1 (affinity
score: 0, topology spread score: 0): picked by scheduling policy'
observedGeneration: 1
reason: Scheduled
status: "True"
type: Scheduled
- lastTransitionTime: "2024-05-07T22:36:33Z"
message: Detected the new changes on the resources and started the rollout process
observedGeneration: 1
reason: RolloutStarted
status: "True"
type: RolloutStarted
- lastTransitionTime: "2024-05-07T22:36:33Z"
message: No override rules are configured for the selected resources
observedGeneration: 1
reason: NoOverrideSpecified
status: "True"
type: Overridden
- lastTransitionTime: "2024-05-07T22:36:33Z"
message: All of the works are synchronized to the latest
observedGeneration: 1
reason: AllWorkSynced
status: "True"
type: WorkSynchronized
- lastTransitionTime: "2024-05-07T22:36:33Z"
message: All corresponding work objects are applied
observedGeneration: 1
reason: AllWorkHaveBeenApplied
status: "True"
type: Applied
- lastTransitionTime: "2024-05-07T22:36:33Z"
message: All corresponding work objects are available
observedGeneration: 1
reason: AllWorkAreAvailable
status: "True"
type: Available
- conditions:
- lastTransitionTime: "2024-05-07T22:36:33Z"
message: 'kind-cluster-2 is not selected: ClusterUnschedulable, cluster does not
match with any of the required cluster affinity terms'
observedGeneration: 1
reason: ScheduleFailed
status: "False"
type: Scheduled
selectedResources:
...
The ClusterResourcePlacementScheduled condition is set to false, the goal is to select two clusters with the label env:prod, but only one member cluster possesses the correct label as specified in clusterAffinity.
We can also take a look at the ClusterSchedulingPolicySnapshot status to figure out why the scheduler could not schedule the resource for the placement policy specified.
To learn how to get the latest ClusterSchedulingPolicySnapshot, see How can I find and verify the latest ClusterSchedulingPolicySnapshot for a ClusterResourcePlacement deployment? to learn how to get the latest ClusterSchedulingPolicySnapshot.
- For ResourcePlacement: use
SchedulingPolicySnapshot.
The corresponding ClusterSchedulingPolicySnapshot spec and status gives us even more information on why scheduling failed.
Latest ClusterSchedulingPolicySnapshot
apiVersion: placement.kubernetes-fleet.io/v1
kind: ClusterSchedulingPolicySnapshot
metadata:
annotations:
kubernetes-fleet.io/CRP-generation: "1"
kubernetes-fleet.io/number-of-clusters: "2"
creationTimestamp: "2024-05-07T22:36:33Z"
generation: 1
labels:
kubernetes-fleet.io/is-latest-snapshot: "true"
kubernetes-fleet.io/parent-CRP: crp-2
kubernetes-fleet.io/policy-index: "0"
name: crp-2-0
ownerReferences:
- apiVersion: placement.kubernetes-fleet.io/v1beta1
blockOwnerDeletion: true
controller: true
kind: ClusterResourcePlacement
name: crp-2
uid: 48bc1e92-a8b9-4450-a2d5-c6905df2cbf0
resourceVersion: "10090"
uid: 2137887e-45fd-4f52-bbb7-b96f39854625
spec:
policy:
affinity:
clusterAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
clusterSelectorTerms:
- labelSelector:
matchLabels:
env: prod
placementType: PickN
policyHash: ZjE0Yjk4YjYyMTVjY2U3NzQ1MTZkNWRhZjRiNjQ1NzQ4NjllNTUyMzZkODBkYzkyYmRkMGU3OTI3MWEwOTkyNQ==
status:
conditions:
- lastTransitionTime: "2024-05-07T22:36:33Z"
message: could not find all the clusters needed as specified by the scheduling
policy
observedGeneration: 1
reason: SchedulingPolicyUnfulfilled
status: "False"
type: Scheduled
observedCRPGeneration: 1
targetClusters:
- clusterName: kind-cluster-1
clusterScore:
affinityScore: 0
priorityScore: 0
reason: picked by scheduling policy
selected: true
- clusterName: kind-cluster-2
reason: ClusterUnschedulable, cluster does not match with any of the required
cluster affinity terms
selected: false
Resolution:
The solution here is to add the env:prod label to the member cluster resource for kind-cluster-2 as well, so that the scheduler can select the cluster to propagate resources.
General Notes
The scheduling failure investigation flow is identical for ClusterResourcePlacement and ResourcePlacement; only the snapshot object kind differs. Replace CRP-specific object kinds with their RP equivalents when working with namespace-scoped placements.
4 - Rollout Failure TSG
When using placement APIs (ClusterResourcePlacement or ResourcePlacement) to propagate resources, selected resources may not begin rolling out and the ClusterResourcePlacementRolloutStarted (for ClusterResourcePlacement) or ResourcePlacementRolloutStarted (for ResourcePlacement) condition shows False.
This TSG only applies to the RollingUpdate rollout strategy, which is the default strategy if you don’t specify in the ClusterResourcePlacement.
To troubleshoot the update run strategy as you specify External in the ClusterResourcePlacement, please refer to the Staged Update Run Troubleshooting Guide.
Note: To get more information about why the rollout doesn’t start, you can check the rollout controller to get more information on why the rollout did not start.
Common scenarios
Instances where this condition may arise:
- The rollout strategy is blocked because the
RollingUpdateconfiguration is too strict.
Troubleshooting Steps
- In the
ClusterResourcePlacementorResourcePlacementstatus section, check theplacementStatusesto identify clusters with theRolloutStartedstatus set toFalse. - Locate the corresponding
ClusterResourceBindingorResourceBindingfor the identified cluster. For more information, see How can I find the latest ClusterResourceBinding resource? or How can I find the latest ResourceBinding resource?. This resource should indicate the status of theWorkwhether it was created or updated. - Verify the values of
maxUnavailableandmaxSurgeto ensure they align with your expectations.
Case Study
In the following example, the ClusterResourcePlacement is trying to propagate a namespace to three member clusters.
However, during the initial creation of the ClusterResourcePlacement, the namespace didn’t exist on the hub cluster,
and the fleet currently comprises two member clusters named kind-cluster-1 and kind-cluster-2.
ClusterResourcePlacement spec
spec:
policy:
numberOfClusters: 3
placementType: PickN
resourceSelectors:
- group: ""
kind: Namespace
name: test-ns
version: v1
revisionHistoryLimit: 10
strategy:
type: RollingUpdate
ClusterResourcePlacement status
status:
conditions:
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: could not find all the clusters needed as specified by the scheduling
policy
observedGeneration: 1
reason: SchedulingPolicyUnfulfilled
status: "False"
type: ClusterResourcePlacementScheduled
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: All 2 cluster(s) start rolling out the latest resource
observedGeneration: 1
reason: RolloutStarted
status: "True"
type: ClusterResourcePlacementRolloutStarted
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: No override rules are configured for the selected resources
observedGeneration: 1
reason: NoOverrideSpecified
status: "True"
type: ClusterResourcePlacementOverridden
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: Works(s) are succcesfully created or updated in the 2 target clusters'
namespaces
observedGeneration: 1
reason: WorkSynchronized
status: "True"
type: ClusterResourcePlacementWorkSynchronized
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: The selected resources are successfully applied to 2 clusters
observedGeneration: 1
reason: ApplySucceeded
status: "True"
type: ClusterResourcePlacementApplied
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: The selected resources in 2 cluster are available now
observedGeneration: 1
reason: ResourceAvailable
status: "True"
type: ClusterResourcePlacementAvailable
observedResourceIndex: "0"
placementStatuses:
- clusterName: kind-cluster-2
conditions:
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: 'Successfully scheduled resources for placement in kind-cluster-2 (affinity
score: 0, topology spread score: 0): picked by scheduling policy'
observedGeneration: 1
reason: Scheduled
status: "True"
type: Scheduled
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: Detected the new changes on the resources and started the rollout process
observedGeneration: 1
reason: RolloutStarted
status: "True"
type: RolloutStarted
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: No override rules are configured for the selected resources
observedGeneration: 1
reason: NoOverrideSpecified
status: "True"
type: Overridden
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: All of the works are synchronized to the latest
observedGeneration: 1
reason: AllWorkSynced
status: "True"
type: WorkSynchronized
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: All corresponding work objects are applied
observedGeneration: 1
reason: AllWorkHaveBeenApplied
status: "True"
type: Applied
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: All corresponding work objects are available
observedGeneration: 1
reason: AllWorkAreAvailable
status: "True"
type: Available
- clusterName: kind-cluster-1
conditions:
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: 'Successfully scheduled resources for placement in kind-cluster-1 (affinity
score: 0, topology spread score: 0): picked by scheduling policy'
observedGeneration: 1
reason: Scheduled
status: "True"
type: Scheduled
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: Detected the new changes on the resources and started the rollout process
observedGeneration: 1
reason: RolloutStarted
status: "True"
type: RolloutStarted
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: No override rules are configured for the selected resources
observedGeneration: 1
reason: NoOverrideSpecified
status: "True"
type: Overridden
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: All of the works are synchronized to the latest
observedGeneration: 1
reason: AllWorkSynced
status: "True"
type: WorkSynchronized
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: All corresponding work objects are applied
observedGeneration: 1
reason: AllWorkHaveBeenApplied
status: "True"
type: Applied
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: All corresponding work objects are available
observedGeneration: 1
reason: AllWorkAreAvailable
status: "True"
type: Available
The previous output indicates that the resource test-ns namespace never exists on the hub cluster and shows the following ClusterResourcePlacement condition statuses:
ClusterResourcePlacementScheduledis set toFalse, as the specified policy aims to pick three clusters, but the scheduler can only accommodate placement in two currently available and joined clusters.ClusterResourcePlacementRolloutStartedis set toTrue, as the rollout process has commenced with 2 clusters being selected.ClusterResourcePlacementOverriddenis set toTrue, as no override rules are configured for the selected resources.ClusterResourcePlacementWorkSynchronizedis set toTrue.ClusterResourcePlacementAppliedis set toTrue.ClusterResourcePlacementAvailableis set toTrue.
To ensure seamless propagation of the namespace across the relevant clusters, proceed to create the test-ns namespace on the hub cluster.
ClusterResourcePlacement status after namespace test-ns is created on the hub cluster
status:
conditions:
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: could not find all the clusters needed as specified by the scheduling
policy
observedGeneration: 1
reason: SchedulingPolicyUnfulfilled
status: "False"
type: ClusterResourcePlacementScheduled
- lastTransitionTime: "2024-05-07T23:13:51Z"
message: The rollout is being blocked by the rollout strategy in 2 cluster(s)
observedGeneration: 1
reason: RolloutNotStartedYet
status: "False"
type: ClusterResourcePlacementRolloutStarted
observedResourceIndex: "1"
placementStatuses:
- clusterName: kind-cluster-2
conditions:
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: 'Successfully scheduled resources for placement in kind-cluster-2 (affinity
score: 0, topology spread score: 0): picked by scheduling policy'
observedGeneration: 1
reason: Scheduled
status: "True"
type: Scheduled
- lastTransitionTime: "2024-05-07T23:13:51Z"
message: The rollout is being blocked by the rollout strategy
observedGeneration: 1
reason: RolloutNotStartedYet
status: "False"
type: RolloutStarted
- clusterName: kind-cluster-1
conditions:
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: 'Successfully scheduled resources for placement in kind-cluster-1 (affinity
score: 0, topology spread score: 0): picked by scheduling policy'
observedGeneration: 1
reason: Scheduled
status: "True"
type: Scheduled
- lastTransitionTime: "2024-05-07T23:13:51Z"
message: The rollout is being blocked by the rollout strategy
observedGeneration: 1
reason: RolloutNotStartedYet
status: "False"
type: RolloutStarted
selectedResources:
- kind: Namespace
name: test-ns
version: v1
Upon examination, the ClusterResourcePlacementScheduled condition status is shown as False.
The ClusterResourcePlacementRolloutStarted status is also shown as False with the message The rollout is being blocked by the rollout strategy in 2 cluster(s).
Let’s check the latest ClusterResourceSnapshot.
Check the latest ClusterResourceSnapshot by running the command in How can I find the latest ClusterResourceSnapshot resource?.
Latest ClusterResourceSnapshot
apiVersion: placement.kubernetes-fleet.io/v1
kind: ClusterResourceSnapshot
metadata:
annotations:
kubernetes-fleet.io/number-of-enveloped-object: "0"
kubernetes-fleet.io/number-of-resource-snapshots: "1"
kubernetes-fleet.io/resource-hash: 72344be6e268bc7af29d75b7f0aad588d341c228801aab50d6f9f5fc33dd9c7c
creationTimestamp: "2024-05-07T23:13:51Z"
generation: 1
labels:
kubernetes-fleet.io/is-latest-snapshot: "true"
kubernetes-fleet.io/parent-CRP: crp-3
kubernetes-fleet.io/resource-index: "1"
name: crp-3-1-snapshot
ownerReferences:
- apiVersion: placement.kubernetes-fleet.io/v1beta1
blockOwnerDeletion: true
controller: true
kind: ClusterResourcePlacement
name: crp-3
uid: b4f31b9a-971a-480d-93ac-93f093ee661f
resourceVersion: "14434"
uid: 85ee0e81-92c9-4362-932b-b0bf57d78e3f
spec:
selectedResources:
- apiVersion: v1
kind: Namespace
metadata:
labels:
kubernetes.io/metadata.name: test-ns
name: test-ns
spec:
finalizers:
- kubernetes
Upon inspecting ClusterResourceSnapshot spec, the selectedResources section now shows the namespace test-ns.
Let’s check the ClusterResourceBinding for kind-cluster-1 to see if it was updated after the namespace test-ns was created.
Check the ClusterResourceBinding for kind-cluster-1 by running the command in How can I find the latest ClusterResourceBinding resource?.
ClusterResourceBinding for kind-cluster-1
apiVersion: placement.kubernetes-fleet.io/v1
kind: ClusterResourceBinding
metadata:
creationTimestamp: "2024-05-07T23:08:53Z"
finalizers:
- kubernetes-fleet.io/work-cleanup
generation: 2
labels:
kubernetes-fleet.io/parent-CRP: crp-3
name: crp-3-kind-cluster-1-7114c253
resourceVersion: "14438"
uid: 0db4e480-8599-4b40-a1cc-f33bcb24b1a7
spec:
applyStrategy:
type: ClientSideApply
clusterDecision:
clusterName: kind-cluster-1
clusterScore:
affinityScore: 0
priorityScore: 0
reason: picked by scheduling policy
selected: true
resourceSnapshotName: crp-3-0-snapshot
schedulingPolicySnapshotName: crp-3-0
state: Bound
targetCluster: kind-cluster-1
status:
conditions:
- lastTransitionTime: "2024-05-07T23:13:51Z"
message: The resources cannot be updated to the latest because of the rollout
strategy
observedGeneration: 2
reason: RolloutNotStartedYet
status: "False"
type: RolloutStarted
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: No override rules are configured for the selected resources
observedGeneration: 2
reason: NoOverrideSpecified
status: "True"
type: Overridden
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: All of the works are synchronized to the latest
observedGeneration: 2
reason: AllWorkSynced
status: "True"
type: WorkSynchronized
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: All corresponding work objects are applied
observedGeneration: 2
reason: AllWorkHaveBeenApplied
status: "True"
type: Applied
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: All corresponding work objects are available
observedGeneration: 2
reason: AllWorkAreAvailable
status: "True"
type: Available
Upon inspection, it is observed that the ClusterResourceBinding remains unchanged. Notably, in the spec, the resourceSnapshotName still references the old ClusterResourceSnapshot name.
This issue arises due to the absence of explicit rollingUpdate input from the user. Consequently, the default values are applied:
- The
maxUnavailablevalue is configured to 25% x 3 (desired number), rounded to1 - The
maxSurgevalue is configured to 25% x 3 (desired number), rounded to1
Why ClusterResourceBinding isn’t updated?
Initially, when the ClusterResourcePlacement was created, two ClusterResourceBindings were generated.
However, since the rollout didn’t apply to the initial phase, the ClusterResourcePlacementRolloutStarted condition was set to True.
Upon creating the test-ns namespace on the hub cluster, the rollout controller attempted to update the two existing ClusterResourceBindings.
However, maxUnavailable was set to 1 due to the lack of member clusters, which caused the RollingUpdate configuration to be too strict.
NOTE: During the update, if one of the bindings fails to apply, it will also violate the
RollingUpdateconfiguration, which causesmaxUnavailableto be set to1.
Resolution
In this situation, to address this issue, consider manually setting maxUnavailable to a value greater than 1 to relax the RollingUpdate configuration.
Alternatively, you can join a third member cluster.
General Notes
The rollout failure investigation flow is identical for ClusterResourcePlacement and ResourcePlacement; only the snapshot object kind differs. Replace CRP-specific object kinds with their RP equivalents when working with namespace-scoped placements.
5 - Override Failure TSG
The ClusterResourcePlacementOverridden (CRP) or ResourcePlacementOverridden (RP) condition is False when an override operation fails.
Note: To get more information, look into the logs for the overrider controller (includes controller for ClusterResourceOverride and ResourceOverride).
Common scenarios
Instances where this condition may arise:
- The
ClusterResourceOverrideorResourceOverrideis created with an invalid field path for the resource.
Case Study
In the following example, an attempt is made to override the cluster role secret-reader that is being propagated by the ClusterResourcePlacement to the selected clusters.
However, the ClusterResourceOverride is created with an invalid path for the field within resource.
ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
creationTimestamp: "2024-05-14T15:36:48Z"
name: secret-reader
resourceVersion: "81334"
uid: 108e6312-3416-49be-aa3d-a665c5df58b4
rules:
- apiGroups:
- ""
resources:
- secrets
verbs:
- get
- watch
- list
The ClusterRole secret-reader that is being propagated to the member clusters by the ClusterResourcePlacement.
ClusterResourceOverride spec
spec:
clusterResourceSelectors:
- group: rbac.authorization.k8s.io
kind: ClusterRole
name: secret-reader
version: v1
policy:
overrideRules:
- clusterSelector:
clusterSelectorTerms:
- labelSelector:
matchLabels:
env: canary
jsonPatchOverrides:
- op: add
path: /metadata/labels/new-label
value: new-value
The ClusterResourceOverride is created to override the ClusterRole secret-reader by adding a new label (new-label)
that has the value new-value for the clusters with the label env: canary.
ClusterResourcePlacement Spec
spec:
resourceSelectors:
- group: rbac.authorization.k8s.io
kind: ClusterRole
name: secret-reader
version: v1
policy:
placementType: PickN
numberOfClusters: 1
affinity:
clusterAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
clusterSelectorTerms:
- labelSelector:
matchLabels:
env: canary
strategy:
type: RollingUpdate
applyStrategy:
allowCoOwnership: true
ClusterResourcePlacement Status
status:
conditions:
- lastTransitionTime: "2024-05-14T16:16:18Z"
message: found all cluster needed as specified by the scheduling policy, found
1 cluster(s)
observedGeneration: 1
reason: SchedulingPolicyFulfilled
status: "True"
type: ClusterResourcePlacementScheduled
- lastTransitionTime: "2024-05-14T16:16:18Z"
message: All 1 cluster(s) start rolling out the latest resource
observedGeneration: 1
reason: RolloutStarted
status: "True"
type: ClusterResourcePlacementRolloutStarted
- lastTransitionTime: "2024-05-14T16:16:18Z"
message: Failed to override resources in 1 cluster(s)
observedGeneration: 1
reason: OverriddenFailed
status: "False"
type: ClusterResourcePlacementOverridden
observedResourceIndex: "0"
placementStatuses:
- applicableClusterResourceOverrides:
- cro-1-0
clusterName: kind-cluster-1
conditions:
- lastTransitionTime: "2024-05-14T16:16:18Z"
message: 'Successfully scheduled resources for placement in kind-cluster-1 (affinity
score: 0, topology spread score: 0): picked by scheduling policy'
observedGeneration: 1
reason: Scheduled
status: "True"
type: Scheduled
- lastTransitionTime: "2024-05-14T16:16:18Z"
message: Detected the new changes on the resources and started the rollout process
observedGeneration: 1
reason: RolloutStarted
status: "True"
type: RolloutStarted
- lastTransitionTime: "2024-05-14T16:16:18Z"
message: 'Failed to apply the override rules on the resources: add operation
does not apply: doc is missing path: "/metadata/labels/new-label": missing
value'
observedGeneration: 1
reason: OverriddenFailed
status: "False"
type: Overridden
selectedResources:
- group: rbac.authorization.k8s.io
kind: ClusterRole
name: secret-reader
version: v1
The CRP attempted to override a propagated resource utilizing an applicable ClusterResourceOverrideSnapshot.
However, as the ClusterResourcePlacementOverridden condition remains false, looking at the placement status for the cluster
where the condition ClusterResourcePlacementOverridden (for ClusterResourcePlacement) or ResourcePlacementOverridden (for ResourcePlacement) failed will offer insights into the exact cause of the failure.
In this situation, the message indicates that the override failed because the path /metadata/labels/new-label and its corresponding value are missing.
Based on the previous example of the cluster role secret-reader, you can see that the path /metadata/labels/ doesn’t exist. This means that labels doesn’t exist.
Therefore, a new label can’t be added.
Resolution
To successfully override the cluster role secret-reader, correct the path and value in ClusterResourceOverride, as shown in the following code:
jsonPatchOverrides:
- op: add
path: /metadata/labels
value:
newlabel: new-value
This will successfully add the new label newlabel with the value new-value to the ClusterRole secret-reader, as we are creating the labels field and adding a new value newlabel: new-value to it.
General Notes
For ResourcePlacement the override flow is identical except that all the resources reside in the same namespace; use ResourceOverride instead of ClusterResourceOverride and expect ResourcePlacementOverridden in conditions.
6 - Work Synchronization Failure TSG
The ClusterResourcePlacementWorkSynchronized (CRP) or ResourcePlacementWorkSynchronized (RP) condition is False when the placement has been recently updated but the associated Work objects have not yet been synchronized with the latest selected resources.
Note: In addition, it may be helpful to look into the logs for the work generator controller to get more information on why the work synchronization failed.
Common Scenarios
Instances where this condition may arise:
- The controller encounters an error while trying to generate the corresponding
workobject. - The enveloped object is not well formatted.
Case Study
The CRP is attempting to propagate a resource to a selected cluster, but the work object has not been updated to reflect the latest changes due to the selected cluster has been terminated.
ClusterResourcePlacement Spec
spec:
resourceSelectors:
- group: rbac.authorization.k8s.io
kind: ClusterRole
name: secret-reader
version: v1
policy:
placementType: PickN
numberOfClusters: 1
strategy:
type: RollingUpdate
ClusterResourcePlacement Status
spec:
policy:
numberOfClusters: 1
placementType: PickN
resourceSelectors:
- group: ""
kind: Namespace
name: test-ns
version: v1
revisionHistoryLimit: 10
strategy:
type: RollingUpdate
status:
conditions:
- lastTransitionTime: "2024-05-14T18:05:04Z"
message: found all cluster needed as specified by the scheduling policy, found
1 cluster(s)
observedGeneration: 1
reason: SchedulingPolicyFulfilled
status: "True"
type: ClusterResourcePlacementScheduled
- lastTransitionTime: "2024-05-14T18:05:05Z"
message: All 1 cluster(s) start rolling out the latest resource
observedGeneration: 1
reason: RolloutStarted
status: "True"
type: ClusterResourcePlacementRolloutStarted
- lastTransitionTime: "2024-05-14T18:05:05Z"
message: No override rules are configured for the selected resources
observedGeneration: 1
reason: NoOverrideSpecified
status: "True"
type: ClusterResourcePlacementOverridden
- lastTransitionTime: "2024-05-14T18:05:05Z"
message: There are 1 cluster(s) which have not finished creating or updating work(s)
yet
observedGeneration: 1
reason: WorkNotSynchronizedYet
status: "False"
type: ClusterResourcePlacementWorkSynchronized
observedResourceIndex: "0"
placementStatuses:
- clusterName: kind-cluster-1
conditions:
- lastTransitionTime: "2024-05-14T18:05:04Z"
message: 'Successfully scheduled resources for placement in kind-cluster-1 (affinity
score: 0, topology spread score: 0): picked by scheduling policy'
observedGeneration: 1
reason: Scheduled
status: "True"
type: Scheduled
- lastTransitionTime: "2024-05-14T18:05:05Z"
message: Detected the new changes on the resources and started the rollout process
observedGeneration: 1
reason: RolloutStarted
status: "True"
type: RolloutStarted
- lastTransitionTime: "2024-05-14T18:05:05Z"
message: No override rules are configured for the selected resources
observedGeneration: 1
reason: NoOverrideSpecified
status: "True"
type: Overridden
- lastTransitionTime: "2024-05-14T18:05:05Z"
message: 'Failed to synchronize the work to the latest: works.placement.kubernetes-fleet.io
"crp1-work" is forbidden: unable to create new content in namespace fleet-member-kind-cluster-1
because it is being terminated'
observedGeneration: 1
reason: SyncWorkFailed
status: "False"
type: WorkSynchronized
selectedResources:
- kind: Namespace
name: test-ns
version: v1
In the ClusterResourcePlacement status, the ClusterResourcePlacementWorkSynchronized condition status shows as False.
The message for it indicates that the work object crp1-work is prohibited from generating new content within the namespace fleet-member-kind-cluster-1 because it’s currently terminating.
Resolution
To address the issue at hand, there are several potential solutions:
- Modify the
ClusterResourcePlacementwith a newly selected cluster. - Delete the
ClusterResourcePlacementto remove work through garbage collection. - Rejoin the member cluster. The namespace can only be regenerated after rejoining the cluster.
In other situations, you might opt to wait for the work to finish propagating.
General Notes
For ResourcePlacement the investigation is identical — inspect .status.placementStatuses[*].conditions for WorkSynchronized and check the associated Work in the fleet-member-{clusterName} namespace.
7 - Work-Application Failure TSG
The ClusterResourcePlacementApplied (for ClusterResourcePlacement) or ResourcePlacementApplied (for ResourcePlacement) condition is set to false when the deployment fails to apply.
Note: To get more information about why the resources are not applied, you can check the work applier logs.
Common scenarios
Instances where this condition may arise:
- The resource already exists on the cluster and isn’t managed by the fleet controller.
- Another placement (ClusterResourcePlacement or ResourcePlacement) is already managing the resource for the selected cluster by using a different apply strategy.
- The placement doesn’t apply the manifest because of syntax errors or invalid resource configurations. This might also occur if a resource is propagated through an envelope object.
Investigation steps
- Check
placementStatuses: In the placement status section, inspect theplacementStatusesto identify which clusters have theClusterResourcePlacementApplied(for ClusterResourcePlacement) orResourcePlacementApplied(for ResourcePlacement) condition set tofalseand note down theirclusterName. - Locate the
WorkObject in Hub Cluster: Use the identifiedclusterNameto locate theWorkobject associated with the member cluster.- For ClusterResourcePlacement, refer to this section
- For ResourcePlacement, refer to this section
- Check
Workobject status: Inspect the status of theWorkobject to understand the specific issues preventing successful resource application.
Case Study: ClusterResourcePlacement
In the following example, a ClusterResourcePlacement is trying to propagate a namespace that contains a deployment to two member clusters. However, the namespace already exists on one member cluster, specifically kind-cluster-1.
ClusterResourcePlacement spec
policy:
clusterNames:
- kind-cluster-1
- kind-cluster-2
placementType: PickFixed
resourceSelectors:
- group: ""
kind: Namespace
name: test-ns
version: v1
revisionHistoryLimit: 10
strategy:
type: RollingUpdate
ClusterResourcePlacement status
status:
conditions:
- lastTransitionTime: "2024-05-07T23:32:40Z"
message: could not find all the clusters needed as specified by the scheduling
policy
observedGeneration: 1
reason: SchedulingPolicyUnfulfilled
status: "False"
type: ClusterResourcePlacementScheduled
- lastTransitionTime: "2024-05-07T23:32:40Z"
message: All 2 cluster(s) start rolling out the latest resource
observedGeneration: 1
reason: RolloutStarted
status: "True"
type: ClusterResourcePlacementRolloutStarted
- lastTransitionTime: "2024-05-07T23:32:40Z"
message: No override rules are configured for the selected resources
observedGeneration: 1
reason: NoOverrideSpecified
status: "True"
type: ClusterResourcePlacementOverridden
- lastTransitionTime: "2024-05-07T23:32:40Z"
message: Works(s) are succcesfully created or updated in the 2 target clusters'
namespaces
observedGeneration: 1
reason: WorkSynchronized
status: "True"
type: ClusterResourcePlacementWorkSynchronized
- lastTransitionTime: "2024-05-07T23:32:40Z"
message: Failed to apply resources to 1 clusters, please check the `failedPlacements`
status
observedGeneration: 1
reason: ApplyFailed
status: "False"
type: ClusterResourcePlacementApplied
observedResourceIndex: "0"
placementStatuses:
- clusterName: kind-cluster-2
conditions:
- lastTransitionTime: "2024-05-07T23:32:40Z"
message: 'Successfully scheduled resources for placement in kind-cluster-2 (affinity
score: 0, topology spread score: 0): picked by scheduling policy'
observedGeneration: 1
reason: Scheduled
status: "True"
type: Scheduled
- lastTransitionTime: "2024-05-07T23:32:40Z"
message: Detected the new changes on the resources and started the rollout process
observedGeneration: 1
reason: RolloutStarted
status: "True"
type: RolloutStarted
- lastTransitionTime: "2024-05-07T23:32:40Z"
message: No override rules are configured for the selected resources
observedGeneration: 1
reason: NoOverrideSpecified
status: "True"
type: Overridden
- lastTransitionTime: "2024-05-07T23:32:40Z"
message: All of the works are synchronized to the latest
observedGeneration: 1
reason: AllWorkSynced
status: "True"
type: WorkSynchronized
- lastTransitionTime: "2024-05-07T23:32:40Z"
message: All corresponding work objects are applied
observedGeneration: 1
reason: AllWorkHaveBeenApplied
status: "True"
type: Applied
- lastTransitionTime: "2024-05-07T23:32:49Z"
message: The availability of work object crp-4-work is not trackable
observedGeneration: 1
reason: WorkNotTrackable
status: "True"
type: Available
- clusterName: kind-cluster-1
conditions:
- lastTransitionTime: "2024-05-07T23:32:40Z"
message: 'Successfully scheduled resources for placement in kind-cluster-1 (affinity
score: 0, topology spread score: 0): picked by scheduling policy'
observedGeneration: 1
reason: Scheduled
status: "True"
type: Scheduled
- lastTransitionTime: "2024-05-07T23:32:40Z"
message: Detected the new changes on the resources and started the rollout process
observedGeneration: 1
reason: RolloutStarted
status: "True"
type: RolloutStarted
- lastTransitionTime: "2024-05-07T23:32:40Z"
message: No override rules are configured for the selected resources
observedGeneration: 1
reason: NoOverrideSpecified
status: "True"
type: Overridden
- lastTransitionTime: "2024-05-07T23:32:40Z"
message: All of the works are synchronized to the latest
observedGeneration: 1
reason: AllWorkSynced
status: "True"
type: WorkSynchronized
- lastTransitionTime: "2024-05-07T23:32:40Z"
message: Work object crp-4-work is not applied
observedGeneration: 1
reason: NotAllWorkHaveBeenApplied
status: "False"
type: Applied
failedPlacements:
- condition:
lastTransitionTime: "2024-05-07T23:32:40Z"
message: 'Failed to apply manifest: failed to process the request due to a
client error: resource exists and is not managed by the fleet controller
and co-ownernship is disallowed'
reason: ManifestsAlreadyOwnedByOthers
status: "False"
type: Applied
kind: Namespace
name: test-ns
version: v1
selectedResources:
- kind: Namespace
name: test-ns
version: v1
- group: apps
kind: Deployment
name: test-nginx
namespace: test-ns
version: v1
In the ClusterResourcePlacement status, within the failedPlacements section for kind-cluster-1, we get a clear message
as to why the resource failed to apply on the member cluster. In the preceding conditions section,
the ClusterResourcePlacementApplied condition for kind-cluster-1 is flagged as false and shows the NotAllWorkHaveBeenApplied reason.
This indicates that the Work object intended for the member cluster kind-cluster-1 has not been applied.
To inspect the Work object for more details, follow the steps in the Investigation steps section above.
Work status of kind-cluster-1
status:
conditions:
- lastTransitionTime: "2024-05-07T23:32:40Z"
message: 'Apply manifest {Ordinal:0 Group: Version:v1 Kind:Namespace Resource:namespaces
Namespace: Name:test-ns} failed'
observedGeneration: 1
reason: WorkAppliedFailed
status: "False"
type: Applied
- lastTransitionTime: "2024-05-07T23:32:40Z"
message: ""
observedGeneration: 1
reason: WorkAppliedFailed
status: Unknown
type: Available
manifestConditions:
- conditions:
- lastTransitionTime: "2024-05-07T23:32:40Z"
message: 'Failed to apply manifest: failed to process the request due to a client
error: resource exists and is not managed by the fleet controller and co-ownernship
is disallowed'
reason: ManifestsAlreadyOwnedByOthers
status: "False"
type: Applied
- lastTransitionTime: "2024-05-07T23:32:40Z"
message: Manifest is not applied yet
reason: ManifestApplyFailed
status: Unknown
type: Available
identifier:
kind: Namespace
name: test-ns
ordinal: 0
resource: namespaces
version: v1
- conditions:
- lastTransitionTime: "2024-05-07T23:32:40Z"
message: Manifest is already up to date
observedGeneration: 1
reason: ManifestAlreadyUpToDate
status: "True"
type: Applied
- lastTransitionTime: "2024-05-07T23:32:51Z"
message: Manifest is trackable and available now
observedGeneration: 1
reason: ManifestAvailable
status: "True"
type: Available
identifier:
group: apps
kind: Deployment
name: test-nginx
namespace: test-ns
ordinal: 1
resource: deployments
version: v1
From looking at the Work status, specifically the manifestConditions section, you can see that the namespace could not be applied but the deployment within the namespace got propagated from the hub to the member cluster.
Resolution
In this situation, a potential solution is to set the AllowCoOwnership to true in the ApplyStrategy policy. However, it’s important to notice that this decision should be made by the user because the resources might not be shared.
General Troubleshooting Notes
The troubleshooting process and Work object inspection are identical for both ClusterResourcePlacement and ResourcePlacement:
- Both use the same underlying Work API to apply resources to member clusters
- The Work object status and manifestConditions have the same structure regardless of whether they were created by a ClusterResourcePlacement or ResourcePlacement
- The main difference is the scope: ClusterResourcePlacement is cluster-scoped and can select both cluster-scoped and namespace-scoped resources, while ResourcePlacement is namespace-scoped and can only select namespace-scoped resources within its own namespace
For ResourcePlacement-specific considerations:
- Ensure the target namespace exists on member clusters before the ResourcePlacement tries to apply resources to it
- ResourcePlacement can only select resources within the same namespace where the ResourcePlacement object itself resides
8 - Availability Failure TSG
The ClusterResourcePlacementAvailable (for ClusterResourcePlacement) or ResourcePlacementAvailable (for ResourcePlacement) condition is false when some of the resources are not available yet. Detailed failures are placed in the failedPlacements section of the placement status.
Note: To get more information about why resources are unavailable check work applier logs.
Common scenarios
Instances where this condition may arise:
- The member cluster doesn’t have enough resource availability.
- The deployment contains an invalid image name.
- Required resources (such as persistent volumes, config maps, or secrets) are missing.
- Resource quotas or limit ranges are preventing the resource from becoming available.
Case Study: ClusterResourcePlacement
The example output below demonstrates a scenario where a ClusterResourcePlacement is unable to propagate a deployment to a member cluster due to the deployment having a bad image name.
ClusterResourcePlacement spec
spec:
resourceSelectors:
- group: ""
kind: Namespace
name: test-ns
version: v1
policy:
placementType: PickN
numberOfClusters: 1
strategy:
type: RollingUpdate
ClusterResourcePlacement status
status:
conditions:
- lastTransitionTime: "2024-05-14T18:52:30Z"
message: found all cluster needed as specified by the scheduling policy, found
1 cluster(s)
observedGeneration: 1
reason: SchedulingPolicyFulfilled
status: "True"
type: ClusterResourcePlacementScheduled
- lastTransitionTime: "2024-05-14T18:52:31Z"
message: All 1 cluster(s) start rolling out the latest resource
observedGeneration: 1
reason: RolloutStarted
status: "True"
type: ClusterResourcePlacementRolloutStarted
- lastTransitionTime: "2024-05-14T18:52:31Z"
message: No override rules are configured for the selected resources
observedGeneration: 1
reason: NoOverrideSpecified
status: "True"
type: ClusterResourcePlacementOverridden
- lastTransitionTime: "2024-05-14T18:52:31Z"
message: Works(s) are succcesfully created or updated in 1 target cluster(s)'
namespaces
observedGeneration: 1
reason: WorkSynchronized
status: "True"
type: ClusterResourcePlacementWorkSynchronized
- lastTransitionTime: "2024-05-14T18:52:31Z"
message: The selected resources are successfully applied to 1 cluster(s)
observedGeneration: 1
reason: ApplySucceeded
status: "True"
type: ClusterResourcePlacementApplied
- lastTransitionTime: "2024-05-14T18:52:31Z"
message: The selected resources in 1 cluster(s) are still not available yet
observedGeneration: 1
reason: ResourceNotAvailableYet
status: "False"
type: ClusterResourcePlacementAvailable
observedResourceIndex: "0"
placementStatuses:
- clusterName: kind-cluster-1
conditions:
- lastTransitionTime: "2024-05-14T18:52:30Z"
message: 'Successfully scheduled resources for placement in kind-cluster-1 (affinity
score: 0, topology spread score: 0): picked by scheduling policy'
observedGeneration: 1
reason: Scheduled
status: "True"
type: Scheduled
- lastTransitionTime: "2024-05-14T18:52:31Z"
message: Detected the new changes on the resources and started the rollout process
observedGeneration: 1
reason: RolloutStarted
status: "True"
type: RolloutStarted
- lastTransitionTime: "2024-05-14T18:52:31Z"
message: No override rules are configured for the selected resources
observedGeneration: 1
reason: NoOverrideSpecified
status: "True"
type: Overridden
- lastTransitionTime: "2024-05-14T18:52:31Z"
message: All of the works are synchronized to the latest
observedGeneration: 1
reason: AllWorkSynced
status: "True"
type: WorkSynchronized
- lastTransitionTime: "2024-05-14T18:52:31Z"
message: All corresponding work objects are applied
observedGeneration: 1
reason: AllWorkHaveBeenApplied
status: "True"
type: Applied
- lastTransitionTime: "2024-05-14T18:52:31Z"
message: Work object crp1-work is not available
observedGeneration: 1
reason: NotAllWorkAreAvailable
status: "False"
type: Available
failedPlacements:
- condition:
lastTransitionTime: "2024-05-14T18:52:31Z"
message: Manifest is trackable but not available yet
observedGeneration: 1
reason: ManifestNotAvailableYet
status: "False"
type: Available
group: apps
kind: Deployment
name: my-deployment
namespace: test-ns
version: v1
selectedResources:
- kind: Namespace
name: test-ns
version: v1
- group: apps
kind: Deployment
name: my-deployment
namespace: test-ns
version: v1
In the ClusterResourcePlacement status, within the failedPlacements section for kind-cluster-1, we get a clear message
as to why the resource is not available on the member cluster. In the preceding conditions section,
the ClusterResourcePlacementAvailable condition for kind-cluster-1 is flagged as false and shows NotAllWorkAreAvailable reason.
This signifies that the Work object intended for the member cluster kind-cluster-1 is not yet available.
For more information on finding the correct Work resource:
- For ClusterResourcePlacement, see this section
- For ResourcePlacement, see this section
Work status of kind-cluster-1
status:
conditions:
- lastTransitionTime: "2024-05-14T18:52:31Z"
message: Work is applied successfully
observedGeneration: 1
reason: WorkAppliedCompleted
status: "True"
type: Applied
- lastTransitionTime: "2024-05-14T18:52:31Z"
message: Manifest {Ordinal:1 Group:apps Version:v1 Kind:Deployment Resource:deployments
Namespace:test-ns Name:my-deployment} is not available yet
observedGeneration: 1
reason: WorkNotAvailableYet
status: "False"
type: Available
manifestConditions:
- conditions:
- lastTransitionTime: "2024-05-14T18:52:31Z"
message: Manifest is already up to date
reason: ManifestAlreadyUpToDate
status: "True"
type: Applied
- lastTransitionTime: "2024-05-14T18:52:31Z"
message: Manifest is trackable and available now
reason: ManifestAvailable
status: "True"
type: Available
identifier:
kind: Namespace
name: test-ns
ordinal: 0
resource: namespaces
version: v1
- conditions:
- lastTransitionTime: "2024-05-14T18:52:31Z"
message: Manifest is already up to date
observedGeneration: 1
reason: ManifestAlreadyUpToDate
status: "True"
type: Applied
- lastTransitionTime: "2024-05-14T18:52:31Z"
message: Manifest is trackable but not available yet
observedGeneration: 1
reason: ManifestNotAvailableYet
status: "False"
type: Available
identifier:
group: apps
kind: Deployment
name: my-deployment
namespace: test-ns
ordinal: 1
resource: deployments
version: v1
Check the Available status for kind-cluster-1. You can see that the my-deployment deployment isn’t yet available on the member cluster.
This suggests that an issue might be affecting the deployment manifest.
Resolution
In this situation, a potential solution is to check the deployment in the member cluster because the message indicates that the root cause of the issue is a bad image name. After this image name is identified, you can correct the deployment manifest and update it. After you fix and update the resource manifest, the placement object (ClusterResourcePlacement or ResourcePlacement) automatically propagates the corrected resource to the member cluster.
For all other situations, make sure that the propagated resource is configured correctly. Additionally, verify that the selected cluster has sufficient available capacity to accommodate the new resources.
General Troubleshooting Notes
The troubleshooting process and Work object inspection are identical for both ClusterResourcePlacement and ResourcePlacement:
- Both use the same underlying Work API to apply resources to member clusters
- The Work object status and manifestConditions have the same structure regardless of whether they were created by a ClusterResourcePlacement or ResourcePlacement
- The
Availablecondition in the Work status indicates whether the applied resources have become available on the member cluster - The main difference is the scope: ClusterResourcePlacement is cluster-scoped and can select both cluster-scoped and namespace-scoped resources, while ResourcePlacement is namespace-scoped and can only select namespace-scoped resources within its own namespace
9 - CRP Drift Detection and Configuration Difference Check Unexpected Result TSG
This document helps you troubleshoot unexpected drift and configuration difference detection results when using the KubeFleet CRP API.
Note
If you are looking for troubleshooting steps on diff reporting failures, i.e., when the
ClusterResourcePlacementDiffReportedorResourcePlacementDiffReportedcondition on your placement object is set toFalse, see the Diff Reporting Failure TSG instead.
Note
This document focuses on unexpected drift and configuration difference detection results. If you have encountered drift and configuration difference detection failures (e.g., no detection results at all with the
ClusterResourcePlacementAppliedcondition being set toFalsewith a detection related error), see the Work-Application Failure TSG instead.
Common scenarios
A drift occurs when a non-KubeFleet agent modifies a KubeFleet-managed resource (i.e.,
a resource that has been applied by KubeFleet). Drift details are reported in the CRP status
on a per-cluster basis (.status.placementStatuses[*].driftedPlacements field).
Drift detection is always on when your CRP uses a ClientSideApply (default) or
ServerSideApply typed apply strategy, however, note the following limitations:
- When you set the
comparisonOptionsetting (.spec.strategy.applyStrategy.comparisonOptionfield) topartialComparison, KubeFleet will only detect drifts in managed fields, i.e., fields that have been explicitly specified on the hub cluster side. A non-KubeFleet agent can then add a field (e.g., a label or an annotation) to the resource without KubeFleet complaining about it. To check for such changes (field additions), use thefullComparisonoption for thecomparisonOptionfield. - Depending on your cluster setup, there might exist Kubernetes webhooks/controllers (built-in or from a third party) that will process KubeFleet-managed resources and add/modify fields as they see fit. The API server on the member cluster side might also add/modify fields (e.g., enforcing default values) on resources. If your comparison option allows, KubeFleet will report these as drifts. For any unexpected drift reportings, verify first if you have installed a source that triggers the changes.
- When you set the
whenToApplysetting (.spec.strategy.applyStrategy.whenToApplyfield) toAlwaysand thecomparisonOptionsetting (.spec.strategy.applyStrategy.comparisonOptionfield) topartialComparison, no drifts will ever be found, as apply ops from KubeFleet will overwrite any drift in managed fields, and drifts in unmanaged fields are always ignored. - Drift detection does not apply to resources that are not yet managed by KubeFleet. If a resource has
not been created on the hub cluster or has not been selected by the CRP API, there will not be any drift
reportings about it, even if the resource live within a KubeFleet managed namespace. Similarly, if KubeFleet
has been blocked from taking over a pre-existing resource due to your takeover setting
(
.spec.strategy.applyStrategy.whenToTakeOverfield), no drift detection will run on the resource. - Resource deletion is not considered as a drift; if a KubeFleet-managed resource has been deleted by a non-KubeFleet agent, KubeFleet will attempt to re-create it as soon as it finds out about the deletion.
- Drift detection will not block resource rollouts. If you have just updated the resources on the hub cluster side and triggered a rollout, drifts on the member cluster side might have been overwritten.
- When a rollout is in progress, drifts will not be reported on the CRP status for a member cluster if the cluster has not received the latest round of updates.
KubeFleet will check for configuration differences under the following two conditions:
- When KubeFleet encounters a pre-existing resource, and the
whenToTakeOversetting (.spec.strategy.applyStrategy.whenToTakeOverfield) is set toIfNoDiff. - When the CRP uses an apply strategy of the
ReportDifftype.
Configuration difference details are reported in the CRP status
on a per-cluster basis (.status.placementStatuses[*].diffedPlacements field). Note that the
following limitations apply:
- When you set the
comparisonOptionsetting (.spec.strategy.applyStrategy.comparisonOptionfield) topartialComparison, KubeFleet will only check for configuration differences in managed fields, i.e., fields that have been explicitly specified on the hub cluster side. Unmanaged fields, such as additional labels and annotations, will not be considered as configuration differences. To check for such changes (field additions), use thefullComparisonoption for thecomparisonOptionfield. - Depending on your cluster setup, there might exist Kubernetes webhooks/controllers (built-in or from a third party) that will process resources and add/modify fields as they see fit. The API server on the member cluster side might also add/modify fields (e.g., enforcing default values) on resources. If your comparison option allows, KubeFleet will report these as configuration differences. For any unexpected configuration difference reportings, verify first if you have installed a source that triggers the changes.
- KubeFleet checks for configuration differences regardless of resource ownerships; resources not managed by KubeFleet will also be checked.
- The absence of a resource will be considered as a configuration difference.
- Configuration differences will not block resource rollouts. If you have just updated the resources on the hub cluster side and triggered a rollout, configuration difference check will be re-run based on the newer versions of resources.
- When a rollout is in progress, configuration differences will not be reported on the CRP status for a member cluster if the cluster has not received the latest round of updates.
Note also that drift detection and configuration difference check in KubeFleet run periodically. The reportings in the CRP status might not be up-to-date.
Investigation steps
If you find an unexpected drift detection or configuration difference check result on a member cluster, follow the steps below for investigation:
- Double-check the apply strategy of your CRP; confirm that your settings allows proper drift detection and/or configuration difference check reportings.
- Verify that rollout has completed on all member clusters; see the CRP Rollout Failure TSG for more information.
- Log onto your member cluster and retrieve the resources with unexpected reportings.
- Check if its generation (
.metadata.generationfield) matches with theobservedInMemberClusterGenerationvalue in the drift detection and/or configuration difference check reportings. A mismatch might signal that the reportings are not yet up-to-date; they should get refreshed soon. - The
kubectl.kubernetes.io/last-applied-configurationannotation and/or the.metadata.managedFieldsfield might have some relevant information on which agents have attempted to update/patch the resource. KubeFleet changes are executed under the namework-api-agent; if you see other manager names, check if it comes from a known source (e.g., Kubernetes controller) in your cluster.
- Check if its generation (
File an issue to the KubeFleet team if you believe that the unexpected reportings come from a bug in KubeFleet.
10 - Diff Reporting Failure TSG
This document helps you troubleshoot diff reporting failures when using KubeFleet placement APIs (ClusterResourcePlacement or ResourcePlacement),
specifically when you find that the ClusterResourcePlacementDiffReported (for ClusterResourcePlacement) or ResourcePlacementDiffReported (for ResourcePlacement) status condition has been
set to False in the placement status.
Note
If you are looking for troubleshooting steps on unexpected drift detection and/or configuration difference detection results, see the Drift Detection and Configuration Difference Detection Failure TSG instead.
Note
The
ClusterResourcePlacementDiffReportedorResourcePlacementDiffReportedstatus condition will only be set if the placement has an apply strategy of theReportDifftype. If your placement usesClientSideApply(default) orServerSideApplytyped apply strategies, it is perfectly normal if the diff reported status condition is absent in the placement status.
Common scenarios
The ClusterResourcePlacementDiffReported or ResourcePlacementDiffReported status condition will be set to False if KubeFleet cannot complete
the configuration difference checking process for one or more of the selected resources.
Depending on your placement configuration, KubeFleet might use one of the three approaches for configuration difference checking:
- If the resource cannot be found on a member cluster, KubeFleet will simply report a full object difference.
- If you ask KubeFleet to perform partial comparisons, i.e., the
comparisonOptionfield in the placement apply strategy (.spec.strategy.applyStrategy.comparisonOptionfield) is set topartialComparison, KubeFleet will perform a dry-run apply op (server-side apply with conflict overriding enabled) and compare the returned apply result against the current state of the resource on the member cluster side for configuration differences. - If you ask KubeFleet to perform full comparisons, i.e., the
comparisonOptionfield in the placement apply strategy (.spec.strategy.applyStrategy.comparisonOptionfield) is set tofullComparison, KubeFleet will directly compare the given manifest (the resource created on the hub cluster side) against the current state of the resource on the member cluster side for configuration differences.
Failures might arise if:
- The dry-run apply op does not complete successfully; or
- An unexpected error occurs during the comparison process, such as a JSON path parsing/evaluation error.
- In this case, please consider filing a bug to the KubeFleet team.
Investigation steps
If you encounter such a failure, follow the steps below for investigation:
Identify the specific resources that have failed in the diff reporting process first. In the placement status, find out the individual member clusters that have diff reporting failures: inspect the
.status.placementStatusesfield of the placement object; each entry corresponds to a member cluster, and for each entry, check if it has a status condition,ClusterResourcePlacementDiffReported(for ClusterResourcePlacement) orResourcePlacementDiffReported(for ResourcePlacement), in the.status.placementStatuses[*].conditionsfield, which has been set toFalse. Write down the name of the member cluster.For each cluster name that has been written down, list all the work objects that have been created for the cluster in correspondence with the placement object:
# For ClusterResourcePlacement: # Replace [YOUR-CLUSTER-NAME] and [YOUR-CRP-NAME] with values of your own. kubectl get work -n fleet-member-[YOUR-CLUSTER-NAME] -l kubernetes-fleet.io/parent-CRP=[YOUR-CRP-NAME] # For ResourcePlacement: # Replace [YOUR-CLUSTER-NAME] and [YOUR-RP-NAME] with values of your own. kubectl get work -n fleet-member-[YOUR-CLUSTER-NAME] -l kubernetes-fleet.io/parent-CRP=[YOUR-RP-NAME]For each found work object, inspect its status. The
.status.manifestConditionsfield features an array of which each item explains about the processing result of a resource on the given member cluster. Find out all items with aClusterResourcePlacementDiffReportedorResourcePlacementDiffReportedcondition in the.status.manifestConditions[*].conditionsfield that has been set toFalse. The.status.manifestConditions[*].identifierfield tells the GVK, namespace, and name of the failing resource.Read the
messagefield of theClusterResourcePlacementDiffReportedorResourcePlacementDiffReportedcondition (.status.manifestConditions[*].conditions[*].message); KubeFleet will include the details about the diff reporting failures in the field.If you are familiar with the cause of the error (for example, dry-run apply ops fails due to API server traffic control measures), fixing the cause (tweaking traffic control limits) should resolve the failure. KubeFleet will periodically retry diff reporting in face of failures. Otherwise, file an issue to the KubeFleet team.
11 - Staged Update Run TSG
This guide provides troubleshooting steps for common issues related to both cluster-scoped and namespace-scoped Staged Update Runs.
Note: To get more information about failures, you can check the updateRun controller logs.
Cluster-Scoped Troubleshooting
CRP status without Staged Update Run
When a ClusterResourcePlacement is created with spec.strategy.type set to External, the rollout does not start immediately.
A sample status of such ClusterResourcePlacement is as follows:
$ kubectl describe crp example-placement
...
Status:
Conditions:
Last Transition Time: 2025-03-12T23:01:32Z
Message: found all cluster needed as specified by the scheduling policy, found 2 cluster(s)
Observed Generation: 1
Reason: SchedulingPolicyFulfilled
Status: True
Type: ClusterResourcePlacementScheduled
Last Transition Time: 2025-03-12T23:01:32Z
Message: There are still 2 cluster(s) in the process of deciding whether to roll out the latest resources or not
Observed Generation: 1
Reason: RolloutStartedUnknown
Status: Unknown
Type: ClusterResourcePlacementRolloutStarted
Observed Resource Index: 0
Placement Statuses:
Cluster Name: member1
Conditions:
Last Transition Time: 2025-03-12T23:01:32Z
Message: Successfully scheduled resources for placement in "member1" (affinity score: 0, topology spread score: 0): picked by scheduling policy
Observed Generation: 1
Reason: Scheduled
Status: True
Type: Scheduled
Last Transition Time: 2025-03-12T23:01:32Z
Message: In the process of deciding whether to roll out the latest resources or not
Observed Generation: 1
Reason: RolloutStartedUnknown
Status: Unknown
Type: RolloutStarted
Cluster Name: member2
Conditions:
Last Transition Time: 2025-03-12T23:01:32Z
Message: Successfully scheduled resources for placement in "member2" (affinity score: 0, topology spread score: 0): picked by scheduling policy
Observed Generation: 1
Reason: Scheduled
Status: True
Type: Scheduled
Last Transition Time: 2025-03-12T23:01:32Z
Message: In the process of deciding whether to roll out the latest resources or not
Observed Generation: 1
Reason: RolloutStartedUnknown
Status: Unknown
Type: RolloutStarted
Selected Resources:
...
Events: <none>
SchedulingPolicyFulfilled condition indicates the CRP has been fully scheduled, while RolloutStartedUnknown condition shows that the rollout has not started.
In the Placement Statuses section, it displays the detailed status of each cluster. Both selected clusters are in the Scheduled state, but the RolloutStarted condition is still Unknown because the rollout has not kicked off yet.
Investigate ClusterStagedUpdateRun initialization failure
An updateRun initialization failure can be easily detected by getting the resource:
$ kubectl get csur example-run
NAME PLACEMENT RESOURCE-SNAPSHOT-INDEX POLICY-SNAPSHOT-INDEX INITIALIZED SUCCEEDED AGE
example-run example-placement 1 0 False 2s
The INITIALIZED field is False, indicating the initialization failed.
Describe the updateRun to get more details:
$ kubectl describe csur example-run
...
Status:
Conditions:
Last Transition Time: 2025-03-13T07:28:29Z
Message: cannot continue the ClusterStagedUpdateRun: failed to initialize the clusterStagedUpdateRun: failed to process the request due to a client error: no clusterResourceSnapshots with index `1` found for clusterResourcePlacement `example-placement`
Observed Generation: 1
Reason: UpdateRunInitializedFailed
Status: False
Type: Initialized
Deletion Stage Status:
Clusters:
Stage Name: kubernetes-fleet.io/deleteStage
Policy Observed Cluster Count: 2
Policy Snapshot Index Used: 0
...
The condition clearly indicates the initialization failed. The condition message gives more details about the failure.
In this case, a non-existing resource snapshot index 1 was used for the updateRun.
Investigate ClusterStagedUpdateRun execution failure
An updateRun execution failure can be easily detected by getting the resource:
$ kubectl get csur example-run
NAME PLACEMENT RESOURCE-SNAPSHOT-INDEX POLICY-SNAPSHOT-INDEX INITIALIZED SUCCEEDED AGE
example-run example-placement 0 0 True False 24m
The SUCCEEDED field is False, indicating the execution failure.
An updateRun execution failure can be caused by mainly 2 scenarios:
- When the updateRun controller is triggered to reconcile an in-progress updateRun, it starts by doing a bunch of validations
including retrieving the CRP and checking its rollout strategy, gathering all the bindings and regenerating the execution plan.
If any failure happens during validation, the updateRun execution fails with the corresponding validation error.
In above case, the CRP referenced by the updateRun is deleted during the execution. The updateRun controller detects and aborts the release.status: conditions: - lastTransitionTime: "2025-05-13T21:11:06Z" message: ClusterStagedUpdateRun initialized successfully observedGeneration: 1 reason: UpdateRunInitializedSuccessfully status: "True" type: Initialized - lastTransitionTime: "2025-05-13T21:11:21Z" message: The stages are aborted due to a non-recoverable error observedGeneration: 1 reason: UpdateRunFailed status: "False" type: Progressing - lastTransitionTime: "2025-05-13T22:15:23Z" message: 'cannot continue the ClusterStagedUpdateRun: failed to execute the clusterStagedUpdateRun: failed to process the request due to a client error: parent clusterResourcePlacement not found' observedGeneration: 1 reason: UpdateRunFailed status: "False" type: Succeeded - The updateRun controller triggers update to a member cluster by updating the corresponding binding spec and setting its
status to
RolloutStarted. It then waits for default 15 seconds and check whether the resources have been successfully applied by checking the binding again. In case that there are multiple concurrent updateRuns, and during the 15-second wait, some other updateRun preempts and updates the binding with new configuration, current updateRun detects and fails with clear error message.
Thestatus: conditions: - lastTransitionTime: "2025-05-13T21:10:58Z" message: ClusterStagedUpdateRun initialized successfully observedGeneration: 1 reason: UpdateRunInitializedSuccessfully status: "True" type: Initialized - lastTransitionTime: "2025-05-13T21:11:13Z" message: The stages are aborted due to a non-recoverable error observedGeneration: 1 reason: UpdateRunFailed status: "False" type: Progressing - lastTransitionTime: "2025-05-13T21:11:13Z" message: 'cannot continue the ClusterStagedUpdateRun: unexpected behavior which cannot be handled by the controller: the clusterResourceBinding of the updating cluster `member1` in the stage `staging` does not have expected status: binding spec diff: binding has different resourceSnapshotName, want: example-placement-0-snapshot, got: example-placement-1-snapshot; binding state (want Bound): Bound; binding RolloutStarted (want true): true, please check if there is concurrent clusterStagedUpdateRun' observedGeneration: 1 reason: UpdateRunFailed status: "False" type: SucceededSucceededcondition is set toFalsewith reasonUpdateRunFailed. In themessage, we showmember1cluster instagingstage gets preempted, and theresourceSnapshotNamefield is changed fromexample-placement-0-snapshottoexample-placement-1-snapshotwhich means probably some other updateRun is rolling out a newer resource version. The message also prints current binding state and ifRolloutStartedcondition is set to true. The message gives a hint about whether these is a concurrent clusterStagedUpdateRun running. Upon such failure, the user can list updateRuns or check the binding state:The binding is named askubectl get clusterresourcebindings NAME WORKSYNCHRONIZED RESOURCESAPPLIED AGE example-placement-member1-2afc7d7f True True 51m example-placement-member2-fc081413 51m<crp-name>-<cluster-name>-<suffix>. Since the error message saysmember1cluster fails the updateRun, we can check its binding:As the bindingkubectl get clusterresourcebindings example-placement-member1-2afc7d7f -o yaml ... spec: ... resourceSnapshotName: example-placement-1-snapshot schedulingPolicySnapshotName: example-placement-0 state: Bound targetCluster: member1 status: conditions: - lastTransitionTime: "2025-05-13T21:11:06Z" message: 'Detected the new changes on the resources and started the rollout process, resourceSnapshotIndex: 1, clusterStagedUpdateRun: example-run-1' observedGeneration: 3 reason: RolloutStarted status: "True" type: RolloutStarted ...RolloutStartedcondition shows, it’s updated by another updateRunexample-run-1.
The updateRun abortion due to execution failures is not recoverable at the moment. If failure happens due to validation error, one can fix the issue and create a new updateRun. If preemption happens, in most cases the user is releasing a new resource version, and they can just let the new updateRun run to complete.
Investigate ClusterStagedUpdateRun rollout stuck
A ClusterStagedUpdateRun can get stuck when resource placement fails on some clusters. Getting the updateRun will show the cluster name and stage that is in stuck state:
$ kubectl get csur example-run -o yaml
...
status:
conditions:
- lastTransitionTime: "2025-05-13T23:15:35Z"
message: ClusterStagedUpdateRun initialized successfully
observedGeneration: 1
reason: UpdateRunInitializedSuccessfully
status: "True"
type: Initialized
- lastTransitionTime: "2025-05-13T23:21:18Z"
message: The updateRun is stuck waiting for cluster member1 in stage staging to
finish updating, please check crp status for potential errors
observedGeneration: 1
reason: UpdateRunStuck
status: "False"
type: Progressing
...
The message shows that the updateRun is stuck waiting for the cluster member1 in stage staging to finish releasing.
The updateRun controller rolls resources to a member cluster by updating its corresponding binding. It then checks periodically
whether the update has completed or not. If the binding is still not available after current default 5 minutes, updateRun
controller decides the rollout has stuck and reports the condition.
This usually indicates something wrong happened on the cluster or the resources have some issue. To further investigate, you can check the ClusterResourcePlacement status:
$ kubectl describe crp example-placement
...
Placement Statuses:
Cluster Name: member1
Conditions:
Last Transition Time: 2025-05-13T23:11:14Z
Message: Successfully scheduled resources for placement in "member1" (affinity score: 0, topology spread score: 0): picked by scheduling policy
Observed Generation: 1
Reason: Scheduled
Status: True
Type: Scheduled
Last Transition Time: 2025-05-13T23:15:35Z
Message: Detected the new changes on the resources and started the rollout process, resourceSnapshotIndex: 0, clusterStagedUpdateRun: example-run
Observed Generation: 1
Reason: RolloutStarted
Status: True
Type: RolloutStarted
Last Transition Time: 2025-05-13T23:15:35Z
Message: No override rules are configured for the selected resources
Observed Generation: 1
Reason: NoOverrideSpecified
Status: True
Type: Overridden
Last Transition Time: 2025-05-13T23:15:35Z
Message: All of the works are synchronized to the latest
Observed Generation: 1
Reason: AllWorkSynced
Status: True
Type: WorkSynchronized
Last Transition Time: 2025-05-13T23:15:35Z
Message: All corresponding work objects are applied
Observed Generation: 1
Reason: AllWorkHaveBeenApplied
Status: True
Type: Applied
Last Transition Time: 2025-05-13T23:15:35Z
Message: Work object example-placement-work-configmap-c5971133-2779-4f6f-8681-3e05c4458c82 is not yet available
Observed Generation: 1
Reason: NotAllWorkAreAvailable
Status: False
Type: Available
Failed Placements:
Condition:
Last Transition Time: 2025-05-13T23:15:35Z
Message: Manifest is trackable but not available yet
Observed Generation: 1
Reason: ManifestNotAvailableYet
Status: False
Type: Available
Envelope:
Name: envelope-nginx-deploy
Namespace: test-namespace
Type: ConfigMap
Group: apps
Kind: Deployment
Name: nginx
Namespace: test-namespace
Version: v1
...
The Applied condition is False and says not all work have been applied. And in the “failed placements” section, it shows
the nginx deployment wrapped by envelope-nginx-deploy configMap is not ready. Check from member1 cluster and we can see
there’s image pull failure:
kubectl config use-context member1
kubectl get deploy -n test-namespace
NAME READY UP-TO-DATE AVAILABLE AGE
nginx 0/1 1 0 16m
kubectl get pods -n test-namespace
NAME READY STATUS RESTARTS AGE
nginx-69b9cb5485-sw24b 0/1 ErrImagePull 0 16m
For more debugging instructions, you can refer to ClusterResourcePlacement TSG.
After resolving the issue, you can create always create a new updateRun to restart the rollout. Stuck updateRuns can be deleted.
Namespace-Scoped Troubleshooting
Namespace-scoped StagedUpdateRun troubleshooting is a mirror image of cluster-scoped ClusterStagedUpdateRun troubleshooting. The concepts, failure patterns, and diagnostic approaches are exactly the same - only the resource names, scopes, and kubectl commands differ.
Both follow identical troubleshooting patterns:
- Initialization failures: Missing resource snapshots, invalid configurations
- Execution failures: Validation errors, concurrent updateRun conflicts
- Rollout stuck scenarios: Resource placement failures, cluster connectivity issues
- Status investigation: Using
kubectl get,kubectl describe, and checking placement status - Recovery approaches: Creating new updateRuns, cleaning up stuck resources
The key differences are:
- Resource scope: Namespace-scoped vs cluster-scoped resources
- Commands: Use
sur(StagedUpdateRun) instead ofcsur(ClusterStagedUpdateRun) - Target resources:
ResourcePlacementinstead ofClusterResourcePlacement - Bindings:
resourcebindingsinstead ofclusterresourcebindings - Approvals:
approvalrequestsinstead ofclusterapprovalrequests
ResourcePlacement status without Staged Update Run
When a namespace-scoped ResourcePlacement is created with spec.strategy.type set to External, the rollout does not start immediately.
A sample status of such ResourcePlacement is as follows:
$ kubectl describe rp web-app-placement -n my-app-namespace
...
Status:
Conditions:
Last Transition Time: 2025-03-12T23:01:32Z
Message: found all cluster needed as specified by the scheduling policy, found 2 cluster(s)
Observed Generation: 1
Reason: SchedulingPolicyFulfilled
Status: True
Type: ResourcePlacementScheduled
Last Transition Time: 2025-03-12T23:01:32Z
Message: There are still 2 cluster(s) in the process of deciding whether to roll out the latest resources or not
Observed Generation: 1
Reason: RolloutStartedUnknown
Status: Unknown
Type: ResourcePlacementRolloutStarted
Observed Resource Index: 0
Placement Statuses:
Cluster Name: member1
Conditions:
Last Transition Time: 2025-03-12T23:01:32Z
Message: Successfully scheduled resources for placement in "member1" (affinity score: 0, topology spread score: 0): picked by scheduling policy
Observed Generation: 1
Reason: Scheduled
Status: True
Type: Scheduled
Last Transition Time: 2025-03-12T23:01:32Z
Message: In the process of deciding whether to roll out the latest resources or not
Observed Generation: 1
Reason: RolloutStartedUnknown
Status: Unknown
Type: RolloutStarted
...
Events: <none>
SchedulingPolicyFulfilled condition indicates the ResourcePlacement has been fully scheduled, while RolloutStartedUnknown condition shows that the rollout has not started.
Investigate StagedUpdateRun initialization failure
A namespace-scoped updateRun initialization failure can be easily detected by getting the resource:
$ kubectl get sur web-app-rollout -n my-app-namespace
NAME PLACEMENT RESOURCE-SNAPSHOT-INDEX POLICY-SNAPSHOT-INDEX INITIALIZED SUCCEEDED AGE
web-app-rollout web-app-placement 1 0 False 2s
The INITIALIZED field is False, indicating the initialization failed.
Describe the updateRun to get more details:
$ kubectl describe sur web-app-rollout -n my-app-namespace
...
Status:
Conditions:
Last Transition Time: 2025-03-13T07:28:29Z
Message: cannot continue the StagedUpdateRun: failed to initialize the stagedUpdateRun: failed to process the request due to a client error: no resourceSnapshots with index `1` found for resourcePlacement `web-app-placement` in namespace `my-app-namespace`
Observed Generation: 1
Reason: UpdateRunInitializedFailed
Status: False
Type: Initialized
Deletion Stage Status:
Clusters:
Stage Name: kubernetes-fleet.io/deleteStage
Policy Observed Cluster Count: 2
Policy Snapshot Index Used: 0
...
The condition clearly indicates the initialization failed. The condition message gives more details about the failure.
In this case, a non-existing resource snapshot index 1 was used for the updateRun.
Investigate StagedUpdateRun execution failure
A namespace-scoped updateRun execution failure can be easily detected by getting the resource:
$ kubectl get sur web-app-rollout -n my-app-namespace
NAME PLACEMENT RESOURCE-SNAPSHOT-INDEX POLICY-SNAPSHOT-INDEX INITIALIZED SUCCEEDED AGE
web-app-rollout web-app-placement 0 0 True False 24m
The SUCCEEDED field is False, indicating the execution failure.
The execution failure scenarios are similar to cluster-scoped updateRuns:
Validation errors during reconciliation: The updateRun controller validates the ResourcePlacement, gathers bindings, and regenerates the execution plan. If any failure occurs, the updateRun execution fails:
status: conditions: - lastTransitionTime: "2025-05-13T21:11:06Z" message: StagedUpdateRun initialized successfully observedGeneration: 1 reason: UpdateRunInitializedSuccessfully status: "True" type: Initialized - lastTransitionTime: "2025-05-13T21:11:21Z" message: The stages are aborted due to a non-recoverable error observedGeneration: 1 reason: UpdateRunFailed status: "False" type: Progressing - lastTransitionTime: "2025-05-13T22:15:23Z" message: 'cannot continue the StagedUpdateRun: failed to initialize the stagedUpdateRun: failed to process the request due to a client error: parent resourcePlacement not found' observedGeneration: 1 reason: UpdateRunFailed status: "False" type: SucceededConcurrent updateRun preemption: When multiple updateRuns target the same ResourcePlacement, they may conflict:
status: conditions: - lastTransitionTime: "2025-05-13T21:11:13Z" message: 'cannot continue the StagedUpdateRun: unexpected behavior which cannot be handled by the controller: the resourceBinding of the updating cluster `member1` in the stage `staging` does not have expected status: binding spec diff: binding has different resourceSnapshotName, want: web-app-placement-0-snapshot, got: web-app-placement-1-snapshot; please check if there is concurrent stagedUpdateRun' observedGeneration: 1 reason: UpdateRunFailed status: "False" type: SucceededTo investigate concurrent updateRuns, check the namespace-scoped resource bindings:
$ kubectl get resourcebindings -n my-app-namespace NAME WORKSYNCHRONIZED RESOURCESAPPLIED AGE web-app-placement-member1-2afc7d7f True True 51m web-app-placement-member2-fc081413 51m
Investigate StagedUpdateRun rollout stuck
A StagedUpdateRun can get stuck when resource placement fails on some clusters:
$ kubectl get sur web-app-rollout -n my-app-namespace -o yaml
...
status:
conditions:
- lastTransitionTime: "2025-05-13T23:15:35Z"
message: StagedUpdateRun initialized successfully
observedGeneration: 1
reason: UpdateRunInitializedSuccessfully
status: "True"
type: Initialized
- lastTransitionTime: "2025-05-13T23:21:18Z"
message: The updateRun is stuck waiting for cluster member1 in stage staging to
finish updating, please check resourceplacement status for potential errors
observedGeneration: 1
reason: UpdateRunStuck
status: "False"
type: Progressing
...
To investigate further, check the ResourcePlacement status:
$ kubectl describe rp web-app-placement -n my-app-namespace
...
Placement Statuses:
Cluster Name: member1
Conditions:
Last Transition Time: 2025-05-13T23:11:14Z
Message: Successfully scheduled resources for placement in "member1" (affinity score: 0, topology spread score: 0): picked by scheduling policy
Observed Generation: 1
Reason: Scheduled
Status: True
Type: Scheduled
Last Transition Time: 2025-05-13T23:15:35Z
Message: Detected the new changes on the resources and started the rollout process, resourceSnapshotIndex: 0, stagedUpdateRun: web-app-rollout
Observed Generation: 1
Reason: RolloutStarted
Status: True
Type: RolloutStarted
Last Transition Time: 2025-05-13T23:15:35Z
Message: Work object web-app-placement-work-deployment-1234 is not yet available
Observed Generation: 1
Reason: NotAllWorkAreAvailable
Status: False
Type: Available
Failed Placements:
Condition:
Last Transition Time: 2025-05-13T23:15:35Z
Message: Manifest is trackable but not available yet
Observed Generation: 1
Reason: ManifestNotAvailableYet
Status: False
Type: Available
Envelope:
Name: envelope-web-app-deploy
Namespace: my-app-namespace
Type: ConfigMap
Group: apps
Kind: Deployment
Name: web-app
Namespace: my-app-namespace
Version: v1
...
Check the target cluster to diagnose the specific issue:
kubectl config use-context member1
kubectl get deploy web-app -n my-app-namespace
NAME READY UP-TO-DATE AVAILABLE AGE
web-app 0/1 1 0 16m
kubectl get pods -n my-app-namespace
NAME READY STATUS RESTARTS AGE
web-app-69b9cb5485-sw24b 0/1 ErrImagePull 0 16m
Namespace-Scoped Approval Troubleshooting
For namespace-scoped staged updates with approval gates, check for ApprovalRequest objects:
# List approval requests in the namespace
$ kubectl get approvalrequests -n my-app-namespace
NAME UPDATE-RUN STAGE APPROVED APPROVALACCEPTED AGE
web-app-rollout-prod-clusters web-app-rollout prod False 2m
# Approve the request
$ kubectl patch approvalrequests web-app-rollout-prod-clusters -n my-app-namespace --type='merge' \
-p '{"status":{"conditions":[{"type":"Approved","status":"True","reason":"approved","message":"approved"}]}}' \
--subresource=status
After resolving issues, create a new updateRun to restart the rollout. Stuck updateRuns can be deleted safely.
12 - ClusterResourcePlacementEviction TSG
This guide provides troubleshooting steps for issues related to placement eviction.
An eviction object when created is ideally only reconciled once and reaches a terminal state. List of terminal states for eviction are:
- Eviction is Invalid
- Eviction is Valid, Eviction failed to Execute
- Eviction is Valid, Eviction executed successfully
Note: If an eviction object doesn’t reach a terminal state i.e. neither valid condition nor executed condition is set it is likely due to a failure in the reconciliation process where the controller is unable to reach the api server.
The first step in troubleshooting is to check the status of the eviction object to understand if the eviction reached a terminal state or not.
Invalid eviction
Missing/Deleting CRP object
Example status with missing CRP object:
status:
conditions:
- lastTransitionTime: "2025-04-17T22:16:59Z"
message: Failed to find ClusterResourcePlacement targeted by eviction
observedGeneration: 1
reason: ClusterResourcePlacementEvictionInvalid
status: "False"
type: Valid
Example status with deleting CRP object:
status:
conditions:
- lastTransitionTime: "2025-04-21T19:53:42Z"
message: Found deleting ClusterResourcePlacement targeted by eviction
observedGeneration: 1
reason: ClusterResourcePlacementEvictionInvalid
status: "False"
type: Valid
In both cases the Eviction object reached a terminal state, its status has Valid condition set to False.
The user should verify if the ClusterResourcePlacement object is missing or if it is being deleted and recreate the
ClusterResourcePlacement object if needed and retry eviction.
Missing CRB object
Example status with missing CRB object:
status:
conditions:
- lastTransitionTime: "2025-04-17T22:21:51Z"
message: Failed to find scheduler decision for placement in cluster targeted by
eviction
observedGeneration: 1
reason: ClusterResourcePlacementEvictionInvalid
status: "False"
type: Valid
Note: The user can find the corresponding
ClusterResourceBindingobject by listing allClusterResourceBindingobjects for theClusterResourcePlacementobjectkubectl get rb -l kubernetes-fleet.io/parent-CRP=<CRPName>The
ClusterResourceBindingobject name is formatted as<CRPName>-<ClusterName>-randomsuffix
In this case the Eviction object reached a terminal state, its status has Valid condition set to False, because the
ClusterResourceBinding object or Placement for target cluster is not found. The user should verify to see if the
ClusterResourcePlacement object is propagating resources to the target cluster,
- If yes, the next step is to check if the
ClusterResourceBindingobject is present for the target cluster or why it was not created and try to create an eviction object onceClusterResourceBindingis created. - If no, the cluster is not picked by the scheduler and hence no need to retry eviction.
Multiple CRB is present
Example status with multiple CRB objects:
status:
conditions:
- lastTransitionTime: "2025-04-17T23:48:08Z"
message: Found more than one scheduler decision for placement in cluster targeted
by eviction
observedGeneration: 1
reason: ClusterResourcePlacementEvictionInvalid
status: "False"
type: Valid
In this case the Eviction object reached a terminal state, its status has Valid condition set to False, because
there is more than one ClusterResourceBinding object or Placement present for the ClusterResourcePlacement object
targeting the member cluster. This is a rare scenario, it’s an in-between state where bindings are being-recreated due
to the member cluster being selected again, and it will normally resolve quickly.
PickFixed CRP is targeted by CRP Eviction
Example status for ClusterResourcePlacementEviction object targeting a PickFixed ClusterResourcePlacement object:
status:
conditions:
- lastTransitionTime: "2025-04-21T23:19:06Z"
message: Found ClusterResourcePlacement with PickFixed placement type targeted
by eviction
observedGeneration: 1
reason: ClusterResourcePlacementEvictionInvalid
status: "False"
type: Valid
In this case the Eviction object reached a terminal state, its status has Valid condition set to False, because
the ClusterResourcePlacement object is of type PickFixed. Users cannot use ClusterResourcePlacementEviction
objects to evict resources propagated by ClusterResourcePlacement objects of type PickFixed. The user can instead
remove the member cluster name from the clusterNames field in the policy of the ClusterResourcePlacement object.
Failed to execute eviction
Eviction blocked because placement is missing
status:
conditions:
- lastTransitionTime: "2025-04-23T23:54:03Z"
message: Eviction is valid
observedGeneration: 1
reason: ClusterResourcePlacementEvictionValid
status: "True"
type: Valid
- lastTransitionTime: "2025-04-23T23:54:03Z"
message: Eviction is blocked, placement has not propagated resources to target
cluster yet
observedGeneration: 1
reason: ClusterResourcePlacementEvictionNotExecuted
status: "False"
type: Executed
In this case the Eviction object reached a terminal state, its status has Executed condition set to False, because
for the targeted ClusterResourcePlacement the corresponding ClusterResourceBinding object’s spec is set to
Scheduled meaning the rollout of resources is not started yet.
Note: The user can find the corresponding
ClusterResourceBindingobject by listing allClusterResourceBindingobjects for theClusterResourcePlacementobjectkubectl get rb -l kubernetes-fleet.io/parent-CRP=<CRPName>The
ClusterResourceBindingobject name is formatted as<CRPName>-<ClusterName>-randomsuffix.
spec:
applyStrategy:
type: ClientSideApply
clusterDecision:
clusterName: kind-cluster-3
clusterScore:
affinityScore: 0
priorityScore: 0
reason: 'Successfully scheduled resources for placement in "kind-cluster-3" (affinity
score: 0, topology spread score: 0): picked by scheduling policy'
selected: true
resourceSnapshotName: ""
schedulingPolicySnapshotName: test-crp-1
state: Scheduled
targetCluster: kind-cluster-3
Here the user can wait for the ClusterResourceBinding object to be updated to Bound state which means that
resources have been propagated to the target cluster and then retry eviction. In some cases this can take a while or not
happen at all, in that case the user should verify if rollout is stuck for ClusterResourcePlacement object.
Eviction blocked by Invalid CRPDB
Example status for ClusterResourcePlacementEviction object with invalid ClusterResourcePlacementDisruptionBudget,
status:
conditions:
- lastTransitionTime: "2025-04-21T23:39:42Z"
message: Eviction is valid
observedGeneration: 1
reason: ClusterResourcePlacementEvictionValid
status: "True"
type: Valid
- lastTransitionTime: "2025-04-21T23:39:42Z"
message: Eviction is blocked by misconfigured ClusterResourcePlacementDisruptionBudget,
either MaxUnavailable is specified or MinAvailable is specified as a percentage
for PickAll ClusterResourcePlacement
observedGeneration: 1
reason: ClusterResourcePlacementEvictionNotExecuted
status: "False"
type: Executed
In this cae the Eviction object reached a terminal state, its status has Executed condition set to False, because
the ClusterResourcePlacementDisruptionBudget object is invalid. For ClusterResourcePlacement objects of type
PickAll, when specifying a ClusterResourcePlacementDisruptionBudget the minAvailable field should be set to an
absolute number and not a percentage and the maxUnavailable field should not be set since the total number of
placements is non-deterministic.
Eviction blocked by specified CRPDB
Example status for ClusterResourcePlacementEviction object blocked by a ClusterResourcePlacementDisruptionBudget
object,
status:
conditions:
- lastTransitionTime: "2025-04-24T18:54:30Z"
message: Eviction is valid
observedGeneration: 1
reason: ClusterResourcePlacementEvictionValid
status: "True"
type: Valid
- lastTransitionTime: "2025-04-24T18:54:30Z"
message: 'Eviction is blocked by specified ClusterResourcePlacementDisruptionBudget,
availablePlacements: 2, totalPlacements: 2'
observedGeneration: 1
reason: ClusterResourcePlacementEvictionNotExecuted
status: "False"
type: Executed
In this cae the Eviction object reached a terminal state, its status has Executed condition set to False, because
the ClusterResourcePlacementDisruptionBudget object is blocking the eviction. The message from Executed condition
reads available placements is 2 and total placements is 2, which means that the ClusterResourcePlacementDisruptionBudget
is protecting all placements propagated by the ClusterResourcePlacement object.
Taking a look at the ClusterResourcePlacementDisruptionBudget object,
apiVersion: placement.kubernetes-fleet.io/v1beta1
kind: ClusterResourcePlacementDisruptionBudget
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"placement.kubernetes-fleet.io/v1beta1","kind":"ClusterResourcePlacementDisruptionBudget","metadata":{"annotations":{},"name":"pick-all-crp"},"spec":{"minAvailable":2}}
creationTimestamp: "2025-04-24T18:47:22Z"
generation: 1
name: pick-all-crp
resourceVersion: "1749"
uid: 7d3a0ac5-0225-4fb6-b5e9-fc28d58cefdc
spec:
minAvailable: 2
We can see that the minAvailable is set to 2, which means that at least 2 placements should be available for the
ClusterResourcePlacement object.
Let’s take a look at the ClusterResourcePlacement object’s status to verify the list of available placements,
status:
conditions:
- lastTransitionTime: "2025-04-24T18:46:38Z"
message: found all cluster needed as specified by the scheduling policy, found
2 cluster(s)
observedGeneration: 1
reason: SchedulingPolicyFulfilled
status: "True"
type: ClusterResourcePlacementScheduled
- lastTransitionTime: "2025-04-24T18:50:19Z"
message: All 2 cluster(s) start rolling out the latest resource
observedGeneration: 1
reason: RolloutStarted
status: "True"
type: ClusterResourcePlacementRolloutStarted
- lastTransitionTime: "2025-04-24T18:50:19Z"
message: No override rules are configured for the selected resources
observedGeneration: 1
reason: NoOverrideSpecified
status: "True"
type: ClusterResourcePlacementOverridden
- lastTransitionTime: "2025-04-24T18:50:19Z"
message: Works(s) are succcesfully created or updated in 2 target cluster(s)'
namespaces
observedGeneration: 1
reason: WorkSynchronized
status: "True"
type: ClusterResourcePlacementWorkSynchronized
- lastTransitionTime: "2025-04-24T18:50:19Z"
message: The selected resources are successfully applied to 2 cluster(s)
observedGeneration: 1
reason: ApplySucceeded
status: "True"
type: ClusterResourcePlacementApplied
- lastTransitionTime: "2025-04-24T18:50:19Z"
message: The selected resources in 2 cluster(s) are available now
observedGeneration: 1
reason: ResourceAvailable
status: "True"
type: ClusterResourcePlacementAvailable
observedResourceIndex: "0"
placementStatuses:
- clusterName: kind-cluster-1
conditions:
- lastTransitionTime: "2025-04-24T18:50:19Z"
message: 'Successfully scheduled resources for placement in "kind-cluster-1"
(affinity score: 0, topology spread score: 0): picked by scheduling policy'
observedGeneration: 1
reason: Scheduled
status: "True"
type: Scheduled
- lastTransitionTime: "2025-04-24T18:50:19Z"
message: Detected the new changes on the resources and started the rollout process
observedGeneration: 1
reason: RolloutStarted
status: "True"
type: RolloutStarted
- lastTransitionTime: "2025-04-24T18:50:19Z"
message: No override rules are configured for the selected resources
observedGeneration: 1
reason: NoOverrideSpecified
status: "True"
type: Overridden
- lastTransitionTime: "2025-04-24T18:50:19Z"
message: All of the works are synchronized to the latest
observedGeneration: 1
reason: AllWorkSynced
status: "True"
type: WorkSynchronized
- lastTransitionTime: "2025-04-24T18:50:19Z"
message: All corresponding work objects are applied
observedGeneration: 1
reason: AllWorkHaveBeenApplied
status: "True"
type: Applied
- lastTransitionTime: "2025-04-24T18:50:19Z"
message: All corresponding work objects are available
observedGeneration: 1
reason: AllWorkAreAvailable
status: "True"
type: Available
- clusterName: kind-cluster-2
conditions:
- lastTransitionTime: "2025-04-24T18:46:38Z"
message: 'Successfully scheduled resources for placement in "kind-cluster-2"
(affinity score: 0, topology spread score: 0): picked by scheduling policy'
observedGeneration: 1
reason: Scheduled
status: "True"
type: Scheduled
- lastTransitionTime: "2025-04-24T18:46:38Z"
message: Detected the new changes on the resources and started the rollout process
observedGeneration: 1
reason: RolloutStarted
status: "True"
type: RolloutStarted
- lastTransitionTime: "2025-04-24T18:46:38Z"
message: No override rules are configured for the selected resources
observedGeneration: 1
reason: NoOverrideSpecified
status: "True"
type: Overridden
- lastTransitionTime: "2025-04-24T18:46:38Z"
message: All of the works are synchronized to the latest
observedGeneration: 1
reason: AllWorkSynced
status: "True"
type: WorkSynchronized
- lastTransitionTime: "2025-04-24T18:46:38Z"
message: All corresponding work objects are applied
observedGeneration: 1
reason: AllWorkHaveBeenApplied
status: "True"
type: Applied
- lastTransitionTime: "2025-04-24T18:46:38Z"
message: All corresponding work objects are available
observedGeneration: 1
reason: AllWorkAreAvailable
status: "True"
type: Available
selectedResources:
- kind: Namespace
name: test-ns
version: v1
from the status we can see that the ClusterResourcePlacement object has 2 placements available, where resources have
been successfully applied and are available in kind-cluster-1 and kind-cluster-2. The users can check the individual
member clusters to verify the resources are available but the users are recommended to check theClusterResourcePlacement
object status to verify placement availability since the status is aggregated and updated by the controller.
Here the user can either remove the ClusterResourcePlacementDisruptionBudget object or update the minAvailable to
1 to allow ClusterResourcePlacementEviction object to execute successfully. In general the user should carefully
check the availability of placements and act accordingly when changing the ClusterResourcePlacementDisruptionBudget
object.