KubeFleet documentation features a number of troubleshooting guides to help you identify and fix KubeFleet issues you encounter. Pick one below to proceed.
This is the multi-page printable view of this section. Click here to print.
Troubleshooting Guides
- 1: ClusterResourcePlacement TSG
- 2: CRP Schedule Failure TSG
- 3: CRP Rollout Failure TSG
- 4: CRP Override Failure TSG
- 5: CRP Work-Synchronization Failure TSG
- 6: CRP Work-Application Failure TSG
- 7: CRP Availability Failure TSG
- 8: CRP Drift Detection and Configuration Difference Check Unexpected Result TSG
- 9: CRP Diff Reporting Failure TSG
- 10: ClusterStagedUpdateRun TSG
- 11: ClusterResourcePlacementEviction TSG
1 - ClusterResourcePlacement TSG
This TSG is meant to help you troubleshoot issues with the ClusterResourcePlacement API in Fleet.
Cluster Resource Placement
Internal Objects to keep in mind when troubleshooting CRP related errors on the hub cluster:
ClusterResourceSnapshot
ClusterSchedulingPolicySnapshot
ClusterResourceBinding
Work
Please read the Fleet API reference for more details about each object.
Complete Progress of the ClusterResourcePlacement
Understanding the progression and the status of the ClusterResourcePlacement
custom resource is crucial for diagnosing and identifying failures.
You can view the status of the ClusterResourcePlacement
custom resource by using the following command:
kubectl describe clusterresourceplacement <name>
The complete progression of ClusterResourcePlacement
is as follows:
ClusterResourcePlacementScheduled
: Indicates a resource has been scheduled for placement..- If this condition is false, refer to CRP Schedule Failure TSG.
ClusterResourcePlacementRolloutStarted
: Indicates the rollout process has begun.- If this condition is false refer to CRP Rollout Failure TSG
- If you are triggering a rollout with a staged update run, refer to ClusterStagedUpdateRun TSG.
ClusterResourcePlacementOverridden
: Indicates the resource has been overridden.- If this condition is false, refer to CRP Override Failure TSG
ClusterResourcePlacementWorkSynchronized
: Indicates the work objects have been synchronized.- If this condition is false, refer to CRP Work-Synchronization Failure TSG
ClusterResourcePlacementApplied
: Indicates the resource has been applied. This condition will only be populated if the apply strategy in use is of the typeClientSideApply
(default) orServerSideApply
.- If this condition is false, refer to CRP Work-Application Failure TSG
ClusterResourcePlacementAvailable
: Indicates the resource is available. This condition will only be populated if the apply strategy in use is of the typeClientSideApply
(default) orServerSideApply
.- If this condition is false, refer to CRP Availability Failure TSG
ClusterResourcePlacementDiffreported
: Indicates whether diff reporting has completed on all resources. This condition will only be populated if the apply strategy in use is of the typeReportDiff
.- If this condition is false, refer to the CRP Diff Reporting Failure TSG for more information.
How can I debug if some clusters are not selected as expected?
Check the status of the ClusterSchedulingPolicySnapshot
to determine which clusters were selected along with the reason.
How can I debug if a selected cluster does not have the expected resources on it or if CRP doesn’t pick up the latest changes?
Please check the following cases,
- Check whether the
ClusterResourcePlacementRolloutStarted
condition inClusterResourcePlacement
status is set to true or false. - If
false
, see CRP Schedule Failure TSG. - If
true
,- Check to see if
ClusterResourcePlacementApplied
condition is set to unknown, false or true. - If
unknown
, wait for the process to finish, as the resources are still being applied to the member cluster. If the state remains unknown for a while, create a issue, as this is an unusual behavior. - If
false
, refer to CRP Work-Application Failure TSG. - If
true
, verify that the resource exists on the hub cluster.
- Check to see if
We can also take a look at the placementStatuses
section in ClusterResourcePlacement
status for that particular cluster. In placementStatuses
we would find failedPlacements
section which should have the reasons as to why resources failed to apply.
How can I debug if the drift detection result or the configuration difference check result are different from my expectations?
See the Drift Detection and Configuration Difference Check Unexpected Result TSG for more information.
How can I find and verify the latest ClusterSchedulingPolicySnapshot for a ClusterResourcePlacement?
To find the latest ClusterSchedulingPolicySnapshot
for a ClusterResourcePlacement
resource, run the following command:
kubectl get clusterschedulingpolicysnapshot -l kubernetes-fleet.io/is-latest-snapshot=true,kubernetes-fleet.io/parent-CRP={CRPName}
NOTE: In this command, replace
{CRPName}
with yourClusterResourcePlacement
name.
Then, compare the ClusterSchedulingPolicySnapshot
with the ClusterResourcePlacement
policy to make sure that they match, excluding the numberOfClusters
field from the ClusterResourcePlacement
spec.
If the placement type is PickN
, check whether the number of clusters that’s requested in the ClusterResourcePlacement
policy matches the value of the number-of-clusters label.
How can I find the latest ClusterResourceBinding resource?
The following command lists all ClusterResourceBindings
instances that are associated with ClusterResourcePlacement
:
kubectl get clusterresourcebinding -l kubernetes-fleet.io/parent-CRP={CRPName}
NOTE: In this command, replace
{CRPName}
with yourClusterResourcePlacement
name.
Example
In this case we have ClusterResourcePlacement
called test-crp.
- List the
ClusterResourcePlacement
to get the name of the CRP,
kubectl get crp test-crp
NAME GEN SCHEDULED SCHEDULEDGEN APPLIED APPLIEDGEN AGE
test-crp 1 True 1 True 1 15s
- The following command is run to view the status of the
ClusterResourcePlacement
deployment.
kubectl describe clusterresourceplacement test-crp
- Here’s an example output. From the
placementStatuses
section of thetest-crp
status, notice that it has distributed resources to two member clusters and, therefore, has twoClusterResourceBindings
instances:
status:
conditions:
- lastTransitionTime: "2023-11-23T00:49:29Z"
...
placementStatuses:
- clusterName: kind-cluster-1
conditions:
...
type: ResourceApplied
- clusterName: kind-cluster-2
conditions:
...
reason: ApplySucceeded
status: "True"
type: ResourceApplied
- To get the
ClusterResourceBindings
value, run the following command:
kubectl get clusterresourcebinding -l kubernetes-fleet.io/parent-CRP=test-crp
- The output lists all
ClusterResourceBindings
instances that are associated withtest-crp
.
kubectl get clusterresourcebinding -l kubernetes-fleet.io/parent-CRP=test-crp
NAME WORKCREATED RESOURCESAPPLIED AGE
test-crp-kind-cluster-1-be990c3e True True 33s
test-crp-kind-cluster-2-ec4d953c True True 33s
The ClusterResourceBinding
resource name uses the following format: {CRPName}-{clusterName}-{suffix}
.
Find the ClusterResourceBinding
for the target cluster you are looking for based on the clusterName
.
How can I find the latest ClusterResourceSnapshot resource?
To find the latest ClusterResourceSnapshot resource, run the following command:
kubectl get clusterresourcesnapshot -l kubernetes-fleet.io/is-latest-snapshot=true,kubernetes-fleet.io/parent-CRP={CRPName}
NOTE: In this command, replace
{CRPName}
with yourClusterResourcePlacement
name.
How can I find the correct work resource that’s associated with ClusterResourcePlacement?
To find the correct work resource, follow these steps:
- Identify the member cluster namespace and the
ClusterResourcePlacement
name. The format for the namespace isfleet-member-{clusterName}
. - To get the work resource, run the following command:
kubectl get work -n fleet-member-{clusterName} -l kubernetes-fleet.io/parent-CRP={CRPName}
NOTE: In this command, replace
{clusterName}
and{CRPName}
with the names that you identified in the first step.
2 - CRP Schedule Failure TSG
The ClusterResourcePlacementScheduled
condition is set to false
when the scheduler cannot find all the clusters needed as specified by the scheduling policy.
Note: To get more information about why the scheduling fails, you can check the scheduler logs.
Common scenarios
Instances where this condition may arise:
- When the placement policy is set to
PickFixed
, but the specified cluster names do not match any joined member cluster name in the fleet, or the specified cluster is no longer connected to the fleet. - When the placement policy is set to
PickN
, and N clusters are specified, but there are fewer than N clusters that have joined the fleet or satisfy the placement policy. - When the
ClusterResourcePlacement
resource selector selects a reserved namespace.
Note: When the placement policy is set to
PickAll
, theClusterResourcePlacementScheduled
condition is always set totrue
.
Case Study
In the following example, the ClusterResourcePlacement
with a PickN
placement policy is trying to propagate resources to two clusters labeled env:prod
.
The two clusters, named kind-cluster-1
and kind-cluster-2
, have joined the fleet. However, only one member cluster, kind-cluster-1
, has the label env:prod
.
CRP spec:
spec:
policy:
affinity:
clusterAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
clusterSelectorTerms:
- labelSelector:
matchLabels:
env: prod
numberOfClusters: 2
placementType: PickN
resourceSelectors:
...
revisionHistoryLimit: 10
strategy:
type: RollingUpdate
ClusterResourcePlacement status
status:
conditions:
- lastTransitionTime: "2024-05-07T22:36:33Z"
message: could not find all the clusters needed as specified by the scheduling
policy
observedGeneration: 1
reason: SchedulingPolicyUnfulfilled
status: "False"
type: ClusterResourcePlacementScheduled
- lastTransitionTime: "2024-05-07T22:36:33Z"
message: All 1 cluster(s) start rolling out the latest resource
observedGeneration: 1
reason: RolloutStarted
status: "True"
type: ClusterResourcePlacementRolloutStarted
- lastTransitionTime: "2024-05-07T22:36:33Z"
message: No override rules are configured for the selected resources
observedGeneration: 1
reason: NoOverrideSpecified
status: "True"
type: ClusterResourcePlacementOverridden
- lastTransitionTime: "2024-05-07T22:36:33Z"
message: Works(s) are succcesfully created or updated in the 1 target clusters'
namespaces
observedGeneration: 1
reason: WorkSynchronized
status: "True"
type: ClusterResourcePlacementWorkSynchronized
- lastTransitionTime: "2024-05-07T22:36:33Z"
message: The selected resources are successfully applied to 1 clusters
observedGeneration: 1
reason: ApplySucceeded
status: "True"
type: ClusterResourcePlacementApplied
- lastTransitionTime: "2024-05-07T22:36:33Z"
message: The selected resources in 1 cluster are available now
observedGeneration: 1
reason: ResourceAvailable
status: "True"
type: ClusterResourcePlacementAvailable
observedResourceIndex: "0"
placementStatuses:
- clusterName: kind-cluster-1
conditions:
- lastTransitionTime: "2024-05-07T22:36:33Z"
message: 'Successfully scheduled resources for placement in kind-cluster-1 (affinity
score: 0, topology spread score: 0): picked by scheduling policy'
observedGeneration: 1
reason: Scheduled
status: "True"
type: Scheduled
- lastTransitionTime: "2024-05-07T22:36:33Z"
message: Detected the new changes on the resources and started the rollout process
observedGeneration: 1
reason: RolloutStarted
status: "True"
type: RolloutStarted
- lastTransitionTime: "2024-05-07T22:36:33Z"
message: No override rules are configured for the selected resources
observedGeneration: 1
reason: NoOverrideSpecified
status: "True"
type: Overridden
- lastTransitionTime: "2024-05-07T22:36:33Z"
message: All of the works are synchronized to the latest
observedGeneration: 1
reason: AllWorkSynced
status: "True"
type: WorkSynchronized
- lastTransitionTime: "2024-05-07T22:36:33Z"
message: All corresponding work objects are applied
observedGeneration: 1
reason: AllWorkHaveBeenApplied
status: "True"
type: Applied
- lastTransitionTime: "2024-05-07T22:36:33Z"
message: All corresponding work objects are available
observedGeneration: 1
reason: AllWorkAreAvailable
status: "True"
type: Available
- conditions:
- lastTransitionTime: "2024-05-07T22:36:33Z"
message: 'kind-cluster-2 is not selected: ClusterUnschedulable, cluster does not
match with any of the required cluster affinity terms'
observedGeneration: 1
reason: ScheduleFailed
status: "False"
type: Scheduled
selectedResources:
...
The ClusterResourcePlacementScheduled
condition is set to false
, the goal is to select two clusters with the label env:prod
, but only one member cluster possesses the correct label as specified in clusterAffinity
.
We can also take a look at the ClusterSchedulingPolicySnapshot
status to figure out why the scheduler could not schedule the resource for the placement policy specified.
To learn how to get the latest ClusterSchedulingPolicySnapshot
, see How can I find and verify the latest ClusterSchedulingPolicySnapshot for a ClusterResourcePlacement deployment? to learn how to get the latest ClusterSchedulingPolicySnapshot
.
The corresponding ClusterSchedulingPolicySnapshot
spec and status gives us even more information on why scheduling failed.
Latest ClusterSchedulingPolicySnapshot
apiVersion: placement.kubernetes-fleet.io/v1
kind: ClusterSchedulingPolicySnapshot
metadata:
annotations:
kubernetes-fleet.io/CRP-generation: "1"
kubernetes-fleet.io/number-of-clusters: "2"
creationTimestamp: "2024-05-07T22:36:33Z"
generation: 1
labels:
kubernetes-fleet.io/is-latest-snapshot: "true"
kubernetes-fleet.io/parent-CRP: crp-2
kubernetes-fleet.io/policy-index: "0"
name: crp-2-0
ownerReferences:
- apiVersion: placement.kubernetes-fleet.io/v1beta1
blockOwnerDeletion: true
controller: true
kind: ClusterResourcePlacement
name: crp-2
uid: 48bc1e92-a8b9-4450-a2d5-c6905df2cbf0
resourceVersion: "10090"
uid: 2137887e-45fd-4f52-bbb7-b96f39854625
spec:
policy:
affinity:
clusterAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
clusterSelectorTerms:
- labelSelector:
matchLabels:
env: prod
placementType: PickN
policyHash: ZjE0Yjk4YjYyMTVjY2U3NzQ1MTZkNWRhZjRiNjQ1NzQ4NjllNTUyMzZkODBkYzkyYmRkMGU3OTI3MWEwOTkyNQ==
status:
conditions:
- lastTransitionTime: "2024-05-07T22:36:33Z"
message: could not find all the clusters needed as specified by the scheduling
policy
observedGeneration: 1
reason: SchedulingPolicyUnfulfilled
status: "False"
type: Scheduled
observedCRPGeneration: 1
targetClusters:
- clusterName: kind-cluster-1
clusterScore:
affinityScore: 0
priorityScore: 0
reason: picked by scheduling policy
selected: true
- clusterName: kind-cluster-2
reason: ClusterUnschedulable, cluster does not match with any of the required
cluster affinity terms
selected: false
Resolution:
The solution here is to add the env:prod
label to the member cluster resource for kind-cluster-2
as well, so that the scheduler can select the cluster to propagate resources.
3 - CRP Rollout Failure TSG
When using the ClusterResourcePlacement
API object in Azure Kubernetes Fleet Manager to propagate resources, the selected resources aren’t rolled out in all scheduled clusters and the ClusterResourcePlacementRolloutStarted
condition status shows as False
.
This TSG only applies to the RollingUpdate
rollout strategy, which is the default strategy if you don’t specify in the ClusterResourcePlacement
.
To troubleshoot the update run strategy as you specify External
in the ClusterResourcePlacement
, please refer to the Staged Update Run Troubleshooting Guide.
Note: To get more information about why the rollout doesn’t start, you can check the rollout controller to get more information on why the rollout did not start.
Common scenarios
Instances where this condition may arise:
- The Cluster Resource Placement rollout strategy is blocked because the
RollingUpdate
configuration is too strict.
Troubleshooting Steps
- In the
ClusterResourcePlacement
status section, check theplacementStatuses
to identify clusters with theRolloutStarted
status set toFalse
. - Locate the corresponding
ClusterResourceBinding
for the identified cluster. For more information, see How can I find the latest ClusterResourceBinding resource?. This resource should indicate the status of theWork
whether it was created or updated. - Verify the values of
maxUnavailable
andmaxSurge
to ensure they align with your expectations.
Case Study
In the following example, the ClusterResourcePlacement
is trying to propagate a namespace to three member clusters.
However, during the initial creation of the ClusterResourcePlacement
, the namespace didn’t exist on the hub cluster,
and the fleet currently comprises two member clusters named kind-cluster-1
and kind-cluster-2
.
ClusterResourcePlacement spec
spec:
policy:
numberOfClusters: 3
placementType: PickN
resourceSelectors:
- group: ""
kind: Namespace
name: test-ns
version: v1
revisionHistoryLimit: 10
strategy:
type: RollingUpdate
ClusterResourcePlacement status
status:
conditions:
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: could not find all the clusters needed as specified by the scheduling
policy
observedGeneration: 1
reason: SchedulingPolicyUnfulfilled
status: "False"
type: ClusterResourcePlacementScheduled
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: All 2 cluster(s) start rolling out the latest resource
observedGeneration: 1
reason: RolloutStarted
status: "True"
type: ClusterResourcePlacementRolloutStarted
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: No override rules are configured for the selected resources
observedGeneration: 1
reason: NoOverrideSpecified
status: "True"
type: ClusterResourcePlacementOverridden
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: Works(s) are succcesfully created or updated in the 2 target clusters'
namespaces
observedGeneration: 1
reason: WorkSynchronized
status: "True"
type: ClusterResourcePlacementWorkSynchronized
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: The selected resources are successfully applied to 2 clusters
observedGeneration: 1
reason: ApplySucceeded
status: "True"
type: ClusterResourcePlacementApplied
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: The selected resources in 2 cluster are available now
observedGeneration: 1
reason: ResourceAvailable
status: "True"
type: ClusterResourcePlacementAvailable
observedResourceIndex: "0"
placementStatuses:
- clusterName: kind-cluster-2
conditions:
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: 'Successfully scheduled resources for placement in kind-cluster-2 (affinity
score: 0, topology spread score: 0): picked by scheduling policy'
observedGeneration: 1
reason: Scheduled
status: "True"
type: Scheduled
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: Detected the new changes on the resources and started the rollout process
observedGeneration: 1
reason: RolloutStarted
status: "True"
type: RolloutStarted
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: No override rules are configured for the selected resources
observedGeneration: 1
reason: NoOverrideSpecified
status: "True"
type: Overridden
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: All of the works are synchronized to the latest
observedGeneration: 1
reason: AllWorkSynced
status: "True"
type: WorkSynchronized
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: All corresponding work objects are applied
observedGeneration: 1
reason: AllWorkHaveBeenApplied
status: "True"
type: Applied
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: All corresponding work objects are available
observedGeneration: 1
reason: AllWorkAreAvailable
status: "True"
type: Available
- clusterName: kind-cluster-1
conditions:
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: 'Successfully scheduled resources for placement in kind-cluster-1 (affinity
score: 0, topology spread score: 0): picked by scheduling policy'
observedGeneration: 1
reason: Scheduled
status: "True"
type: Scheduled
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: Detected the new changes on the resources and started the rollout process
observedGeneration: 1
reason: RolloutStarted
status: "True"
type: RolloutStarted
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: No override rules are configured for the selected resources
observedGeneration: 1
reason: NoOverrideSpecified
status: "True"
type: Overridden
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: All of the works are synchronized to the latest
observedGeneration: 1
reason: AllWorkSynced
status: "True"
type: WorkSynchronized
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: All corresponding work objects are applied
observedGeneration: 1
reason: AllWorkHaveBeenApplied
status: "True"
type: Applied
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: All corresponding work objects are available
observedGeneration: 1
reason: AllWorkAreAvailable
status: "True"
type: Available
The previous output indicates that the resource test-ns
namespace never exists on the hub cluster and shows the following ClusterResourcePlacement
condition statuses:
ClusterResourcePlacementScheduled
is set toFalse
, as the specified policy aims to pick three clusters, but the scheduler can only accommodate placement in two currently available and joined clusters.ClusterResourcePlacementRolloutStarted
is set toTrue
, as the rollout process has commenced with 2 clusters being selected.ClusterResourcePlacementOverridden
is set toTrue
, as no override rules are configured for the selected resources.ClusterResourcePlacementWorkSynchronized
is set toTrue
.ClusterResourcePlacementApplied
is set toTrue
.ClusterResourcePlacementAvailable
is set toTrue
.
To ensure seamless propagation of the namespace across the relevant clusters, proceed to create the test-ns
namespace on the hub cluster.
ClusterResourcePlacement status after namespace test-ns is created on the hub cluster
status:
conditions:
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: could not find all the clusters needed as specified by the scheduling
policy
observedGeneration: 1
reason: SchedulingPolicyUnfulfilled
status: "False"
type: ClusterResourcePlacementScheduled
- lastTransitionTime: "2024-05-07T23:13:51Z"
message: The rollout is being blocked by the rollout strategy in 2 cluster(s)
observedGeneration: 1
reason: RolloutNotStartedYet
status: "False"
type: ClusterResourcePlacementRolloutStarted
observedResourceIndex: "1"
placementStatuses:
- clusterName: kind-cluster-2
conditions:
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: 'Successfully scheduled resources for placement in kind-cluster-2 (affinity
score: 0, topology spread score: 0): picked by scheduling policy'
observedGeneration: 1
reason: Scheduled
status: "True"
type: Scheduled
- lastTransitionTime: "2024-05-07T23:13:51Z"
message: The rollout is being blocked by the rollout strategy
observedGeneration: 1
reason: RolloutNotStartedYet
status: "False"
type: RolloutStarted
- clusterName: kind-cluster-1
conditions:
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: 'Successfully scheduled resources for placement in kind-cluster-1 (affinity
score: 0, topology spread score: 0): picked by scheduling policy'
observedGeneration: 1
reason: Scheduled
status: "True"
type: Scheduled
- lastTransitionTime: "2024-05-07T23:13:51Z"
message: The rollout is being blocked by the rollout strategy
observedGeneration: 1
reason: RolloutNotStartedYet
status: "False"
type: RolloutStarted
selectedResources:
- kind: Namespace
name: test-ns
version: v1
Upon examination, the ClusterResourcePlacementScheduled
condition status is shown as False
.
The ClusterResourcePlacementRolloutStarted
status is also shown as False
with the message The rollout is being blocked by the rollout strategy in 2 cluster(s)
.
Let’s check the latest ClusterResourceSnapshot
.
Check the latest ClusterResourceSnapshot
by running the command in How can I find the latest ClusterResourceSnapshot resource?.
Latest ClusterResourceSnapshot
apiVersion: placement.kubernetes-fleet.io/v1
kind: ClusterResourceSnapshot
metadata:
annotations:
kubernetes-fleet.io/number-of-enveloped-object: "0"
kubernetes-fleet.io/number-of-resource-snapshots: "1"
kubernetes-fleet.io/resource-hash: 72344be6e268bc7af29d75b7f0aad588d341c228801aab50d6f9f5fc33dd9c7c
creationTimestamp: "2024-05-07T23:13:51Z"
generation: 1
labels:
kubernetes-fleet.io/is-latest-snapshot: "true"
kubernetes-fleet.io/parent-CRP: crp-3
kubernetes-fleet.io/resource-index: "1"
name: crp-3-1-snapshot
ownerReferences:
- apiVersion: placement.kubernetes-fleet.io/v1beta1
blockOwnerDeletion: true
controller: true
kind: ClusterResourcePlacement
name: crp-3
uid: b4f31b9a-971a-480d-93ac-93f093ee661f
resourceVersion: "14434"
uid: 85ee0e81-92c9-4362-932b-b0bf57d78e3f
spec:
selectedResources:
- apiVersion: v1
kind: Namespace
metadata:
labels:
kubernetes.io/metadata.name: test-ns
name: test-ns
spec:
finalizers:
- kubernetes
Upon inspecting ClusterResourceSnapshot
spec, the selectedResources
section now shows the namespace test-ns
.
Let’s check the ClusterResourceBinding
for kind-cluster-1
to see if it was updated after the namespace test-ns
was created.
Check the ClusterResourceBinding
for kind-cluster-1
by running the command in How can I find the latest ClusterResourceBinding resource?.
ClusterResourceBinding for kind-cluster-1
apiVersion: placement.kubernetes-fleet.io/v1
kind: ClusterResourceBinding
metadata:
creationTimestamp: "2024-05-07T23:08:53Z"
finalizers:
- kubernetes-fleet.io/work-cleanup
generation: 2
labels:
kubernetes-fleet.io/parent-CRP: crp-3
name: crp-3-kind-cluster-1-7114c253
resourceVersion: "14438"
uid: 0db4e480-8599-4b40-a1cc-f33bcb24b1a7
spec:
applyStrategy:
type: ClientSideApply
clusterDecision:
clusterName: kind-cluster-1
clusterScore:
affinityScore: 0
priorityScore: 0
reason: picked by scheduling policy
selected: true
resourceSnapshotName: crp-3-0-snapshot
schedulingPolicySnapshotName: crp-3-0
state: Bound
targetCluster: kind-cluster-1
status:
conditions:
- lastTransitionTime: "2024-05-07T23:13:51Z"
message: The resources cannot be updated to the latest because of the rollout
strategy
observedGeneration: 2
reason: RolloutNotStartedYet
status: "False"
type: RolloutStarted
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: No override rules are configured for the selected resources
observedGeneration: 2
reason: NoOverrideSpecified
status: "True"
type: Overridden
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: All of the works are synchronized to the latest
observedGeneration: 2
reason: AllWorkSynced
status: "True"
type: WorkSynchronized
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: All corresponding work objects are applied
observedGeneration: 2
reason: AllWorkHaveBeenApplied
status: "True"
type: Applied
- lastTransitionTime: "2024-05-07T23:08:53Z"
message: All corresponding work objects are available
observedGeneration: 2
reason: AllWorkAreAvailable
status: "True"
type: Available
Upon inspection, it is observed that the ClusterResourceBinding
remains unchanged. Notably, in the spec, the resourceSnapshotName
still references the old ClusterResourceSnapshot
name.
This issue arises due to the absence of explicit rollingUpdate
input from the user. Consequently, the default values are applied:
- The
maxUnavailable
value is configured to 25% x 3 (desired number), rounded to1
- The
maxSurge
value is configured to 25% x 3 (desired number), rounded to1
Why ClusterResourceBinding isn’t updated?
Initially, when the ClusterResourcePlacement
was created, two ClusterResourceBindings
were generated.
However, since the rollout didn’t apply to the initial phase, the ClusterResourcePlacementRolloutStarted
condition was set to True
.
Upon creating the test-ns
namespace on the hub cluster, the rollout controller attempted to update the two existing ClusterResourceBindings
.
However, maxUnavailable
was set to 1
due to the lack of member clusters, which caused the RollingUpdate
configuration to be too strict.
NOTE: During the update, if one of the bindings fails to apply, it will also violate the
RollingUpdate
configuration, which causesmaxUnavailable
to be set to1
.
Resolution
In this situation, to address this issue, consider manually setting maxUnavailable
to a value greater than 1
to relax the RollingUpdate
configuration.
Alternatively, you can join a third member cluster.
4 - CRP Override Failure TSG
The status of the ClusterResourcePlacementOverridden
condition is set to false
when there is an Override API related issue.
Note: To get more information, look into the logs for the overrider controller (includes controller for ClusterResourceOverride and ResourceOverride).
Common scenarios
Instances where this condition may arise:
- The
ClusterResourceOverride
orResourceOverride
is created with an invalid field path for the resource.
Case Study
In the following example, an attempt is made to override the cluster role secret-reader
that is being propagated by the ClusterResourcePlacement
to the selected clusters.
However, the ClusterResourceOverride
is created with an invalid path for the field within resource.
ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
creationTimestamp: "2024-05-14T15:36:48Z"
name: secret-reader
resourceVersion: "81334"
uid: 108e6312-3416-49be-aa3d-a665c5df58b4
rules:
- apiGroups:
- ""
resources:
- secrets
verbs:
- get
- watch
- list
The ClusterRole
secret-reader
that is being propagated to the member clusters by the ClusterResourcePlacement
.
ClusterResourceOverride spec
spec:
clusterResourceSelectors:
- group: rbac.authorization.k8s.io
kind: ClusterRole
name: secret-reader
version: v1
policy:
overrideRules:
- clusterSelector:
clusterSelectorTerms:
- labelSelector:
matchLabels:
env: canary
jsonPatchOverrides:
- op: add
path: /metadata/labels/new-label
value: new-value
The ClusterResourceOverride
is created to override the ClusterRole
secret-reader
by adding a new label (new-label
)
that has the value new-value
for the clusters with the label env: canary
.
ClusterResourcePlacement Spec
spec:
resourceSelectors:
- group: rbac.authorization.k8s.io
kind: ClusterRole
name: secret-reader
version: v1
policy:
placementType: PickN
numberOfClusters: 1
affinity:
clusterAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
clusterSelectorTerms:
- labelSelector:
matchLabels:
env: canary
strategy:
type: RollingUpdate
applyStrategy:
allowCoOwnership: true
ClusterResourcePlacement Status
status:
conditions:
- lastTransitionTime: "2024-05-14T16:16:18Z"
message: found all cluster needed as specified by the scheduling policy, found
1 cluster(s)
observedGeneration: 1
reason: SchedulingPolicyFulfilled
status: "True"
type: ClusterResourcePlacementScheduled
- lastTransitionTime: "2024-05-14T16:16:18Z"
message: All 1 cluster(s) start rolling out the latest resource
observedGeneration: 1
reason: RolloutStarted
status: "True"
type: ClusterResourcePlacementRolloutStarted
- lastTransitionTime: "2024-05-14T16:16:18Z"
message: Failed to override resources in 1 cluster(s)
observedGeneration: 1
reason: OverriddenFailed
status: "False"
type: ClusterResourcePlacementOverridden
observedResourceIndex: "0"
placementStatuses:
- applicableClusterResourceOverrides:
- cro-1-0
clusterName: kind-cluster-1
conditions:
- lastTransitionTime: "2024-05-14T16:16:18Z"
message: 'Successfully scheduled resources for placement in kind-cluster-1 (affinity
score: 0, topology spread score: 0): picked by scheduling policy'
observedGeneration: 1
reason: Scheduled
status: "True"
type: Scheduled
- lastTransitionTime: "2024-05-14T16:16:18Z"
message: Detected the new changes on the resources and started the rollout process
observedGeneration: 1
reason: RolloutStarted
status: "True"
type: RolloutStarted
- lastTransitionTime: "2024-05-14T16:16:18Z"
message: 'Failed to apply the override rules on the resources: add operation
does not apply: doc is missing path: "/metadata/labels/new-label": missing
value'
observedGeneration: 1
reason: OverriddenFailed
status: "False"
type: Overridden
selectedResources:
- group: rbac.authorization.k8s.io
kind: ClusterRole
name: secret-reader
version: v1
The CRP attempted to override a propagated resource utilizing an applicable ClusterResourceOverrideSnapshot
.
However, as the ClusterResourcePlacementOverridden
condition remains false, looking at the placement status for the cluster
where the condition Overridden
failed will offer insights into the exact cause of the failure.
In this situation, the message indicates that the override failed because the path /metadata/labels/new-label
and its corresponding value are missing.
Based on the previous example of the cluster role secret-reader
, you can see that the path /metadata/labels/
doesn’t exist. This means that labels
doesn’t exist.
Therefore, a new label can’t be added.
Resolution
To successfully override the cluster role secret-reader
, correct the path and value in ClusterResourceOverride
, as shown in the following code:
jsonPatchOverrides:
- op: add
path: /metadata/labels
value:
newlabel: new-value
This will successfully add the new label newlabel
with the value new-value
to the ClusterRole
secret-reader
, as we are creating the labels
field and adding a new value newlabel: new-value
to it.
5 - CRP Work-Synchronization Failure TSG
The ClusterResourcePlacementWorkSynchronized
condition is false when the CRP has been recently updated but the associated work objects have not yet been synchronized with the changes.
Note: In addition, it may be helpful to look into the logs for the work generator controller to get more information on why the work synchronization failed.
Common Scenarios
Instances where this condition may arise:
- The controller encounters an error while trying to generate the corresponding
work
object. - The enveloped object is not well formatted.
Case Study
The CRP is attempting to propagate a resource to a selected cluster, but the work object has not been updated to reflect the latest changes due to the selected cluster has been terminated.
ClusterResourcePlacement Spec
spec:
resourceSelectors:
- group: rbac.authorization.k8s.io
kind: ClusterRole
name: secret-reader
version: v1
policy:
placementType: PickN
numberOfClusters: 1
strategy:
type: RollingUpdate
ClusterResourcePlacement Status
spec:
policy:
numberOfClusters: 1
placementType: PickN
resourceSelectors:
- group: ""
kind: Namespace
name: test-ns
version: v1
revisionHistoryLimit: 10
strategy:
type: RollingUpdate
status:
conditions:
- lastTransitionTime: "2024-05-14T18:05:04Z"
message: found all cluster needed as specified by the scheduling policy, found
1 cluster(s)
observedGeneration: 1
reason: SchedulingPolicyFulfilled
status: "True"
type: ClusterResourcePlacementScheduled
- lastTransitionTime: "2024-05-14T18:05:05Z"
message: All 1 cluster(s) start rolling out the latest resource
observedGeneration: 1
reason: RolloutStarted
status: "True"
type: ClusterResourcePlacementRolloutStarted
- lastTransitionTime: "2024-05-14T18:05:05Z"
message: No override rules are configured for the selected resources
observedGeneration: 1
reason: NoOverrideSpecified
status: "True"
type: ClusterResourcePlacementOverridden
- lastTransitionTime: "2024-05-14T18:05:05Z"
message: There are 1 cluster(s) which have not finished creating or updating work(s)
yet
observedGeneration: 1
reason: WorkNotSynchronizedYet
status: "False"
type: ClusterResourcePlacementWorkSynchronized
observedResourceIndex: "0"
placementStatuses:
- clusterName: kind-cluster-1
conditions:
- lastTransitionTime: "2024-05-14T18:05:04Z"
message: 'Successfully scheduled resources for placement in kind-cluster-1 (affinity
score: 0, topology spread score: 0): picked by scheduling policy'
observedGeneration: 1
reason: Scheduled
status: "True"
type: Scheduled
- lastTransitionTime: "2024-05-14T18:05:05Z"
message: Detected the new changes on the resources and started the rollout process
observedGeneration: 1
reason: RolloutStarted
status: "True"
type: RolloutStarted
- lastTransitionTime: "2024-05-14T18:05:05Z"
message: No override rules are configured for the selected resources
observedGeneration: 1
reason: NoOverrideSpecified
status: "True"
type: Overridden
- lastTransitionTime: "2024-05-14T18:05:05Z"
message: 'Failed to synchronize the work to the latest: works.placement.kubernetes-fleet.io
"crp1-work" is forbidden: unable to create new content in namespace fleet-member-kind-cluster-1
because it is being terminated'
observedGeneration: 1
reason: SyncWorkFailed
status: "False"
type: WorkSynchronized
selectedResources:
- kind: Namespace
name: test-ns
version: v1
In the ClusterResourcePlacement
status, the ClusterResourcePlacementWorkSynchronized
condition status shows as False
.
The message for it indicates that the work object crp1-work
is prohibited from generating new content within the namespace fleet-member-kind-cluster-1
because it’s currently terminating.
Resolution
To address the issue at hand, there are several potential solutions:
- Modify the
ClusterResourcePlacement
with a newly selected cluster. - Delete the
ClusterResourcePlacement
to remove work through garbage collection. - Rejoin the member cluster. The namespace can only be regenerated after rejoining the cluster.
In other situations, you might opt to wait for the work to finish propagating.
6 - CRP Work-Application Failure TSG
The ClusterResourcePlacementApplied
condition is set to false
when the deployment fails.
Note: To get more information about why the resources are not applied, you can check the work applier logs.
Common scenarios
Instances where this condition may arise:
- The resource already exists on the cluster and isn’t managed by the fleet controller.
- Another
ClusterResourcePlacement
deployment is already managing the resource for the selected cluster by using a different apply strategy. - The
ClusterResourcePlacement
deployment doesn’t apply the manifest because of syntax errors or invalid resource configurations. This might also occur if a resource is propagated through an envelope object.
Investigation steps
- Check
placementStatuses
: In theClusterResourcePlacement
status section, inspect theplacementStatuses
to identify which clusters have theResourceApplied
condition set tofalse
and note down theirclusterName
. - Locate the
Work
Object in Hub Cluster: Use the identifiedclusterName
to locate theWork
object associated with the member cluster. Please refer to this section to learn how to get the correctWork
resource. - Check
Work
object status: Inspect the status of theWork
object to understand the specific issues preventing successful resource application.
Case Study
In the following example, ClusterResourcePlacement
is trying to propagate a namespace that contains a deployment to two member clusters. However, the namespace already exists on one member cluster, specifically kind-cluster-1
.
ClusterResourcePlacement spec
policy:
clusterNames:
- kind-cluster-1
- kind-cluster-2
placementType: PickFixed
resourceSelectors:
- group: ""
kind: Namespace
name: test-ns
version: v1
revisionHistoryLimit: 10
strategy:
type: RollingUpdate
ClusterResourcePlacement status
status:
conditions:
- lastTransitionTime: "2024-05-07T23:32:40Z"
message: could not find all the clusters needed as specified by the scheduling
policy
observedGeneration: 1
reason: SchedulingPolicyUnfulfilled
status: "False"
type: ClusterResourcePlacementScheduled
- lastTransitionTime: "2024-05-07T23:32:40Z"
message: All 2 cluster(s) start rolling out the latest resource
observedGeneration: 1
reason: RolloutStarted
status: "True"
type: ClusterResourcePlacementRolloutStarted
- lastTransitionTime: "2024-05-07T23:32:40Z"
message: No override rules are configured for the selected resources
observedGeneration: 1
reason: NoOverrideSpecified
status: "True"
type: ClusterResourcePlacementOverridden
- lastTransitionTime: "2024-05-07T23:32:40Z"
message: Works(s) are succcesfully created or updated in the 2 target clusters'
namespaces
observedGeneration: 1
reason: WorkSynchronized
status: "True"
type: ClusterResourcePlacementWorkSynchronized
- lastTransitionTime: "2024-05-07T23:32:40Z"
message: Failed to apply resources to 1 clusters, please check the `failedPlacements`
status
observedGeneration: 1
reason: ApplyFailed
status: "False"
type: ClusterResourcePlacementApplied
observedResourceIndex: "0"
placementStatuses:
- clusterName: kind-cluster-2
conditions:
- lastTransitionTime: "2024-05-07T23:32:40Z"
message: 'Successfully scheduled resources for placement in kind-cluster-2 (affinity
score: 0, topology spread score: 0): picked by scheduling policy'
observedGeneration: 1
reason: Scheduled
status: "True"
type: Scheduled
- lastTransitionTime: "2024-05-07T23:32:40Z"
message: Detected the new changes on the resources and started the rollout process
observedGeneration: 1
reason: RolloutStarted
status: "True"
type: RolloutStarted
- lastTransitionTime: "2024-05-07T23:32:40Z"
message: No override rules are configured for the selected resources
observedGeneration: 1
reason: NoOverrideSpecified
status: "True"
type: Overridden
- lastTransitionTime: "2024-05-07T23:32:40Z"
message: All of the works are synchronized to the latest
observedGeneration: 1
reason: AllWorkSynced
status: "True"
type: WorkSynchronized
- lastTransitionTime: "2024-05-07T23:32:40Z"
message: All corresponding work objects are applied
observedGeneration: 1
reason: AllWorkHaveBeenApplied
status: "True"
type: Applied
- lastTransitionTime: "2024-05-07T23:32:49Z"
message: The availability of work object crp-4-work is not trackable
observedGeneration: 1
reason: WorkNotTrackable
status: "True"
type: Available
- clusterName: kind-cluster-1
conditions:
- lastTransitionTime: "2024-05-07T23:32:40Z"
message: 'Successfully scheduled resources for placement in kind-cluster-1 (affinity
score: 0, topology spread score: 0): picked by scheduling policy'
observedGeneration: 1
reason: Scheduled
status: "True"
type: Scheduled
- lastTransitionTime: "2024-05-07T23:32:40Z"
message: Detected the new changes on the resources and started the rollout process
observedGeneration: 1
reason: RolloutStarted
status: "True"
type: RolloutStarted
- lastTransitionTime: "2024-05-07T23:32:40Z"
message: No override rules are configured for the selected resources
observedGeneration: 1
reason: NoOverrideSpecified
status: "True"
type: Overridden
- lastTransitionTime: "2024-05-07T23:32:40Z"
message: All of the works are synchronized to the latest
observedGeneration: 1
reason: AllWorkSynced
status: "True"
type: WorkSynchronized
- lastTransitionTime: "2024-05-07T23:32:40Z"
message: Work object crp-4-work is not applied
observedGeneration: 1
reason: NotAllWorkHaveBeenApplied
status: "False"
type: Applied
failedPlacements:
- condition:
lastTransitionTime: "2024-05-07T23:32:40Z"
message: 'Failed to apply manifest: failed to process the request due to a
client error: resource exists and is not managed by the fleet controller
and co-ownernship is disallowed'
reason: ManifestsAlreadyOwnedByOthers
status: "False"
type: Applied
kind: Namespace
name: test-ns
version: v1
selectedResources:
- kind: Namespace
name: test-ns
version: v1
- group: apps
kind: Deployment
name: test-nginx
namespace: test-ns
version: v1
In the ClusterResourcePlacement
status, within the failedPlacements
section for kind-cluster-1
, we get a clear message
as to why the resource failed to apply on the member cluster. In the preceding conditions
section,
the Applied
condition for kind-cluster-1
is flagged as false and shows the NotAllWorkHaveBeenApplied
reason.
This indicates that the Work object intended for the member cluster kind-cluster-1
has not been applied.
For more information, see this section.
Work status of kind-cluster-1
status:
conditions:
- lastTransitionTime: "2024-05-07T23:32:40Z"
message: 'Apply manifest {Ordinal:0 Group: Version:v1 Kind:Namespace Resource:namespaces
Namespace: Name:test-ns} failed'
observedGeneration: 1
reason: WorkAppliedFailed
status: "False"
type: Applied
- lastTransitionTime: "2024-05-07T23:32:40Z"
message: ""
observedGeneration: 1
reason: WorkAppliedFailed
status: Unknown
type: Available
manifestConditions:
- conditions:
- lastTransitionTime: "2024-05-07T23:32:40Z"
message: 'Failed to apply manifest: failed to process the request due to a client
error: resource exists and is not managed by the fleet controller and co-ownernship
is disallowed'
reason: ManifestsAlreadyOwnedByOthers
status: "False"
type: Applied
- lastTransitionTime: "2024-05-07T23:32:40Z"
message: Manifest is not applied yet
reason: ManifestApplyFailed
status: Unknown
type: Available
identifier:
kind: Namespace
name: test-ns
ordinal: 0
resource: namespaces
version: v1
- conditions:
- lastTransitionTime: "2024-05-07T23:32:40Z"
message: Manifest is already up to date
observedGeneration: 1
reason: ManifestAlreadyUpToDate
status: "True"
type: Applied
- lastTransitionTime: "2024-05-07T23:32:51Z"
message: Manifest is trackable and available now
observedGeneration: 1
reason: ManifestAvailable
status: "True"
type: Available
identifier:
group: apps
kind: Deployment
name: test-nginx
namespace: test-ns
ordinal: 1
resource: deployments
version: v1
From looking at the Work
status, specifically the manifestConditions
section, you can see that the namespace could not be applied but the deployment within the namespace got propagated from the hub to the member cluster.
Resolution
In this situation, a potential solution is to set the AllowCoOwnership
to true
in the ApplyStrategy policy. However, it’s important to notice that this decision should be made by the user because the resources might not be shared.
7 - CRP Availability Failure TSG
The ClusterResourcePlacementAvailable
condition is false
when some of the resources are not available yet. We will place some of the detailed failure in the FailedResourcePlacement
array.
Note: To get more information about why resources are unavailable check work applier logs.
Common scenarios
Instances where this condition may arise:
- The member cluster doesn’t have enough resource availability.
- The deployment contains an invalid image name.
Case Study
The example output below demonstrates a scenario where the CRP is unable to propagate a deployment to a member cluster due to the deployment having a bad image name.
ClusterResourcePlacement spec
spec:
resourceSelectors:
- group: ""
kind: Namespace
name: test-ns
version: v1
policy:
placementType: PickN
numberOfClusters: 1
strategy:
type: RollingUpdate
ClusterResourcePlacement status
status:
conditions:
- lastTransitionTime: "2024-05-14T18:52:30Z"
message: found all cluster needed as specified by the scheduling policy, found
1 cluster(s)
observedGeneration: 1
reason: SchedulingPolicyFulfilled
status: "True"
type: ClusterResourcePlacementScheduled
- lastTransitionTime: "2024-05-14T18:52:31Z"
message: All 1 cluster(s) start rolling out the latest resource
observedGeneration: 1
reason: RolloutStarted
status: "True"
type: ClusterResourcePlacementRolloutStarted
- lastTransitionTime: "2024-05-14T18:52:31Z"
message: No override rules are configured for the selected resources
observedGeneration: 1
reason: NoOverrideSpecified
status: "True"
type: ClusterResourcePlacementOverridden
- lastTransitionTime: "2024-05-14T18:52:31Z"
message: Works(s) are succcesfully created or updated in 1 target cluster(s)'
namespaces
observedGeneration: 1
reason: WorkSynchronized
status: "True"
type: ClusterResourcePlacementWorkSynchronized
- lastTransitionTime: "2024-05-14T18:52:31Z"
message: The selected resources are successfully applied to 1 cluster(s)
observedGeneration: 1
reason: ApplySucceeded
status: "True"
type: ClusterResourcePlacementApplied
- lastTransitionTime: "2024-05-14T18:52:31Z"
message: The selected resources in 1 cluster(s) are still not available yet
observedGeneration: 1
reason: ResourceNotAvailableYet
status: "False"
type: ClusterResourcePlacementAvailable
observedResourceIndex: "0"
placementStatuses:
- clusterName: kind-cluster-1
conditions:
- lastTransitionTime: "2024-05-14T18:52:30Z"
message: 'Successfully scheduled resources for placement in kind-cluster-1 (affinity
score: 0, topology spread score: 0): picked by scheduling policy'
observedGeneration: 1
reason: Scheduled
status: "True"
type: Scheduled
- lastTransitionTime: "2024-05-14T18:52:31Z"
message: Detected the new changes on the resources and started the rollout process
observedGeneration: 1
reason: RolloutStarted
status: "True"
type: RolloutStarted
- lastTransitionTime: "2024-05-14T18:52:31Z"
message: No override rules are configured for the selected resources
observedGeneration: 1
reason: NoOverrideSpecified
status: "True"
type: Overridden
- lastTransitionTime: "2024-05-14T18:52:31Z"
message: All of the works are synchronized to the latest
observedGeneration: 1
reason: AllWorkSynced
status: "True"
type: WorkSynchronized
- lastTransitionTime: "2024-05-14T18:52:31Z"
message: All corresponding work objects are applied
observedGeneration: 1
reason: AllWorkHaveBeenApplied
status: "True"
type: Applied
- lastTransitionTime: "2024-05-14T18:52:31Z"
message: Work object crp1-work is not available
observedGeneration: 1
reason: NotAllWorkAreAvailable
status: "False"
type: Available
failedPlacements:
- condition:
lastTransitionTime: "2024-05-14T18:52:31Z"
message: Manifest is trackable but not available yet
observedGeneration: 1
reason: ManifestNotAvailableYet
status: "False"
type: Available
group: apps
kind: Deployment
name: my-deployment
namespace: test-ns
version: v1
selectedResources:
- kind: Namespace
name: test-ns
version: v1
- group: apps
kind: Deployment
name: my-deployment
namespace: test-ns
version: v1
In the ClusterResourcePlacement
status, within the failedPlacements
section for kind-cluster-1
, we get a clear message
as to why the resource failed to apply on the member cluster. In the preceding conditions
section,
the Available
condition for kind-cluster-1
is flagged as false
and shows NotAllWorkAreAvailable
reason.
This signifies that the Work object intended for the member cluster kind-cluster-1
is not yet available.
For more information, see this section.
Work status of kind-cluster-1
status:
conditions:
- lastTransitionTime: "2024-05-14T18:52:31Z"
message: Work is applied successfully
observedGeneration: 1
reason: WorkAppliedCompleted
status: "True"
type: Applied
- lastTransitionTime: "2024-05-14T18:52:31Z"
message: Manifest {Ordinal:1 Group:apps Version:v1 Kind:Deployment Resource:deployments
Namespace:test-ns Name:my-deployment} is not available yet
observedGeneration: 1
reason: WorkNotAvailableYet
status: "False"
type: Available
manifestConditions:
- conditions:
- lastTransitionTime: "2024-05-14T18:52:31Z"
message: Manifest is already up to date
reason: ManifestAlreadyUpToDate
status: "True"
type: Applied
- lastTransitionTime: "2024-05-14T18:52:31Z"
message: Manifest is trackable and available now
reason: ManifestAvailable
status: "True"
type: Available
identifier:
kind: Namespace
name: test-ns
ordinal: 0
resource: namespaces
version: v1
- conditions:
- lastTransitionTime: "2024-05-14T18:52:31Z"
message: Manifest is already up to date
observedGeneration: 1
reason: ManifestAlreadyUpToDate
status: "True"
type: Applied
- lastTransitionTime: "2024-05-14T18:52:31Z"
message: Manifest is trackable but not available yet
observedGeneration: 1
reason: ManifestNotAvailableYet
status: "False"
type: Available
identifier:
group: apps
kind: Deployment
name: my-deployment
namespace: test-ns
ordinal: 1
resource: deployments
version: v1
Check the Available
status for kind-cluster-1
. You can see that the my-deployment
deployment isn’t yet available on the member cluster.
This suggests that an issue might be affecting the deployment manifest.
Resolution
In this situation, a potential solution is to check the deployment in the member cluster because the message indicates that the root cause of the issue is a bad image name.
After this image name is identified, you can correct the deployment manifest and update it.
After you fix and update the resource manifest, the ClusterResourcePlacement
object API automatically propagates the corrected resource to the member cluster.
For all other situations, make sure that the propagated resource is configured correctly. Additionally, verify that the selected cluster has sufficient available capacity to accommodate the new resources.
8 - CRP Drift Detection and Configuration Difference Check Unexpected Result TSG
This document helps you troubleshoot unexpected drift and configuration difference detection results when using the KubeFleet CRP API.
Note
If you are looking for troubleshooting steps on diff reporting failures, i.e., when the
ClusterResourcePlacementDiffReported
condition on your CRP object is set toFalse
, see the CRP Diff Reporting Failure TSG instead.
Note
This document focuses on unexpected drift and configuration difference detection results. If you have encountered drift and configuration difference detection failures (e.g., no detection results at all with the
ClusterResourcePlacementApplied
condition being set toFalse
with a detection related error), see the CRP Apply Op Failure TSG instead.
Common scenarios
A drift occurs when a non-KubeFleet agent modifies a KubeFleet-managed resource (i.e.,
a resource that has been applied by KubeFleet). Drift details are reported in the CRP status
on a per-cluster basis (.status.placementStatuses[*].driftedPlacements
field).
Drift detection is always on when your CRP uses a ClientSideApply
(default) or
ServerSideApply
typed apply strategy, however, note the following limitations:
- When you set the
comparisonOption
setting (.spec.strategy.applyStrategy.comparisonOption
field) topartialComparison
, KubeFleet will only detect drifts in managed fields, i.e., fields that have been explicitly specified on the hub cluster side. A non-KubeFleet agent can then add a field (e.g., a label or an annotation) to the resource without KubeFleet complaining about it. To check for such changes (field additions), use thefullComparison
option for thecomparisonOption
field. - Depending on your cluster setup, there might exist Kubernetes webhooks/controllers (built-in or from a third party) that will process KubeFleet-managed resources and add/modify fields as they see fit. The API server on the member cluster side might also add/modify fields (e.g., enforcing default values) on resources. If your comparison option allows, KubeFleet will report these as drifts. For any unexpected drift reportings, verify first if you have installed a source that triggers the changes.
- When you set the
whenToApply
setting (.spec.strategy.applyStrategy.whenToApply
field) toAlways
and thecomparisonOption
setting (.spec.strategy.applyStrategy.comparisonOption
field) topartialComparison
, no drifts will ever be found, as apply ops from KubeFleet will overwrite any drift in managed fields, and drifts in unmanaged fields are always ignored. - Drift detection does not apply to resources that are not yet managed by KubeFleet. If a resource has
not been created on the hub cluster or has not been selected by the CRP API, there will not be any drift
reportings about it, even if the resource live within a KubeFleet managed namespace. Similarly, if KubeFleet
has been blocked from taking over a pre-existing resource due to your takeover setting
(
.spec.strategy.applyStrategy.whenToTakeOver
field), no drift detection will run on the resource. - Resource deletion is not considered as a drift; if a KubeFleet-managed resource has been deleted by a non-KubeFleet agent, KubeFleet will attempt to re-create it as soon as it finds out about the deletion.
- Drift detection will not block resource rollouts. If you have just updated the resources on the hub cluster side and triggered a rollout, drifts on the member cluster side might have been overwritten.
- When a rollout is in progress, drifts will not be reported on the CRP status for a member cluster if the cluster has not received the latest round of updates.
KubeFleet will check for configuration differences under the following two conditions:
- When KubeFleet encounters a pre-existing resource, and the
whenToTakeOver
setting (.spec.strategy.applyStrategy.whenToTakeOver
field) is set toIfNoDiff
. - When the CRP uses an apply strategy of the
ReportDiff
type.
Configuration difference details are reported in the CRP status
on a per-cluster basis (.status.placementStatuses[*].diffedPlacements
field). Note that the
following limitations apply:
- When you set the
comparisonOption
setting (.spec.strategy.applyStrategy.comparisonOption
field) topartialComparison
, KubeFleet will only check for configuration differences in managed fields, i.e., fields that have been explicitly specified on the hub cluster side. Unmanaged fields, such as additional labels and annotations, will not be considered as configuration differences. To check for such changes (field additions), use thefullComparison
option for thecomparisonOption
field. - Depending on your cluster setup, there might exist Kubernetes webhooks/controllers (built-in or from a third party) that will process resources and add/modify fields as they see fit. The API server on the member cluster side might also add/modify fields (e.g., enforcing default values) on resources. If your comparison option allows, KubeFleet will report these as configuration differences. For any unexpected configuration difference reportings, verify first if you have installed a source that triggers the changes.
- KubeFleet checks for configuration differences regardless of resource ownerships; resources not managed by KubeFleet will also be checked.
- The absence of a resource will be considered as a configuration difference.
- Configuration differences will not block resource rollouts. If you have just updated the resources on the hub cluster side and triggered a rollout, configuration difference check will be re-run based on the newer versions of resources.
- When a rollout is in progress, configuration differences will not be reported on the CRP status for a member cluster if the cluster has not received the latest round of updates.
Note also that drift detection and configuration difference check in KubeFleet run periodically. The reportings in the CRP status might not be up-to-date.
Investigation steps
If you find an unexpected drift detection or configuration difference check result on a member cluster, follow the steps below for investigation:
- Double-check the apply strategy of your CRP; confirm that your settings allows proper drift detection and/or configuration difference check reportings.
- Verify that rollout has completed on all member clusters; see the CRP Rollout Failure TSG for more information.
- Log onto your member cluster and retrieve the resources with unexpected reportings.
- Check if its generation (
.metadata.generation
field) matches with theobservedInMemberClusterGeneration
value in the drift detection and/or configuration difference check reportings. A mismatch might signal that the reportings are not yet up-to-date; they should get refreshed soon. - The
kubectl.kubernetes.io/last-applied-configuration
annotation and/or the.metadata.managedFields
field might have some relevant information on which agents have attempted to update/patch the resource. KubeFleet changes are executed under the namework-api-agent
; if you see other manager names, check if it comes from a known source (e.g., Kubernetes controller) in your cluster.
- Check if its generation (
File an issue to the KubeFleet team if you believe that the unexpected reportings come from a bug in KubeFleet.
9 - CRP Diff Reporting Failure TSG
This document helps you troubleshoot diff reporting failures when using the KubeFleet CRP API,
specifically when you find that the ClusterResourcePlacementDiffReported
status condition has been
set to False
in the CRP status.
Note
If you are looking for troubleshooting steps on unexpected drift detection and/or configuration difference detection results, see the Drift Detection and Configuration Difference Detection Failure TSG instead.
Note
The
ClusterResourcePlacementDiffReported
status condition will only be set if the CRP has an apply strategy of theReportDiff
type. If your CRP usesClientSideApply
(default) orServerSideApply
typed apply strategies, it is perfectly normal if theClusterResourcePlacementDiffReported
status condition is absent in the CRP status.
Common scenarios
ClusterResourcePlacementDiffReported
status condition will be set to False
if KubeFleet cannot complete
the configuration difference checking process for one or more of the selected resources.
Depending on your CRP configuration, KubeFleet might use one of the three approaches for configuration difference checking:
- If the resource cannot be found on a member cluster, KubeFleet will simply report a full object difference.
- If you ask KubeFleet to perform partial comparisons, i.e., the
comparisonOption
field in the CRP apply strategy (.spec.strategy.applyStrategy.comparisonOption
field) is set topartialComparison
, KubeFleet will perform a dry-run apply op (server-side apply with conflict overriding enabled) and compare the returned apply result against the current state of the resource on the member cluster side for configuration differences. - If you ask KubeFleet to perform full comparisons, i.e., the
comparisonOption
field in the CRP apply strategy (.spec.strategy.applyStrategy.comparisonOption
field) is set tofullComparison
, KubeFleet will directly compare the given manifest (the resource created on the hub cluster side) against the current state of the resource on the member cluster side for configuration differences.
Failures might arise if:
- The dry-run apply op does not complete successfully; or
- An unexpected error occurs during the comparison process, such as a JSON path parsing/evaluation error.
- In this case, please consider filing a bug to the KubeFleet team.
Investigation steps
If you encounter such a failure, follow the steps below for investigation:
Identify the specific resources that have failed in the diff reporting process first. In the CRP status, find out the individual member clusters that have diff reporting failures: inspect the
.status.placementStatuses
field of the CRP object; each entry corresponds to a member cluster, and for each entry, check if it has a status condition,ClusterResourcePlacementDiffReported
, in the.status.placementStatuses[*].conditions
field, which has been set toFalse
. Write down the name of the member cluster.For each cluster name that has been written down, list all the work objects that have been created for the cluster in correspondence with the CRP object:
# Replace [YOUR-CLUSTER-NAME] and [YOUR-CRP-NAME] with values of your own. kubectl get work -n fleet-member-[YOUR-CLUSTER-NAME] -l kubernetes-fleet.io/parent-CRP=[YOUR-CRP-NAME]
For each found work object, inspect its status. The
.status.manifestConditions
field features an array of which each item explains about the processing result of a resource on the given member cluster. Find out all items with aDiffReported
condition in the.status.manifestConditions[*].conditions
field that has been set toFalse
. The.status.manifestConditions[*].identifier
field tells the GVK, namespace, and name of the failing resource.Read the
message
field of theDiffReported
condition (.status.manifestConditions[*].conditions[*].message
); KubeFleet will include the details about the diff reporting failures in the field.If you are familiar with the cause of the error (for example, dry-run apply ops fails due to API server traffic control measures), fixing the cause (tweaking traffic control limits) should resolve the failure. KubeFleet will periodically retry diff reporting in face of failures. Otherwise, file an issue to the KubeFleet team.
10 - ClusterStagedUpdateRun TSG
This guide provides troubleshooting steps for common issues related to Staged Update Run.
Note: To get more information about why the scheduling fails, you can check the updateRun controller logs.
CRP status without Staged Update Run
When a ClusterResourcePlacement
is created with spec.strategy.type
set to External
, the rollout does not start immediately.
A sample status of such ClusterResourcePlacement
is as follows:
$ kubectl describe crp example-placement
...
Status:
Conditions:
Last Transition Time: 2025-03-12T23:01:32Z
Message: found all cluster needed as specified by the scheduling policy, found 2 cluster(s)
Observed Generation: 1
Reason: SchedulingPolicyFulfilled
Status: True
Type: ClusterResourcePlacementScheduled
Last Transition Time: 2025-03-12T23:01:32Z
Message: There are still 2 cluster(s) in the process of deciding whether to roll out the latest resources or not
Observed Generation: 1
Reason: RolloutStartedUnknown
Status: Unknown
Type: ClusterResourcePlacementRolloutStarted
Observed Resource Index: 0
Placement Statuses:
Cluster Name: member1
Conditions:
Last Transition Time: 2025-03-12T23:01:32Z
Message: Successfully scheduled resources for placement in "member1" (affinity score: 0, topology spread score: 0): picked by scheduling policy
Observed Generation: 1
Reason: Scheduled
Status: True
Type: Scheduled
Last Transition Time: 2025-03-12T23:01:32Z
Message: In the process of deciding whether to roll out the latest resources or not
Observed Generation: 1
Reason: RolloutStartedUnknown
Status: Unknown
Type: RolloutStarted
Cluster Name: member2
Conditions:
Last Transition Time: 2025-03-12T23:01:32Z
Message: Successfully scheduled resources for placement in "member2" (affinity score: 0, topology spread score: 0): picked by scheduling policy
Observed Generation: 1
Reason: Scheduled
Status: True
Type: Scheduled
Last Transition Time: 2025-03-12T23:01:32Z
Message: In the process of deciding whether to roll out the latest resources or not
Observed Generation: 1
Reason: RolloutStartedUnknown
Status: Unknown
Type: RolloutStarted
Selected Resources:
...
Events: <none>
SchedulingPolicyFulfilled
condition indicates the CRP has been fully scheduled, while RolloutStartedUnknown
condition shows that the rollout has not started.
In the Placement Statuses
section, it displays the detailed status of each cluster. Both selected clusters are in the Scheduled
state, but the RolloutStarted
condition is still Unknown
because the rollout has not kicked off yet.
Investigate ClusterStagedUpdateRun initialization failure
An updateRun initialization failure can be easily detected by getting the resource:
$ kubectl get csur example-run
NAME PLACEMENT RESOURCE-SNAPSHOT-INDEX POLICY-SNAPSHOT-INDEX INITIALIZED SUCCEEDED AGE
example-run example-placement 1 0 False 2s
The INITIALIZED
field is False
, indicating the initialization failed.
Describe the updateRun to get more details:
$ kubectl describe csur example-run
...
Status:
Conditions:
Last Transition Time: 2025-03-13T07:28:29Z
Message: cannot continue the ClusterStagedUpdateRun: failed to initialize the clusterStagedUpdateRun: failed to process the request due to a client error: no clusterResourceSnapshots with index `1` found for clusterResourcePlacement `example-placement`
Observed Generation: 1
Reason: UpdateRunInitializedFailed
Status: False
Type: Initialized
Deletion Stage Status:
Clusters:
Stage Name: kubernetes-fleet.io/deleteStage
Policy Observed Cluster Count: 2
Policy Snapshot Index Used: 0
...
The condition clearly indicates the initialization failed. And the condition message gives more details about the failure.
In this case, I used a not-existing resource snapshot index 1
for the updateRun.
Investigate ClusterStagedUpdateRun execution failure
An updateRun execution failure can be easily detected by getting the resource:
$ kubectl get csur example-run
NAME PLACEMENT RESOURCE-SNAPSHOT-INDEX POLICY-SNAPSHOT-INDEX INITIALIZED SUCCEEDED AGE
example-run example-placement 0 0 True False 24m
The SUCCEEDED
field is False
, indicating the execution failure.
An updateRun execution failure can be caused by mainly 2 scenarios:
- When the updateRun controller is triggered to reconcile an in-progress updateRun, it starts by doing a bunch of validations
including retrieving the CRP and checking its rollout strategy, gathering all the bindings and regenerating the execution plan.
If any failure happens during validation, the updateRun execution fails with the corresponding validation error.
In above case, the CRP referenced by the updateRun is deleted during the execution. The updateRun controller detects and aborts the release.status: conditions: - lastTransitionTime: "2025-05-13T21:11:06Z" message: ClusterStagedUpdateRun initialized successfully observedGeneration: 1 reason: UpdateRunInitializedSuccessfully status: "True" type: Initialized - lastTransitionTime: "2025-05-13T21:11:21Z" message: The stages are aborted due to a non-recoverable error observedGeneration: 1 reason: UpdateRunFailed status: "False" type: Progressing - lastTransitionTime: "2025-05-13T22:15:23Z" message: 'cannot continue the ClusterStagedUpdateRun: failed to initialize the clusterStagedUpdateRun: failed to process the request due to a client error: parent clusterResourcePlacement not found' observedGeneration: 1 reason: UpdateRunFailed status: "False" type: Succeeded
- The updateRun controller triggers update to a member cluster by updating the corresponding binding spec and setting its
status to
RolloutStarted
. It then waits for default 15 seconds and check whether the resources have been successfully applied by checking the binding again. In case that there are multiple concurrent updateRuns, and during the 15-second wait, some other updateRun preempts and updates the binding with new configuration, current updateRun detects and fails with clear error message.
Thestatus: conditions: - lastTransitionTime: "2025-05-13T21:10:58Z" message: ClusterStagedUpdateRun initialized successfully observedGeneration: 1 reason: UpdateRunInitializedSuccessfully status: "True" type: Initialized - lastTransitionTime: "2025-05-13T21:11:13Z" message: The stages are aborted due to a non-recoverable error observedGeneration: 1 reason: UpdateRunFailed status: "False" type: Progressing - lastTransitionTime: "2025-05-13T21:11:13Z" message: 'cannot continue the ClusterStagedUpdateRun: unexpected behavior which cannot be handled by the controller: the clusterResourceBinding of the updating cluster `member1` in the stage `staging` does not have expected status: binding spec diff: binding has different resourceSnapshotName, want: example-placement-0-snapshot, got: example-placement-1-snapshot; binding state (want Bound): Bound; binding RolloutStarted (want true): true, please check if there is concurrent clusterStagedUpdateRun' observedGeneration: 1 reason: UpdateRunFailed status: "False" type: Succeeded
Succeeded
condition is set toFalse
with reasonUpdateRunFailed
. In themessage
, we showmember1
cluster instaging
stage gets preempted, and theresourceSnapshotName
field is changed fromexample-placement-0-snapshot
toexample-placement-1-snapshot
which means probably some other updateRun is rolling out a newer resource version. The message also prints current binding state and ifRolloutStarted
condition is set to true. The message gives a hint about whether these is a concurrent clusterStagedUpdateRun running. Upon such failure, the user can list updateRuns or check the binding state:The binding is named askubectl get clusterresourcebindings NAME WORKSYNCHRONIZED RESOURCESAPPLIED AGE example-placement-member1-2afc7d7f True True 51m example-placement-member2-fc081413 51m
<crp-name>-<cluster-name>-<suffix>
. Since the error message saysmember1
cluster fails the updateRun, we can check its binding:As the bindingkubectl get clusterresourcebindings example-placement-member1-2afc7d7f -o yaml ... spec: ... resourceSnapshotName: example-placement-1-snapshot schedulingPolicySnapshotName: example-placement-0 state: Bound targetCluster: member1 status: conditions: - lastTransitionTime: "2025-05-13T21:11:06Z" message: 'Detected the new changes on the resources and started the rollout process, resourceSnapshotIndex: 1, clusterStagedUpdateRun: example-run-1' observedGeneration: 3 reason: RolloutStarted status: "True" type: RolloutStarted ...
RolloutStarted
condition shows, it’s updated by another updateRunexample-run-1
.
The updateRun abortion due to execution failures is not recoverable at the moment. If failure happens due to validation error, one can fix the issue and create a new updateRun. If preemption happens, in most cases the user is releasing a new resource version, and they can just let the new updateRun run to complete.
Investigate ClusterStagedUpdateRun rollout stuck
A ClusterStagedUpdateRun
can get stuck when resource placement fails on some clusters. Getting the updateRun will show the cluster name and stage that is in stuck state:
$ kubectl get csur example-run -o yaml
...
status:
conditions:
- lastTransitionTime: "2025-05-13T23:15:35Z"
message: ClusterStagedUpdateRun initialized successfully
observedGeneration: 1
reason: UpdateRunInitializedSuccessfully
status: "True"
type: Initialized
- lastTransitionTime: "2025-05-13T23:21:18Z"
message: The updateRun is stuck waiting for cluster member1 in stage staging to
finish updating, please check crp status for potential errors
observedGeneration: 1
reason: UpdateRunStuck
status: "False"
type: Progressing
...
The message shows that the updateRun is stuck waiting for the cluster member1
in stage staging
to finish releasing.
The updateRun controller rolls resources to a member cluster by updating its corresponding binding. It then checks periodically
whether the update has completed or not. If the binding is still not available after current default 5 minutes, updateRun
controller decides the rollout has stuck and reports the condition.
This usually indicates something wrong happened on the cluster or the resources have some issue. To further investigate, you can check the ClusterResourcePlacement
status:
$ kubectl describe crp example-placement
...
Placement Statuses:
Cluster Name: member1
Conditions:
Last Transition Time: 2025-05-13T23:11:14Z
Message: Successfully scheduled resources for placement in "member1" (affinity score: 0, topology spread score: 0): picked by scheduling policy
Observed Generation: 1
Reason: Scheduled
Status: True
Type: Scheduled
Last Transition Time: 2025-05-13T23:15:35Z
Message: Detected the new changes on the resources and started the rollout process, resourceSnapshotIndex: 0, clusterStagedUpdateRun: example-run
Observed Generation: 1
Reason: RolloutStarted
Status: True
Type: RolloutStarted
Last Transition Time: 2025-05-13T23:15:35Z
Message: No override rules are configured for the selected resources
Observed Generation: 1
Reason: NoOverrideSpecified
Status: True
Type: Overridden
Last Transition Time: 2025-05-13T23:15:35Z
Message: All of the works are synchronized to the latest
Observed Generation: 1
Reason: AllWorkSynced
Status: True
Type: WorkSynchronized
Last Transition Time: 2025-05-13T23:15:35Z
Message: All corresponding work objects are applied
Observed Generation: 1
Reason: AllWorkHaveBeenApplied
Status: True
Type: Applied
Last Transition Time: 2025-05-13T23:15:35Z
Message: Work object example-placement-work-configmap-c5971133-2779-4f6f-8681-3e05c4458c82 is not yet available
Observed Generation: 1
Reason: NotAllWorkAreAvailable
Status: False
Type: Available
Failed Placements:
Condition:
Last Transition Time: 2025-05-13T23:15:35Z
Message: Manifest is trackable but not available yet
Observed Generation: 1
Reason: ManifestNotAvailableYet
Status: False
Type: Available
Envelope:
Name: envelope-nginx-deploy
Namespace: test-namespace
Type: ConfigMap
Group: apps
Kind: Deployment
Name: nginx
Namespace: test-namespace
Version: v1
...
The Applied
condition is False
and says not all work have been applied. And in the “failed placements” section, it shows
the nginx
deployment wrapped by envelope-nginx-deploy
configMap is not ready. Check from member1
cluster and we can see
there’s image pull failure:
kubectl config use-context member1
kubectl get deploy -n test-namespace
NAME READY UP-TO-DATE AVAILABLE AGE
nginx 0/1 1 0 16m
kubectl get pods -n test-namespace
NAME READY STATUS RESTARTS AGE
nginx-69b9cb5485-sw24b 0/1 ErrImagePull 0 16m
For more debugging instructions, you can refer to ClusterResourcePlacement TSG.
After resolving the issue, you can create always create a new updateRun to restart the rollout. Stuck updateRuns can be deleted.
11 - ClusterResourcePlacementEviction TSG
This guide provides troubleshooting steps for issues related to placement eviction.
An eviction object when created is ideally only reconciled once and reaches a terminal state. List of terminal states for eviction are:
- Eviction is Invalid
- Eviction is Valid, Eviction failed to Execute
- Eviction is Valid, Eviction executed successfully
Note: If an eviction object doesn’t reach a terminal state i.e. neither valid condition nor executed condition is set it is likely due to a failure in the reconciliation process where the controller is unable to reach the api server.
The first step in troubleshooting is to check the status of the eviction object to understand if the eviction reached a terminal state or not.
Invalid eviction
Missing/Deleting CRP object
Example status with missing CRP
object:
status:
conditions:
- lastTransitionTime: "2025-04-17T22:16:59Z"
message: Failed to find ClusterResourcePlacement targeted by eviction
observedGeneration: 1
reason: ClusterResourcePlacementEvictionInvalid
status: "False"
type: Valid
Example status with deleting CRP
object:
status:
conditions:
- lastTransitionTime: "2025-04-21T19:53:42Z"
message: Found deleting ClusterResourcePlacement targeted by eviction
observedGeneration: 1
reason: ClusterResourcePlacementEvictionInvalid
status: "False"
type: Valid
In both cases the Eviction object reached a terminal state, its status has Valid
condition set to False
.
The user should verify if the ClusterResourcePlacement
object is missing or if it is being deleted and recreate the
ClusterResourcePlacement
object if needed and retry eviction.
Missing CRB object
Example status with missing CRB
object:
status:
conditions:
- lastTransitionTime: "2025-04-17T22:21:51Z"
message: Failed to find scheduler decision for placement in cluster targeted by
eviction
observedGeneration: 1
reason: ClusterResourcePlacementEvictionInvalid
status: "False"
type: Valid
Note: The user can find the corresponding
ClusterResourceBinding
object by listing allClusterResourceBinding
objects for theClusterResourcePlacement
objectkubectl get rb -l kubernetes-fleet.io/parent-CRP=<CRPName>
The
ClusterResourceBinding
object name is formatted as<CRPName>-<ClusterName>-randomsuffix
In this case the Eviction object reached a terminal state, its status has Valid
condition set to False
, because the
ClusterResourceBinding
object or Placement for target cluster is not found. The user should verify to see if the
ClusterResourcePlacement
object is propagating resources to the target cluster,
- If yes, the next step is to check if the
ClusterResourceBinding
object is present for the target cluster or why it was not created and try to create an eviction object onceClusterResourceBinding
is created. - If no, the cluster is not picked by the scheduler and hence no need to retry eviction.
Multiple CRB is present
Example status with multiple CRB
objects:
status:
conditions:
- lastTransitionTime: "2025-04-17T23:48:08Z"
message: Found more than one scheduler decision for placement in cluster targeted
by eviction
observedGeneration: 1
reason: ClusterResourcePlacementEvictionInvalid
status: "False"
type: Valid
In this case the Eviction object reached a terminal state, its status has Valid
condition set to False
, because
there is more than one ClusterResourceBinding
object or Placement present for the ClusterResourcePlacement
object
targeting the member cluster. This is a rare scenario, it’s an in-between state where bindings are being-recreated due
to the member cluster being selected again, and it will normally resolve quickly.
PickFixed CRP is targeted by CRP Eviction
Example status for ClusterResourcePlacementEviction
object targeting a PickFixed ClusterResourcePlacement
object:
status:
conditions:
- lastTransitionTime: "2025-04-21T23:19:06Z"
message: Found ClusterResourcePlacement with PickFixed placement type targeted
by eviction
observedGeneration: 1
reason: ClusterResourcePlacementEvictionInvalid
status: "False"
type: Valid
In this case the Eviction object reached a terminal state, its status has Valid
condition set to False
, because
the ClusterResourcePlacement
object is of type PickFixed
. Users cannot use ClusterResourcePlacementEviction
objects to evict resources propagated by ClusterResourcePlacement
objects of type PickFixed
. The user can instead
remove the member cluster name from the clusterNames
field in the policy of the ClusterResourcePlacement
object.
Failed to execute eviction
Eviction blocked because placement is missing
status:
conditions:
- lastTransitionTime: "2025-04-23T23:54:03Z"
message: Eviction is valid
observedGeneration: 1
reason: ClusterResourcePlacementEvictionValid
status: "True"
type: Valid
- lastTransitionTime: "2025-04-23T23:54:03Z"
message: Eviction is blocked, placement has not propagated resources to target
cluster yet
observedGeneration: 1
reason: ClusterResourcePlacementEvictionNotExecuted
status: "False"
type: Executed
In this case the Eviction object reached a terminal state, its status has Executed
condition set to False
, because
for the targeted ClusterResourcePlacement
the corresponding ClusterResourceBinding
object’s spec is set to
Scheduled
meaning the rollout of resources is not started yet.
Note: The user can find the corresponding
ClusterResourceBinding
object by listing allClusterResourceBinding
objects for theClusterResourcePlacement
objectkubectl get rb -l kubernetes-fleet.io/parent-CRP=<CRPName>
The
ClusterResourceBinding
object name is formatted as<CRPName>-<ClusterName>-randomsuffix
.
spec:
applyStrategy:
type: ClientSideApply
clusterDecision:
clusterName: kind-cluster-3
clusterScore:
affinityScore: 0
priorityScore: 0
reason: 'Successfully scheduled resources for placement in "kind-cluster-3" (affinity
score: 0, topology spread score: 0): picked by scheduling policy'
selected: true
resourceSnapshotName: ""
schedulingPolicySnapshotName: test-crp-1
state: Scheduled
targetCluster: kind-cluster-3
Here the user can wait for the ClusterResourceBinding
object to be updated to Bound
state which means that
resources have been propagated to the target cluster and then retry eviction. In some cases this can take a while or not
happen at all, in that case the user should verify if rollout is stuck for ClusterResourcePlacement
object.
Eviction blocked by Invalid CRPDB
Example status for ClusterResourcePlacementEviction
object with invalid ClusterResourcePlacementDisruptionBudget
,
status:
conditions:
- lastTransitionTime: "2025-04-21T23:39:42Z"
message: Eviction is valid
observedGeneration: 1
reason: ClusterResourcePlacementEvictionValid
status: "True"
type: Valid
- lastTransitionTime: "2025-04-21T23:39:42Z"
message: Eviction is blocked by misconfigured ClusterResourcePlacementDisruptionBudget,
either MaxUnavailable is specified or MinAvailable is specified as a percentage
for PickAll ClusterResourcePlacement
observedGeneration: 1
reason: ClusterResourcePlacementEvictionNotExecuted
status: "False"
type: Executed
In this cae the Eviction object reached a terminal state, its status has Executed
condition set to False
, because
the ClusterResourcePlacementDisruptionBudget
object is invalid. For ClusterResourcePlacement
objects of type
PickAll
, when specifying a ClusterResourcePlacementDisruptionBudget
the minAvailable
field should be set to an
absolute number and not a percentage and the maxUnavailable
field should not be set since the total number of
placements is non-deterministic.
Eviction blocked by specified CRPDB
Example status for ClusterResourcePlacementEviction
object blocked by a ClusterResourcePlacementDisruptionBudget
object,
status:
conditions:
- lastTransitionTime: "2025-04-24T18:54:30Z"
message: Eviction is valid
observedGeneration: 1
reason: ClusterResourcePlacementEvictionValid
status: "True"
type: Valid
- lastTransitionTime: "2025-04-24T18:54:30Z"
message: 'Eviction is blocked by specified ClusterResourcePlacementDisruptionBudget,
availablePlacements: 2, totalPlacements: 2'
observedGeneration: 1
reason: ClusterResourcePlacementEvictionNotExecuted
status: "False"
type: Executed
In this cae the Eviction object reached a terminal state, its status has Executed
condition set to False
, because
the ClusterResourcePlacementDisruptionBudget
object is blocking the eviction. The message from Executed
condition
reads available placements is 2 and total placements is 2, which means that the ClusterResourcePlacementDisruptionBudget
is protecting all placements propagated by the ClusterResourcePlacement
object.
Taking a look at the ClusterResourcePlacementDisruptionBudget
object,
apiVersion: placement.kubernetes-fleet.io/v1beta1
kind: ClusterResourcePlacementDisruptionBudget
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"placement.kubernetes-fleet.io/v1beta1","kind":"ClusterResourcePlacementDisruptionBudget","metadata":{"annotations":{},"name":"pick-all-crp"},"spec":{"minAvailable":2}}
creationTimestamp: "2025-04-24T18:47:22Z"
generation: 1
name: pick-all-crp
resourceVersion: "1749"
uid: 7d3a0ac5-0225-4fb6-b5e9-fc28d58cefdc
spec:
minAvailable: 2
We can see that the minAvailable
is set to 2
, which means that at least 2 placements should be available for the
ClusterResourcePlacement
object.
Let’s take a look at the ClusterResourcePlacement
object’s status to verify the list of available placements,
status:
conditions:
- lastTransitionTime: "2025-04-24T18:46:38Z"
message: found all cluster needed as specified by the scheduling policy, found
2 cluster(s)
observedGeneration: 1
reason: SchedulingPolicyFulfilled
status: "True"
type: ClusterResourcePlacementScheduled
- lastTransitionTime: "2025-04-24T18:50:19Z"
message: All 2 cluster(s) start rolling out the latest resource
observedGeneration: 1
reason: RolloutStarted
status: "True"
type: ClusterResourcePlacementRolloutStarted
- lastTransitionTime: "2025-04-24T18:50:19Z"
message: No override rules are configured for the selected resources
observedGeneration: 1
reason: NoOverrideSpecified
status: "True"
type: ClusterResourcePlacementOverridden
- lastTransitionTime: "2025-04-24T18:50:19Z"
message: Works(s) are succcesfully created or updated in 2 target cluster(s)'
namespaces
observedGeneration: 1
reason: WorkSynchronized
status: "True"
type: ClusterResourcePlacementWorkSynchronized
- lastTransitionTime: "2025-04-24T18:50:19Z"
message: The selected resources are successfully applied to 2 cluster(s)
observedGeneration: 1
reason: ApplySucceeded
status: "True"
type: ClusterResourcePlacementApplied
- lastTransitionTime: "2025-04-24T18:50:19Z"
message: The selected resources in 2 cluster(s) are available now
observedGeneration: 1
reason: ResourceAvailable
status: "True"
type: ClusterResourcePlacementAvailable
observedResourceIndex: "0"
placementStatuses:
- clusterName: kind-cluster-1
conditions:
- lastTransitionTime: "2025-04-24T18:50:19Z"
message: 'Successfully scheduled resources for placement in "kind-cluster-1"
(affinity score: 0, topology spread score: 0): picked by scheduling policy'
observedGeneration: 1
reason: Scheduled
status: "True"
type: Scheduled
- lastTransitionTime: "2025-04-24T18:50:19Z"
message: Detected the new changes on the resources and started the rollout process
observedGeneration: 1
reason: RolloutStarted
status: "True"
type: RolloutStarted
- lastTransitionTime: "2025-04-24T18:50:19Z"
message: No override rules are configured for the selected resources
observedGeneration: 1
reason: NoOverrideSpecified
status: "True"
type: Overridden
- lastTransitionTime: "2025-04-24T18:50:19Z"
message: All of the works are synchronized to the latest
observedGeneration: 1
reason: AllWorkSynced
status: "True"
type: WorkSynchronized
- lastTransitionTime: "2025-04-24T18:50:19Z"
message: All corresponding work objects are applied
observedGeneration: 1
reason: AllWorkHaveBeenApplied
status: "True"
type: Applied
- lastTransitionTime: "2025-04-24T18:50:19Z"
message: All corresponding work objects are available
observedGeneration: 1
reason: AllWorkAreAvailable
status: "True"
type: Available
- clusterName: kind-cluster-2
conditions:
- lastTransitionTime: "2025-04-24T18:46:38Z"
message: 'Successfully scheduled resources for placement in "kind-cluster-2"
(affinity score: 0, topology spread score: 0): picked by scheduling policy'
observedGeneration: 1
reason: Scheduled
status: "True"
type: Scheduled
- lastTransitionTime: "2025-04-24T18:46:38Z"
message: Detected the new changes on the resources and started the rollout process
observedGeneration: 1
reason: RolloutStarted
status: "True"
type: RolloutStarted
- lastTransitionTime: "2025-04-24T18:46:38Z"
message: No override rules are configured for the selected resources
observedGeneration: 1
reason: NoOverrideSpecified
status: "True"
type: Overridden
- lastTransitionTime: "2025-04-24T18:46:38Z"
message: All of the works are synchronized to the latest
observedGeneration: 1
reason: AllWorkSynced
status: "True"
type: WorkSynchronized
- lastTransitionTime: "2025-04-24T18:46:38Z"
message: All corresponding work objects are applied
observedGeneration: 1
reason: AllWorkHaveBeenApplied
status: "True"
type: Applied
- lastTransitionTime: "2025-04-24T18:46:38Z"
message: All corresponding work objects are available
observedGeneration: 1
reason: AllWorkAreAvailable
status: "True"
type: Available
selectedResources:
- kind: Namespace
name: test-ns
version: v1
from the status we can see that the ClusterResourcePlacement
object has 2 placements available, where resources have
been successfully applied and are available in kind-cluster-1 and kind-cluster-2. The users can check the individual
member clusters to verify the resources are available but the users are recommended to check theClusterResourcePlacement
object status to verify placement availability since the status is aggregated and updated by the controller.
Here the user can either remove the ClusterResourcePlacementDisruptionBudget
object or update the minAvailable
to
1
to allow ClusterResourcePlacementEviction
object to execute successfully. In general the user should carefully
check the availability of placements and act accordingly when changing the ClusterResourcePlacementDisruptionBudget
object.