This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Troubleshooting Guides

Guides for identifying and fixing common KubeFleet issues

1: ClusterResourcePlacement TSG
2: CRP Schedule Failure TSG
3: CRP Rollout Failure TSG
4: CRP Override Failure TSG
5: CRP Work-Synchronization Failure TSG
6: CRP Work-Application Failure TSG
7: CRP Availability Failure TSG
8: CRP Drift Detection and Configuration Difference Check Unexpected Result TSG
9: CRP Diff Reporting Failure TSG
10: ClusterStagedUpdateRun TSG
11: ClusterResourcePlacementEviction TSG

KubeFleet documentation features a number of troubleshooting guides to help you identify and fix KubeFleet issues you encounter. Pick one below to proceed.

1 - ClusterResourcePlacement TSG

Identify and fix KubeFleet issues associated with the ClusterResourcePlacement API

This TSG is meant to help you troubleshoot issues with the ClusterResourcePlacement API in Fleet.

Cluster Resource Placement

Internal Objects to keep in mind when troubleshooting CRP related errors on the hub cluster:

ClusterResourceSnapshot
ClusterSchedulingPolicySnapshot
ClusterResourceBinding
Work

Please read the Fleet API reference for more details about each object.

Complete Progress of the ClusterResourcePlacement

Understanding the progression and the status of the ClusterResourcePlacement custom resource is crucial for diagnosing and identifying failures. You can view the status of the ClusterResourcePlacement custom resource by using the following command:

kubectl describe clusterresourceplacement <name>

The complete progression of ClusterResourcePlacement is as follows:

ClusterResourcePlacementScheduled: Indicates a resource has been scheduled for placement..
- If this condition is false, refer to CRP Schedule Failure TSG.
ClusterResourcePlacementRolloutStarted: Indicates the rollout process has begun.
- If this condition is false refer to CRP Rollout Failure TSG
- If you are triggering a rollout with a staged update run, refer to ClusterStagedUpdateRun TSG.
ClusterResourcePlacementOverridden: Indicates the resource has been overridden.
- If this condition is false, refer to CRP Override Failure TSG
ClusterResourcePlacementWorkSynchronized: Indicates the work objects have been synchronized.
- If this condition is false, refer to CRP Work-Synchronization Failure TSG
ClusterResourcePlacementApplied: Indicates the resource has been applied. This condition will only be populated if the apply strategy in use is of the type ClientSideApply (default) or ServerSideApply.
- If this condition is false, refer to CRP Work-Application Failure TSG
ClusterResourcePlacementAvailable: Indicates the resource is available. This condition will only be populated if the apply strategy in use is of the type ClientSideApply (default) or ServerSideApply.
- If this condition is false, refer to CRP Availability Failure TSG
ClusterResourcePlacementDiffreported: Indicates whether diff reporting has completed on all resources. This condition will only be populated if the apply strategy in use is of the type ReportDiff.
- If this condition is false, refer to the CRP Diff Reporting Failure TSG for more information.

How can I debug if some clusters are not selected as expected?

Check the status of the ClusterSchedulingPolicySnapshot to determine which clusters were selected along with the reason.

How can I debug if a selected cluster does not have the expected resources on it or if CRP doesn’t pick up the latest changes?

Please check the following cases,

Check whether the ClusterResourcePlacementRolloutStarted condition in ClusterResourcePlacement status is set to true or false.
If false, see CRP Schedule Failure TSG.
If true,
- Check to see if ClusterResourcePlacementApplied condition is set to unknown, false or true.
- If unknown, wait for the process to finish, as the resources are still being applied to the member cluster. If the state remains unknown for a while, create a issue, as this is an unusual behavior.
- If false, refer to CRP Work-Application Failure TSG.
- If true, verify that the resource exists on the hub cluster.

We can also take a look at the placementStatuses section in ClusterResourcePlacement status for that particular cluster. In placementStatuses we would find failedPlacements section which should have the reasons as to why resources failed to apply.

How can I debug if the drift detection result or the configuration difference check result are different from my expectations?

See the Drift Detection and Configuration Difference Check Unexpected Result TSG for more information.

How can I find and verify the latest ClusterSchedulingPolicySnapshot for a ClusterResourcePlacement?

To find the latest ClusterSchedulingPolicySnapshot for a ClusterResourcePlacement resource, run the following command:

kubectl get clusterschedulingpolicysnapshot -l kubernetes-fleet.io/is-latest-snapshot=true,kubernetes-fleet.io/parent-CRP={CRPName}

NOTE: In this command, replace {CRPName} with your ClusterResourcePlacement name.

Then, compare the ClusterSchedulingPolicySnapshot with the ClusterResourcePlacement policy to make sure that they match, excluding the numberOfClusters field from the ClusterResourcePlacement spec.

If the placement type is PickN, check whether the number of clusters that’s requested in the ClusterResourcePlacement policy matches the value of the number-of-clusters label.

How can I find the latest ClusterResourceBinding resource?

The following command lists all ClusterResourceBindings instances that are associated with ClusterResourcePlacement:

kubectl get clusterresourcebinding -l kubernetes-fleet.io/parent-CRP={CRPName}

NOTE: In this command, replace {CRPName} with your ClusterResourcePlacement name.

Example

In this case we have ClusterResourcePlacement called test-crp.

List the ClusterResourcePlacement to get the name of the CRP,

kubectl get crp test-crp
NAME       GEN   SCHEDULED   SCHEDULEDGEN   APPLIED   APPLIEDGEN   AGE
test-crp   1     True        1              True      1            15s

The following command is run to view the status of the ClusterResourcePlacement deployment.

kubectl describe clusterresourceplacement test-crp

Here’s an example output. From the placementStatuses section of the test-crp status, notice that it has distributed resources to two member clusters and, therefore, has two ClusterResourceBindings instances:

status:
  conditions:
  - lastTransitionTime: "2023-11-23T00:49:29Z"
    ...
  placementStatuses:
  - clusterName: kind-cluster-1
    conditions:
      ...
      type: ResourceApplied
  - clusterName: kind-cluster-2
    conditions:
      ...
      reason: ApplySucceeded
      status: "True"
      type: ResourceApplied

To get the ClusterResourceBindings value, run the following command:

    kubectl get clusterresourcebinding -l kubernetes-fleet.io/parent-CRP=test-crp

The output lists all ClusterResourceBindings instances that are associated with test-crp.

kubectl get clusterresourcebinding -l kubernetes-fleet.io/parent-CRP=test-crp 
NAME                               WORKCREATED   RESOURCESAPPLIED   AGE
test-crp-kind-cluster-1-be990c3e   True          True               33s
test-crp-kind-cluster-2-ec4d953c   True          True               33s

The ClusterResourceBinding resource name uses the following format: {CRPName}-{clusterName}-{suffix}. Find the ClusterResourceBinding for the target cluster you are looking for based on the clusterName.

How can I find the latest ClusterResourceSnapshot resource?

To find the latest ClusterResourceSnapshot resource, run the following command:

kubectl get clusterresourcesnapshot -l kubernetes-fleet.io/is-latest-snapshot=true,kubernetes-fleet.io/parent-CRP={CRPName}

NOTE: In this command, replace {CRPName} with your ClusterResourcePlacement name.

How can I find the correct work resource that’s associated with ClusterResourcePlacement?

To find the correct work resource, follow these steps:

Identify the member cluster namespace and the ClusterResourcePlacement name. The format for the namespace is fleet-member-{clusterName}.
To get the work resource, run the following command:

kubectl get work -n fleet-member-{clusterName} -l kubernetes-fleet.io/parent-CRP={CRPName}

NOTE: In this command, replace {clusterName} and {CRPName} with the names that you identified in the first step.

2 - CRP Schedule Failure TSG

Troubleshooting guide for CRP status “ClusterResourcePlacementScheduled” condition set to false

The ClusterResourcePlacementScheduled condition is set to false when the scheduler cannot find all the clusters needed as specified by the scheduling policy.

Note: To get more information about why the scheduling fails, you can check the scheduler logs.

Common scenarios

Instances where this condition may arise:

When the placement policy is set to PickFixed, but the specified cluster names do not match any joined member cluster name in the fleet, or the specified cluster is no longer connected to the fleet.
When the placement policy is set to PickN, and N clusters are specified, but there are fewer than N clusters that have joined the fleet or satisfy the placement policy.
When the ClusterResourcePlacement resource selector selects a reserved namespace.

Note: When the placement policy is set to PickAll, the ClusterResourcePlacementScheduled condition is always set to true.

Case Study

In the following example, the ClusterResourcePlacement with a PickN placement policy is trying to propagate resources to two clusters labeled env:prod. The two clusters, named kind-cluster-1 and kind-cluster-2, have joined the fleet. However, only one member cluster, kind-cluster-1, has the label env:prod.

CRP spec:

spec:
  policy:
    affinity:
      clusterAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          clusterSelectorTerms:
          - labelSelector:
              matchLabels:
                env: prod
    numberOfClusters: 2
    placementType: PickN
  resourceSelectors:
  ...
  revisionHistoryLimit: 10
  strategy:
    type: RollingUpdate

ClusterResourcePlacement status

status:
  conditions:
  - lastTransitionTime: "2024-05-07T22:36:33Z"
    message: could not find all the clusters needed as specified by the scheduling
      policy
    observedGeneration: 1
    reason: SchedulingPolicyUnfulfilled
    status: "False"
    type: ClusterResourcePlacementScheduled
  - lastTransitionTime: "2024-05-07T22:36:33Z"
    message: All 1 cluster(s) start rolling out the latest resource
    observedGeneration: 1
    reason: RolloutStarted
    status: "True"
    type: ClusterResourcePlacementRolloutStarted
  - lastTransitionTime: "2024-05-07T22:36:33Z"
    message: No override rules are configured for the selected resources
    observedGeneration: 1
    reason: NoOverrideSpecified
    status: "True"
    type: ClusterResourcePlacementOverridden
  - lastTransitionTime: "2024-05-07T22:36:33Z"
    message: Works(s) are succcesfully created or updated in the 1 target clusters'
      namespaces
    observedGeneration: 1
    reason: WorkSynchronized
    status: "True"
    type: ClusterResourcePlacementWorkSynchronized
  - lastTransitionTime: "2024-05-07T22:36:33Z"
    message: The selected resources are successfully applied to 1 clusters
    observedGeneration: 1
    reason: ApplySucceeded
    status: "True"
    type: ClusterResourcePlacementApplied
  - lastTransitionTime: "2024-05-07T22:36:33Z"
    message: The selected resources in 1 cluster are available now
    observedGeneration: 1
    reason: ResourceAvailable
    status: "True"
    type: ClusterResourcePlacementAvailable
  observedResourceIndex: "0"
  placementStatuses:
  - clusterName: kind-cluster-1
    conditions:
    - lastTransitionTime: "2024-05-07T22:36:33Z"
      message: 'Successfully scheduled resources for placement in kind-cluster-1 (affinity
        score: 0, topology spread score: 0): picked by scheduling policy'
      observedGeneration: 1
      reason: Scheduled
      status: "True"
      type: Scheduled
    - lastTransitionTime: "2024-05-07T22:36:33Z"
      message: Detected the new changes on the resources and started the rollout process
      observedGeneration: 1
      reason: RolloutStarted
      status: "True"
      type: RolloutStarted
    - lastTransitionTime: "2024-05-07T22:36:33Z"
      message: No override rules are configured for the selected resources
      observedGeneration: 1
      reason: NoOverrideSpecified
      status: "True"
      type: Overridden
    - lastTransitionTime: "2024-05-07T22:36:33Z"
      message: All of the works are synchronized to the latest
      observedGeneration: 1
      reason: AllWorkSynced
      status: "True"
      type: WorkSynchronized
    - lastTransitionTime: "2024-05-07T22:36:33Z"
      message: All corresponding work objects are applied
      observedGeneration: 1
      reason: AllWorkHaveBeenApplied
      status: "True"
      type: Applied
    - lastTransitionTime: "2024-05-07T22:36:33Z"
      message: All corresponding work objects are available
      observedGeneration: 1
      reason: AllWorkAreAvailable
      status: "True"
      type: Available
  - conditions:
    - lastTransitionTime: "2024-05-07T22:36:33Z"
      message: 'kind-cluster-2 is not selected: ClusterUnschedulable, cluster does not
        match with any of the required cluster affinity terms'
      observedGeneration: 1
      reason: ScheduleFailed
      status: "False"
      type: Scheduled
  selectedResources:
  ...

The ClusterResourcePlacementScheduled condition is set to false, the goal is to select two clusters with the label env:prod, but only one member cluster possesses the correct label as specified in clusterAffinity.

We can also take a look at the ClusterSchedulingPolicySnapshot status to figure out why the scheduler could not schedule the resource for the placement policy specified. To learn how to get the latest ClusterSchedulingPolicySnapshot, see How can I find and verify the latest ClusterSchedulingPolicySnapshot for a ClusterResourcePlacement deployment? to learn how to get the latest ClusterSchedulingPolicySnapshot.

The corresponding ClusterSchedulingPolicySnapshot spec and status gives us even more information on why scheduling failed.

Latest ClusterSchedulingPolicySnapshot

apiVersion: placement.kubernetes-fleet.io/v1
kind: ClusterSchedulingPolicySnapshot
metadata:
  annotations:
    kubernetes-fleet.io/CRP-generation: "1"
    kubernetes-fleet.io/number-of-clusters: "2"
  creationTimestamp: "2024-05-07T22:36:33Z"
  generation: 1
  labels:
    kubernetes-fleet.io/is-latest-snapshot: "true"
    kubernetes-fleet.io/parent-CRP: crp-2
    kubernetes-fleet.io/policy-index: "0"
  name: crp-2-0
  ownerReferences:
  - apiVersion: placement.kubernetes-fleet.io/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: ClusterResourcePlacement
    name: crp-2
    uid: 48bc1e92-a8b9-4450-a2d5-c6905df2cbf0
  resourceVersion: "10090"
  uid: 2137887e-45fd-4f52-bbb7-b96f39854625
spec:
  policy:
    affinity:
      clusterAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          clusterSelectorTerms:
          - labelSelector:
              matchLabels:
                env: prod
    placementType: PickN
  policyHash: ZjE0Yjk4YjYyMTVjY2U3NzQ1MTZkNWRhZjRiNjQ1NzQ4NjllNTUyMzZkODBkYzkyYmRkMGU3OTI3MWEwOTkyNQ==
status:
  conditions:
  - lastTransitionTime: "2024-05-07T22:36:33Z"
    message: could not find all the clusters needed as specified by the scheduling
      policy
    observedGeneration: 1
    reason: SchedulingPolicyUnfulfilled
    status: "False"
    type: Scheduled
  observedCRPGeneration: 1
  targetClusters:
  - clusterName: kind-cluster-1
    clusterScore:
      affinityScore: 0
      priorityScore: 0
    reason: picked by scheduling policy
    selected: true
  - clusterName: kind-cluster-2
    reason: ClusterUnschedulable, cluster does not match with any of the required
      cluster affinity terms
    selected: false

Resolution:

The solution here is to add the env:prod label to the member cluster resource for kind-cluster-2 as well, so that the scheduler can select the cluster to propagate resources.

3 - CRP Rollout Failure TSG

Troubleshooting guide for CRP status “ClusterResourcePlacementRolloutStarted” condition set to false

When using the ClusterResourcePlacement API object in Azure Kubernetes Fleet Manager to propagate resources, the selected resources aren’t rolled out in all scheduled clusters and the ClusterResourcePlacementRolloutStarted condition status shows as False.

This TSG only applies to the RollingUpdate rollout strategy, which is the default strategy if you don’t specify in the ClusterResourcePlacement. To troubleshoot the update run strategy as you specify External in the ClusterResourcePlacement, please refer to the Staged Update Run Troubleshooting Guide.

Note: To get more information about why the rollout doesn’t start, you can check the rollout controller to get more information on why the rollout did not start.

Common scenarios

Instances where this condition may arise:

The Cluster Resource Placement rollout strategy is blocked because the RollingUpdate configuration is too strict.

Troubleshooting Steps

In the ClusterResourcePlacement status section, check the placementStatuses to identify clusters with the RolloutStarted status set to False.
Locate the corresponding ClusterResourceBinding for the identified cluster. For more information, see How can I find the latest ClusterResourceBinding resource?. This resource should indicate the status of the Work whether it was created or updated.
Verify the values of maxUnavailable and maxSurge to ensure they align with your expectations.

Case Study

In the following example, the ClusterResourcePlacement is trying to propagate a namespace to three member clusters. However, during the initial creation of the ClusterResourcePlacement, the namespace didn’t exist on the hub cluster, and the fleet currently comprises two member clusters named kind-cluster-1 and kind-cluster-2.

ClusterResourcePlacement spec

spec:
  policy:
    numberOfClusters: 3
    placementType: PickN
  resourceSelectors:
  - group: ""
    kind: Namespace
    name: test-ns
    version: v1
  revisionHistoryLimit: 10
  strategy:
    type: RollingUpdate

ClusterResourcePlacement status

status:
  conditions:
  - lastTransitionTime: "2024-05-07T23:08:53Z"
    message: could not find all the clusters needed as specified by the scheduling
      policy
    observedGeneration: 1
    reason: SchedulingPolicyUnfulfilled
    status: "False"
    type: ClusterResourcePlacementScheduled
  - lastTransitionTime: "2024-05-07T23:08:53Z"
    message: All 2 cluster(s) start rolling out the latest resource
    observedGeneration: 1
    reason: RolloutStarted
    status: "True"
    type: ClusterResourcePlacementRolloutStarted
  - lastTransitionTime: "2024-05-07T23:08:53Z"
    message: No override rules are configured for the selected resources
    observedGeneration: 1
    reason: NoOverrideSpecified
    status: "True"
    type: ClusterResourcePlacementOverridden
  - lastTransitionTime: "2024-05-07T23:08:53Z"
    message: Works(s) are succcesfully created or updated in the 2 target clusters'
      namespaces
    observedGeneration: 1
    reason: WorkSynchronized
    status: "True"
    type: ClusterResourcePlacementWorkSynchronized
  - lastTransitionTime: "2024-05-07T23:08:53Z"
    message: The selected resources are successfully applied to 2 clusters
    observedGeneration: 1
    reason: ApplySucceeded
    status: "True"
    type: ClusterResourcePlacementApplied
  - lastTransitionTime: "2024-05-07T23:08:53Z"
    message: The selected resources in 2 cluster are available now
    observedGeneration: 1
    reason: ResourceAvailable
    status: "True"
    type: ClusterResourcePlacementAvailable
  observedResourceIndex: "0"
  placementStatuses:
  - clusterName: kind-cluster-2
    conditions:
    - lastTransitionTime: "2024-05-07T23:08:53Z"
      message: 'Successfully scheduled resources for placement in kind-cluster-2 (affinity
        score: 0, topology spread score: 0): picked by scheduling policy'
      observedGeneration: 1
      reason: Scheduled
      status: "True"
      type: Scheduled
    - lastTransitionTime: "2024-05-07T23:08:53Z"
      message: Detected the new changes on the resources and started the rollout process
      observedGeneration: 1
      reason: RolloutStarted
      status: "True"
      type: RolloutStarted
    - lastTransitionTime: "2024-05-07T23:08:53Z"
      message: No override rules are configured for the selected resources
      observedGeneration: 1
      reason: NoOverrideSpecified
      status: "True"
      type: Overridden
    - lastTransitionTime: "2024-05-07T23:08:53Z"
      message: All of the works are synchronized to the latest
      observedGeneration: 1
      reason: AllWorkSynced
      status: "True"
      type: WorkSynchronized
    - lastTransitionTime: "2024-05-07T23:08:53Z"
      message: All corresponding work objects are applied
      observedGeneration: 1
      reason: AllWorkHaveBeenApplied
      status: "True"
      type: Applied
    - lastTransitionTime: "2024-05-07T23:08:53Z"
      message: All corresponding work objects are available
      observedGeneration: 1
      reason: AllWorkAreAvailable
      status: "True"
      type: Available
  - clusterName: kind-cluster-1
    conditions:
    - lastTransitionTime: "2024-05-07T23:08:53Z"
      message: 'Successfully scheduled resources for placement in kind-cluster-1 (affinity
        score: 0, topology spread score: 0): picked by scheduling policy'
      observedGeneration: 1
      reason: Scheduled
      status: "True"
      type: Scheduled
    - lastTransitionTime: "2024-05-07T23:08:53Z"
      message: Detected the new changes on the resources and started the rollout process
      observedGeneration: 1
      reason: RolloutStarted
      status: "True"
      type: RolloutStarted
    - lastTransitionTime: "2024-05-07T23:08:53Z"
      message: No override rules are configured for the selected resources
      observedGeneration: 1
      reason: NoOverrideSpecified
      status: "True"
      type: Overridden
    - lastTransitionTime: "2024-05-07T23:08:53Z"
      message: All of the works are synchronized to the latest
      observedGeneration: 1
      reason: AllWorkSynced
      status: "True"
      type: WorkSynchronized
    - lastTransitionTime: "2024-05-07T23:08:53Z"
      message: All corresponding work objects are applied
      observedGeneration: 1
      reason: AllWorkHaveBeenApplied
      status: "True"
      type: Applied
    - lastTransitionTime: "2024-05-07T23:08:53Z"
      message: All corresponding work objects are available
      observedGeneration: 1
      reason: AllWorkAreAvailable
      status: "True"
      type: Available

The previous output indicates that the resource test-ns namespace never exists on the hub cluster and shows the following ClusterResourcePlacement condition statuses:

ClusterResourcePlacementScheduled is set to False, as the specified policy aims to pick three clusters, but the scheduler can only accommodate placement in two currently available and joined clusters.
ClusterResourcePlacementRolloutStarted is set to True, as the rollout process has commenced with 2 clusters being selected.
ClusterResourcePlacementOverridden is set to True, as no override rules are configured for the selected resources.
ClusterResourcePlacementWorkSynchronized is set to True.
ClusterResourcePlacementApplied is set to True.
ClusterResourcePlacementAvailable is set to True.

To ensure seamless propagation of the namespace across the relevant clusters, proceed to create the test-ns namespace on the hub cluster.

ClusterResourcePlacement status after namespace test-ns is created on the hub cluster

status:
  conditions:
  - lastTransitionTime: "2024-05-07T23:08:53Z"
    message: could not find all the clusters needed as specified by the scheduling
      policy
    observedGeneration: 1
    reason: SchedulingPolicyUnfulfilled
    status: "False"
    type: ClusterResourcePlacementScheduled
  - lastTransitionTime: "2024-05-07T23:13:51Z"
    message: The rollout is being blocked by the rollout strategy in 2 cluster(s)
    observedGeneration: 1
    reason: RolloutNotStartedYet
    status: "False"
    type: ClusterResourcePlacementRolloutStarted
  observedResourceIndex: "1"
  placementStatuses:
  - clusterName: kind-cluster-2
    conditions:
    - lastTransitionTime: "2024-05-07T23:08:53Z"
      message: 'Successfully scheduled resources for placement in kind-cluster-2 (affinity
        score: 0, topology spread score: 0): picked by scheduling policy'
      observedGeneration: 1
      reason: Scheduled
      status: "True"
      type: Scheduled
    - lastTransitionTime: "2024-05-07T23:13:51Z"
      message: The rollout is being blocked by the rollout strategy
      observedGeneration: 1
      reason: RolloutNotStartedYet
      status: "False"
      type: RolloutStarted
  - clusterName: kind-cluster-1
    conditions:
    - lastTransitionTime: "2024-05-07T23:08:53Z"
      message: 'Successfully scheduled resources for placement in kind-cluster-1 (affinity
        score: 0, topology spread score: 0): picked by scheduling policy'
      observedGeneration: 1
      reason: Scheduled
      status: "True"
      type: Scheduled
    - lastTransitionTime: "2024-05-07T23:13:51Z"
      message: The rollout is being blocked by the rollout strategy
      observedGeneration: 1
      reason: RolloutNotStartedYet
      status: "False"
      type: RolloutStarted
  selectedResources:
  - kind: Namespace
    name: test-ns
    version: v1

Upon examination, the ClusterResourcePlacementScheduled condition status is shown as False. The ClusterResourcePlacementRolloutStarted status is also shown as False with the message The rollout is being blocked by the rollout strategy in 2 cluster(s).

Let’s check the latest ClusterResourceSnapshot.

Check the latest ClusterResourceSnapshot by running the command in How can I find the latest ClusterResourceSnapshot resource?.

Latest ClusterResourceSnapshot

apiVersion: placement.kubernetes-fleet.io/v1
kind: ClusterResourceSnapshot
metadata:
  annotations:
    kubernetes-fleet.io/number-of-enveloped-object: "0"
    kubernetes-fleet.io/number-of-resource-snapshots: "1"
    kubernetes-fleet.io/resource-hash: 72344be6e268bc7af29d75b7f0aad588d341c228801aab50d6f9f5fc33dd9c7c
  creationTimestamp: "2024-05-07T23:13:51Z"
  generation: 1
  labels:
    kubernetes-fleet.io/is-latest-snapshot: "true"
    kubernetes-fleet.io/parent-CRP: crp-3
    kubernetes-fleet.io/resource-index: "1"
  name: crp-3-1-snapshot
  ownerReferences:
  - apiVersion: placement.kubernetes-fleet.io/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: ClusterResourcePlacement
    name: crp-3
    uid: b4f31b9a-971a-480d-93ac-93f093ee661f
  resourceVersion: "14434"
  uid: 85ee0e81-92c9-4362-932b-b0bf57d78e3f
spec:
  selectedResources:
  - apiVersion: v1
    kind: Namespace
    metadata:
      labels:
        kubernetes.io/metadata.name: test-ns
      name: test-ns
    spec:
      finalizers:
      - kubernetes

Upon inspecting ClusterResourceSnapshot spec, the selectedResources section now shows the namespace test-ns.

Let’s check the ClusterResourceBinding for kind-cluster-1 to see if it was updated after the namespace test-ns was created. Check the ClusterResourceBinding for kind-cluster-1 by running the command in How can I find the latest ClusterResourceBinding resource?.

ClusterResourceBinding for kind-cluster-1

apiVersion: placement.kubernetes-fleet.io/v1
kind: ClusterResourceBinding
metadata:
  creationTimestamp: "2024-05-07T23:08:53Z"
  finalizers:
  - kubernetes-fleet.io/work-cleanup
  generation: 2
  labels:
    kubernetes-fleet.io/parent-CRP: crp-3
  name: crp-3-kind-cluster-1-7114c253
  resourceVersion: "14438"
  uid: 0db4e480-8599-4b40-a1cc-f33bcb24b1a7
spec:
  applyStrategy:
    type: ClientSideApply
  clusterDecision:
    clusterName: kind-cluster-1
    clusterScore:
      affinityScore: 0
      priorityScore: 0
    reason: picked by scheduling policy
    selected: true
  resourceSnapshotName: crp-3-0-snapshot
  schedulingPolicySnapshotName: crp-3-0
  state: Bound
  targetCluster: kind-cluster-1
status:
  conditions:
  - lastTransitionTime: "2024-05-07T23:13:51Z"
    message: The resources cannot be updated to the latest because of the rollout
      strategy
    observedGeneration: 2
    reason: RolloutNotStartedYet
    status: "False"
    type: RolloutStarted
  - lastTransitionTime: "2024-05-07T23:08:53Z"
    message: No override rules are configured for the selected resources
    observedGeneration: 2
    reason: NoOverrideSpecified
    status: "True"
    type: Overridden
  - lastTransitionTime: "2024-05-07T23:08:53Z"
    message: All of the works are synchronized to the latest
    observedGeneration: 2
    reason: AllWorkSynced
    status: "True"
    type: WorkSynchronized
  - lastTransitionTime: "2024-05-07T23:08:53Z"
    message: All corresponding work objects are applied
    observedGeneration: 2
    reason: AllWorkHaveBeenApplied
    status: "True"
    type: Applied
  - lastTransitionTime: "2024-05-07T23:08:53Z"
    message: All corresponding work objects are available
    observedGeneration: 2
    reason: AllWorkAreAvailable
    status: "True"
    type: Available

Upon inspection, it is observed that the ClusterResourceBinding remains unchanged. Notably, in the spec, the resourceSnapshotName still references the old ClusterResourceSnapshot name.

This issue arises due to the absence of explicit rollingUpdate input from the user. Consequently, the default values are applied:

The maxUnavailable value is configured to 25% x 3 (desired number), rounded to 1
The maxSurge value is configured to 25% x 3 (desired number), rounded to 1

Why ClusterResourceBinding isn’t updated?

Initially, when the ClusterResourcePlacement was created, two ClusterResourceBindings were generated. However, since the rollout didn’t apply to the initial phase, the ClusterResourcePlacementRolloutStarted condition was set to True.

Upon creating the test-ns namespace on the hub cluster, the rollout controller attempted to update the two existing ClusterResourceBindings. However, maxUnavailable was set to 1 due to the lack of member clusters, which caused the RollingUpdate configuration to be too strict.

NOTE: During the update, if one of the bindings fails to apply, it will also violate the RollingUpdate configuration, which causes maxUnavailable to be set to 1.

Resolution

In this situation, to address this issue, consider manually setting maxUnavailable to a value greater than 1 to relax the RollingUpdate configuration. Alternatively, you can join a third member cluster.

4 - CRP Override Failure TSG

Troubleshooting guide for CRP status “ClusterResourcePlacementOverridden” condition set to false

The status of the ClusterResourcePlacementOverridden condition is set to false when there is an Override API related issue.

Note: To get more information, look into the logs for the overrider controller (includes controller for ClusterResourceOverride and ResourceOverride).

Common scenarios

Instances where this condition may arise:

The ClusterResourceOverride or ResourceOverride is created with an invalid field path for the resource.

Case Study

In the following example, an attempt is made to override the cluster role secret-reader that is being propagated by the ClusterResourcePlacement to the selected clusters. However, the ClusterResourceOverride is created with an invalid path for the field within resource.

ClusterRole

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
    creationTimestamp: "2024-05-14T15:36:48Z"
    name: secret-reader
    resourceVersion: "81334"
    uid: 108e6312-3416-49be-aa3d-a665c5df58b4
rules:
- apiGroups:
  - ""
    resources:
  - secrets
    verbs:
  - get
  - watch
  - list

The ClusterRole secret-reader that is being propagated to the member clusters by the ClusterResourcePlacement.

ClusterResourceOverride spec

spec:
  clusterResourceSelectors:
  - group: rbac.authorization.k8s.io
    kind: ClusterRole
    name: secret-reader
    version: v1
  policy:
    overrideRules:
    - clusterSelector:
        clusterSelectorTerms:
        - labelSelector:
            matchLabels:
              env: canary
      jsonPatchOverrides:
      - op: add
        path: /metadata/labels/new-label
        value: new-value

The ClusterResourceOverride is created to override the ClusterRole secret-reader by adding a new label (new-label) that has the value new-value for the clusters with the label env: canary.

ClusterResourcePlacement Spec

spec:
  resourceSelectors:
    - group: rbac.authorization.k8s.io
      kind: ClusterRole
      name: secret-reader
      version: v1
  policy:
    placementType: PickN
    numberOfClusters: 1
    affinity:
      clusterAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          clusterSelectorTerms:
            - labelSelector:
                matchLabels:
                  env: canary
  strategy:
    type: RollingUpdate
    applyStrategy:
      allowCoOwnership: true

ClusterResourcePlacement Status

status:
  conditions:
  - lastTransitionTime: "2024-05-14T16:16:18Z"
    message: found all cluster needed as specified by the scheduling policy, found
      1 cluster(s)
    observedGeneration: 1
    reason: SchedulingPolicyFulfilled
    status: "True"
    type: ClusterResourcePlacementScheduled
  - lastTransitionTime: "2024-05-14T16:16:18Z"
    message: All 1 cluster(s) start rolling out the latest resource
    observedGeneration: 1
    reason: RolloutStarted
    status: "True"
    type: ClusterResourcePlacementRolloutStarted
  - lastTransitionTime: "2024-05-14T16:16:18Z"
    message: Failed to override resources in 1 cluster(s)
    observedGeneration: 1
    reason: OverriddenFailed
    status: "False"
    type: ClusterResourcePlacementOverridden
  observedResourceIndex: "0"
  placementStatuses:
  - applicableClusterResourceOverrides:
    - cro-1-0
    clusterName: kind-cluster-1
    conditions:
    - lastTransitionTime: "2024-05-14T16:16:18Z"
      message: 'Successfully scheduled resources for placement in kind-cluster-1 (affinity
        score: 0, topology spread score: 0): picked by scheduling policy'
      observedGeneration: 1
      reason: Scheduled
      status: "True"
      type: Scheduled
    - lastTransitionTime: "2024-05-14T16:16:18Z"
      message: Detected the new changes on the resources and started the rollout process
      observedGeneration: 1
      reason: RolloutStarted
      status: "True"
      type: RolloutStarted
    - lastTransitionTime: "2024-05-14T16:16:18Z"
      message: 'Failed to apply the override rules on the resources: add operation
        does not apply: doc is missing path: "/metadata/labels/new-label": missing
        value'
      observedGeneration: 1
      reason: OverriddenFailed
      status: "False"
      type: Overridden
  selectedResources:
  - group: rbac.authorization.k8s.io
    kind: ClusterRole
    name: secret-reader
    version: v1

The CRP attempted to override a propagated resource utilizing an applicable ClusterResourceOverrideSnapshot. However, as the ClusterResourcePlacementOverridden condition remains false, looking at the placement status for the cluster where the condition Overridden failed will offer insights into the exact cause of the failure.

In this situation, the message indicates that the override failed because the path /metadata/labels/new-label and its corresponding value are missing. Based on the previous example of the cluster role secret-reader, you can see that the path /metadata/labels/ doesn’t exist. This means that labels doesn’t exist. Therefore, a new label can’t be added.

Resolution

To successfully override the cluster role secret-reader, correct the path and value in ClusterResourceOverride, as shown in the following code:

jsonPatchOverrides:
  - op: add
    path: /metadata/labels
    value: 
      newlabel: new-value

This will successfully add the new label newlabel with the value new-value to the ClusterRole secret-reader, as we are creating the labels field and adding a new value newlabel: new-value to it.

5 - CRP Work-Synchronization Failure TSG

Troubleshooting guide for CRP status “ClusterResourcePlacementWorkSynchronized” condition set to false

The ClusterResourcePlacementWorkSynchronized condition is false when the CRP has been recently updated but the associated work objects have not yet been synchronized with the changes.

Note: In addition, it may be helpful to look into the logs for the work generator controller to get more information on why the work synchronization failed.

Common Scenarios

Instances where this condition may arise:

The controller encounters an error while trying to generate the corresponding work object.
The enveloped object is not well formatted.

Case Study

The CRP is attempting to propagate a resource to a selected cluster, but the work object has not been updated to reflect the latest changes due to the selected cluster has been terminated.

ClusterResourcePlacement Spec

spec:
  resourceSelectors:
    - group: rbac.authorization.k8s.io
      kind: ClusterRole
      name: secret-reader
      version: v1
  policy:
    placementType: PickN
    numberOfClusters: 1
  strategy:
    type: RollingUpdate

ClusterResourcePlacement Status

spec:
  policy:
    numberOfClusters: 1
    placementType: PickN
  resourceSelectors:
  - group: ""
    kind: Namespace
    name: test-ns
    version: v1
  revisionHistoryLimit: 10
  strategy:
    type: RollingUpdate
status:
  conditions:
  - lastTransitionTime: "2024-05-14T18:05:04Z"
    message: found all cluster needed as specified by the scheduling policy, found
      1 cluster(s)
    observedGeneration: 1
    reason: SchedulingPolicyFulfilled
    status: "True"
    type: ClusterResourcePlacementScheduled
  - lastTransitionTime: "2024-05-14T18:05:05Z"
    message: All 1 cluster(s) start rolling out the latest resource
    observedGeneration: 1
    reason: RolloutStarted
    status: "True"
    type: ClusterResourcePlacementRolloutStarted
  - lastTransitionTime: "2024-05-14T18:05:05Z"
    message: No override rules are configured for the selected resources
    observedGeneration: 1
    reason: NoOverrideSpecified
    status: "True"
    type: ClusterResourcePlacementOverridden
  - lastTransitionTime: "2024-05-14T18:05:05Z"
    message: There are 1 cluster(s) which have not finished creating or updating work(s)
      yet
    observedGeneration: 1
    reason: WorkNotSynchronizedYet
    status: "False"
    type: ClusterResourcePlacementWorkSynchronized
  observedResourceIndex: "0"
  placementStatuses:
  - clusterName: kind-cluster-1
    conditions:
    - lastTransitionTime: "2024-05-14T18:05:04Z"
      message: 'Successfully scheduled resources for placement in kind-cluster-1 (affinity
        score: 0, topology spread score: 0): picked by scheduling policy'
      observedGeneration: 1
      reason: Scheduled
      status: "True"
      type: Scheduled
    - lastTransitionTime: "2024-05-14T18:05:05Z"
      message: Detected the new changes on the resources and started the rollout process
      observedGeneration: 1
      reason: RolloutStarted
      status: "True"
      type: RolloutStarted
    - lastTransitionTime: "2024-05-14T18:05:05Z"
      message: No override rules are configured for the selected resources
      observedGeneration: 1
      reason: NoOverrideSpecified
      status: "True"
      type: Overridden
    - lastTransitionTime: "2024-05-14T18:05:05Z"
      message: 'Failed to synchronize the work to the latest: works.placement.kubernetes-fleet.io
        "crp1-work" is forbidden: unable to create new content in namespace fleet-member-kind-cluster-1
        because it is being terminated'
      observedGeneration: 1
      reason: SyncWorkFailed
      status: "False"
      type: WorkSynchronized
  selectedResources:
  - kind: Namespace
    name: test-ns
    version: v1

In the ClusterResourcePlacement status, the ClusterResourcePlacementWorkSynchronized condition status shows as False. The message for it indicates that the work object crp1-work is prohibited from generating new content within the namespace fleet-member-kind-cluster-1 because it’s currently terminating.

Resolution

To address the issue at hand, there are several potential solutions:

Modify the ClusterResourcePlacement with a newly selected cluster.
Delete the ClusterResourcePlacement to remove work through garbage collection.
Rejoin the member cluster. The namespace can only be regenerated after rejoining the cluster.

In other situations, you might opt to wait for the work to finish propagating.

6 - CRP Work-Application Failure TSG

Troubleshooting guide for CRP status “ClusterResourcePlacementApplied” condition set to false

The ClusterResourcePlacementApplied condition is set to false when the deployment fails.

Note: To get more information about why the resources are not applied, you can check the work applier logs.

Common scenarios

Instances where this condition may arise:

The resource already exists on the cluster and isn’t managed by the fleet controller.
Another ClusterResourcePlacement deployment is already managing the resource for the selected cluster by using a different apply strategy.
The ClusterResourcePlacement deployment doesn’t apply the manifest because of syntax errors or invalid resource configurations. This might also occur if a resource is propagated through an envelope object.

Investigation steps

Check placementStatuses: In the ClusterResourcePlacement status section, inspect the placementStatuses to identify which clusters have the ResourceApplied condition set to false and note down their clusterName.
Locate the Work Object in Hub Cluster: Use the identified clusterName to locate the Work object associated with the member cluster. Please refer to this section to learn how to get the correct Work resource.
Check Work object status: Inspect the status of the Work object to understand the specific issues preventing successful resource application.

Case Study

In the following example, ClusterResourcePlacement is trying to propagate a namespace that contains a deployment to two member clusters. However, the namespace already exists on one member cluster, specifically kind-cluster-1.

ClusterResourcePlacement spec

  policy:
    clusterNames:
    - kind-cluster-1
    - kind-cluster-2
    placementType: PickFixed
  resourceSelectors:
  - group: ""
    kind: Namespace
    name: test-ns
    version: v1
  revisionHistoryLimit: 10
  strategy:
    type: RollingUpdate

ClusterResourcePlacement status

status:
  conditions:
  - lastTransitionTime: "2024-05-07T23:32:40Z"
    message: could not find all the clusters needed as specified by the scheduling
      policy
    observedGeneration: 1
    reason: SchedulingPolicyUnfulfilled
    status: "False"
    type: ClusterResourcePlacementScheduled
  - lastTransitionTime: "2024-05-07T23:32:40Z"
    message: All 2 cluster(s) start rolling out the latest resource
    observedGeneration: 1
    reason: RolloutStarted
    status: "True"
    type: ClusterResourcePlacementRolloutStarted
  - lastTransitionTime: "2024-05-07T23:32:40Z"
    message: No override rules are configured for the selected resources
    observedGeneration: 1
    reason: NoOverrideSpecified
    status: "True"
    type: ClusterResourcePlacementOverridden
  - lastTransitionTime: "2024-05-07T23:32:40Z"
    message: Works(s) are succcesfully created or updated in the 2 target clusters'
      namespaces
    observedGeneration: 1
    reason: WorkSynchronized
    status: "True"
    type: ClusterResourcePlacementWorkSynchronized
  - lastTransitionTime: "2024-05-07T23:32:40Z"
    message: Failed to apply resources to 1 clusters, please check the `failedPlacements`
      status
    observedGeneration: 1
    reason: ApplyFailed
    status: "False"
    type: ClusterResourcePlacementApplied
  observedResourceIndex: "0"
  placementStatuses:
  - clusterName: kind-cluster-2
    conditions:
    - lastTransitionTime: "2024-05-07T23:32:40Z"
      message: 'Successfully scheduled resources for placement in kind-cluster-2 (affinity
        score: 0, topology spread score: 0): picked by scheduling policy'
      observedGeneration: 1
      reason: Scheduled
      status: "True"
      type: Scheduled
    - lastTransitionTime: "2024-05-07T23:32:40Z"
      message: Detected the new changes on the resources and started the rollout process
      observedGeneration: 1
      reason: RolloutStarted
      status: "True"
      type: RolloutStarted
    - lastTransitionTime: "2024-05-07T23:32:40Z"
      message: No override rules are configured for the selected resources
      observedGeneration: 1
      reason: NoOverrideSpecified
      status: "True"
      type: Overridden
    - lastTransitionTime: "2024-05-07T23:32:40Z"
      message: All of the works are synchronized to the latest
      observedGeneration: 1
      reason: AllWorkSynced
      status: "True"
      type: WorkSynchronized
    - lastTransitionTime: "2024-05-07T23:32:40Z"
      message: All corresponding work objects are applied
      observedGeneration: 1
      reason: AllWorkHaveBeenApplied
      status: "True"
      type: Applied
    - lastTransitionTime: "2024-05-07T23:32:49Z"
      message: The availability of work object crp-4-work is not trackable
      observedGeneration: 1
      reason: WorkNotTrackable
      status: "True"
      type: Available
  - clusterName: kind-cluster-1
    conditions:
    - lastTransitionTime: "2024-05-07T23:32:40Z"
      message: 'Successfully scheduled resources for placement in kind-cluster-1 (affinity
        score: 0, topology spread score: 0): picked by scheduling policy'
      observedGeneration: 1
      reason: Scheduled
      status: "True"
      type: Scheduled
    - lastTransitionTime: "2024-05-07T23:32:40Z"
      message: Detected the new changes on the resources and started the rollout process
      observedGeneration: 1
      reason: RolloutStarted
      status: "True"
      type: RolloutStarted
    - lastTransitionTime: "2024-05-07T23:32:40Z"
      message: No override rules are configured for the selected resources
      observedGeneration: 1
      reason: NoOverrideSpecified
      status: "True"
      type: Overridden
    - lastTransitionTime: "2024-05-07T23:32:40Z"
      message: All of the works are synchronized to the latest
      observedGeneration: 1
      reason: AllWorkSynced
      status: "True"
      type: WorkSynchronized
    - lastTransitionTime: "2024-05-07T23:32:40Z"
      message: Work object crp-4-work is not applied
      observedGeneration: 1
      reason: NotAllWorkHaveBeenApplied
      status: "False"
      type: Applied
    failedPlacements:
    - condition:
        lastTransitionTime: "2024-05-07T23:32:40Z"
        message: 'Failed to apply manifest: failed to process the request due to a
          client error: resource exists and is not managed by the fleet controller
          and co-ownernship is disallowed'
        reason: ManifestsAlreadyOwnedByOthers
        status: "False"
        type: Applied
      kind: Namespace
      name: test-ns
      version: v1
  selectedResources:
  - kind: Namespace
    name: test-ns
    version: v1
  - group: apps
    kind: Deployment
    name: test-nginx
    namespace: test-ns
    version: v1

In the ClusterResourcePlacement status, within the failedPlacements section for kind-cluster-1, we get a clear message as to why the resource failed to apply on the member cluster. In the preceding conditions section, the Applied condition for kind-cluster-1 is flagged as false and shows the NotAllWorkHaveBeenApplied reason. This indicates that the Work object intended for the member cluster kind-cluster-1 has not been applied.

For more information, see this section.

Work status of kind-cluster-1

 status:
  conditions:
  - lastTransitionTime: "2024-05-07T23:32:40Z"
    message: 'Apply manifest {Ordinal:0 Group: Version:v1 Kind:Namespace Resource:namespaces
      Namespace: Name:test-ns} failed'
    observedGeneration: 1
    reason: WorkAppliedFailed
    status: "False"
    type: Applied
  - lastTransitionTime: "2024-05-07T23:32:40Z"
    message: ""
    observedGeneration: 1
    reason: WorkAppliedFailed
    status: Unknown
    type: Available
  manifestConditions:
  - conditions:
    - lastTransitionTime: "2024-05-07T23:32:40Z"
      message: 'Failed to apply manifest: failed to process the request due to a client
        error: resource exists and is not managed by the fleet controller and co-ownernship
        is disallowed'
      reason: ManifestsAlreadyOwnedByOthers
      status: "False"
      type: Applied
    - lastTransitionTime: "2024-05-07T23:32:40Z"
      message: Manifest is not applied yet
      reason: ManifestApplyFailed
      status: Unknown
      type: Available
    identifier:
      kind: Namespace
      name: test-ns
      ordinal: 0
      resource: namespaces
      version: v1
  - conditions:
    - lastTransitionTime: "2024-05-07T23:32:40Z"
      message: Manifest is already up to date
      observedGeneration: 1
      reason: ManifestAlreadyUpToDate
      status: "True"
      type: Applied
    - lastTransitionTime: "2024-05-07T23:32:51Z"
      message: Manifest is trackable and available now
      observedGeneration: 1
      reason: ManifestAvailable
      status: "True"
      type: Available
    identifier:
      group: apps
      kind: Deployment
      name: test-nginx
      namespace: test-ns
      ordinal: 1
      resource: deployments
      version: v1

From looking at the Work status, specifically the manifestConditions section, you can see that the namespace could not be applied but the deployment within the namespace got propagated from the hub to the member cluster.

Resolution

In this situation, a potential solution is to set the AllowCoOwnership to true in the ApplyStrategy policy. However, it’s important to notice that this decision should be made by the user because the resources might not be shared.

7 - CRP Availability Failure TSG

Troubleshooting guide for CRP status “ClusterResourcePlacementAvailable” condition set to false

The ClusterResourcePlacementAvailable condition is false when some of the resources are not available yet. We will place some of the detailed failure in the FailedResourcePlacement array.

Note: To get more information about why resources are unavailable check work applier logs.

Common scenarios

Instances where this condition may arise:

The member cluster doesn’t have enough resource availability.
The deployment contains an invalid image name.

Case Study

The example output below demonstrates a scenario where the CRP is unable to propagate a deployment to a member cluster due to the deployment having a bad image name.

ClusterResourcePlacement spec

spec:
  resourceSelectors:
    - group: ""
      kind: Namespace
      name: test-ns
      version: v1
  policy:
    placementType: PickN
    numberOfClusters: 1
  strategy:
    type: RollingUpdate

ClusterResourcePlacement status

status:
  conditions:
  - lastTransitionTime: "2024-05-14T18:52:30Z"
    message: found all cluster needed as specified by the scheduling policy, found
      1 cluster(s)
    observedGeneration: 1
    reason: SchedulingPolicyFulfilled
    status: "True"
    type: ClusterResourcePlacementScheduled
  - lastTransitionTime: "2024-05-14T18:52:31Z"
    message: All 1 cluster(s) start rolling out the latest resource
    observedGeneration: 1
    reason: RolloutStarted
    status: "True"
    type: ClusterResourcePlacementRolloutStarted
  - lastTransitionTime: "2024-05-14T18:52:31Z"
    message: No override rules are configured for the selected resources
    observedGeneration: 1
    reason: NoOverrideSpecified
    status: "True"
    type: ClusterResourcePlacementOverridden
  - lastTransitionTime: "2024-05-14T18:52:31Z"
    message: Works(s) are succcesfully created or updated in 1 target cluster(s)'
      namespaces
    observedGeneration: 1
    reason: WorkSynchronized
    status: "True"
    type: ClusterResourcePlacementWorkSynchronized
  - lastTransitionTime: "2024-05-14T18:52:31Z"
    message: The selected resources are successfully applied to 1 cluster(s)
    observedGeneration: 1
    reason: ApplySucceeded
    status: "True"
    type: ClusterResourcePlacementApplied
  - lastTransitionTime: "2024-05-14T18:52:31Z"
    message: The selected resources in 1 cluster(s) are still not available yet
    observedGeneration: 1
    reason: ResourceNotAvailableYet
    status: "False"
    type: ClusterResourcePlacementAvailable
  observedResourceIndex: "0"
  placementStatuses:
  - clusterName: kind-cluster-1
    conditions:
    - lastTransitionTime: "2024-05-14T18:52:30Z"
      message: 'Successfully scheduled resources for placement in kind-cluster-1 (affinity
        score: 0, topology spread score: 0): picked by scheduling policy'
      observedGeneration: 1
      reason: Scheduled
      status: "True"
      type: Scheduled
    - lastTransitionTime: "2024-05-14T18:52:31Z"
      message: Detected the new changes on the resources and started the rollout process
      observedGeneration: 1
      reason: RolloutStarted
      status: "True"
      type: RolloutStarted
    - lastTransitionTime: "2024-05-14T18:52:31Z"
      message: No override rules are configured for the selected resources
      observedGeneration: 1
      reason: NoOverrideSpecified
      status: "True"
      type: Overridden
    - lastTransitionTime: "2024-05-14T18:52:31Z"
      message: All of the works are synchronized to the latest
      observedGeneration: 1
      reason: AllWorkSynced
      status: "True"
      type: WorkSynchronized
    - lastTransitionTime: "2024-05-14T18:52:31Z"
      message: All corresponding work objects are applied
      observedGeneration: 1
      reason: AllWorkHaveBeenApplied
      status: "True"
      type: Applied
    - lastTransitionTime: "2024-05-14T18:52:31Z"
      message: Work object crp1-work is not available
      observedGeneration: 1
      reason: NotAllWorkAreAvailable
      status: "False"
      type: Available
    failedPlacements:
    - condition:
        lastTransitionTime: "2024-05-14T18:52:31Z"
        message: Manifest is trackable but not available yet
        observedGeneration: 1
        reason: ManifestNotAvailableYet
        status: "False"
        type: Available
      group: apps
      kind: Deployment
      name: my-deployment
      namespace: test-ns
      version: v1
  selectedResources:
  - kind: Namespace
    name: test-ns
    version: v1
  - group: apps
    kind: Deployment
    name: my-deployment
    namespace: test-ns
    version: v1

In the ClusterResourcePlacement status, within the failedPlacements section for kind-cluster-1, we get a clear message as to why the resource failed to apply on the member cluster. In the preceding conditions section, the Available condition for kind-cluster-1 is flagged as false and shows NotAllWorkAreAvailable reason. This signifies that the Work object intended for the member cluster kind-cluster-1 is not yet available.

For more information, see this section.

Work status of kind-cluster-1

status:
conditions:
- lastTransitionTime: "2024-05-14T18:52:31Z"
  message: Work is applied successfully
  observedGeneration: 1
  reason: WorkAppliedCompleted
  status: "True"
  type: Applied
- lastTransitionTime: "2024-05-14T18:52:31Z"
  message: Manifest {Ordinal:1 Group:apps Version:v1 Kind:Deployment Resource:deployments
  Namespace:test-ns Name:my-deployment} is not available yet
  observedGeneration: 1
  reason: WorkNotAvailableYet
  status: "False"
  type: Available
  manifestConditions:
- conditions:
  - lastTransitionTime: "2024-05-14T18:52:31Z"
    message: Manifest is already up to date
    reason: ManifestAlreadyUpToDate
    status: "True"
    type: Applied
  - lastTransitionTime: "2024-05-14T18:52:31Z"
    message: Manifest is trackable and available now
    reason: ManifestAvailable
    status: "True"
    type: Available
    identifier:
    kind: Namespace
    name: test-ns
    ordinal: 0
    resource: namespaces
    version: v1
- conditions:
  - lastTransitionTime: "2024-05-14T18:52:31Z"
    message: Manifest is already up to date
    observedGeneration: 1
    reason: ManifestAlreadyUpToDate
    status: "True"
    type: Applied
  - lastTransitionTime: "2024-05-14T18:52:31Z"
    message: Manifest is trackable but not available yet
    observedGeneration: 1
    reason: ManifestNotAvailableYet
    status: "False"
    type: Available
    identifier:
    group: apps
    kind: Deployment
    name: my-deployment
    namespace: test-ns
    ordinal: 1
    resource: deployments
    version: v1

Check the Available status for kind-cluster-1. You can see that the my-deployment deployment isn’t yet available on the member cluster. This suggests that an issue might be affecting the deployment manifest.

Resolution

In this situation, a potential solution is to check the deployment in the member cluster because the message indicates that the root cause of the issue is a bad image name. After this image name is identified, you can correct the deployment manifest and update it. After you fix and update the resource manifest, the ClusterResourcePlacement object API automatically propagates the corrected resource to the member cluster.

For all other situations, make sure that the propagated resource is configured correctly. Additionally, verify that the selected cluster has sufficient available capacity to accommodate the new resources.

8 - CRP Drift Detection and Configuration Difference Check Unexpected Result TSG

Troubleshoot situations where CRP drift detection and configuration difference check features are returning unexpected results

This document helps you troubleshoot unexpected drift and configuration difference detection results when using the KubeFleet CRP API.

Note
If you are looking for troubleshooting steps on diff reporting failures, i.e., when the ClusterResourcePlacementDiffReported condition on your CRP object is set to False, see the CRP Diff Reporting Failure TSG instead.

Note
This document focuses on unexpected drift and configuration difference detection results. If you have encountered drift and configuration difference detection failures (e.g., no detection results at all with the ClusterResourcePlacementApplied condition being set to False with a detection related error), see the CRP Apply Op Failure TSG instead.

Common scenarios

A drift occurs when a non-KubeFleet agent modifies a KubeFleet-managed resource (i.e., a resource that has been applied by KubeFleet). Drift details are reported in the CRP status on a per-cluster basis (.status.placementStatuses[*].driftedPlacements field). Drift detection is always on when your CRP uses a ClientSideApply (default) or ServerSideApply typed apply strategy, however, note the following limitations:

When you set the comparisonOption setting (.spec.strategy.applyStrategy.comparisonOption field) to partialComparison, KubeFleet will only detect drifts in managed fields, i.e., fields that have been explicitly specified on the hub cluster side. A non-KubeFleet agent can then add a field (e.g., a label or an annotation) to the resource without KubeFleet complaining about it. To check for such changes (field additions), use the fullComparison option for the comparisonOption field.
Depending on your cluster setup, there might exist Kubernetes webhooks/controllers (built-in or from a third party) that will process KubeFleet-managed resources and add/modify fields as they see fit. The API server on the member cluster side might also add/modify fields (e.g., enforcing default values) on resources. If your comparison option allows, KubeFleet will report these as drifts. For any unexpected drift reportings, verify first if you have installed a source that triggers the changes.
When you set the whenToApply setting (.spec.strategy.applyStrategy.whenToApply field) to Always and the comparisonOption setting (.spec.strategy.applyStrategy.comparisonOption field) to partialComparison, no drifts will ever be found, as apply ops from KubeFleet will overwrite any drift in managed fields, and drifts in unmanaged fields are always ignored.
Drift detection does not apply to resources that are not yet managed by KubeFleet. If a resource has not been created on the hub cluster or has not been selected by the CRP API, there will not be any drift reportings about it, even if the resource live within a KubeFleet managed namespace. Similarly, if KubeFleet has been blocked from taking over a pre-existing resource due to your takeover setting (.spec.strategy.applyStrategy.whenToTakeOver field), no drift detection will run on the resource.
Resource deletion is not considered as a drift; if a KubeFleet-managed resource has been deleted by a non-KubeFleet agent, KubeFleet will attempt to re-create it as soon as it finds out about the deletion.
Drift detection will not block resource rollouts. If you have just updated the resources on the hub cluster side and triggered a rollout, drifts on the member cluster side might have been overwritten.
When a rollout is in progress, drifts will not be reported on the CRP status for a member cluster if the cluster has not received the latest round of updates.

KubeFleet will check for configuration differences under the following two conditions:

When KubeFleet encounters a pre-existing resource, and the whenToTakeOver setting (.spec.strategy.applyStrategy.whenToTakeOver field) is set to IfNoDiff.
When the CRP uses an apply strategy of the ReportDiff type.

Configuration difference details are reported in the CRP status on a per-cluster basis (.status.placementStatuses[*].diffedPlacements field). Note that the following limitations apply:

When you set the comparisonOption setting (.spec.strategy.applyStrategy.comparisonOption field) to partialComparison, KubeFleet will only check for configuration differences in managed fields, i.e., fields that have been explicitly specified on the hub cluster side. Unmanaged fields, such as additional labels and annotations, will not be considered as configuration differences. To check for such changes (field additions), use the fullComparison option for the comparisonOption field.
Depending on your cluster setup, there might exist Kubernetes webhooks/controllers (built-in or from a third party) that will process resources and add/modify fields as they see fit. The API server on the member cluster side might also add/modify fields (e.g., enforcing default values) on resources. If your comparison option allows, KubeFleet will report these as configuration differences. For any unexpected configuration difference reportings, verify first if you have installed a source that triggers the changes.
KubeFleet checks for configuration differences regardless of resource ownerships; resources not managed by KubeFleet will also be checked.
The absence of a resource will be considered as a configuration difference.
Configuration differences will not block resource rollouts. If you have just updated the resources on the hub cluster side and triggered a rollout, configuration difference check will be re-run based on the newer versions of resources.
When a rollout is in progress, configuration differences will not be reported on the CRP status for a member cluster if the cluster has not received the latest round of updates.

Note also that drift detection and configuration difference check in KubeFleet run periodically. The reportings in the CRP status might not be up-to-date.

Investigation steps

If you find an unexpected drift detection or configuration difference check result on a member cluster, follow the steps below for investigation:

Double-check the apply strategy of your CRP; confirm that your settings allows proper drift detection and/or configuration difference check reportings.
Verify that rollout has completed on all member clusters; see the CRP Rollout Failure TSG for more information.
Log onto your member cluster and retrieve the resources with unexpected reportings.
- Check if its generation (.metadata.generation field) matches with the observedInMemberClusterGeneration value in the drift detection and/or configuration difference check reportings. A mismatch might signal that the reportings are not yet up-to-date; they should get refreshed soon.
- The kubectl.kubernetes.io/last-applied-configuration annotation and/or the .metadata.managedFields field might have some relevant information on which agents have attempted to update/patch the resource. KubeFleet changes are executed under the name work-api-agent; if you see other manager names, check if it comes from a known source (e.g., Kubernetes controller) in your cluster.

File an issue to the KubeFleet team if you believe that the unexpected reportings come from a bug in KubeFleet.

9 - CRP Diff Reporting Failure TSG

Troubleshoot failures in the CRP diff reporting process

This document helps you troubleshoot diff reporting failures when using the KubeFleet CRP API, specifically when you find that the ClusterResourcePlacementDiffReported status condition has been set to False in the CRP status.

Note
If you are looking for troubleshooting steps on unexpected drift detection and/or configuration difference detection results, see the Drift Detection and Configuration Difference Detection Failure TSG instead.

Note
The ClusterResourcePlacementDiffReported status condition will only be set if the CRP has an apply strategy of the ReportDiff type. If your CRP uses ClientSideApply (default) or ServerSideApply typed apply strategies, it is perfectly normal if the ClusterResourcePlacementDiffReported status condition is absent in the CRP status.

Common scenarios

ClusterResourcePlacementDiffReported status condition will be set to False if KubeFleet cannot complete the configuration difference checking process for one or more of the selected resources.

Depending on your CRP configuration, KubeFleet might use one of the three approaches for configuration difference checking:

If the resource cannot be found on a member cluster, KubeFleet will simply report a full object difference.
If you ask KubeFleet to perform partial comparisons, i.e., the comparisonOption field in the CRP apply strategy (.spec.strategy.applyStrategy.comparisonOption field) is set to partialComparison, KubeFleet will perform a dry-run apply op (server-side apply with conflict overriding enabled) and compare the returned apply result against the current state of the resource on the member cluster side for configuration differences.
If you ask KubeFleet to perform full comparisons, i.e., the comparisonOption field in the CRP apply strategy (.spec.strategy.applyStrategy.comparisonOption field) is set to fullComparison, KubeFleet will directly compare the given manifest (the resource created on the hub cluster side) against the current state of the resource on the member cluster side for configuration differences.

Failures might arise if:

The dry-run apply op does not complete successfully; or
An unexpected error occurs during the comparison process, such as a JSON path parsing/evaluation error.
- In this case, please consider filing a bug to the KubeFleet team.

Investigation steps

If you encounter such a failure, follow the steps below for investigation:

Identify the specific resources that have failed in the diff reporting process first. In the CRP status, find out the individual member clusters that have diff reporting failures: inspect the .status.placementStatuses field of the CRP object; each entry corresponds to a member cluster, and for each entry, check if it has a status condition, ClusterResourcePlacementDiffReported, in the .status.placementStatuses[*].conditions field, which has been set to False. Write down the name of the member cluster.

For each cluster name that has been written down, list all the work objects that have been created for the cluster in correspondence with the CRP object:

# Replace [YOUR-CLUSTER-NAME] and [YOUR-CRP-NAME] with values of your own.
kubectl get work -n fleet-member-[YOUR-CLUSTER-NAME] -l kubernetes-fleet.io/parent-CRP=[YOUR-CRP-NAME]

For each found work object, inspect its status. The .status.manifestConditions field features an array of which each item explains about the processing result of a resource on the given member cluster. Find out all items with a DiffReported condition in the .status.manifestConditions[*].conditions field that has been set to False. The .status.manifestConditions[*].identifier field tells the GVK, namespace, and name of the failing resource.
Read the message field of the DiffReported condition (.status.manifestConditions[*].conditions[*].message); KubeFleet will include the details about the diff reporting failures in the field.
If you are familiar with the cause of the error (for example, dry-run apply ops fails due to API server traffic control measures), fixing the cause (tweaking traffic control limits) should resolve the failure. KubeFleet will periodically retry diff reporting in face of failures. Otherwise, file an issue to the KubeFleet team.

10 - ClusterStagedUpdateRun TSG

Identify and fix KubeFleet issues associated with the ClusterStagedUpdateRun API

This guide provides troubleshooting steps for common issues related to Staged Update Run.

Note: To get more information about why the scheduling fails, you can check the updateRun controller logs.

CRP status without Staged Update Run

When a ClusterResourcePlacement is created with spec.strategy.type set to External, the rollout does not start immediately.

A sample status of such ClusterResourcePlacement is as follows:

$ kubectl describe crp example-placement
...
Status:
  Conditions:
    Last Transition Time:   2025-03-12T23:01:32Z
    Message:                found all cluster needed as specified by the scheduling policy, found 2 cluster(s)
    Observed Generation:    1
    Reason:                 SchedulingPolicyFulfilled
    Status:                 True
    Type:                   ClusterResourcePlacementScheduled
    Last Transition Time:   2025-03-12T23:01:32Z
    Message:                There are still 2 cluster(s) in the process of deciding whether to roll out the latest resources or not
    Observed Generation:    1
    Reason:                 RolloutStartedUnknown
    Status:                 Unknown
    Type:                   ClusterResourcePlacementRolloutStarted
  Observed Resource Index:  0
  Placement Statuses:
    Cluster Name:  member1
    Conditions:
      Last Transition Time:  2025-03-12T23:01:32Z
      Message:               Successfully scheduled resources for placement in "member1" (affinity score: 0, topology spread score: 0): picked by scheduling policy
      Observed Generation:   1
      Reason:                Scheduled
      Status:                True
      Type:                  Scheduled
      Last Transition Time:  2025-03-12T23:01:32Z
      Message:               In the process of deciding whether to roll out the latest resources or not
      Observed Generation:   1
      Reason:                RolloutStartedUnknown
      Status:                Unknown
      Type:                  RolloutStarted
    Cluster Name:            member2
    Conditions:
      Last Transition Time:  2025-03-12T23:01:32Z
      Message:               Successfully scheduled resources for placement in "member2" (affinity score: 0, topology spread score: 0): picked by scheduling policy
      Observed Generation:   1
      Reason:                Scheduled
      Status:                True
      Type:                  Scheduled
      Last Transition Time:  2025-03-12T23:01:32Z
      Message:               In the process of deciding whether to roll out the latest resources or not
      Observed Generation:   1
      Reason:                RolloutStartedUnknown
      Status:                Unknown
      Type:                  RolloutStarted
  Selected Resources:
    ...
Events:         <none>

SchedulingPolicyFulfilled condition indicates the CRP has been fully scheduled, while RolloutStartedUnknown condition shows that the rollout has not started.

In the Placement Statuses section, it displays the detailed status of each cluster. Both selected clusters are in the Scheduled state, but the RolloutStarted condition is still Unknown because the rollout has not kicked off yet.

Investigate ClusterStagedUpdateRun initialization failure

An updateRun initialization failure can be easily detected by getting the resource:

$ kubectl get csur example-run 
NAME          PLACEMENT           RESOURCE-SNAPSHOT-INDEX   POLICY-SNAPSHOT-INDEX   INITIALIZED   SUCCEEDED   AGE
example-run   example-placement   1                         0                       False                     2s

The INITIALIZED field is False, indicating the initialization failed.

Describe the updateRun to get more details:

$ kubectl describe csur example-run
...
Status:
  Conditions:
    Last Transition Time:  2025-03-13T07:28:29Z
    Message:               cannot continue the ClusterStagedUpdateRun: failed to initialize the clusterStagedUpdateRun: failed to process the request due to a client error: no clusterResourceSnapshots with index `1` found for clusterResourcePlacement `example-placement`
    Observed Generation:   1
    Reason:                UpdateRunInitializedFailed
    Status:                False
    Type:                  Initialized
  Deletion Stage Status:
    Clusters:
    Stage Name:                   kubernetes-fleet.io/deleteStage
  Policy Observed Cluster Count:  2
  Policy Snapshot Index Used:     0
...

The condition clearly indicates the initialization failed. And the condition message gives more details about the failure. In this case, I used a not-existing resource snapshot index 1 for the updateRun.

Investigate ClusterStagedUpdateRun execution failure

An updateRun execution failure can be easily detected by getting the resource:

$ kubectl get csur example-run
NAME          PLACEMENT           RESOURCE-SNAPSHOT-INDEX   POLICY-SNAPSHOT-INDEX   INITIALIZED   SUCCEEDED   AGE
example-run   example-placement   0                         0                       True          False       24m

The SUCCEEDED field is False, indicating the execution failure.

An updateRun execution failure can be caused by mainly 2 scenarios:

When the updateRun controller is triggered to reconcile an in-progress updateRun, it starts by doing a bunch of validations including retrieving the CRP and checking its rollout strategy, gathering all the bindings and regenerating the execution plan. If any failure happens during validation, the updateRun execution fails with the corresponding validation error.

status:
  conditions:
  - lastTransitionTime: "2025-05-13T21:11:06Z"
    message: ClusterStagedUpdateRun initialized successfully
    observedGeneration: 1
    reason: UpdateRunInitializedSuccessfully
    status: "True"
    type: Initialized
  - lastTransitionTime: "2025-05-13T21:11:21Z"
    message: The stages are aborted due to a non-recoverable error
    observedGeneration: 1
    reason: UpdateRunFailed
    status: "False"
    type: Progressing
  - lastTransitionTime: "2025-05-13T22:15:23Z"
    message: 'cannot continue the ClusterStagedUpdateRun: failed to initialize the
      clusterStagedUpdateRun: failed to process the request due to a client error:
      parent clusterResourcePlacement not found'
    observedGeneration: 1
    reason: UpdateRunFailed
    status: "False"
    type: Succeeded

In above case, the CRP referenced by the updateRun is deleted during the execution. The updateRun controller detects and aborts the release.

The updateRun controller triggers update to a member cluster by updating the corresponding binding spec and setting its status to RolloutStarted. It then waits for default 15 seconds and check whether the resources have been successfully applied by checking the binding again. In case that there are multiple concurrent updateRuns, and during the 15-second wait, some other updateRun preempts and updates the binding with new configuration, current updateRun detects and fails with clear error message.

status:
 conditions:
 - lastTransitionTime: "2025-05-13T21:10:58Z"
   message: ClusterStagedUpdateRun initialized successfully
   observedGeneration: 1
   reason: UpdateRunInitializedSuccessfully
   status: "True"
   type: Initialized
 - lastTransitionTime: "2025-05-13T21:11:13Z"
   message: The stages are aborted due to a non-recoverable error
   observedGeneration: 1
   reason: UpdateRunFailed
   status: "False"
   type: Progressing
 - lastTransitionTime: "2025-05-13T21:11:13Z"
   message: 'cannot continue the ClusterStagedUpdateRun: unexpected behavior which
     cannot be handled by the controller: the clusterResourceBinding of the updating
     cluster `member1` in the stage `staging` does not have expected status: binding
     spec diff: binding has different resourceSnapshotName, want: example-placement-0-snapshot,
     got: example-placement-1-snapshot; binding state (want Bound): Bound; binding
     RolloutStarted (want true): true, please check if there is concurrent clusterStagedUpdateRun'
   observedGeneration: 1
   reason: UpdateRunFailed
   status: "False"
   type: Succeeded

The Succeeded condition is set to False with reason UpdateRunFailed. In the message, we show member1 cluster in staging stage gets preempted, and the resourceSnapshotName field is changed from example-placement-0-snapshot to example-placement-1-snapshot which means probably some other updateRun is rolling out a newer resource version. The message also prints current binding state and if RolloutStarted condition is set to true. The message gives a hint about whether these is a concurrent clusterStagedUpdateRun running. Upon such failure, the user can list updateRuns or check the binding state:

kubectl get clusterresourcebindings
NAME                                 WORKSYNCHRONIZED   RESOURCESAPPLIED   AGE
example-placement-member1-2afc7d7f   True               True               51m
example-placement-member2-fc081413                                         51m

The binding is named as <crp-name>-<cluster-name>-<suffix>. Since the error message says member1 cluster fails the updateRun, we can check its binding:

kubectl get clusterresourcebindings example-placement-member1-2afc7d7f -o yaml
...
spec:
  ...
  resourceSnapshotName: example-placement-1-snapshot
  schedulingPolicySnapshotName: example-placement-0
  state: Bound
  targetCluster: member1
status:
  conditions:
  - lastTransitionTime: "2025-05-13T21:11:06Z"
    message: 'Detected the new changes on the resources and started the rollout process,
      resourceSnapshotIndex: 1, clusterStagedUpdateRun: example-run-1'
    observedGeneration: 3
    reason: RolloutStarted
    status: "True"
    type: RolloutStarted
  ...

As the binding RolloutStarted condition shows, it’s updated by another updateRun example-run-1.

The updateRun abortion due to execution failures is not recoverable at the moment. If failure happens due to validation error, one can fix the issue and create a new updateRun. If preemption happens, in most cases the user is releasing a new resource version, and they can just let the new updateRun run to complete.

Investigate ClusterStagedUpdateRun rollout stuck

A ClusterStagedUpdateRun can get stuck when resource placement fails on some clusters. Getting the updateRun will show the cluster name and stage that is in stuck state:

$ kubectl get csur example-run -o yaml
...
status:
  conditions:
  - lastTransitionTime: "2025-05-13T23:15:35Z"
    message: ClusterStagedUpdateRun initialized successfully
    observedGeneration: 1
    reason: UpdateRunInitializedSuccessfully
    status: "True"
    type: Initialized
  - lastTransitionTime: "2025-05-13T23:21:18Z"
    message: The updateRun is stuck waiting for cluster member1 in stage staging to
      finish updating, please check crp status for potential errors
    observedGeneration: 1
    reason: UpdateRunStuck
    status: "False"
    type: Progressing
...

The message shows that the updateRun is stuck waiting for the cluster member1 in stage staging to finish releasing. The updateRun controller rolls resources to a member cluster by updating its corresponding binding. It then checks periodically whether the update has completed or not. If the binding is still not available after current default 5 minutes, updateRun controller decides the rollout has stuck and reports the condition.

This usually indicates something wrong happened on the cluster or the resources have some issue. To further investigate, you can check the ClusterResourcePlacement status:

$ kubectl describe crp example-placement
...
 Placement Statuses:
    Cluster Name:  member1
    Conditions:
      Last Transition Time:  2025-05-13T23:11:14Z
      Message:               Successfully scheduled resources for placement in "member1" (affinity score: 0, topology spread score: 0): picked by scheduling policy
      Observed Generation:   1
      Reason:                Scheduled
      Status:                True
      Type:                  Scheduled
      Last Transition Time:  2025-05-13T23:15:35Z
      Message:               Detected the new changes on the resources and started the rollout process, resourceSnapshotIndex: 0, clusterStagedUpdateRun: example-run
      Observed Generation:   1
      Reason:                RolloutStarted
      Status:                True
      Type:                  RolloutStarted
      Last Transition Time:  2025-05-13T23:15:35Z
      Message:               No override rules are configured for the selected resources
      Observed Generation:   1
      Reason:                NoOverrideSpecified
      Status:                True
      Type:                  Overridden
      Last Transition Time:  2025-05-13T23:15:35Z
      Message:               All of the works are synchronized to the latest
      Observed Generation:   1
      Reason:                AllWorkSynced
      Status:                True
      Type:                  WorkSynchronized
      Last Transition Time:  2025-05-13T23:15:35Z
      Message:               All corresponding work objects are applied
      Observed Generation:   1
      Reason:                AllWorkHaveBeenApplied
      Status:                True
      Type:                  Applied
      Last Transition Time:  2025-05-13T23:15:35Z
      Message:               Work object example-placement-work-configmap-c5971133-2779-4f6f-8681-3e05c4458c82 is not yet available
      Observed Generation:   1
      Reason:                NotAllWorkAreAvailable
      Status:                False
      Type:                  Available
    Failed Placements:
      Condition:
        Last Transition Time:  2025-05-13T23:15:35Z
        Message:               Manifest is trackable but not available yet
        Observed Generation:   1
        Reason:                ManifestNotAvailableYet
        Status:                False
        Type:                  Available
      Envelope:
        Name:       envelope-nginx-deploy
        Namespace:  test-namespace
        Type:       ConfigMap
      Group:        apps
      Kind:         Deployment
      Name:         nginx
      Namespace:    test-namespace
      Version:      v1
...

The Applied condition is False and says not all work have been applied. And in the “failed placements” section, it shows the nginx deployment wrapped by envelope-nginx-deploy configMap is not ready. Check from member1 cluster and we can see there’s image pull failure:

kubectl config use-context member1

kubectl get deploy -n test-namespace
NAME    READY   UP-TO-DATE   AVAILABLE   AGE
nginx   0/1     1            0           16m

kubectl get pods -n test-namespace
NAME                     READY   STATUS         RESTARTS   AGE
nginx-69b9cb5485-sw24b   0/1     ErrImagePull   0          16m

For more debugging instructions, you can refer to ClusterResourcePlacement TSG.

After resolving the issue, you can create always create a new updateRun to restart the rollout. Stuck updateRuns can be deleted.

11 - ClusterResourcePlacementEviction TSG

Identify and fix KubeFleet issues associated with the ClusterResourcePlacementEviction API

This guide provides troubleshooting steps for issues related to placement eviction.

An eviction object when created is ideally only reconciled once and reaches a terminal state. List of terminal states for eviction are:

Eviction is Invalid
Eviction is Valid, Eviction failed to Execute
Eviction is Valid, Eviction executed successfully

Note: If an eviction object doesn’t reach a terminal state i.e. neither valid condition nor executed condition is set it is likely due to a failure in the reconciliation process where the controller is unable to reach the api server.

The first step in troubleshooting is to check the status of the eviction object to understand if the eviction reached a terminal state or not.

Invalid eviction

Missing/Deleting CRP object

Example status with missing CRP object:

status:
  conditions:
  - lastTransitionTime: "2025-04-17T22:16:59Z"
    message: Failed to find ClusterResourcePlacement targeted by eviction
    observedGeneration: 1
    reason: ClusterResourcePlacementEvictionInvalid
    status: "False"
    type: Valid

Example status with deleting CRP object:

status:
  conditions:
  - lastTransitionTime: "2025-04-21T19:53:42Z"
    message: Found deleting ClusterResourcePlacement targeted by eviction
    observedGeneration: 1
    reason: ClusterResourcePlacementEvictionInvalid
    status: "False"
    type: Valid

In both cases the Eviction object reached a terminal state, its status has Valid condition set to False. The user should verify if the ClusterResourcePlacement object is missing or if it is being deleted and recreate the ClusterResourcePlacement object if needed and retry eviction.

Missing CRB object

Example status with missing CRB object:

status:
  conditions:
  - lastTransitionTime: "2025-04-17T22:21:51Z"
    message: Failed to find scheduler decision for placement in cluster targeted by
      eviction
    observedGeneration: 1
    reason: ClusterResourcePlacementEvictionInvalid
    status: "False"
    type: Valid

Note: The user can find the corresponding ClusterResourceBinding object by listing all ClusterResourceBinding objects for the ClusterResourcePlacement object
kubectl get rb -l kubernetes-fleet.io/parent-CRP=<CRPName>
The ClusterResourceBinding object name is formatted as <CRPName>-<ClusterName>-randomsuffix

In this case the Eviction object reached a terminal state, its status has Valid condition set to False, because the ClusterResourceBinding object or Placement for target cluster is not found. The user should verify to see if the ClusterResourcePlacement object is propagating resources to the target cluster,

If yes, the next step is to check if the ClusterResourceBinding object is present for the target cluster or why it was not created and try to create an eviction object once ClusterResourceBinding is created.
If no, the cluster is not picked by the scheduler and hence no need to retry eviction.

Multiple CRB is present

Example status with multiple CRB objects:

status:
  conditions:
  - lastTransitionTime: "2025-04-17T23:48:08Z"
    message: Found more than one scheduler decision for placement in cluster targeted
      by eviction
    observedGeneration: 1
    reason: ClusterResourcePlacementEvictionInvalid
    status: "False"
    type: Valid

In this case the Eviction object reached a terminal state, its status has Valid condition set to False, because there is more than one ClusterResourceBinding object or Placement present for the ClusterResourcePlacement object targeting the member cluster. This is a rare scenario, it’s an in-between state where bindings are being-recreated due to the member cluster being selected again, and it will normally resolve quickly.

PickFixed CRP is targeted by CRP Eviction

Example status for ClusterResourcePlacementEviction object targeting a PickFixed ClusterResourcePlacement object:

status:
  conditions:
  - lastTransitionTime: "2025-04-21T23:19:06Z"
    message: Found ClusterResourcePlacement with PickFixed placement type targeted
      by eviction
    observedGeneration: 1
    reason: ClusterResourcePlacementEvictionInvalid
    status: "False"
    type: Valid

In this case the Eviction object reached a terminal state, its status has Valid condition set to False, because the ClusterResourcePlacement object is of type PickFixed. Users cannot use ClusterResourcePlacementEviction objects to evict resources propagated by ClusterResourcePlacement objects of type PickFixed. The user can instead remove the member cluster name from the clusterNames field in the policy of the ClusterResourcePlacement object.

Failed to execute eviction

Eviction blocked because placement is missing

status:
  conditions:
  - lastTransitionTime: "2025-04-23T23:54:03Z"
    message: Eviction is valid
    observedGeneration: 1
    reason: ClusterResourcePlacementEvictionValid
    status: "True"
    type: Valid
  - lastTransitionTime: "2025-04-23T23:54:03Z"
    message: Eviction is blocked, placement has not propagated resources to target
      cluster yet
    observedGeneration: 1
    reason: ClusterResourcePlacementEvictionNotExecuted
    status: "False"
    type: Executed

In this case the Eviction object reached a terminal state, its status has Executed condition set to False, because for the targeted ClusterResourcePlacement the corresponding ClusterResourceBinding object’s spec is set to Scheduled meaning the rollout of resources is not started yet.

Note: The user can find the corresponding ClusterResourceBinding object by listing all ClusterResourceBinding objects for the ClusterResourcePlacement object
kubectl get rb -l kubernetes-fleet.io/parent-CRP=<CRPName>
The ClusterResourceBinding object name is formatted as <CRPName>-<ClusterName>-randomsuffix.

spec:
  applyStrategy:
    type: ClientSideApply
  clusterDecision:
    clusterName: kind-cluster-3
    clusterScore:
      affinityScore: 0
      priorityScore: 0
    reason: 'Successfully scheduled resources for placement in "kind-cluster-3" (affinity
      score: 0, topology spread score: 0): picked by scheduling policy'
    selected: true
  resourceSnapshotName: ""
  schedulingPolicySnapshotName: test-crp-1
  state: Scheduled
  targetCluster: kind-cluster-3

Here the user can wait for the ClusterResourceBinding object to be updated to Bound state which means that resources have been propagated to the target cluster and then retry eviction. In some cases this can take a while or not happen at all, in that case the user should verify if rollout is stuck for ClusterResourcePlacement object.

Eviction blocked by Invalid CRPDB

Example status for ClusterResourcePlacementEviction object with invalid ClusterResourcePlacementDisruptionBudget,

status:
  conditions:
  - lastTransitionTime: "2025-04-21T23:39:42Z"
    message: Eviction is valid
    observedGeneration: 1
    reason: ClusterResourcePlacementEvictionValid
    status: "True"
    type: Valid
  - lastTransitionTime: "2025-04-21T23:39:42Z"
    message: Eviction is blocked by misconfigured ClusterResourcePlacementDisruptionBudget,
      either MaxUnavailable is specified or MinAvailable is specified as a percentage
      for PickAll ClusterResourcePlacement
    observedGeneration: 1
    reason: ClusterResourcePlacementEvictionNotExecuted
    status: "False"
    type: Executed

In this cae the Eviction object reached a terminal state, its status has Executed condition set to False, because the ClusterResourcePlacementDisruptionBudget object is invalid. For ClusterResourcePlacement objects of type PickAll, when specifying a ClusterResourcePlacementDisruptionBudget the minAvailable field should be set to an absolute number and not a percentage and the maxUnavailable field should not be set since the total number of placements is non-deterministic.

Eviction blocked by specified CRPDB

Example status for ClusterResourcePlacementEviction object blocked by a ClusterResourcePlacementDisruptionBudget object,

status:
  conditions:
  - lastTransitionTime: "2025-04-24T18:54:30Z"
    message: Eviction is valid
    observedGeneration: 1
    reason: ClusterResourcePlacementEvictionValid
    status: "True"
    type: Valid
  - lastTransitionTime: "2025-04-24T18:54:30Z"
    message: 'Eviction is blocked by specified ClusterResourcePlacementDisruptionBudget,
      availablePlacements: 2, totalPlacements: 2'
    observedGeneration: 1
    reason: ClusterResourcePlacementEvictionNotExecuted
    status: "False"
    type: Executed

In this cae the Eviction object reached a terminal state, its status has Executed condition set to False, because the ClusterResourcePlacementDisruptionBudget object is blocking the eviction. The message from Executed condition reads available placements is 2 and total placements is 2, which means that the ClusterResourcePlacementDisruptionBudget is protecting all placements propagated by the ClusterResourcePlacement object.

Taking a look at the ClusterResourcePlacementDisruptionBudget object,

apiVersion: placement.kubernetes-fleet.io/v1beta1
kind: ClusterResourcePlacementDisruptionBudget
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"placement.kubernetes-fleet.io/v1beta1","kind":"ClusterResourcePlacementDisruptionBudget","metadata":{"annotations":{},"name":"pick-all-crp"},"spec":{"minAvailable":2}}
  creationTimestamp: "2025-04-24T18:47:22Z"
  generation: 1
  name: pick-all-crp
  resourceVersion: "1749"
  uid: 7d3a0ac5-0225-4fb6-b5e9-fc28d58cefdc
spec:
  minAvailable: 2

We can see that the minAvailable is set to 2, which means that at least 2 placements should be available for the ClusterResourcePlacement object.

Let’s take a look at the ClusterResourcePlacement object’s status to verify the list of available placements,

status:
  conditions:
  - lastTransitionTime: "2025-04-24T18:46:38Z"
    message: found all cluster needed as specified by the scheduling policy, found
      2 cluster(s)
    observedGeneration: 1
    reason: SchedulingPolicyFulfilled
    status: "True"
    type: ClusterResourcePlacementScheduled
  - lastTransitionTime: "2025-04-24T18:50:19Z"
    message: All 2 cluster(s) start rolling out the latest resource
    observedGeneration: 1
    reason: RolloutStarted
    status: "True"
    type: ClusterResourcePlacementRolloutStarted
  - lastTransitionTime: "2025-04-24T18:50:19Z"
    message: No override rules are configured for the selected resources
    observedGeneration: 1
    reason: NoOverrideSpecified
    status: "True"
    type: ClusterResourcePlacementOverridden
  - lastTransitionTime: "2025-04-24T18:50:19Z"
    message: Works(s) are succcesfully created or updated in 2 target cluster(s)'
      namespaces
    observedGeneration: 1
    reason: WorkSynchronized
    status: "True"
    type: ClusterResourcePlacementWorkSynchronized
  - lastTransitionTime: "2025-04-24T18:50:19Z"
    message: The selected resources are successfully applied to 2 cluster(s)
    observedGeneration: 1
    reason: ApplySucceeded
    status: "True"
    type: ClusterResourcePlacementApplied
  - lastTransitionTime: "2025-04-24T18:50:19Z"
    message: The selected resources in 2 cluster(s) are available now
    observedGeneration: 1
    reason: ResourceAvailable
    status: "True"
    type: ClusterResourcePlacementAvailable
  observedResourceIndex: "0"
  placementStatuses:
  - clusterName: kind-cluster-1
    conditions:
    - lastTransitionTime: "2025-04-24T18:50:19Z"
      message: 'Successfully scheduled resources for placement in "kind-cluster-1"
        (affinity score: 0, topology spread score: 0): picked by scheduling policy'
      observedGeneration: 1
      reason: Scheduled
      status: "True"
      type: Scheduled
    - lastTransitionTime: "2025-04-24T18:50:19Z"
      message: Detected the new changes on the resources and started the rollout process
      observedGeneration: 1
      reason: RolloutStarted
      status: "True"
      type: RolloutStarted
    - lastTransitionTime: "2025-04-24T18:50:19Z"
      message: No override rules are configured for the selected resources
      observedGeneration: 1
      reason: NoOverrideSpecified
      status: "True"
      type: Overridden
    - lastTransitionTime: "2025-04-24T18:50:19Z"
      message: All of the works are synchronized to the latest
      observedGeneration: 1
      reason: AllWorkSynced
      status: "True"
      type: WorkSynchronized
    - lastTransitionTime: "2025-04-24T18:50:19Z"
      message: All corresponding work objects are applied
      observedGeneration: 1
      reason: AllWorkHaveBeenApplied
      status: "True"
      type: Applied
    - lastTransitionTime: "2025-04-24T18:50:19Z"
      message: All corresponding work objects are available
      observedGeneration: 1
      reason: AllWorkAreAvailable
      status: "True"
      type: Available
  - clusterName: kind-cluster-2
    conditions:
    - lastTransitionTime: "2025-04-24T18:46:38Z"
      message: 'Successfully scheduled resources for placement in "kind-cluster-2"
        (affinity score: 0, topology spread score: 0): picked by scheduling policy'
      observedGeneration: 1
      reason: Scheduled
      status: "True"
      type: Scheduled
    - lastTransitionTime: "2025-04-24T18:46:38Z"
      message: Detected the new changes on the resources and started the rollout process
      observedGeneration: 1
      reason: RolloutStarted
      status: "True"
      type: RolloutStarted
    - lastTransitionTime: "2025-04-24T18:46:38Z"
      message: No override rules are configured for the selected resources
      observedGeneration: 1
      reason: NoOverrideSpecified
      status: "True"
      type: Overridden
    - lastTransitionTime: "2025-04-24T18:46:38Z"
      message: All of the works are synchronized to the latest
      observedGeneration: 1
      reason: AllWorkSynced
      status: "True"
      type: WorkSynchronized
    - lastTransitionTime: "2025-04-24T18:46:38Z"
      message: All corresponding work objects are applied
      observedGeneration: 1
      reason: AllWorkHaveBeenApplied
      status: "True"
      type: Applied
    - lastTransitionTime: "2025-04-24T18:46:38Z"
      message: All corresponding work objects are available
      observedGeneration: 1
      reason: AllWorkAreAvailable
      status: "True"
      type: Available
  selectedResources:
  - kind: Namespace
    name: test-ns
    version: v1

from the status we can see that the ClusterResourcePlacement object has 2 placements available, where resources have been successfully applied and are available in kind-cluster-1 and kind-cluster-2. The users can check the individual member clusters to verify the resources are available but the users are recommended to check theClusterResourcePlacement object status to verify placement availability since the status is aggregated and updated by the controller.

Here the user can either remove the ClusterResourcePlacementDisruptionBudget object or update the minAvailable to 1 to allow ClusterResourcePlacementEviction object to execute successfully. In general the user should carefully check the availability of placements and act accordingly when changing the ClusterResourcePlacementDisruptionBudget object.