This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Concepts

Core concepts in Fleet

The documentation in this section explains core Fleet concepts. Pick one below to proceed.

1 - Fleet components

Concept about the Fleet components

Components

This document provides an overview of the components required for a fully functional and operational Fleet setup.

The fleet consists of the following components:

  • fleet-hub-agent is a Kubernetes controller that creates and reconciles all the fleet related CRs in the hub cluster.
  • fleet-member-agent is a Kubernetes controller that creates and reconciles all the fleet related CRs in the member cluster. The fleet-member-agent is pulling the latest CRs from the hub cluster and consistently reconciles the member clusters to the desired state.

The fleet implements agent-based pull mode. So that the working pressure can be distributed to the member clusters, and it helps to breach the bottleneck of scalability, by dividing the load into each member cluster. On the other hand, hub cluster does not need to directly access to the member clusters. Fleet can support the member clusters which only have the outbound network and no inbound network access.

To allow multiple clusters to run securely, fleet will create a reserved namespace on the hub cluster to isolate the access permissions and resources across multiple clusters.

2 - MemberCluster

Concept about the MemberCluster API

Overview

The fleet constitutes an implementation of a ClusterSet and encompasses the following attributes:

  • A collective of clusters managed by a centralized authority.
  • Typically characterized by a high level of mutual trust within the cluster set.
  • Embraces the principle of Namespace Sameness across clusters:
    • Ensures uniform permissions and characteristics for a given namespace across all clusters.
    • While not mandatory for every cluster, namespaces exhibit consistent behavior across those where they are present.

The MemberCluster represents a cluster-scoped API established within the hub cluster, serving as a representation of a cluster within the fleet. This API offers a dependable, uniform, and automated approach for multi-cluster applications (frameworks, toolsets) to identify registered clusters within a fleet. Additionally, it facilitates applications in querying a list of clusters managed by the fleet or observing cluster statuses for subsequent actions.

Some illustrative use cases encompass:

  • The Fleet Scheduler utilizing managed cluster statuses or specific cluster properties (e.g., labels, taints) of a MemberCluster for resource scheduling.
  • Automation tools like GitOps systems (e.g., ArgoCD or Flux) automatically registering/deregistering clusters in compliance with the MemberCluster API.
  • The MCS API automatically generating ServiceImport CRs based on the MemberCluster CR defined within a fleet.

Moreover, it furnishes a user-friendly interface for human operators to monitor the managed clusters.

MemberCluster Lifecycle

Joining the Fleet

The process to join the Fleet involves creating a MemberCluster. The MemberCluster controller, a constituent of the hub-cluster-agent described in the Component, watches the MemberCluster CR and generates a corresponding namespace for the member cluster within the hub cluster. It configures roles and role bindings within the hub cluster, authorizing the specified member cluster identity (as detailed in the MemberCluster spec) access solely to resources within that namespace. To collate member cluster status, the controller generates another internal CR named InternalMemberCluster within the newly formed namespace. Simultaneously, the InternalMemberCluster controller, a component of the member-cluster-agent situated in the member cluster, gathers statistics on cluster usage, such as capacity utilization, and reports its status based on the HeartbeatPeriodSeconds specified in the CR. Meanwhile, the MemberCluster controller consolidates agent statuses and marks the cluster as Joined.

Leaving the Fleet

Fleet administrators can deregister a cluster by deleting the MemberCluster CR. Upon detection of deletion events by the MemberCluster controller within the hub cluster, it removes the corresponding InternalMemberCluster CR in the reserved namespace of the member cluster. It awaits completion of the “leave” process by the InternalMemberCluster controller of member agents, and then deletes role and role bindings and other resources including the member cluster reserved namespaces on the hub cluster.

Taints

Taints are a mechanism to prevent the Fleet Scheduler from scheduling resources to a MemberCluster. We adopt the concept of taints and tolerations introduced in Kubernetes to the multi-cluster use case.

The MemberCluster CR supports the specification of list of taints, which are applied to the MemberCluster. Each Taint object comprises the following fields:

  • key: The key of the taint.
  • value: The value of the taint.
  • effect: The effect of the taint, which can be NoSchedule for now.

Once a MemberCluster is tainted with a specific taint, it lets the Fleet Scheduler know that the MemberCluster should not receive resources as part of the workload propagation from the hub cluster.

The NoSchedule taint is a signal to the Fleet Scheduler to avoid scheduling resources from a ClusterResourcePlacement to the MemberCluster. Any MemberCluster already selected for resource propagation will continue to receive resources even if a new taint is added.

Taints are only honored by ClusterResourcePlacement with PickAll, PickN placement policies. In the case of PickFixed placement policy the taints are ignored because the user has explicitly specify the MemberClusters where the resources should be placed.

For detailed instructions, please refer to this document.

What’s next

3 - ClusterResourcePlacement

Concept about the ClusterResourcePlacement API

Overview

ClusterResourcePlacement concept is used to dynamically select cluster scoped resources (especially namespaces and all objects within it) and control how they are propagated to all or a subset of the member clusters. A ClusterResourcePlacement mainly consists of three parts:

  • Resource selection: select which cluster-scoped Kubernetes resource objects need to be propagated from the hub cluster to selected member clusters.

    It supports the following forms of resource selection:

    • Select resources by specifying just the <group, version, kind>. This selection propagates all resources with matching <group, version, kind>.
    • Select resources by specifying the <group, version, kind> and name. This selection propagates only one resource that matches the <group, version, kind> and name.
    • Select resources by specifying the <group, version, kind> and a set of labels using ClusterResourcePlacement -> LabelSelector. This selection propagates all resources that match the <group, version, kind> and label specified.

    Note: When a namespace is selected, all the namespace-scoped objects under this namespace are propagated to the selected member clusters along with this namespace.

  • Placement policy: limit propagation of selected resources to a specific subset of member clusters. The following types of target cluster selection are supported:

    • PickAll (Default): select any member clusters with matching cluster Affinity scheduling rules. If the Affinity is not specified, it will select all joined and healthy member clusters.
    • PickFixed: select a fixed list of member clusters defined in the ClusterNames.
    • PickN: select a NumberOfClusters of member clusters with optional matching cluster Affinity scheduling rules or topology spread constraints TopologySpreadConstraints.
  • Strategy: how changes are rolled out (rollout strategy) and how resources are applied on the member cluster side (apply strategy).

A simple ClusterResourcePlacement looks like this:

apiVersion: placement.kubernetes-fleet.io/v1
kind: ClusterResourcePlacement
metadata:
  name: crp-1
spec:
  policy:
    placementType: PickN
    numberOfClusters: 2
    topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: "env"
        whenUnsatisfiable: DoNotSchedule
  resourceSelectors:
    - group: ""
      kind: Namespace
      name: test-deployment
      version: v1
  revisionHistoryLimit: 100
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
      unavailablePeriodSeconds: 5
    type: RollingUpdate

When To Use ClusterResourcePlacement

ClusterResourcePlacement is useful when you want for a general way of managing and running workloads across multiple clusters. Some example scenarios include the following:

  • As a platform operator, I want to place my cluster-scoped resources (especially namespaces and all objects within it) to a cluster that resides in the us-east-1.
  • As a platform operator, I want to spread my cluster-scoped resources (especially namespaces and all objects within it) evenly across the different regions/zones.
  • As a platform operator, I prefer to place my test resources into the staging AKS cluster.
  • As a platform operator, I would like to separate the workloads for compliance or policy reasons.
  • As a developer, I want to run my cluster-scoped resources (especially namespaces and all objects within it) on 3 clusters. In addition, each time I update my workloads, the updates take place with zero downtime by rolling out to these three clusters incrementally.

Placement Workflow

The placement controller will create ClusterSchedulingPolicySnapshot and ClusterResourceSnapshot snapshots by watching the ClusterResourcePlacement object. So that it can trigger the scheduling and resource rollout process whenever needed.

The override controller will create the corresponding snapshots by watching the ClusterResourceOverride and ResourceOverride which captures the snapshot of the overrides.

The placement workflow will be divided into several stages:

  1. Scheduling: multi-cluster scheduler makes the schedule decision by creating the clusterResourceBinding for a bundle of resources based on the latest ClusterSchedulingPolicySnapshotgenerated by the ClusterResourcePlacement.
  2. Rolling out resources: rollout controller applies the resources to the selected member clusters based on the rollout strategy.
  3. Overriding: work generator applies the override rules defined by ClusterResourceOverride and ResourceOverride to the selected resources on the target clusters.
  4. Creating or updating works: work generator creates the work on the corresponding member cluster namespace. Each work contains the (overridden) manifest workload to be deployed on the member clusters.
  5. Applying resources on target clusters: apply work controller applies the manifest workload on the member clusters.
  6. Checking resource availability: apply work controller checks the resource availability on the target clusters.

Resource Selection

Resource selectors identify cluster-scoped objects to include based on standard Kubernetes identifiers - namely, the group, kind, version, and name of the object. Namespace-scoped objects are included automatically when the namespace they are part of is selected. The example ClusterResourcePlacement above would include the test-deployment namespace and any objects that were created in that namespace.

The clusterResourcePlacement controller creates the ClusterResourceSnapshot to store a snapshot of selected resources selected by the placement. The ClusterResourceSnapshot spec is immutable. Each time when the selected resources are updated, the clusterResourcePlacement controller will detect the resource changes and create a new ClusterResourceSnapshot. It implies that resources can change independently of any modifications to the ClusterResourceSnapshot. In other words, resource changes can occur without directly affecting the ClusterResourceSnapshot itself.

The total amount of selected resources may exceed the 1MB limit for a single Kubernetes object. As a result, the controller may produce more than one ClusterResourceSnapshots for all the selected resources.

ClusterResourceSnapshot sample:

apiVersion: placement.kubernetes-fleet.io/v1
kind: ClusterResourceSnapshot
metadata:
  annotations:
    kubernetes-fleet.io/number-of-enveloped-object: "0"
    kubernetes-fleet.io/number-of-resource-snapshots: "1"
    kubernetes-fleet.io/resource-hash: e0927e7d75c7f52542a6d4299855995018f4a6de46edf0f814cfaa6e806543f3
  creationTimestamp: "2023-11-10T08:23:38Z"
  generation: 1
  labels:
    kubernetes-fleet.io/is-latest-snapshot: "true"
    kubernetes-fleet.io/parent-CRP: crp-1
    kubernetes-fleet.io/resource-index: "4"
  name: crp-1-4-snapshot
  ownerReferences:
  - apiVersion: placement.kubernetes-fleet.io/v1
    blockOwnerDeletion: true
    controller: true
    kind: ClusterResourcePlacement
    name: crp-1
    uid: 757f2d2c-682f-433f-b85c-265b74c3090b
  resourceVersion: "1641940"
  uid: d6e2108b-882b-4f6c-bb5e-c5ec5491dd20
spec:
  selectedResources:
  - apiVersion: v1
    kind: Namespace
    metadata:
      labels:
        kubernetes.io/metadata.name: test
      name: test
    spec:
      finalizers:
      - kubernetes
  - apiVersion: v1
    data:
      key1: value1
      key2: value2
      key3: value3
    kind: ConfigMap
    metadata:
      name: test-1
      namespace: test

Placement Policy

ClusterResourcePlacement supports three types of policy as mentioned above. ClusterSchedulingPolicySnapshot will be generated whenever policy changes are made to the ClusterResourcePlacement that require a new scheduling. Similar to ClusterResourceSnapshot, its spec is immutable.

ClusterSchedulingPolicySnapshot sample:

apiVersion: placement.kubernetes-fleet.io/v1
kind: ClusterSchedulingPolicySnapshot
metadata:
  annotations:
    kubernetes-fleet.io/CRP-generation: "5"
    kubernetes-fleet.io/number-of-clusters: "2"
  creationTimestamp: "2023-11-06T10:22:56Z"
  generation: 1
  labels:
    kubernetes-fleet.io/is-latest-snapshot: "true"
    kubernetes-fleet.io/parent-CRP: crp-1
    kubernetes-fleet.io/policy-index: "1"
  name: crp-1-1
  ownerReferences:
  - apiVersion: placement.kubernetes-fleet.io/v1
    blockOwnerDeletion: true
    controller: true
    kind: ClusterResourcePlacement
    name: crp-1
    uid: 757f2d2c-682f-433f-b85c-265b74c3090b
  resourceVersion: "1639412"
  uid: 768606f2-aa5a-481a-aa12-6e01e6adbea2
spec:
  policy:
    placementType: PickN
  policyHash: NDc5ZjQwNWViNzgwOGNmYzU4MzY2YjI2NDg2ODBhM2E4MTVlZjkxNGZlNjc1NmFlOGRmMGQ2Zjc0ODg1NDE2YQ==
status:
  conditions:
  - lastTransitionTime: "2023-11-06T10:22:56Z"
    message: found all the clusters needed as specified by the scheduling policy
    observedGeneration: 1
    reason: SchedulingPolicyFulfilled
    status: "True"
    type: Scheduled
  observedCRPGeneration: 5
  targetClusters:
  - clusterName: aks-member-1
    clusterScore:
      affinityScore: 0
      priorityScore: 0
    reason: picked by scheduling policy
    selected: true
  - clusterName: aks-member-2
    clusterScore:
      affinityScore: 0
      priorityScore: 0
    reason: picked by scheduling policy
    selected: true

In contrast to the original scheduler framework in Kubernetes, the multi-cluster scheduling process involves selecting a cluster for placement through a structured 5-step operation:

  1. Batch & PostBatch
  2. Filter
  3. Score
  4. Sort
  5. Bind

The batch & postBatch step is to define the batch size according to the desired and current ClusterResourceBinding. The postBatch is to adjust the batch size if needed.

The filter step finds the set of clusters where it’s feasible to schedule the placement, for example, whether the cluster is matching required Affinity scheduling rules specified in the Policy. It also filters out any clusters which are leaving the fleet or no longer connected to the fleet, for example, its heartbeat has been stopped for a prolonged period of time.

In the score step (only applied to the pickN type), the scheduler assigns a score to each cluster that survived filtering. Each cluster is given a topology spread score (how much a cluster would satisfy the topology spread constraints specified by the user), and an affinity score (how much a cluster would satisfy the preferred affinity terms specified by the user).

In the sort step (only applied to the pickN type), it sorts all eligible clusters by their scores, sorting first by topology spread score and breaking ties based on the affinity score.

The bind step is to create/update/delete the ClusterResourceBinding based on the desired and current member cluster list.

Strategy

Rollout strategy

Use rollout strategy to control how KubeFleet rolls out a resource change made on the hub cluster to all member clusters. Right now KubeFleet supports two types of rollout strategies out of the box:

  • Rolling update: this rollout strategy helps roll out changes incrementally in a way that ensures system availability, akin to how the Kubernetes Deployment API handles updates. For more information, see the Safe Rollout concept.
  • Staged update: this rollout strategy helps roll out changes in different stages; users may group clusters into different stages and specify the order in which each stage receives the update. The strategy also allows users to set up timed or approval-based gates between stages to fine-control the flow. For more information, see the Staged Update concept and Staged Update How-To Guide.

Apply strategy

Use apply strategy to control how KubeFleet applies a resource to a member cluster. KubeFleet currently features three different types of apply strategies:

  • Client-side apply: this apply strategy sets up KubeFleet to apply resources in a three-way merge that is similar to how the Kubernetes CLI, kubectl, performs client-side apply.
  • Server-side apply: this apply strategy sets up KubeFleet to apply resources via the new server-side apply mechanism.
  • Report Diff mode: this apply strategy instructs KubeFleet to check for configuration differences between the resource on the hub cluster and its counterparts among the member clusters; no apply op will be performed. For more information, see the ReportDiff Mode How-To Guide.

To learn more about the differences between client-side apply and server-side apply, see also the Kubernetes official documentation.

KubeFleet apply strategy is also the place where users can set up KubeFleet’s drift detection capabilities and takeover settings:

  • Drift detection helps users identify and resolve configuration drifts that are commonly observed in a multi-cluster environment; through this feature, KubeFleet can detect the presence of drifts, reveal their details, and let users decide how and when to handle them. See the Drift Detection How-To Guide for more information.
  • Takeover settings allows users to decide how KubeFleet can best handle pre-existing resources. When you join a cluster with running workloads into a fleet, these settings can help bring the workloads under KubeFleet’s management in a way that avoids interruptions. For specifics, see the Takeover Settings How-To Guide.

Placement status

After a ClusterResourcePlacement is created, details on current status can be seen by performing a kubectl describe crp <name>. The status output will indicate both placement conditions and individual placement statuses on each member cluster that was selected. The list of resources that are selected for placement will also be included in the describe output.

Sample output:

Name:         crp-1
Namespace:
Labels:       <none>
Annotations:  <none>
API Version:  placement.kubernetes-fleet.io/v1
Kind:         ClusterResourcePlacement
Metadata:
  ...
Spec:
  Policy:
    Placement Type:  PickAll
  Resource Selectors:
    Group:
    Kind:                  Namespace
    Name:                  application-1
    Version:               v1
  Revision History Limit:  10
  Strategy:
    Rolling Update:
      Max Surge:                   25%
      Max Unavailable:             25%
      Unavailable Period Seconds:  2
    Type:                          RollingUpdate
Status:
  Conditions:
    Last Transition Time:   2024-04-29T09:58:20Z
    Message:                found all the clusters needed as specified by the scheduling policy
    Observed Generation:    1
    Reason:                 SchedulingPolicyFulfilled
    Status:                 True
    Type:                   ClusterResourcePlacementScheduled
    Last Transition Time:   2024-04-29T09:58:20Z
    Message:                All 3 cluster(s) start rolling out the latest resource
    Observed Generation:    1
    Reason:                 RolloutStarted
    Status:                 True
    Type:                   ClusterResourcePlacementRolloutStarted
    Last Transition Time:   2024-04-29T09:58:20Z
    Message:                No override rules are configured for the selected resources
    Observed Generation:    1
    Reason:                 NoOverrideSpecified
    Status:                 True
    Type:                   ClusterResourcePlacementOverridden
    Last Transition Time:   2024-04-29T09:58:20Z
    Message:                Works(s) are succcesfully created or updated in the 3 target clusters' namespaces
    Observed Generation:    1
    Reason:                 WorkSynchronized
    Status:                 True
    Type:                   ClusterResourcePlacementWorkSynchronized
    Last Transition Time:   2024-04-29T09:58:20Z
    Message:                The selected resources are successfully applied to 3 clusters
    Observed Generation:    1
    Reason:                 ApplySucceeded
    Status:                 True
    Type:                   ClusterResourcePlacementApplied
    Last Transition Time:   2024-04-29T09:58:20Z
    Message:                The selected resources in 3 cluster are available now
    Observed Generation:    1
    Reason:                 ResourceAvailable
    Status:                 True
    Type:                   ClusterResourcePlacementAvailable
  Observed Resource Index:  0
  Placement Statuses:
    Cluster Name:  kind-cluster-1
    Conditions:
      Last Transition Time:  2024-04-29T09:58:20Z
      Message:               Successfully scheduled resources for placement in kind-cluster-1 (affinity score: 0, topology spread score: 0): picked by scheduling policy
      Observed Generation:   1
      Reason:                Scheduled
      Status:                True
      Type:                  Scheduled
      Last Transition Time:  2024-04-29T09:58:20Z
      Message:               Detected the new changes on the resources and started the rollout process
      Observed Generation:   1
      Reason:                RolloutStarted
      Status:                True
      Type:                  RolloutStarted
      Last Transition Time:  2024-04-29T09:58:20Z
      Message:               No override rules are configured for the selected resources
      Observed Generation:   1
      Reason:                NoOverrideSpecified
      Status:                True
      Type:                  Overridden
      Last Transition Time:  2024-04-29T09:58:20Z
      Message:               All of the works are synchronized to the latest
      Observed Generation:   1
      Reason:                AllWorkSynced
      Status:                True
      Type:                  WorkSynchronized
      Last Transition Time:  2024-04-29T09:58:20Z
      Message:               All corresponding work objects are applied
      Observed Generation:   1
      Reason:                AllWorkHaveBeenApplied
      Status:                True
      Type:                  Applied
      Last Transition Time:  2024-04-29T09:58:20Z
      Message:               The availability of work object crp-1-work is not trackable
      Observed Generation:   1
      Reason:                WorkNotTrackable
      Status:                True
      Type:                  Available
    Cluster Name:            kind-cluster-2
    Conditions:
      Last Transition Time:  2024-04-29T09:58:20Z
      Message:               Successfully scheduled resources for placement in kind-cluster-2 (affinity score: 0, topology spread score: 0): picked by scheduling policy
      Observed Generation:   1
      Reason:                Scheduled
      Status:                True
      Type:                  Scheduled
      Last Transition Time:  2024-04-29T09:58:20Z
      Message:               Detected the new changes on the resources and started the rollout process
      Observed Generation:   1
      Reason:                RolloutStarted
      Status:                True
      Type:                  RolloutStarted
      Last Transition Time:  2024-04-29T09:58:20Z
      Message:               No override rules are configured for the selected resources
      Observed Generation:   1
      Reason:                NoOverrideSpecified
      Status:                True
      Type:                  Overridden
      Last Transition Time:  2024-04-29T09:58:20Z
      Message:               All of the works are synchronized to the latest
      Observed Generation:   1
      Reason:                AllWorkSynced
      Status:                True
      Type:                  WorkSynchronized
      Last Transition Time:  2024-04-29T09:58:20Z
      Message:               All corresponding work objects are applied
      Observed Generation:   1
      Reason:                AllWorkHaveBeenApplied
      Status:                True
      Type:                  Applied
      Last Transition Time:  2024-04-29T09:58:20Z
      Message:               The availability of work object crp-1-work is not trackable
      Observed Generation:   1
      Reason:                WorkNotTrackable
      Status:                True
      Type:                  Available
    Cluster Name:            kind-cluster-3
    Conditions:
      Last Transition Time:  2024-04-29T09:58:20Z
      Message:               Successfully scheduled resources for placement in kind-cluster-3 (affinity score: 0, topology spread score: 0): picked by scheduling policy
      Observed Generation:   1
      Reason:                Scheduled
      Status:                True
      Type:                  Scheduled
      Last Transition Time:  2024-04-29T09:58:20Z
      Message:               Detected the new changes on the resources and started the rollout process
      Observed Generation:   1
      Reason:                RolloutStarted
      Status:                True
      Type:                  RolloutStarted
      Last Transition Time:  2024-04-29T09:58:20Z
      Message:               No override rules are configured for the selected resources
      Observed Generation:   1
      Reason:                NoOverrideSpecified
      Status:                True
      Type:                  Overridden
      Last Transition Time:  2024-04-29T09:58:20Z
      Message:               All of the works are synchronized to the latest
      Observed Generation:   1
      Reason:                AllWorkSynced
      Status:                True
      Type:                  WorkSynchronized
      Last Transition Time:  2024-04-29T09:58:20Z
      Message:               All corresponding work objects are applied
      Observed Generation:   1
      Reason:                AllWorkHaveBeenApplied
      Status:                True
      Type:                  Applied
      Last Transition Time:  2024-04-29T09:58:20Z
      Message:               The availability of work object crp-1-work is not trackable
      Observed Generation:   1
      Reason:                WorkNotTrackable
      Status:                True
      Type:                  Available
  Selected Resources:
    Kind:       Namespace
    Name:       application-1
    Version:    v1
    Kind:       ConfigMap
    Name:       app-config-1
    Namespace:  application-1
    Version:    v1
Events:
  Type    Reason                        Age    From                                   Message
  ----    ------                        ----   ----                                   -------
  Normal  PlacementRolloutStarted       3m46s  cluster-resource-placement-controller  Started rolling out the latest resources
  Normal  PlacementOverriddenSucceeded  3m46s  cluster-resource-placement-controller  Placement has been successfully overridden
  Normal  PlacementWorkSynchronized     3m46s  cluster-resource-placement-controller  Work(s) have been created or updated successfully for the selected cluster(s)
  Normal  PlacementApplied              3m46s  cluster-resource-placement-controller  Resources have been applied to the selected cluster(s)
  Normal  PlacementRolloutCompleted     3m46s  cluster-resource-placement-controller  Resources are available in the selected clusters

Tolerations

Tolerations are a mechanism to allow the Fleet Scheduler to schedule resources to a MemberCluster that has taints specified on it. We adopt the concept of taints & tolerations introduced in Kubernetes to the multi-cluster use case.

The ClusterResourcePlacement CR supports the specification of list of tolerations, which are applied to the ClusterResourcePlacement object. Each Toleration object comprises the following fields:

  • key: The key of the toleration.
  • value: The value of the toleration.
  • effect: The effect of the toleration, which can be NoSchedule for now.
  • operator: The operator of the toleration, which can be Exists or Equal.

Each toleration is used to tolerate one or more specific taints applied on the MemberCluster. Once all taints on a MemberCluster are tolerated by tolerations on a ClusterResourcePlacement, resources can be propagated to the MemberCluster by the scheduler for that ClusterResourcePlacement resource.

Tolerations cannot be updated or removed from a ClusterResourcePlacement. If there is a need to update toleration a better approach is to add another toleration. If we absolutely need to update or remove existing tolerations, the only option is to delete the existing ClusterResourcePlacement and create a new object with the updated tolerations.

For detailed instructions, please refer to this document.

Envelope Object

The ClusterResourcePlacement leverages the fleet hub cluster as a staging environment for customer resources. These resources are then propagated to member clusters that are part of the fleet, based on the ClusterResourcePlacement spec.

In essence, the objective is not to apply or create resources on the hub cluster for local use but to propagate these resources to other member clusters within the fleet.

Certain resources, when created or applied on the hub cluster, may lead to unintended side effects. These include:

  • Validating/Mutating Webhook Configurations
  • Cluster Role Bindings
  • Resource Quotas
  • Storage Classes
  • Flow Schemas
  • Priority Classes
  • Ingress Classes
  • Ingresses
  • Network Policies

To address this, we support the use of ConfigMap with a fleet-reserved annotation. This allows users to encapsulate resources that might have side effects on the hub cluster within the ConfigMap. For detailed instructions, please refer to this document.

4 - Scheduler

Concept about the Fleet scheduler

The scheduler component is a vital element in Fleet workload scheduling. Its primary responsibility is to determine the schedule decision for a bundle of resources based on the latest ClusterSchedulingPolicySnapshotgenerated by the ClusterResourcePlacement. By default, the scheduler operates in batch mode, which enhances performance. In this mode, it binds a ClusterResourceBinding from a ClusterResourcePlacement to multiple clusters whenever possible.

Batch in nature

Scheduling resources within a ClusterResourcePlacement involves more dependencies compared with scheduling pods within a deployment in Kubernetes. There are two notable distinctions:

  1. In a ClusterResourcePlacement, multiple replicas of resources cannot be scheduled on the same cluster, whereas pods belonging to the same deployment in Kubernetes can run on the same node.
  2. The ClusterResourcePlacement supports different placement types within a single object.

These requirements necessitate treating the scheduling policy as a whole and feeding it to the scheduler, as opposed to handling individual pods like Kubernetes today. Specially:

  1. Scheduling the entire ClusterResourcePlacement at once enables us to increase the parallelism of the scheduler if needed.
  2. Supporting the PickAll mode would require generating the replica for each cluster in the fleet to scheduler. This approach is not only inefficient but can also result in scheduler repeatedly attempting to schedule unassigned replica when there are no possibilities of placing them.
  3. To support the PickN mode, the scheduler needs to compute the filtering and scoring for each replica. Conversely, in batch mode, these calculations are performed once. Scheduler sorts all the eligible clusters and pick the top N clusters.

Placement Decisions

The output of the scheduler is an array of ClusterResourceBindings on the hub cluster.

ClusterResourceBinding sample:

apiVersion: placement.kubernetes-fleet.io/v1beta1
kind: ClusterResourceBinding
metadata:
  annotations:
    kubernetes-fleet.io/previous-binding-state: Bound
  creationTimestamp: "2023-11-06T09:53:11Z"
  finalizers:
  - kubernetes-fleet.io/work-cleanup
  generation: 8
  labels:
    kubernetes-fleet.io/parent-CRP: crp-1
  name: crp-1-aks-member-1-2f8fe606
  resourceVersion: "1641949"
  uid: 3a443dec-a5ad-4c15-9c6d-05727b9e1d15
spec:
  clusterDecision:
    clusterName: aks-member-1
    clusterScore:
      affinityScore: 0
      priorityScore: 0
    reason: picked by scheduling policy
    selected: true
  resourceSnapshotName: crp-1-4-snapshot
  schedulingPolicySnapshotName: crp-1-1
  state: Bound
  targetCluster: aks-member-1
status:
  conditions:
  - lastTransitionTime: "2023-11-06T09:53:11Z"
    message: ""
    observedGeneration: 8
    reason: AllWorkSynced
    status: "True"
    type: Bound
  - lastTransitionTime: "2023-11-10T08:23:38Z"
    message: ""
    observedGeneration: 8
    reason: AllWorkHasBeenApplied
    status: "True"
    type: Applied

ClusterResourceBinding can have three states:

  • Scheduled: It indicates that the scheduler has selected this cluster for placing the resources. The resource is waiting to be picked up by the rollout controller.
  • Bound: It indicates that the rollout controller has initiated the placement of resources on the target cluster. The resources are actively being deployed.
  • Unscheduled: This states signifies that the target cluster is no longer selected by the scheduler for the placement. The resource associated with this cluster are in the process of being removed. They are awaiting deletion from the cluster.

The scheduler operates by generating scheduling decisions through the creating of new bindings in the “scheduled” state and the removal of existing bindings by marking them as “unscheduled”. There is a separate rollout controller which is responsible for executing these decisions based on the defined rollout strategy.

Enforcing the semantics of “IgnoreDuringExecutionTime”

The ClusterResourcePlacement enforces the semantics of “IgnoreDuringExecutionTime” to prioritize the stability of resources running in production. Therefore, the resources should not be moved or rescheduled without explicit changes to the scheduling policy.

Here are some high-level guidelines outlining the actions that trigger scheduling and corresponding behavior:

  1. Policy changes trigger scheduling:

    • The scheduler makes the placement decisions based on the latest ClusterSchedulingPolicySnapshot.
    • When it’s just a scale out operation (NumberOfClusters of pickN mode is increased), the ClusterResourcePlacement controller updates the label of the existing ClusterSchedulingPolicySnapshot instead of creating a new one, so that the scheduler won’t move any existing resources that are already scheduled and just fulfill the new requirement.
  2. The following cluster changes trigger scheduling:

    • a cluster, originally ineligible for resource placement for some reason, becomes eligible, such as:
      • the cluster setting changes, specifically MemberCluster labels has changed
      • an unexpected deployment which originally leads the scheduler to discard the cluster (for example, agents not joining, networking issues, etc.) has been resolved
    • a cluster, originally eligible for resource placement, is leaving the fleet and becomes ineligible

    Note: The scheduler is only going to place the resources on the new cluster and won’t touch the existing clusters.

  3. Resource-only changes do not trigger scheduling including:

    • ResourceSelectors is updated in the ClusterResourcePlacement spec.
    • The selected resources is updated without directly affecting the ClusterResourcePlacement.

What’s next

5 - Scheduling Framework

Concept about the Fleet scheduling framework

The fleet scheduling framework closely aligns with the native Kubernetes scheduling framework, incorporating several modifications and tailored functionalities.

The primary advantage of this framework lies in its capability to compile plugins directly into the scheduler. Its API facilitates the implementation of diverse scheduling features as plugins, thereby ensuring a lightweight and maintainable core.

The fleet scheduler integrates three fundamental built-in plugin types:

  • Topology Spread Plugin: Supports the TopologySpreadConstraints stipulated in the placement policy.
  • Cluster Affinity Plugin: Facilitates the Affinity clause of the placement policy.
  • Same Placement Affinity Plugin: Uniquely designed for the fleet, preventing multiple replicas (selected resources) from being placed within the same cluster. This distinguishes it from Kubernetes, which allows multiple pods on a node.
  • Cluster Eligibility Plugin: Enables cluster selection based on specific status criteria.
  • ** Taint & Toleration Plugin**: Enables cluster selection based on taints on the cluster & tolerations on the ClusterResourcePlacement.

Compared to the Kubernetes scheduling framework, the fleet framework introduces additional stages for the pickN placement type:

  • Batch & PostBatch:
    • Batch: Defines the batch size based on the desired and current ClusterResourceBinding.
    • PostBatch: Adjusts the batch size as necessary. Unlike the Kubernetes scheduler, which schedules pods individually (batch size = 1).
  • Sort:
    • Fleet’s sorting mechanism selects a number of clusters, whereas Kubernetes’ scheduler prioritizes nodes with the highest scores.

To streamline the scheduling framework, certain stages, such as permit and reserve, have been omitted due to the absence of corresponding plugins or APIs enabling customers to reserve or permit clusters for specific placements. However, the framework remains designed for easy extension in the future to accommodate these functionalities.

In-tree plugins

The scheduler includes default plugins, each associated with distinct extension points:

PluginPostBatchFilterScore
Cluster Affinity
Same Placement Anti-affinity
Topology Spread Constraints
Cluster Eligibility
Taint & Toleration

The Cluster Affinity Plugin serves as an illustrative example and operates within the following extension points:

  1. PreFilter: Verifies whether the policy contains any required cluster affinity terms. If absent, the plugin bypasses the subsequent Filter stage.
  2. Filter: Filters out clusters that fail to meet the specified required cluster affinity terms outlined in the policy.
  3. PreScore: Determines if the policy includes any preferred cluster affinity terms. If none are found, this plugin will be skipped during the Score stage.
  4. Score: Assigns affinity scores to clusters based on compliance with the preferred cluster affinity terms stipulated in the policy.

6 - Properties and Property Provides

Concept about cluster properties and property provides

This document explains the concepts of property provider and cluster properties in Fleet.

Fleet allows developers to implement a property provider to expose arbitrary properties about a member cluster, such as its node count and available resources for workload placement. Platforms could also enable their property providers to expose platform-specific properties via Fleet. These properties can be useful in a variety of cases: for example, administrators could monitor the health of a member cluster using related properties; Fleet also supports making scheduling decisions based on the property data.

Property provider

A property provider implements Fleet’s property provider interface:

// PropertyProvider is the interface that every property provider must implement.
type PropertyProvider interface {
	// Collect is called periodically by the Fleet member agent to collect properties.
	//
	// Note that this call should complete promptly. Fleet member agent will cancel the
	// context if the call does not complete in time.
	Collect(ctx context.Context) PropertyCollectionResponse
	// Start is called when the Fleet member agent starts up to initialize the property provider.
	// This call should not block.
	//
	// Note that Fleet member agent will cancel the context when it exits.
	Start(ctx context.Context, config *rest.Config) error
}

For the details, see the Fleet source code.

A property provider should be shipped as a part of the Fleet member agent and run alongside it. Refer to the Fleet source code for specifics on how to set it up with the Fleet member agent. At this moment, only one property provider can be set up with the Fleet member agent at a time. Once connected, the Fleet member agent will attempt to start it when the agent itself initializes; the agent will then start collecting properties from the property provider periodically.

A property provider can expose two types of properties: resource properties, and non-resource properties. To learn about the two types, see the section below. In addition, the provider can choose to report its status, such as any errors encountered when preparing the properties, in the form of Kubernetes conditions.

The Fleet member agent can run with or without a property provider. If a provider is not set up, or the given provider fails to start properly, the agent will collect limited properties about the cluster on its own, specifically the node count, plus the total/allocatable CPU and memory capacities of the host member cluster.

Cluster properties

A cluster property is an attribute of a member cluster. There are two types of properties:

  • Resource property: the usage information of a resource in a member cluster; the name of the resource should be in the format of a Kubernetes label key, such as cpu and memory, and the usage information should consist of:

    • the total capacity of the resource, which is the amount of the resource installed in the cluster;
    • the allocatable capacity of the resource, which is the maximum amount of the resource that can be used for running user workloads, as some amount of the resource might be reserved by the OS, kubelet, etc.;
    • the available capacity of the resource, which is the amount of the resource that is currently free for running user workloads.

    Note that you may report a virtual resource via the property provider, if applicable.

  • Non-resource property: a metric about a member cluster, in the form of a key/value pair; the key should be in the format of a Kubernetes label key, such as kubernetes-fleet.io/node-count, and the value at this moment should be a sortable numeric that can be parsed as a Kubernetes quantity.

Eventually, all cluster properties are exposed via the Fleet MemberCluster API, with the non-resource properties in the .status.properties field and the resource properties .status.resourceUsage field:

apiVersion: cluster.kubernetes-fleet.io/v1beta1
kind: MemberCluster
metadata: ...
spec: ...
status:
  agentStatus: ...
  conditions: ...
  properties:
    kubernetes-fleet.io/node-count:
      observationTime: "2024-04-30T14:54:24Z"
      value: "2"
    ...
  resourceUsage:
    allocatable:
      cpu: 32
      memory: "16Gi"
    available:
      cpu: 2
      memory: "800Mi"
    capacity:
      cpu: 40
      memory: "20Gi"

Note that conditions reported by the property provider (if any), would be available in the .status.conditions array as well.

Core properties

The following properties are considered core properties in Fleet, which should be supported in all property provider implementations. Fleet agents will collect them even when no property provider has been set up.

Property TypeNameDescription
Non-resource propertykubernetes-fleet.io/node-countThe number of nodes in a cluster.
Resource propertycpuThe usage information (total, allocatable, and available capacity) of CPU resource in a cluster.
Resource propertymemoryThe usage information (total, allocatable, and available capacity) of memory resource in a cluster.

7 - Safe Rollout

Concept about rolling out changes safely in Fleet

One of the most important features of Fleet is the ability to safely rollout changes across multiple clusters. We do this by rolling out the changes in a controlled manner, ensuring that we only continue to propagate the changes to the next target clusters if the resources are successfully applied to the previous target clusters.

Overview

We automatically propagate any resource changes that are selected by a ClusterResourcePlacement from the hub cluster to the target clusters based on the placement policy defined in the ClusterResourcePlacement. In order to reduce the blast radius of such operation, we provide users a way to safely rollout the new changes so that a bad release won’t affect all the running instances all at once.

Rollout Strategy

We currently only support the RollingUpdate rollout strategy. It updates the resources in the selected target clusters gradually based on the maxUnavailable and maxSurge settings.

In place update policy

We always try to do in-place update by respecting the rollout strategy if there is no change in the placement. This is to avoid unnecessary interrupts to the running workloads when there is only resource changes. For example, if you only change the tag of the deployment in the namespace you want to place, we will do an in-place update on the deployments already placed on the targeted cluster instead of moving the existing deployments to other clusters even if the labels or properties of the current clusters are not the best to match the current placement policy.

How To Use RollingUpdateConfig

RolloutUpdateConfig is used to control behavior of the rolling update strategy.

MaxUnavailable and MaxSurge

MaxUnavailable specifies the maximum number of connected clusters to the fleet compared to target number of clusters specified in ClusterResourcePlacement policy in which resources propagated by the ClusterResourcePlacement can be unavailable. Minimum value for MaxUnavailable is set to 1 to avoid stuck rollout during in-place resource update.

MaxSurge specifies the maximum number of clusters that can be scheduled with resources above the target number of clusters specified in ClusterResourcePlacement policy.

Note: MaxSurge only applies to rollouts to newly scheduled clusters, and doesn’t apply to rollouts of workload triggered by updates to already propagated resource. For updates to already propagated resources, we always try to do the updates in place with no surge.

target number of clusters changes based on the ClusterResourcePlacement policy.

  • For PickAll, it’s the number of clusters picked by the scheduler.
  • For PickN, it’s the number of clusters specified in the ClusterResourcePlacement policy.
  • For PickFixed, it’s the length of the list of cluster names specified in the ClusterResourcePlacement policy.

Example 1:

Consider a fleet with 4 connected member clusters (cluster-1, cluster-2, cluster-3 & cluster-4) where every member cluster has label env: prod. The hub cluster has a namespace called test-ns with a deployment in it.

The ClusterResourcePlacement spec is defined as follows:

spec:
  resourceSelectors:
    - group: ""
      kind: Namespace
      version: v1
      name: test-ns
  policy:
    placementType: PickN
    numberOfClusters: 3
    affinity:
      clusterAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          clusterSelectorTerms:
            - labelSelector:
                matchLabels:
                  env: prod
  strategy:
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 1

The rollout will be as follows:

  • We try to pick 3 clusters out of 4, for this scenario let’s say we pick cluster-1, cluster-2 & cluster-3.

  • Since we can’t track the initial availability for the deployment, we rollout the namespace with deployment to cluster-1, cluster-2 & cluster-3.

  • Then we update the deployment with a bad image name to update the resource in place on cluster-1, cluster-2 & cluster-3.

  • But since we have maxUnavailable set to 1, we will rollout the bad image name update for deployment to one of the clusters (which cluster the resource is rolled out to first is non-deterministic).

  • Once the deployment is updated on the first cluster, we will wait for the deployment’s availability to be true before rolling out to the other clusters

  • And since we rolled out a bad image name update for the deployment it’s availability will always be false and hence the rollout for the other two clusters will be stuck

  • Users might think maxSurge of 1 might be utilized here but in this case since we are updating the resource in place maxSurge will not be utilized to surge and pick cluster-4.

Note: maxSurge will be utilized to pick cluster-4, if we change the policy to pick 4 cluster or change placement type to PickAll.

Example 2:

Consider a fleet with 4 connected member clusters (cluster-1, cluster-2, cluster-3 & cluster-4) where,

  • cluster-1 and cluster-2 has label loc: west
  • cluster-3 and cluster-4 has label loc: east

The hub cluster has a namespace called test-ns with a deployment in it.

Initially, the ClusterResourcePlacement spec is defined as follows:

spec:
  resourceSelectors:
    - group: ""
      kind: Namespace
      version: v1          
      name: test-ns
  policy:
    placementType: PickN
    numberOfClusters: 2
    affinity:
      clusterAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          clusterSelectorTerms:
              - labelSelector:
                  matchLabels:
                    loc: west
  strategy:
    rollingUpdate:
      maxSurge: 2

The rollout will be as follows:

  • We try to pick clusters (cluster-1 and cluster-2) by specifying the label selector loc: west.
  • Since we can’t track the initial availability for the deployment, we rollout the namespace with deployment to cluster-1 and cluster-2 and wait till they become available.

Then we update the ClusterResourcePlacement spec to the following:

spec:
  resourceSelectors:
    - group: ""
      kind: Namespace
      version: v1          
      name: test-ns
  policy:
    placementType: PickN
    numberOfClusters: 2
    affinity:
      clusterAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          clusterSelectorTerms:
              - labelSelector:
                  matchLabels:
                    loc: east
  strategy:
    rollingUpdate:
      maxSurge: 2

The rollout will be as follows:

  • We try to pick clusters (cluster-3 and cluster-4) by specifying the label selector loc: east.
  • But this time around since we have maxSurge set to 2 we are saying we can propagate resources to a maximum of 4 clusters but our target number of clusters specified is 2, we will rollout the namespace with deployment to both cluster-3 and cluster-4 before removing the deployment from cluster-1 and cluster-2.
  • And since maxUnavailable is always set to 25% by default which is rounded off to 1, we will remove the resource from one of the existing clusters (cluster-1 or cluster-2) because when maxUnavailable is 1 the policy mandates at least one cluster to be available.

UnavailablePeriodSeconds

UnavailablePeriodSeconds is used to configure the waiting time between rollout phases when we cannot determine if the resources have rolled out successfully or not. This field is used only if the availability of resources we propagate are not trackable. Refer to the Data only object section for more details.

Availability based Rollout

We have built-in mechanisms to determine the availability of some common Kubernetes native resources. We only mark them as available in the target clusters when they meet the criteria we defined.

How It Works

We have an agent running in the target cluster to check the status of the resources. We have specific criteria for each of the following resources to determine if they are available or not. Here are the list of resources we support:

Deployment

We only mark a Deployment as available when all its pods are running, ready and updated according to the latest spec.

DaemonSet

We only mark a DaemonSet as available when all its pods are available and updated according to the latest spec on all desired scheduled nodes.

StatefulSet

We only mark a StatefulSet as available when all its pods are running, ready and updated according to the latest revision.

Job

We only mark a Job as available when it has at least one succeeded pod or one ready pod.

Service

For Service based on the service type the availability is determined as follows:

  • For ClusterIP & NodePort service, we mark it as available when a cluster IP is assigned.
  • For LoadBalancer service, we mark it as available when a LoadBalancerIngress has been assigned along with an IP or Hostname.
  • For ExternalName service, checking availability is not supported, so it will be marked as available with not trackable reason.

Data only objects

For the objects described below since they are a data resource we mark them as available immediately after creation,

  • Namespace
  • Secret
  • ConfigMap
  • Role
  • ClusterRole
  • RoleBinding
  • ClusterRoleBinding

8 - Override

Concept about the override APIs

Overview

The ClusterResourceOverride and ResourceOverride provides a way to customize resource configurations before they are propagated to the target cluster by the ClusterResourcePlacement.

Difference Between ClusterResourceOverride And ResourceOverride

ClusterResourceOverride represents the cluster-wide policy that overrides the cluster scoped resources to one or more clusters while ResourceOverride will apply to resources in the same namespace as the namespace-wide policy.

Note: If a namespace is selected by the ClusterResourceOverride, ALL the resources under the namespace are selected automatically.

If the resource is selected by both ClusterResourceOverride and ResourceOverride, the ResourceOverride will win when resolving the conflicts.

When To Use Override

Overrides is useful when you want to customize the resources before they are propagated from the hub cluster to the target clusters. Some example use cases are:

  • As a platform operator, I want to propagate a clusterRoleBinding to cluster-us-east and cluster-us-west and would like to grant the same role to different groups in each cluster.
  • As a platform operator, I want to propagate a clusterRole to cluster-staging and cluster-production and would like to grant more permissions to the cluster-staging cluster than the cluster-production cluster.
  • As a platform operator, I want to propagate a namespace to all the clusters and would like to customize the labels for each cluster.
  • As an application developer, I would like to propagate a deployment to cluster-staging and cluster-production and would like to always use the latest image in the staging cluster and a specific image in the production cluster.
  • As an application developer, I would like to propagate a deployment to all the clusters and would like to use different commands for my container in different regions.

Limits

  • Each resource can be only selected by one override simultaneously. In the case of namespace scoped resources, up to two overrides will be allowed, considering the potential selection through both ClusterResourceOverride (select its namespace) and ResourceOverride.
  • At most 100 ClusterResourceOverride can be created.
  • At most 100 ResourceOverride can be created.

Placement

This specifies which placement the override should be applied to.

Resource Selector

ClusterResourceSelector of ClusterResourceOverride selects which cluster-scoped resources need to be overridden before applying to the selected clusters.

It supports the following forms of resource selection:

  • Select resources by specifying the <group, version, kind> and name. This selection propagates only one resource that matches the <group, version, kind> and name.

Note: Label selector of ClusterResourceSelector is not supported.

ResourceSelector of ResourceOverride selects which namespace-scoped resources need to be overridden before applying to the selected clusters.

It supports the following forms of resource selection:

  • Select resources by specifying the <group, version, kind> and name. This selection propagates only one resource that matches the <group, version, kind> and name under the ResourceOverride namespace.

Override Policy

Override policy defines how to override the selected resources on the target clusters.

It contains an array of override rules and its order determines the override order. For example, when there are two rules selecting the same fields on the target cluster, the last one will win.

Each override rule contains the following fields:

  • ClusterSelector: which cluster(s) the override rule applies to. It supports the following forms of cluster selection:
    • Select clusters by specifying the cluster labels.
    • An empty selector selects ALL the clusters.
    • A nil selector selects NO target cluster.

    IMPORTANT: Only labelSelector is supported in the clusterSelectorTerms field.

  • OverrideType: which type of the override should be applied to the selected resources. The default type is JSONPatch.
    • JSONPatch: applies the JSON patch to the selected resources using RFC 6902.
    • Delete: deletes the selected resources on the target cluster.
  • JSONPatchOverrides: a list of JSON path override rules applied to the selected resources following RFC 6902 when the override type is JSONPatch.

Note: Updating the fields in the TypeMeta (e.g., apiVersion, kind) is not allowed.

Note: Updating the fields in the ObjectMeta (e.g., name, namespace) excluding annotations and labels is not allowed.

Note: Updating the fields in the Status (e.g., status) is not allowed.

Reserved Variables in the JSON Patch Override Value

There is a list of reserved variables that will be replaced by the actual values used in the value of the JSON patch override rule:

  • ${MEMBER-CLUSTER-NAME}: this will be replaced by the name of the memberCluster that represents this cluster.

For example, to add a label to the ClusterRole named secret-reader on clusters with the label env: prod, you can use the following configuration:

apiVersion: placement.kubernetes-fleet.io/v1alpha1
kind: ClusterResourceOverride
metadata:
  name: example-cro
spec:
  placement:
    name: crp-example
  clusterResourceSelectors:
    - group: rbac.authorization.k8s.io
      kind: ClusterRole
      version: v1
      name: secret-reader
  policy:
    overrideRules:
      - clusterSelector:
          clusterSelectorTerms:
            - labelSelector:
                matchLabels:
                  env: prod
        jsonPatchOverrides:
          - op: add
            path: /metadata/labels
            value:
              {"cluster-name":"${MEMBER-CLUSTER-NAME}"}

The ClusterResourceOverride object above will add a label cluster-name with the value of the memberCluster name to the ClusterRole named secret-reader on clusters with the label env: prod.

When To Trigger Rollout

It will take the snapshot of each override change as a result of ClusterResourceOverrideSnapshot and ResourceOverrideSnapshot. The snapshot will be used to determine whether the override change should be applied to the existing ClusterResourcePlacement or not. If applicable, it will start rolling out the new resources to the target clusters by respecting the rollout strategy defined in the ClusterResourcePlacement.

Examples

add annotations to the configmap by using clusterResourceOverride

Suppose we create a configmap named app-config-1 under the namespace application-1 in the hub cluster, and we want to add an annotation to it, which is applied to all the member clusters.

apiVersion: v1
data:
  data: test
kind: ConfigMap
metadata:
  creationTimestamp: "2024-05-07T08:06:27Z"
  name: app-config-1
  namespace: application-1
  resourceVersion: "1434"
  uid: b4109de8-32f2-4ac8-9e1a-9cb715b3261d

Create a ClusterResourceOverride named cro-1 to add an annotation to the namespace application-1.

apiVersion: placement.kubernetes-fleet.io/v1alpha1
kind: ClusterResourceOverride
metadata:
  creationTimestamp: "2024-05-07T08:06:27Z"
  finalizers:
    - kubernetes-fleet.io/override-cleanup
  generation: 1
  name: cro-1
  resourceVersion: "1436"
  uid: 32237804-7eb2-4d5f-9996-ff4d8ce778e7
spec:
  placement:
    name: crp-example
  clusterResourceSelectors:
    - group: ""
      kind: Namespace
      name: application-1
      version: v1
  policy:
    overrideRules:
      - clusterSelector:
          clusterSelectorTerms: []
        jsonPatchOverrides:
          - op: add
            path: /metadata/annotations
            value:
              cro-test-annotation: cro-test-annotation-val

Check the configmap on one of the member cluster by running kubectl get configmap app-config-1 -n application-1 -o yaml command:

apiVersion: v1
data:
  data: test
kind: ConfigMap
metadata:
  annotations:
    cro-test-annotation: cro-test-annotation-val
    kubernetes-fleet.io/last-applied-configuration: '{"apiVersion":"v1","data":{"data":"test"},"kind":"ConfigMap","metadata":{"annotations":{"cro-test-annotation":"cro-test-annotation-val","kubernetes-fleet.io/spec-hash":"4dd5a08aed74884de455b03d3b9c48be8278a61841f3b219eca9ed5e8a0af472"},"name":"app-config-1","namespace":"application-1","ownerReferences":[{"apiVersion":"placement.kubernetes-fleet.io/v1beta1","blockOwnerDeletion":false,"kind":"AppliedWork","name":"crp-1-work","uid":"77d804f5-f2f1-440e-8d7e-e9abddacb80c"}]}}'
    kubernetes-fleet.io/spec-hash: 4dd5a08aed74884de455b03d3b9c48be8278a61841f3b219eca9ed5e8a0af472
  creationTimestamp: "2024-05-07T08:06:27Z"
  name: app-config-1
  namespace: application-1
  ownerReferences:
  - apiVersion: placement.kubernetes-fleet.io/v1beta1
    blockOwnerDeletion: false
    kind: AppliedWork
    name: crp-1-work
    uid: 77d804f5-f2f1-440e-8d7e-e9abddacb80c
  resourceVersion: "1449"
  uid: a8601007-1e6b-4b64-bc05-1057ea6bd21b

add annotations to the configmap by using resourceOverride

You can use the ResourceOverride to add an annotation to the configmap app-config-1 explicitly in the namespace application-1.

apiVersion: placement.kubernetes-fleet.io/v1alpha1
kind: ResourceOverride
metadata:
  creationTimestamp: "2024-05-07T08:25:31Z"
  finalizers:
  - kubernetes-fleet.io/override-cleanup
  generation: 1
  name: ro-1
  namespace: application-1
  resourceVersion: "3859"
  uid: b4117925-bc3c-438d-a4f6-067bc4577364
spec:
  placement:
    name: crp-example
  policy:
    overrideRules:
    - clusterSelector:
        clusterSelectorTerms: []
      jsonPatchOverrides:
      - op: add
        path: /metadata/annotations
        value:
          ro-test-annotation: ro-test-annotation-val
  resourceSelectors:
  - group: ""
    kind: ConfigMap
    name: app-config-1
    version: v1

How To Validate If Overrides Are Applied

You can validate if the overrides are applied by checking the ClusterResourcePlacement status. The status output will indicate both placement conditions and individual placement statuses on each member cluster that was overridden.

Sample output:

status:
  conditions:
  - lastTransitionTime: "2024-05-07T08:06:27Z"
    message: found all the clusters needed as specified by the scheduling policy
    observedGeneration: 1
    reason: SchedulingPolicyFulfilled
    status: "True"
    type: ClusterResourcePlacementScheduled
  - lastTransitionTime: "2024-05-07T08:06:27Z"
    message: All 3 cluster(s) start rolling out the latest resource
    observedGeneration: 1
    reason: RolloutStarted
    status: "True"
    type: ClusterResourcePlacementRolloutStarted
  - lastTransitionTime: "2024-05-07T08:06:27Z"
    message: The selected resources are successfully overridden in the 3 clusters
    observedGeneration: 1
    reason: OverriddenSucceeded
    status: "True"
    type: ClusterResourcePlacementOverridden
  - lastTransitionTime: "2024-05-07T08:06:27Z"
    message: Works(s) are succcesfully created or updated in the 3 target clusters'
      namespaces
    observedGeneration: 1
    reason: WorkSynchronized
    status: "True"
    type: ClusterResourcePlacementWorkSynchronized
  - lastTransitionTime: "2024-05-07T08:06:27Z"
    message: The selected resources are successfully applied to 3 clusters
    observedGeneration: 1
    reason: ApplySucceeded
    status: "True"
    type: ClusterResourcePlacementApplied
  - lastTransitionTime: "2024-05-07T08:06:27Z"
    message: The selected resources in 3 cluster are available now
    observedGeneration: 1
    reason: ResourceAvailable
    status: "True"
    type: ClusterResourcePlacementAvailable
  observedResourceIndex: "0"
  placementStatuses:
  - applicableClusterResourceOverrides:
    - cro-1-0
    clusterName: kind-cluster-1
    conditions:
    - lastTransitionTime: "2024-05-07T08:06:27Z"
      message: 'Successfully scheduled resources for placement in kind-cluster-1 (affinity
        score: 0, topology spread score: 0): picked by scheduling policy'
      observedGeneration: 1
      reason: Scheduled
      status: "True"
      type: Scheduled
    - lastTransitionTime: "2024-05-07T08:06:27Z"
      message: Detected the new changes on the resources and started the rollout process
      observedGeneration: 1
      reason: RolloutStarted
      status: "True"
      type: RolloutStarted
    - lastTransitionTime: "2024-05-07T08:06:27Z"
      message: Successfully applied the override rules on the resources
      observedGeneration: 1
      reason: OverriddenSucceeded
      status: "True"
      type: Overridden
    - lastTransitionTime: "2024-05-07T08:06:27Z"
      message: All of the works are synchronized to the latest
      observedGeneration: 1
      reason: AllWorkSynced
      status: "True"
      type: WorkSynchronized
    - lastTransitionTime: "2024-05-07T08:06:27Z"
      message: All corresponding work objects are applied
      observedGeneration: 1
      reason: AllWorkHaveBeenApplied
      status: "True"
      type: Applied
    - lastTransitionTime: "2024-05-07T08:06:27Z"
      message: The availability of work object crp-1-work is not trackable
      observedGeneration: 1
      reason: WorkNotTrackable
      status: "True"
      type: Available
...

applicableClusterResourceOverrides in placementStatuses indicates which ClusterResourceOverrideSnapshot that is applied to the target cluster. Similarly, applicableResourceOverrides will be set if the ResourceOverrideSnapshot is applied.

9 - Staged Update

Concept about Staged Update

While users rely on the RollingUpdate rollout strategy to safely roll out their workloads, there is also a requirement for a staged rollout mechanism at the cluster level to enable more controlled and systematic continuous delivery (CD) across the fleet. Introducing a staged update run feature would address this need by enabling gradual deployments, reducing risk, and ensuring greater reliability and consistency in workload updates across clusters.

Overview

We introduce two new Custom Resources, ClusterStagedUpdateStrategy and ClusterStagedUpdateRun.

ClusterStagedUpdateStrategy defines a reusable orchestration pattern that organizes member clusters into distinct stages, controlling both the rollout sequence within each stage and incorporating post-stage validation tasks that must succeed before proceeding to subsequent stages. For brevity, we’ll refer to ClusterStagedUpdateStrategy as updateRun strategy throughout this document.

ClusterStagedUpdateRun orchestrates resource deployment across clusters by executing a ClusterStagedUpdateStrategy. It requires three key inputs: the target ClusterResourcePlacement name, a resource snapshot index specifying the version to deploy, and the strategy name that defines the rollout rules. The term updateRun will be used to represent ClusterStagedUpdateRun in this document.

Specify Rollout Strategy for ClusterResourcePlacement

While ClusterResourcePlacement uses RollingUpdate as its default strategy, switching to staged updates requires setting the rollout strategy to External:

apiVersion: placement.kubernetes-fleet.io/v1beta1
kind: ClusterResourcePlacement
metadata:
  name: example-placement
spec:
  resourceSelectors:
    - group: ""
      kind: Namespace
      name: test-namespace
      version: v1
  policy:
    placementType: PickAll
    tolerations:
      - key: gpu-workload
        operator: Exists
  strategy:
    type: External # specify External here to use the stagedUpdateRun strategy.

Deploy a ClusterStagedUpdateStrategy

The ClusterStagedUpdateStrategy custom resource enables users to organize member clusters into stages and define their rollout sequence. This strategy is reusable across multiple updateRuns, with each updateRun creating an immutable snapshot of the strategy at startup. This ensures that modifications to the strategy do not impact any in-progress updateRun executions.

An example ClusterStagedUpdateStrategy looks like below:

apiVersion: placement.kubernetes-fleet.io/v1beta1
kind: ClusterStagedUpdateStrategy
metadata:
  name: example-strategy
spec:
  stages:
    - name: staging
      labelSelector:
        matchLabels:
          environment: staging
      afterStageTasks:
        - type: TimedWait
          waitTime: 1h
    - name: canary
      labelSelector:
        matchLabels:
          environment: canary
      afterStageTasks:
        - type: Approval
    - name: production
      labelSelector:
        matchLabels:
          environment: production
      sortingLabelKey: order
      afterStageTasks:
        - type: Approval
        - type: TimedWait
          waitTime: 1h

ClusterStagedUpdateStrategy is cluster-scoped resource. Its spec contains a list of stageConfig entries defining the configuration for each stage. Stages execute sequentially in the order specified. Each stage must have a unique name and uses a labelSelector to identify member clusters for update. In above example, we define 3 stages: staging selecting member clusters labeled with environment: staging, canary selecting member clusters labeled with environment: canary and production selecting member clusters labeled with environment: production.

Each stage can optionally specify sortingLabelKey and afterStageTasks. sortingLabelKey is used to define a label whose integer value determines update sequence within a stage. With above example, assuming there are 3 clusters selected in the production (all 3 clusters have environment: production label), then the fleet admin can label them with order: 1, order: 2, and order: 3 respectively to control the rollout sequence. Without sortingLabelKey, clusters are updated in alphabetical order by name.

By default, the next stage begins immediately after the current stage completes. A user can control this cross-stage behavior by specifying the afterStageTasks in each stage. These tasks execute after all clusters in a stage update successfully. We currently support two types of tasks: Approval and Timedwait. Each stage can include one task of each type (maximum of two tasks). Both tasks must be satisfied before advancing to the next stage.

Timedwait task requires a specified waitTime duration. The updateRun waits for the duration to pass before executing the next stage. For Approval task, the controller generates a ClusterApprovalRequest object automatically named as <updateRun name>-<stage name>. The name is also shown in the updateRun status. The ClusterApprovalRequest object is pretty simple:

apiVersion: placement.kubernetes-fleet.io/v1beta1
kind: ClusterApprovalRequest
metadata:
  name: example-run-canary
  labels:
    kubernetes-fleet.io/targetupdaterun: example-run
    kubernetes-fleet.io/targetUpdatingStage: canary
    kubernetes-fleet.io/isLatestUpdateRunApproval: "true"
spec:
  parentStageRollout: example-run
  targetStage: canary

The user then need to manually approve the task by patching its status:

kubectl patch clusterapprovalrequests example-run-canary --type='merge' -p '{"status":{"conditions":[{"type":"Approved","status":"True","reason":"lgtm","message":"lgtm","lastTransitionTime":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","observedGeneration":1}]}}' --subresource=status

The updateRun will only continue to next stage after the ClusterApprovalRequest is approved.

Trigger rollout with ClusterStagedUpdateRun

When using External rollout strategy, a ClusterResourcePlacement begins deployment only when triggered by a ClusterStagedUpdateRun. An example ClusterStagedUpdateRun is shown below:

apiVersion: placement.kubernetes-fleet.io/v1beta1
kind: ClusterStagedUpdateRun
metadata:
  name: example-run
spec:
  placementName: example-placement
  resourceSnapshotIndex: "0"
  stagedRolloutStrategyName: example-strategy

This cluster-scoped resource requires three key parameters: the placementName specifying the target ClusterResourcePlacement, the resourceSnapshotIndex identifying which version of resources to deploy (learn how to find resourceSnapshotIndex here), and the stagedRolloutStrategyName indicating the ClusterStagedUpdateStrategy to follow.

An updateRun executes in two phases. During the initialization phase, the controller performs a one-time setup where it captures a snapshot of the updateRun strategy, collects scheduled and to-be-deleted ClusterResourceBindings, generates the cluster update sequence, and records all this information in the updateRun status.

In the execution phase, the controller processes each stage sequentially, updates clusters within each stage one at a time, and enforces completion of after-stage tasks. It then executes a final delete stage to clean up resources from unscheduled clusters. The updateRun succeeds when all stages complete successfully. However, it will fail if any execution-affecting events occur, for example, the target ClusterResourcePlacement being deleted, and member cluster changes triggering new scheduling. In such cases, error details are recorded in the updateRun status. Remember that once initialized, an updateRun operates on its strategy snapshot, making it immune to subsequent strategy modifications.

Understand ClusterStagedUpdateRun status

Let’s take a deep look into the status of a completed ClusterStagedUpdateRun. It displays details about the rollout status for every clusters and stages.

$ kubectl describe csur run example-run
...
Status:
  Conditions:
    Last Transition Time:  2025-03-12T23:21:39Z
    Message:               ClusterStagedUpdateRun initialized successfully
    Observed Generation:   1
    Reason:                UpdateRunInitializedSuccessfully
    Status:                True
    Type:                  Initialized
    Last Transition Time:  2025-03-12T23:21:39Z
    Message:               
    Observed Generation:   1
    Reason:                UpdateRunStarted
    Status:                True
    Type:                  Progressing
    Last Transition Time:  2025-03-12T23:26:15Z
    Message:               
    Observed Generation:   1
    Reason:                UpdateRunSucceeded
    Status:                True
    Type:                  Succeeded
  Deletion Stage Status:
    Clusters:
    Conditions:
      Last Transition Time:       2025-03-12T23:26:15Z
      Message:                    
      Observed Generation:        1
      Reason:                     StageUpdatingStarted
      Status:                     True
      Type:                       Progressing
      Last Transition Time:       2025-03-12T23:26:15Z
      Message:                    
      Observed Generation:        1
      Reason:                     StageUpdatingSucceeded
      Status:                     True
      Type:                       Succeeded
    End Time:                     2025-03-12T23:26:15Z
    Stage Name:                   kubernetes-fleet.io/deleteStage
    Start Time:                   2025-03-12T23:26:15Z
  Policy Observed Cluster Count:  2
  Policy Snapshot Index Used:     0
  Staged Update Strategy Snapshot:
    Stages:
      After Stage Tasks:
        Type:       Approval
        Wait Time:  0s
        Type:       TimedWait
        Wait Time:  1m0s
      Label Selector:
        Match Labels:
          Environment:  staging
      Name:             staging
      After Stage Tasks:
        Type:       Approval
        Wait Time:  0s
      Label Selector:
        Match Labels:
          Environment:    canary
      Name:               canary
      Sorting Label Key:  name
      After Stage Tasks:
        Type:       TimedWait
        Wait Time:  1m0s
        Type:       Approval
        Wait Time:  0s
      Label Selector:
        Match Labels:
          Environment:    production
      Name:               production
      Sorting Label Key:  order
  Stages Status:
    After Stage Task Status:
      Approval Request Name:  example-run-staging
      Conditions:
        Last Transition Time:  2025-03-12T23:21:54Z
        Message:               
        Observed Generation:   1
        Reason:                AfterStageTaskApprovalRequestCreated
        Status:                True
        Type:                  ApprovalRequestCreated
        Last Transition Time:  2025-03-12T23:22:55Z
        Message:               
        Observed Generation:   1
        Reason:                AfterStageTaskApprovalRequestApproved
        Status:                True
        Type:                  ApprovalRequestApproved
      Type:                    Approval
      Conditions:
        Last Transition Time:  2025-03-12T23:22:54Z
        Message:               
        Observed Generation:   1
        Reason:                AfterStageTaskWaitTimeElapsed
        Status:                True
        Type:                  WaitTimeElapsed
      Type:                    TimedWait
    Clusters:
      Cluster Name:  member1
      Conditions:
        Last Transition Time:  2025-03-12T23:21:39Z
        Message:               
        Observed Generation:   1
        Reason:                ClusterUpdatingStarted
        Status:                True
        Type:                  Started
        Last Transition Time:  2025-03-12T23:21:54Z
        Message:               
        Observed Generation:   1
        Reason:                ClusterUpdatingSucceeded
        Status:                True
        Type:                  Succeeded
    Conditions:
      Last Transition Time:  2025-03-12T23:21:54Z
      Message:               
      Observed Generation:   1
      Reason:                StageUpdatingWaiting
      Status:                False
      Type:                  Progressing
      Last Transition Time:  2025-03-12T23:22:55Z
      Message:               
      Observed Generation:   1
      Reason:                StageUpdatingSucceeded
      Status:                True
      Type:                  Succeeded
    End Time:                2025-03-12T23:22:55Z
    Stage Name:              staging
    Start Time:              2025-03-12T23:21:39Z
    After Stage Task Status:
      Approval Request Name:  example-run-canary
      Conditions:
        Last Transition Time:  2025-03-12T23:23:10Z
        Message:               
        Observed Generation:   1
        Reason:                AfterStageTaskApprovalRequestCreated
        Status:                True
        Type:                  ApprovalRequestCreated
        Last Transition Time:  2025-03-12T23:25:15Z
        Message:               
        Observed Generation:   1
        Reason:                AfterStageTaskApprovalRequestApproved
        Status:                True
        Type:                  ApprovalRequestApproved
      Type:                    Approval
    Clusters:
      Cluster Name:  member2
      Conditions:
        Last Transition Time:  2025-03-12T23:22:55Z
        Message:               
        Observed Generation:   1
        Reason:                ClusterUpdatingStarted
        Status:                True
        Type:                  Started
        Last Transition Time:  2025-03-12T23:23:10Z
        Message:               
        Observed Generation:   1
        Reason:                ClusterUpdatingSucceeded
        Status:                True
        Type:                  Succeeded
    Conditions:
      Last Transition Time:  2025-03-12T23:23:10Z
      Message:               
      Observed Generation:   1
      Reason:                StageUpdatingWaiting
      Status:                False
      Type:                  Progressing
      Last Transition Time:  2025-03-12T23:25:15Z
      Message:               
      Observed Generation:   1
      Reason:                StageUpdatingSucceeded
      Status:                True
      Type:                  Succeeded
    End Time:                2025-03-12T23:25:15Z
    Stage Name:              canary
    Start Time:              2025-03-12T23:22:55Z
    After Stage Task Status:
      Conditions:
        Last Transition Time:  2025-03-12T23:26:15Z
        Message:               
        Observed Generation:   1
        Reason:                AfterStageTaskWaitTimeElapsed
        Status:                True
        Type:                  WaitTimeElapsed
      Type:                    TimedWait
      Approval Request Name:   example-run-production
      Conditions:
        Last Transition Time:  2025-03-12T23:25:15Z
        Message:               
        Observed Generation:   1
        Reason:                AfterStageTaskApprovalRequestCreated
        Status:                True
        Type:                  ApprovalRequestCreated
        Last Transition Time:  2025-03-12T23:25:25Z
        Message:               
        Observed Generation:   1
        Reason:                AfterStageTaskApprovalRequestApproved
        Status:                True
        Type:                  ApprovalRequestApproved
      Type:                    Approval
    Clusters:
    Conditions:
      Last Transition Time:  2025-03-12T23:25:15Z
      Message:               
      Observed Generation:   1
      Reason:                StageUpdatingWaiting
      Status:                False
      Type:                  Progressing
      Last Transition Time:  2025-03-12T23:26:15Z
      Message:               
      Observed Generation:   1
      Reason:                StageUpdatingSucceeded
      Status:                True
      Type:                  Succeeded
    End Time:                2025-03-12T23:26:15Z
    Stage Name:              production
Events:                      <none>

UpdateRun overall status

At the very top, Status.Conditions gives the overall status of the updateRun. The execution an update run consists of two phases: initialization and execution. During initialization, the controller performs a one-time setup where it captures a snapshot of the updateRun strategy, collects scheduled and to-be-deleted ClusterResourceBindings, generates the cluster update sequence, and records all this information in the updateRun status. The UpdateRunInitializedSuccessfully condition indicates the initialization is successful.

After initialization, the controller starts executing the updateRun. The UpdateRunStarted condition indicates the execution has started.

After all clusters are updated, all after-stage tasks are completed, and thus all stages are finished, the UpdateRunSucceeded condition is set to True, indicating the updateRun has succeeded.

Fields recorded in the updateRun status during initialization

During initialization, the controller records the following fields in the updateRun status:

  • PolicySnapshotIndexUsed: the index of the policy snapshot used for the updateRun, it should be the latest one.
  • PolicyObservedClusterCount: the number of clusters selected by the scheduling policy.
  • StagedUpdateStrategySnapshot: the snapshot of the updateRun strategy, which ensures any strategy changes will not affect executing updateRuns.

Stages and clusters status

The Stages Status section displays the status of each stage and cluster. As shown in the strategy snapshot, the updateRun has three stages: staging, canary, and production. During initialization, the controller generates the rollout plan, classifies the scheduled clusters into these three stages and dumps the plan into the updateRun status. As the execution progresses, the controller updates the status of each stage and cluster. Take the staging stage as an example, member1 is included in this stage. ClusterUpdatingStarted condition indicates the cluster is being updated and ClusterUpdatingSucceeded condition shows the cluster is updated successfully.

After all clusters are updated in a stage, the controller executes the specified after-stage tasks. Stage staging has two after-stage tasks: Approval and TimedWait. The Approval task requires the admin to manually approve a ClusterApprovalRequest generated by the controller. The name of the ClusterApprovalRequest is also included in the status, which is example-run-staging. AfterStageTaskApprovalRequestCreated condition indicates the approval request is created and AfterStageTaskApprovalRequestApproved condition indicates the approval request has been approved. The TimedWait task enforces a suspension of the rollout until the specified wait time has elapsed and in this case, the wait time is 1 minute. AfterStageTaskWaitTimeElapsed condition indicates the wait time has elapsed and the rollout can proceed to the next stage.

Each stage also has its own conditions. When a stage starts, the Progressing condition is set to True. When all the cluster updates complete, the Progressing condition is set to False with reason StageUpdatingWaiting as shown above. It means the stage is waiting for after-stage tasks to pass. And thus the lastTransitionTime of the Progressing condition also serves as the start time of the wait in case there’s a TimedWait task. When all after-stage tasks pass, the Succeeded condition is set to True. Each stage status also has Start Time and End Time fields, making it easier to read.

There’s also a Deletion Stage Status section, which displays the status of the deletion stage. The deletion stage is the last stage of the updateRun. It deletes resources from the unscheduled clusters. The status is pretty much the same as a normal update stage, except that there are no after-stage tasks.

Note that all these conditions have lastTransitionTime set to the time when the controller updates the status. It can help debug and check the progress of the updateRun.

Relationship between ClusterStagedUpdateRun and ClusterResourcePlacement

A ClusterStagedUpdateRun serves as the trigger mechanism for rolling out a ClusterResourcePlacement. The key points of this relationship are:

  • The ClusterResourcePlacement remains in a scheduled state without being deployed until a corresponding ClusterStagedUpdateRun is created.
  • During rollout, the ClusterResourcePlacement status is continuously updated with detailed information from each target cluster.
  • While a ClusterStagedUpdateRun only indicates whether updates have started and completed for each member cluster (as described in previous section), the ClusterResourcePlacement provides comprehensive details including:
    • Success/failure of resource creation
    • Application of overrides
    • Specific error messages

For example, below is the status of an in-progress ClusterStagedUpdateRun:

kubectl describe csur example-run
Name:         example-run
...
Status:
  Conditions:
    Last Transition Time:  2025-03-17T21:37:14Z
    Message:               ClusterStagedUpdateRun initialized successfully
    Observed Generation:   1
    Reason:                UpdateRunInitializedSuccessfully
    Status:                True
    Type:                  Initialized
    Last Transition Time:  2025-03-17T21:37:14Z
    Message:               
    Observed Generation:   1
    Reason:                UpdateRunStarted # updateRun started
    Status:                True
    Type:                  Progressing
...
  Stages Status:
    After Stage Task Status:
      Approval Request Name:  example-run-staging
      Conditions:
        Last Transition Time:  2025-03-17T21:37:29Z
        Message:               
        Observed Generation:   1
        Reason:                AfterStageTaskApprovalRequestCreated
        Status:                True
        Type:                  ApprovalRequestCreated
      Type:                    Approval
      Conditions:
        Last Transition Time:  2025-03-17T21:38:29Z
        Message:               
        Observed Generation:   1
        Reason:                AfterStageTaskWaitTimeElapsed
        Status:                True
        Type:                  WaitTimeElapsed
      Type:                    TimedWait
    Clusters:
      Cluster Name:  member1
      Conditions:
        Last Transition Time:  2025-03-17T21:37:14Z
        Message:               
        Observed Generation:   1
        Reason:                ClusterUpdatingStarted
        Status:                True
        Type:                  Started
        Last Transition Time:  2025-03-17T21:37:29Z
        Message:               
        Observed Generation:   1
        Reason:                ClusterUpdatingSucceeded # member1 has updated successfully
        Status:                True
        Type:                  Succeeded
    Conditions:
      Last Transition Time:  2025-03-17T21:37:29Z
      Message:               
      Observed Generation:   1
      Reason:                StageUpdatingWaiting # waiting for approval
      Status:                False
      Type:                  Progressing
    Stage Name:              staging
    Start Time:              2025-03-17T21:37:14Z
    After Stage Task Status:
      Approval Request Name:  example-run-canary
      Type:                   Approval
    Clusters:
      Cluster Name:  member2
    Stage Name:      canary
    After Stage Task Status:
      Type:                   TimedWait
      Approval Request Name:  example-run-production
      Type:                   Approval
    Clusters:
    Stage Name:  production
...

In above status, member1 from stage staging has been updated successfully. The stage is waiting for approval to proceed to the next stage. And member2 from stage canary is not updated yet.

Let’s take a look at the status of the ClusterResourcePlacement example-placement:

kubectl describe crp example-placement
Name:         example-placement
...
Status:
  Conditions:
    Last Transition Time:   2025-03-12T23:01:32Z
    Message:                found all cluster needed as specified by the scheduling policy, found 2 cluster(s)
    Observed Generation:    1
    Reason:                 SchedulingPolicyFulfilled
    Status:                 True
    Type:                   ClusterResourcePlacementScheduled
    Last Transition Time:   2025-03-13T07:35:25Z
    Message:                There are still 1 cluster(s) in the process of deciding whether to roll out the latest resources or not
    Observed Generation:    1
    Reason:                 RolloutStartedUnknown
    Status:                 Unknown
    Type:                   ClusterResourcePlacementRolloutStarted
  Observed Resource Index:  5
  Placement Statuses:
    Cluster Name:  member1
    Conditions:
      Last Transition Time:  2025-03-12T23:01:32Z
      Message:               Successfully scheduled resources for placement in "member1" (affinity score: 0, topology spread score: 0): picked by scheduling policy
      Observed Generation:   1
      Reason:                Scheduled
      Status:                True
      Type:                  Scheduled
      Last Transition Time:  2025-03-17T21:37:14Z
      Message:               Detected the new changes on the resources and started the rollout process, resourceSnapshotIndex: 5, clusterStagedUpdateRun: example-run
      Observed Generation:   1
      Reason:                RolloutStarted
      Status:                True
      Type:                  RolloutStarted
      Last Transition Time:  2025-03-17T21:37:14Z
      Message:               No override rules are configured for the selected resources
      Observed Generation:   1
      Reason:                NoOverrideSpecified
      Status:                True
      Type:                  Overridden
      Last Transition Time:  2025-03-17T21:37:14Z
      Message:               All of the works are synchronized to the latest
      Observed Generation:   1
      Reason:                AllWorkSynced
      Status:                True
      Type:                  WorkSynchronized
      Last Transition Time:  2025-03-17T21:37:14Z
      Message:               All corresponding work objects are applied
      Observed Generation:   1
      Reason:                AllWorkHaveBeenApplied
      Status:                True
      Type:                  Applied
      Last Transition Time:  2025-03-17T21:37:14Z
      Message:               All corresponding work objects are available
      Observed Generation:   1
      Reason:                AllWorkAreAvailable # member1 is all good
      Status:                True
      Type:                  Available
    Cluster Name:            member2
    Conditions:
      Last Transition Time:  2025-03-12T23:01:32Z
      Message:               Successfully scheduled resources for placement in "member2" (affinity score: 0, topology spread score: 0): picked by scheduling policy
      Observed Generation:   1
      Reason:                Scheduled
      Status:                True
      Type:                  Scheduled
      Last Transition Time:  2025-03-13T07:35:25Z
      Message:               In the process of deciding whether to roll out the latest resources or not
      Observed Generation:   1
      Reason:                RolloutStartedUnknown # member2 is not updated yet
      Status:                Unknown
      Type:                  RolloutStarted
...

In the Placement Statuses section, we can see the status of each member cluster. For member1, the RolloutStarted condition is set to True, indicating the rollout has started. In the condition message, we print the ClusterStagedUpdateRun name, which is example-run. This indicates the most recent cluster update is triggered by example-run. It also displays the detailed update status: the works are synced and applied and are detected available. As a comparison, member2 is still in Scheduled state only.

When troubleshooting a stalled updateRun, examining the ClusterResourcePlacement status offers valuable diagnostic information that can help identify the root cause. For comprehensive troubleshooting steps, refer to the troubleshooting guide.

Concurrent updateRuns

Multiple concurrent ClusterStagedUpdateRuns can be created for the same ClusterResourcePlacement, allowing fleet administrators to pipeline the rollout of different resource versions. However, to maintain consistency across the fleet and prevent member clusters from running different resource versions simultaneously, we enforce a key constraint: all concurrent ClusterStagedUpdateRuns must use identical ClusterStagedUpdateStrategy settings.

This strategy consistency requirement is validated during the initialization phase of each updateRun. This validation ensures predictable rollout behavior and prevents configuration drift across your cluster fleet, even when multiple updates are in progress.

Next Steps

10 - Eviction and Placement Disruption Budget

Concept about Eviction and Placement Disrupiton Budget

This document explains the concept of Eviction and Placement Disruption Budget in the context of the fleet.

Overview

Eviction provides a way to force remove resources from a target cluster once the resources have already been propagated from the hub cluster by a Placement object. Eviction is considered as an voluntary disruption triggered by the user. Eviction alone doesn’t guarantee that resources won’t be propagated to target cluster again by the scheduler. The users need to use taints in conjunction with Eviction to prevent the scheduler from picking the target cluster again.

The Placement Disruption Budget object protects against voluntary disruptions.

The only voluntary disruption that can occur in the fleet is the eviction of resources from a target cluster which can be achieved by creating the ClusterResourcePlacementEviction object.

Some cases of involuntary disruptions in the context of fleet,

  • The removal of resources from a member cluster by the scheduler due to scheduling policy changes.
  • Users manually deleting workload resources running on a member cluster.
  • Users manually deleting the ClusterResourceBinding object which is an internal resource the represents the placement of resources on a member cluster.
  • Workloads failing to run properly on a member cluster due to misconfiguration or cluster related issues.

For all the cases of involuntary disruptions described above, the Placement Disruption Budget object does not protect against them.

ClusterResourcePlacementEviction

An eviction object is used to remove resources from a member cluster once the resources have already been propagated from the hub cluster.

The eviction object is only reconciled once after which it reaches a terminal state. Below is the list of terminal states for ClusterResourcePlacementEviction,

  • ClusterResourcePlacementEviction is valid and it’s executed successfully.
  • ClusterResourcePlacementEviction is invalid.
  • ClusterResourcePlacementEviction is valid but it’s not executed.

To successfully evict resources from a cluster, the user needs to specify:

  • The name of the ClusterResourcePlacement object which propagated resources to the target cluster.
  • The name of the target cluster from which we need to evict resources.

When specifying the ClusterResourcePlacement object in the eviction’s spec, the user needs to consider the following cases:

  • For PickFixed CRP, eviction is not allowed; it is recommended that one directly edit the list of target clusters on the CRP object.
  • For PickAll & PickN CRPs, eviction is allowed because the users cannot deterministically pick or unpick a cluster based on the placement strategy; it’s up to the scheduler.

Note: After an eviction is executed, there is no guarantee that the cluster won’t be picked again by the scheduler to propagate resources for a ClusterResourcePlacement resource. The user needs to specify a taint on the cluster to prevent the scheduler from picking the cluster again. This is especially true for PickAll ClusterResourcePlacement because the scheduler will try to propagate resources to all the clusters in the fleet.

ClusterResourcePlacementDisruptionBudget

The ClusterResourcePlacementDisruptionBudget is used to protect resources propagated by a ClusterResourcePlacement to a target cluster from voluntary disruption, i.e., ClusterResourcePlacementEviction.

Note: When specifying a ClusterResourcePlacementDisruptionBudget, the name should be the same as the ClusterResourcePlacement that it’s trying to protect.

Users are allowed to specify one of two fields in the ClusterResourcePlacementDisruptionBudget spec since they are mutually exclusive:

  • MaxUnavailable - specifies the maximum number of clusters in which a placement can be unavailable due to any form of disruptions.
  • MinAvailable - specifies the minimum number of clusters in which placements are available despite any form of disruptions.

for both MaxUnavailable and MinAvailable, the user can specify the number of clusters as an integer or as a percentage of the total number of clusters in the fleet.

Note: For both MaxUnavailable and MinAvailable, involuntary disruptions are not subject to the disruption budget but will still count against it.

When specifying a disruption budget for a particular ClusterResourcePlacement, the user needs to consider the following cases:

CRP typeMinAvailable DB with an integerMinAvailable DB with a percentageMaxUnavailable DB with an integerMaxUnavailable DB with a percentage
PickFixed
PickAll
PickN

Note: We don’t allow eviction for PickFixed CRP and hence specifying a ClusterResourcePlacementDisruptionBudget for PickFixed CRP does nothing. And for PickAll CRP, the user can only specify MinAvailable because total number of clusters selected by a PickAll CRP is non-deterministic. If the user creates an invalid ClusterResourcePlacementDisruptionBudget object, when an eviction is created, the eviction won’t be successfully executed.