The documentation in this section explains core Fleet concepts. Pick one below to proceed.
This is the multi-page printable view of this section. Click here to print.
Concepts
- 1: Fleet components
- 2: MemberCluster
- 3: ClusterResourcePlacement
- 4: Scheduler
- 5: Scheduling Framework
- 6: Properties and Property Provides
- 7: Safe Rollout
- 8: Override
- 9: Staged Update
- 10: Eviction and Placement Disruption Budget
1 - Fleet components
Components
This document provides an overview of the components required for a fully functional and operational Fleet setup.
The fleet consists of the following components:
- fleet-hub-agent is a Kubernetes controller that creates and reconciles all the fleet related CRs in the hub cluster.
- fleet-member-agent is a Kubernetes controller that creates and reconciles all the fleet related CRs in the member cluster. The fleet-member-agent is pulling the latest CRs from the hub cluster and consistently reconciles the member clusters to the desired state.
The fleet implements agent-based pull mode. So that the working pressure can be distributed to the member clusters, and it helps to breach the bottleneck of scalability, by dividing the load into each member cluster. On the other hand, hub cluster does not need to directly access to the member clusters. Fleet can support the member clusters which only have the outbound network and no inbound network access.
To allow multiple clusters to run securely, fleet will create a reserved namespace on the hub cluster to isolate the access permissions and resources across multiple clusters.
2 - MemberCluster
Overview
The fleet constitutes an implementation of a ClusterSet
and
encompasses the following attributes:
- A collective of clusters managed by a centralized authority.
- Typically characterized by a high level of mutual trust within the cluster set.
- Embraces the principle of Namespace Sameness across clusters:
- Ensures uniform permissions and characteristics for a given namespace across all clusters.
- While not mandatory for every cluster, namespaces exhibit consistent behavior across those where they are present.
The MemberCluster
represents a cluster-scoped API established within the hub cluster, serving as a representation of
a cluster within the fleet. This API offers a dependable, uniform, and automated approach for multi-cluster applications
(frameworks, toolsets) to identify registered clusters within a fleet. Additionally, it facilitates applications in querying
a list of clusters managed by the fleet or observing cluster statuses for subsequent actions.
Some illustrative use cases encompass:
- The Fleet Scheduler utilizing managed cluster statuses or specific cluster properties (e.g., labels, taints) of a
MemberCluster
for resource scheduling. - Automation tools like GitOps systems (e.g., ArgoCD or Flux) automatically registering/deregistering clusters in compliance
with the
MemberCluster
API. - The MCS API automatically generating
ServiceImport
CRs based on theMemberCluster
CR defined within a fleet.
Moreover, it furnishes a user-friendly interface for human operators to monitor the managed clusters.
MemberCluster Lifecycle
Joining the Fleet
The process to join the Fleet involves creating a MemberCluster
. The MemberCluster
controller, a constituent of the
hub-cluster-agent described in the Component, watches the MemberCluster
CR and generates
a corresponding namespace for the member cluster within the hub cluster. It configures roles and role bindings within the
hub cluster, authorizing the specified member cluster identity (as detailed in the MemberCluster
spec) access solely
to resources within that namespace. To collate member cluster status, the controller generates another internal CR named
InternalMemberCluster
within the newly formed namespace. Simultaneously, the InternalMemberCluster
controller, a component
of the member-cluster-agent situated in the member cluster, gathers statistics on cluster usage, such as capacity utilization,
and reports its status based on the HeartbeatPeriodSeconds
specified in the CR. Meanwhile, the MemberCluster
controller
consolidates agent statuses and marks the cluster as Joined
.
Leaving the Fleet
Fleet administrators can deregister a cluster by deleting the MemberCluster
CR. Upon detection of deletion events by
the MemberCluster
controller within the hub cluster, it removes the corresponding InternalMemberCluster
CR in the
reserved namespace of the member cluster. It awaits completion of the “leave” process by the InternalMemberCluster
controller of member agents, and then deletes role and role bindings and other resources including the member cluster reserved
namespaces on the hub cluster.
Taints
Taints are a mechanism to prevent the Fleet Scheduler from scheduling resources to a MemberCluster
. We adopt the concept of
taints and tolerations introduced in Kubernetes to
the multi-cluster use case.
The MemberCluster
CR supports the specification of list of taints, which are applied to the MemberCluster
. Each Taint object comprises
the following fields:
key
: The key of the taint.value
: The value of the taint.effect
: The effect of the taint, which can beNoSchedule
for now.
Once a MemberCluster
is tainted with a specific taint, it lets the Fleet Scheduler know that the MemberCluster
should not receive resources
as part of the workload propagation from the hub cluster.
The NoSchedule
taint is a signal to the Fleet Scheduler to avoid scheduling resources from a ClusterResourcePlacement
to the MemberCluster
.
Any MemberCluster
already selected for resource propagation will continue to receive resources even if a new taint is added.
Taints are only honored by ClusterResourcePlacement
with PickAll, PickN placement policies. In the case of PickFixed placement policy
the taints are ignored because the user has explicitly specify the MemberClusters
where the resources should be placed.
For detailed instructions, please refer to this document.
What’s next
- Get hands-on experience how to add a member cluster to a fleet.
- Explore the
ClusterResourcePlacement
concept to placement cluster scope resources among managed clusters.
3 - ClusterResourcePlacement
Overview
ClusterResourcePlacement
concept is used to dynamically select cluster scoped resources (especially namespaces and all
objects within it) and control how they are propagated to all or a subset of the member clusters.
A ClusterResourcePlacement
mainly consists of three parts:
Resource selection: select which cluster-scoped Kubernetes resource objects need to be propagated from the hub cluster to selected member clusters.
It supports the following forms of resource selection:
- Select resources by specifying just the <group, version, kind>. This selection propagates all resources with matching <group, version, kind>.
- Select resources by specifying the <group, version, kind> and name. This selection propagates only one resource that matches the <group, version, kind> and name.
- Select resources by specifying the <group, version, kind> and a set of labels using ClusterResourcePlacement -> LabelSelector. This selection propagates all resources that match the <group, version, kind> and label specified.
Note: When a namespace is selected, all the namespace-scoped objects under this namespace are propagated to the selected member clusters along with this namespace.
Placement policy: limit propagation of selected resources to a specific subset of member clusters. The following types of target cluster selection are supported:
- PickAll (Default): select any member clusters with matching cluster
Affinity
scheduling rules. If theAffinity
is not specified, it will select all joined and healthy member clusters. - PickFixed: select a fixed list of member clusters defined in the
ClusterNames
. - PickN: select a
NumberOfClusters
of member clusters with optional matching clusterAffinity
scheduling rules or topology spread constraintsTopologySpreadConstraints
.
- PickAll (Default): select any member clusters with matching cluster
Strategy: how changes are rolled out (rollout strategy) and how resources are applied on the member cluster side (apply strategy).
A simple ClusterResourcePlacement
looks like this:
apiVersion: placement.kubernetes-fleet.io/v1
kind: ClusterResourcePlacement
metadata:
name: crp-1
spec:
policy:
placementType: PickN
numberOfClusters: 2
topologySpreadConstraints:
- maxSkew: 1
topologyKey: "env"
whenUnsatisfiable: DoNotSchedule
resourceSelectors:
- group: ""
kind: Namespace
name: test-deployment
version: v1
revisionHistoryLimit: 100
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
unavailablePeriodSeconds: 5
type: RollingUpdate
When To Use ClusterResourcePlacement
ClusterResourcePlacement
is useful when you want for a general way of managing and running workloads across multiple clusters.
Some example scenarios include the following:
- As a platform operator, I want to place my cluster-scoped resources (especially namespaces and all objects within it) to a cluster that resides in the us-east-1.
- As a platform operator, I want to spread my cluster-scoped resources (especially namespaces and all objects within it) evenly across the different regions/zones.
- As a platform operator, I prefer to place my test resources into the staging AKS cluster.
- As a platform operator, I would like to separate the workloads for compliance or policy reasons.
- As a developer, I want to run my cluster-scoped resources (especially namespaces and all objects within it) on 3 clusters. In addition, each time I update my workloads, the updates take place with zero downtime by rolling out to these three clusters incrementally.
Placement Workflow
The placement controller will create ClusterSchedulingPolicySnapshot
and ClusterResourceSnapshot
snapshots by watching
the ClusterResourcePlacement
object. So that it can trigger the scheduling and resource rollout process whenever needed.
The override controller will create the corresponding snapshots by watching the ClusterResourceOverride
and ResourceOverride
which captures the snapshot of the overrides.
The placement workflow will be divided into several stages:
- Scheduling: multi-cluster scheduler makes the schedule decision by creating the
clusterResourceBinding
for a bundle of resources based on the latestClusterSchedulingPolicySnapshot
generated by theClusterResourcePlacement
. - Rolling out resources: rollout controller applies the resources to the selected member clusters based on the rollout strategy.
- Overriding: work generator applies the override rules defined by
ClusterResourceOverride
andResourceOverride
to the selected resources on the target clusters. - Creating or updating works: work generator creates the work on the corresponding member cluster namespace. Each work contains the (overridden) manifest workload to be deployed on the member clusters.
- Applying resources on target clusters: apply work controller applies the manifest workload on the member clusters.
- Checking resource availability: apply work controller checks the resource availability on the target clusters.
Resource Selection
Resource selectors identify cluster-scoped objects to include based on standard Kubernetes identifiers - namely, the group
,
kind
, version
, and name
of the object. Namespace-scoped objects are included automatically when the namespace they
are part of is selected. The example ClusterResourcePlacement
above would include the test-deployment
namespace and
any objects that were created in that namespace.
The clusterResourcePlacement controller creates the ClusterResourceSnapshot
to store a snapshot of selected resources
selected by the placement. The ClusterResourceSnapshot
spec is immutable. Each time when the selected resources are updated,
the clusterResourcePlacement controller will detect the resource changes and create a new ClusterResourceSnapshot
. It implies
that resources can change independently of any modifications to the ClusterResourceSnapshot
. In other words, resource
changes can occur without directly affecting the ClusterResourceSnapshot
itself.
The total amount of selected resources may exceed the 1MB limit for a single Kubernetes object. As a result, the controller
may produce more than one ClusterResourceSnapshot
s for all the selected resources.
ClusterResourceSnapshot
sample:
apiVersion: placement.kubernetes-fleet.io/v1
kind: ClusterResourceSnapshot
metadata:
annotations:
kubernetes-fleet.io/number-of-enveloped-object: "0"
kubernetes-fleet.io/number-of-resource-snapshots: "1"
kubernetes-fleet.io/resource-hash: e0927e7d75c7f52542a6d4299855995018f4a6de46edf0f814cfaa6e806543f3
creationTimestamp: "2023-11-10T08:23:38Z"
generation: 1
labels:
kubernetes-fleet.io/is-latest-snapshot: "true"
kubernetes-fleet.io/parent-CRP: crp-1
kubernetes-fleet.io/resource-index: "4"
name: crp-1-4-snapshot
ownerReferences:
- apiVersion: placement.kubernetes-fleet.io/v1
blockOwnerDeletion: true
controller: true
kind: ClusterResourcePlacement
name: crp-1
uid: 757f2d2c-682f-433f-b85c-265b74c3090b
resourceVersion: "1641940"
uid: d6e2108b-882b-4f6c-bb5e-c5ec5491dd20
spec:
selectedResources:
- apiVersion: v1
kind: Namespace
metadata:
labels:
kubernetes.io/metadata.name: test
name: test
spec:
finalizers:
- kubernetes
- apiVersion: v1
data:
key1: value1
key2: value2
key3: value3
kind: ConfigMap
metadata:
name: test-1
namespace: test
Placement Policy
ClusterResourcePlacement
supports three types of policy as mentioned above. ClusterSchedulingPolicySnapshot
will be
generated whenever policy changes are made to the ClusterResourcePlacement
that require a new scheduling. Similar to
ClusterResourceSnapshot
, its spec is immutable.
ClusterSchedulingPolicySnapshot
sample:
apiVersion: placement.kubernetes-fleet.io/v1
kind: ClusterSchedulingPolicySnapshot
metadata:
annotations:
kubernetes-fleet.io/CRP-generation: "5"
kubernetes-fleet.io/number-of-clusters: "2"
creationTimestamp: "2023-11-06T10:22:56Z"
generation: 1
labels:
kubernetes-fleet.io/is-latest-snapshot: "true"
kubernetes-fleet.io/parent-CRP: crp-1
kubernetes-fleet.io/policy-index: "1"
name: crp-1-1
ownerReferences:
- apiVersion: placement.kubernetes-fleet.io/v1
blockOwnerDeletion: true
controller: true
kind: ClusterResourcePlacement
name: crp-1
uid: 757f2d2c-682f-433f-b85c-265b74c3090b
resourceVersion: "1639412"
uid: 768606f2-aa5a-481a-aa12-6e01e6adbea2
spec:
policy:
placementType: PickN
policyHash: NDc5ZjQwNWViNzgwOGNmYzU4MzY2YjI2NDg2ODBhM2E4MTVlZjkxNGZlNjc1NmFlOGRmMGQ2Zjc0ODg1NDE2YQ==
status:
conditions:
- lastTransitionTime: "2023-11-06T10:22:56Z"
message: found all the clusters needed as specified by the scheduling policy
observedGeneration: 1
reason: SchedulingPolicyFulfilled
status: "True"
type: Scheduled
observedCRPGeneration: 5
targetClusters:
- clusterName: aks-member-1
clusterScore:
affinityScore: 0
priorityScore: 0
reason: picked by scheduling policy
selected: true
- clusterName: aks-member-2
clusterScore:
affinityScore: 0
priorityScore: 0
reason: picked by scheduling policy
selected: true
In contrast to the original scheduler framework in Kubernetes, the multi-cluster scheduling process involves selecting a cluster for placement through a structured 5-step operation:
- Batch & PostBatch
- Filter
- Score
- Sort
- Bind
The batch & postBatch step is to define the batch size according to the desired and current ClusterResourceBinding
.
The postBatch is to adjust the batch size if needed.
The filter step finds the set of clusters where it’s feasible to schedule the placement, for example, whether the cluster
is matching required Affinity
scheduling rules specified in the Policy
. It also filters out any clusters which are
leaving the fleet or no longer connected to the fleet, for example, its heartbeat has been stopped for a prolonged period of time.
In the score step (only applied to the pickN type), the scheduler assigns a score to each cluster that survived filtering. Each cluster is given a topology spread score (how much a cluster would satisfy the topology spread constraints specified by the user), and an affinity score (how much a cluster would satisfy the preferred affinity terms specified by the user).
In the sort step (only applied to the pickN type), it sorts all eligible clusters by their scores, sorting first by topology spread score and breaking ties based on the affinity score.
The bind step is to create/update/delete the ClusterResourceBinding
based on the desired and current member cluster list.
Strategy
Rollout strategy
Use rollout strategy to control how KubeFleet rolls out a resource change made on the hub cluster to all member clusters. Right now KubeFleet supports two types of rollout strategies out of the box:
- Rolling update: this rollout strategy helps roll out changes incrementally in a way that ensures system availability, akin to how the Kubernetes Deployment API handles updates. For more information, see the Safe Rollout concept.
- Staged update: this rollout strategy helps roll out changes in different stages; users may group clusters into different stages and specify the order in which each stage receives the update. The strategy also allows users to set up timed or approval-based gates between stages to fine-control the flow. For more information, see the Staged Update concept and Staged Update How-To Guide.
Apply strategy
Use apply strategy to control how KubeFleet applies a resource to a member cluster. KubeFleet currently features three different types of apply strategies:
- Client-side apply: this apply strategy sets up KubeFleet to apply resources in a three-way merge that is similar to how
the Kubernetes CLI,
kubectl
, performs client-side apply. - Server-side apply: this apply strategy sets up KubeFleet to apply resources via the new server-side apply mechanism.
- Report Diff mode: this apply strategy instructs KubeFleet to check for configuration differences between the resource on the hub cluster and its counterparts among the member clusters; no apply op will be performed. For more information, see the ReportDiff Mode How-To Guide.
To learn more about the differences between client-side apply and server-side apply, see also the Kubernetes official documentation.
KubeFleet apply strategy is also the place where users can set up KubeFleet’s drift detection capabilities and takeover settings:
- Drift detection helps users identify and resolve configuration drifts that are commonly observed in a multi-cluster environment; through this feature, KubeFleet can detect the presence of drifts, reveal their details, and let users decide how and when to handle them. See the Drift Detection How-To Guide for more information.
- Takeover settings allows users to decide how KubeFleet can best handle pre-existing resources. When you join a cluster with running workloads into a fleet, these settings can help bring the workloads under KubeFleet’s management in a way that avoids interruptions. For specifics, see the Takeover Settings How-To Guide.
Placement status
After a ClusterResourcePlacement
is created, details on current status can be seen by performing a kubectl describe crp <name>
.
The status output will indicate both placement conditions and individual placement statuses on each member cluster that was selected.
The list of resources that are selected for placement will also be included in the describe output.
Sample output:
Name: crp-1
Namespace:
Labels: <none>
Annotations: <none>
API Version: placement.kubernetes-fleet.io/v1
Kind: ClusterResourcePlacement
Metadata:
...
Spec:
Policy:
Placement Type: PickAll
Resource Selectors:
Group:
Kind: Namespace
Name: application-1
Version: v1
Revision History Limit: 10
Strategy:
Rolling Update:
Max Surge: 25%
Max Unavailable: 25%
Unavailable Period Seconds: 2
Type: RollingUpdate
Status:
Conditions:
Last Transition Time: 2024-04-29T09:58:20Z
Message: found all the clusters needed as specified by the scheduling policy
Observed Generation: 1
Reason: SchedulingPolicyFulfilled
Status: True
Type: ClusterResourcePlacementScheduled
Last Transition Time: 2024-04-29T09:58:20Z
Message: All 3 cluster(s) start rolling out the latest resource
Observed Generation: 1
Reason: RolloutStarted
Status: True
Type: ClusterResourcePlacementRolloutStarted
Last Transition Time: 2024-04-29T09:58:20Z
Message: No override rules are configured for the selected resources
Observed Generation: 1
Reason: NoOverrideSpecified
Status: True
Type: ClusterResourcePlacementOverridden
Last Transition Time: 2024-04-29T09:58:20Z
Message: Works(s) are succcesfully created or updated in the 3 target clusters' namespaces
Observed Generation: 1
Reason: WorkSynchronized
Status: True
Type: ClusterResourcePlacementWorkSynchronized
Last Transition Time: 2024-04-29T09:58:20Z
Message: The selected resources are successfully applied to 3 clusters
Observed Generation: 1
Reason: ApplySucceeded
Status: True
Type: ClusterResourcePlacementApplied
Last Transition Time: 2024-04-29T09:58:20Z
Message: The selected resources in 3 cluster are available now
Observed Generation: 1
Reason: ResourceAvailable
Status: True
Type: ClusterResourcePlacementAvailable
Observed Resource Index: 0
Placement Statuses:
Cluster Name: kind-cluster-1
Conditions:
Last Transition Time: 2024-04-29T09:58:20Z
Message: Successfully scheduled resources for placement in kind-cluster-1 (affinity score: 0, topology spread score: 0): picked by scheduling policy
Observed Generation: 1
Reason: Scheduled
Status: True
Type: Scheduled
Last Transition Time: 2024-04-29T09:58:20Z
Message: Detected the new changes on the resources and started the rollout process
Observed Generation: 1
Reason: RolloutStarted
Status: True
Type: RolloutStarted
Last Transition Time: 2024-04-29T09:58:20Z
Message: No override rules are configured for the selected resources
Observed Generation: 1
Reason: NoOverrideSpecified
Status: True
Type: Overridden
Last Transition Time: 2024-04-29T09:58:20Z
Message: All of the works are synchronized to the latest
Observed Generation: 1
Reason: AllWorkSynced
Status: True
Type: WorkSynchronized
Last Transition Time: 2024-04-29T09:58:20Z
Message: All corresponding work objects are applied
Observed Generation: 1
Reason: AllWorkHaveBeenApplied
Status: True
Type: Applied
Last Transition Time: 2024-04-29T09:58:20Z
Message: The availability of work object crp-1-work is not trackable
Observed Generation: 1
Reason: WorkNotTrackable
Status: True
Type: Available
Cluster Name: kind-cluster-2
Conditions:
Last Transition Time: 2024-04-29T09:58:20Z
Message: Successfully scheduled resources for placement in kind-cluster-2 (affinity score: 0, topology spread score: 0): picked by scheduling policy
Observed Generation: 1
Reason: Scheduled
Status: True
Type: Scheduled
Last Transition Time: 2024-04-29T09:58:20Z
Message: Detected the new changes on the resources and started the rollout process
Observed Generation: 1
Reason: RolloutStarted
Status: True
Type: RolloutStarted
Last Transition Time: 2024-04-29T09:58:20Z
Message: No override rules are configured for the selected resources
Observed Generation: 1
Reason: NoOverrideSpecified
Status: True
Type: Overridden
Last Transition Time: 2024-04-29T09:58:20Z
Message: All of the works are synchronized to the latest
Observed Generation: 1
Reason: AllWorkSynced
Status: True
Type: WorkSynchronized
Last Transition Time: 2024-04-29T09:58:20Z
Message: All corresponding work objects are applied
Observed Generation: 1
Reason: AllWorkHaveBeenApplied
Status: True
Type: Applied
Last Transition Time: 2024-04-29T09:58:20Z
Message: The availability of work object crp-1-work is not trackable
Observed Generation: 1
Reason: WorkNotTrackable
Status: True
Type: Available
Cluster Name: kind-cluster-3
Conditions:
Last Transition Time: 2024-04-29T09:58:20Z
Message: Successfully scheduled resources for placement in kind-cluster-3 (affinity score: 0, topology spread score: 0): picked by scheduling policy
Observed Generation: 1
Reason: Scheduled
Status: True
Type: Scheduled
Last Transition Time: 2024-04-29T09:58:20Z
Message: Detected the new changes on the resources and started the rollout process
Observed Generation: 1
Reason: RolloutStarted
Status: True
Type: RolloutStarted
Last Transition Time: 2024-04-29T09:58:20Z
Message: No override rules are configured for the selected resources
Observed Generation: 1
Reason: NoOverrideSpecified
Status: True
Type: Overridden
Last Transition Time: 2024-04-29T09:58:20Z
Message: All of the works are synchronized to the latest
Observed Generation: 1
Reason: AllWorkSynced
Status: True
Type: WorkSynchronized
Last Transition Time: 2024-04-29T09:58:20Z
Message: All corresponding work objects are applied
Observed Generation: 1
Reason: AllWorkHaveBeenApplied
Status: True
Type: Applied
Last Transition Time: 2024-04-29T09:58:20Z
Message: The availability of work object crp-1-work is not trackable
Observed Generation: 1
Reason: WorkNotTrackable
Status: True
Type: Available
Selected Resources:
Kind: Namespace
Name: application-1
Version: v1
Kind: ConfigMap
Name: app-config-1
Namespace: application-1
Version: v1
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal PlacementRolloutStarted 3m46s cluster-resource-placement-controller Started rolling out the latest resources
Normal PlacementOverriddenSucceeded 3m46s cluster-resource-placement-controller Placement has been successfully overridden
Normal PlacementWorkSynchronized 3m46s cluster-resource-placement-controller Work(s) have been created or updated successfully for the selected cluster(s)
Normal PlacementApplied 3m46s cluster-resource-placement-controller Resources have been applied to the selected cluster(s)
Normal PlacementRolloutCompleted 3m46s cluster-resource-placement-controller Resources are available in the selected clusters
Tolerations
Tolerations are a mechanism to allow the Fleet Scheduler to schedule resources to a MemberCluster
that has taints specified on it.
We adopt the concept of taints & tolerations
introduced in Kubernetes to the multi-cluster use case.
The ClusterResourcePlacement
CR supports the specification of list of tolerations, which are applied to the ClusterResourcePlacement
object. Each Toleration object comprises the following fields:
key
: The key of the toleration.value
: The value of the toleration.effect
: The effect of the toleration, which can beNoSchedule
for now.operator
: The operator of the toleration, which can beExists
orEqual
.
Each toleration is used to tolerate one or more specific taints applied on the MemberCluster
. Once all taints on a MemberCluster
are tolerated by tolerations on a ClusterResourcePlacement
, resources can be propagated to the MemberCluster
by the scheduler for that
ClusterResourcePlacement
resource.
Tolerations cannot be updated or removed from a ClusterResourcePlacement
. If there is a need to update toleration a better approach is to
add another toleration. If we absolutely need to update or remove existing tolerations, the only option is to delete the existing ClusterResourcePlacement
and create a new object with the updated tolerations.
For detailed instructions, please refer to this document.
Envelope Object
The ClusterResourcePlacement
leverages the fleet hub cluster as a staging environment for customer resources. These resources are then propagated to member clusters that are part of the fleet, based on the ClusterResourcePlacement
spec.
In essence, the objective is not to apply or create resources on the hub cluster for local use but to propagate these resources to other member clusters within the fleet.
Certain resources, when created or applied on the hub cluster, may lead to unintended side effects. These include:
- Validating/Mutating Webhook Configurations
- Cluster Role Bindings
- Resource Quotas
- Storage Classes
- Flow Schemas
- Priority Classes
- Ingress Classes
- Ingresses
- Network Policies
To address this, we support the use of ConfigMap
with a fleet-reserved annotation. This allows users to encapsulate resources that might have side effects on the hub cluster within the ConfigMap
. For detailed instructions, please refer to this document.
4 - Scheduler
The scheduler component is a vital element in Fleet workload scheduling. Its primary responsibility is to determine the
schedule decision for a bundle of resources based on the latest ClusterSchedulingPolicySnapshot
generated by the ClusterResourcePlacement
.
By default, the scheduler operates in batch mode, which enhances performance. In this mode, it binds a ClusterResourceBinding
from a ClusterResourcePlacement
to multiple clusters whenever possible.
Batch in nature
Scheduling resources within a ClusterResourcePlacement
involves more dependencies compared with scheduling pods within
a deployment in Kubernetes. There are two notable distinctions:
- In a
ClusterResourcePlacement
, multiple replicas of resources cannot be scheduled on the same cluster, whereas pods belonging to the same deployment in Kubernetes can run on the same node. - The
ClusterResourcePlacement
supports different placement types within a single object.
These requirements necessitate treating the scheduling policy as a whole and feeding it to the scheduler, as opposed to handling individual pods like Kubernetes today. Specially:
- Scheduling the entire
ClusterResourcePlacement
at once enables us to increase the parallelism of the scheduler if needed. - Supporting the
PickAll
mode would require generating the replica for each cluster in the fleet to scheduler. This approach is not only inefficient but can also result in scheduler repeatedly attempting to schedule unassigned replica when there are no possibilities of placing them. - To support the
PickN
mode, the scheduler needs to compute the filtering and scoring for each replica. Conversely, in batch mode, these calculations are performed once. Scheduler sorts all the eligible clusters and pick the top N clusters.
Placement Decisions
The output of the scheduler is an array of ClusterResourceBinding
s on the hub cluster.
ClusterResourceBinding
sample:
apiVersion: placement.kubernetes-fleet.io/v1beta1
kind: ClusterResourceBinding
metadata:
annotations:
kubernetes-fleet.io/previous-binding-state: Bound
creationTimestamp: "2023-11-06T09:53:11Z"
finalizers:
- kubernetes-fleet.io/work-cleanup
generation: 8
labels:
kubernetes-fleet.io/parent-CRP: crp-1
name: crp-1-aks-member-1-2f8fe606
resourceVersion: "1641949"
uid: 3a443dec-a5ad-4c15-9c6d-05727b9e1d15
spec:
clusterDecision:
clusterName: aks-member-1
clusterScore:
affinityScore: 0
priorityScore: 0
reason: picked by scheduling policy
selected: true
resourceSnapshotName: crp-1-4-snapshot
schedulingPolicySnapshotName: crp-1-1
state: Bound
targetCluster: aks-member-1
status:
conditions:
- lastTransitionTime: "2023-11-06T09:53:11Z"
message: ""
observedGeneration: 8
reason: AllWorkSynced
status: "True"
type: Bound
- lastTransitionTime: "2023-11-10T08:23:38Z"
message: ""
observedGeneration: 8
reason: AllWorkHasBeenApplied
status: "True"
type: Applied
ClusterResourceBinding
can have three states:
- Scheduled: It indicates that the scheduler has selected this cluster for placing the resources. The resource is waiting to be picked up by the rollout controller.
- Bound: It indicates that the rollout controller has initiated the placement of resources on the target cluster. The resources are actively being deployed.
- Unscheduled: This states signifies that the target cluster is no longer selected by the scheduler for the placement. The resource associated with this cluster are in the process of being removed. They are awaiting deletion from the cluster.
The scheduler operates by generating scheduling decisions through the creating of new bindings in the “scheduled” state and the removal of existing bindings by marking them as “unscheduled”. There is a separate rollout controller which is responsible for executing these decisions based on the defined rollout strategy.
Enforcing the semantics of “IgnoreDuringExecutionTime”
The ClusterResourcePlacement
enforces the semantics of “IgnoreDuringExecutionTime” to prioritize the stability of resources
running in production. Therefore, the resources should not be moved or rescheduled without explicit changes to the scheduling
policy.
Here are some high-level guidelines outlining the actions that trigger scheduling and corresponding behavior:
Policy
changes trigger scheduling:- The scheduler makes the placement decisions based on the latest
ClusterSchedulingPolicySnapshot
. - When it’s just a scale out operation (
NumberOfClusters
of pickN mode is increased), theClusterResourcePlacement
controller updates the label of the existingClusterSchedulingPolicySnapshot
instead of creating a new one, so that the scheduler won’t move any existing resources that are already scheduled and just fulfill the new requirement.
- The scheduler makes the placement decisions based on the latest
The following cluster changes trigger scheduling:
- a cluster, originally ineligible for resource placement for some reason, becomes eligible, such as:
- the cluster setting changes, specifically
MemberCluster
labels has changed - an unexpected deployment which originally leads the scheduler to discard the cluster (for example, agents not joining, networking issues, etc.) has been resolved
- the cluster setting changes, specifically
- a cluster, originally eligible for resource placement, is leaving the fleet and becomes ineligible
Note: The scheduler is only going to place the resources on the new cluster and won’t touch the existing clusters.
- a cluster, originally ineligible for resource placement for some reason, becomes eligible, such as:
Resource-only changes do not trigger scheduling including:
ResourceSelectors
is updated in theClusterResourcePlacement
spec.- The selected resources is updated without directly affecting the
ClusterResourcePlacement
.
What’s next
- Read about Scheduling Framework
5 - Scheduling Framework
The fleet scheduling framework closely aligns with the native Kubernetes scheduling framework, incorporating several modifications and tailored functionalities.
The primary advantage of this framework lies in its capability to compile plugins directly into the scheduler. Its API facilitates the implementation of diverse scheduling features as plugins, thereby ensuring a lightweight and maintainable core.
The fleet scheduler integrates three fundamental built-in plugin types:
- Topology Spread Plugin: Supports the TopologySpreadConstraints stipulated in the placement policy.
- Cluster Affinity Plugin: Facilitates the Affinity clause of the placement policy.
- Same Placement Affinity Plugin: Uniquely designed for the fleet, preventing multiple replicas (selected resources) from being placed within the same cluster. This distinguishes it from Kubernetes, which allows multiple pods on a node.
- Cluster Eligibility Plugin: Enables cluster selection based on specific status criteria.
- ** Taint & Toleration Plugin**: Enables cluster selection based on taints on the cluster & tolerations on the ClusterResourcePlacement.
Compared to the Kubernetes scheduling framework, the fleet framework introduces additional stages for the pickN placement type:
- Batch & PostBatch:
- Batch: Defines the batch size based on the desired and current
ClusterResourceBinding
. - PostBatch: Adjusts the batch size as necessary. Unlike the Kubernetes scheduler, which schedules pods individually (batch size = 1).
- Batch: Defines the batch size based on the desired and current
- Sort:
- Fleet’s sorting mechanism selects a number of clusters, whereas Kubernetes’ scheduler prioritizes nodes with the highest scores.
To streamline the scheduling framework, certain stages, such as permit
and reserve
, have been omitted due to the absence
of corresponding plugins or APIs enabling customers to reserve or permit clusters for specific placements. However, the
framework remains designed for easy extension in the future to accommodate these functionalities.
In-tree plugins
The scheduler includes default plugins, each associated with distinct extension points:
Plugin | PostBatch | Filter | Score |
---|---|---|---|
Cluster Affinity | ❌ | ✅ | ✅ |
Same Placement Anti-affinity | ❌ | ✅ | ❌ |
Topology Spread Constraints | ✅ | ✅ | ✅ |
Cluster Eligibility | ❌ | ✅ | ❌ |
Taint & Toleration | ❌ | ✅ | ❌ |
The Cluster Affinity Plugin serves as an illustrative example and operates within the following extension points:
- PreFilter: Verifies whether the policy contains any required cluster affinity terms. If absent, the plugin bypasses the subsequent Filter stage.
- Filter: Filters out clusters that fail to meet the specified required cluster affinity terms outlined in the policy.
- PreScore: Determines if the policy includes any preferred cluster affinity terms. If none are found, this plugin will be skipped during the Score stage.
- Score: Assigns affinity scores to clusters based on compliance with the preferred cluster affinity terms stipulated in the policy.
6 - Properties and Property Provides
This document explains the concepts of property provider and cluster properties in Fleet.
Fleet allows developers to implement a property provider to expose arbitrary properties about a member cluster, such as its node count and available resources for workload placement. Platforms could also enable their property providers to expose platform-specific properties via Fleet. These properties can be useful in a variety of cases: for example, administrators could monitor the health of a member cluster using related properties; Fleet also supports making scheduling decisions based on the property data.
Property provider
A property provider implements Fleet’s property provider interface:
// PropertyProvider is the interface that every property provider must implement.
type PropertyProvider interface {
// Collect is called periodically by the Fleet member agent to collect properties.
//
// Note that this call should complete promptly. Fleet member agent will cancel the
// context if the call does not complete in time.
Collect(ctx context.Context) PropertyCollectionResponse
// Start is called when the Fleet member agent starts up to initialize the property provider.
// This call should not block.
//
// Note that Fleet member agent will cancel the context when it exits.
Start(ctx context.Context, config *rest.Config) error
}
For the details, see the Fleet source code.
A property provider should be shipped as a part of the Fleet member agent and run alongside it. Refer to the Fleet source code for specifics on how to set it up with the Fleet member agent. At this moment, only one property provider can be set up with the Fleet member agent at a time. Once connected, the Fleet member agent will attempt to start it when the agent itself initializes; the agent will then start collecting properties from the property provider periodically.
A property provider can expose two types of properties: resource properties, and non-resource properties. To learn about the two types, see the section below. In addition, the provider can choose to report its status, such as any errors encountered when preparing the properties, in the form of Kubernetes conditions.
The Fleet member agent can run with or without a property provider. If a provider is not set up, or the given provider fails to start properly, the agent will collect limited properties about the cluster on its own, specifically the node count, plus the total/allocatable CPU and memory capacities of the host member cluster.
Cluster properties
A cluster property is an attribute of a member cluster. There are two types of properties:
Resource property: the usage information of a resource in a member cluster; the name of the resource should be in the format of a Kubernetes label key, such as
cpu
andmemory
, and the usage information should consist of:- the total capacity of the resource, which is the amount of the resource installed in the cluster;
- the allocatable capacity of the resource, which is the maximum amount of the resource that can be used for running user workloads, as some amount of the resource might be reserved by the OS, kubelet, etc.;
- the available capacity of the resource, which is the amount of the resource that is currently free for running user workloads.
Note that you may report a virtual resource via the property provider, if applicable.
Non-resource property: a metric about a member cluster, in the form of a key/value pair; the key should be in the format of a Kubernetes label key, such as
kubernetes-fleet.io/node-count
, and the value at this moment should be a sortable numeric that can be parsed as a Kubernetes quantity.
Eventually, all cluster properties are exposed via the Fleet MemberCluster
API, with the
non-resource properties in the .status.properties
field and the resource properties
.status.resourceUsage
field:
apiVersion: cluster.kubernetes-fleet.io/v1beta1
kind: MemberCluster
metadata: ...
spec: ...
status:
agentStatus: ...
conditions: ...
properties:
kubernetes-fleet.io/node-count:
observationTime: "2024-04-30T14:54:24Z"
value: "2"
...
resourceUsage:
allocatable:
cpu: 32
memory: "16Gi"
available:
cpu: 2
memory: "800Mi"
capacity:
cpu: 40
memory: "20Gi"
Note that conditions reported by the property provider (if any), would be available in the
.status.conditions
array as well.
Core properties
The following properties are considered core properties in Fleet, which should be supported in all property provider implementations. Fleet agents will collect them even when no property provider has been set up.
Property Type | Name | Description |
---|---|---|
Non-resource property | kubernetes-fleet.io/node-count | The number of nodes in a cluster. |
Resource property | cpu | The usage information (total, allocatable, and available capacity) of CPU resource in a cluster. |
Resource property | memory | The usage information (total, allocatable, and available capacity) of memory resource in a cluster. |
7 - Safe Rollout
One of the most important features of Fleet is the ability to safely rollout changes across multiple clusters. We do this by rolling out the changes in a controlled manner, ensuring that we only continue to propagate the changes to the next target clusters if the resources are successfully applied to the previous target clusters.
Overview
We automatically propagate any resource changes that are selected by a ClusterResourcePlacement
from the hub cluster
to the target clusters based on the placement policy defined in the ClusterResourcePlacement
. In order to reduce the
blast radius of such operation, we provide users a way to safely rollout the new changes so that a bad release
won’t affect all the running instances all at once.
Rollout Strategy
We currently only support the RollingUpdate
rollout strategy. It updates the resources in the selected target clusters
gradually based on the maxUnavailable
and maxSurge
settings.
In place update policy
We always try to do in-place update by respecting the rollout strategy if there is no change in the placement. This is to avoid unnecessary interrupts to the running workloads when there is only resource changes. For example, if you only change the tag of the deployment in the namespace you want to place, we will do an in-place update on the deployments already placed on the targeted cluster instead of moving the existing deployments to other clusters even if the labels or properties of the current clusters are not the best to match the current placement policy.
How To Use RollingUpdateConfig
RolloutUpdateConfig is used to control behavior of the rolling update strategy.
MaxUnavailable and MaxSurge
MaxUnavailable
specifies the maximum number of connected clusters to the fleet compared to target number of clusters
specified in ClusterResourcePlacement
policy in which resources propagated by the ClusterResourcePlacement
can be
unavailable. Minimum value for MaxUnavailable
is set to 1 to avoid stuck rollout during in-place resource update.
MaxSurge
specifies the maximum number of clusters that can be scheduled with resources above the target number of clusters
specified in ClusterResourcePlacement
policy.
Note:
MaxSurge
only applies to rollouts to newly scheduled clusters, and doesn’t apply to rollouts of workload triggered by updates to already propagated resource. For updates to already propagated resources, we always try to do the updates in place with no surge.
target number of clusters
changes based on the ClusterResourcePlacement
policy.
- For PickAll, it’s the number of clusters picked by the scheduler.
- For PickN, it’s the number of clusters specified in the
ClusterResourcePlacement
policy. - For PickFixed, it’s the length of the list of cluster names specified in the
ClusterResourcePlacement
policy.
Example 1:
Consider a fleet with 4 connected member clusters (cluster-1, cluster-2, cluster-3 & cluster-4) where every member
cluster has label env: prod
. The hub cluster has a namespace called test-ns
with a deployment in it.
The ClusterResourcePlacement
spec is defined as follows:
spec:
resourceSelectors:
- group: ""
kind: Namespace
version: v1
name: test-ns
policy:
placementType: PickN
numberOfClusters: 3
affinity:
clusterAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
clusterSelectorTerms:
- labelSelector:
matchLabels:
env: prod
strategy:
rollingUpdate:
maxUnavailable: 1
maxSurge: 1
The rollout will be as follows:
We try to pick 3 clusters out of 4, for this scenario let’s say we pick cluster-1, cluster-2 & cluster-3.
Since we can’t track the initial availability for the deployment, we rollout the namespace with deployment to cluster-1, cluster-2 & cluster-3.
Then we update the deployment with a bad image name to update the resource in place on cluster-1, cluster-2 & cluster-3.
But since we have
maxUnavailable
set to 1, we will rollout the bad image name update for deployment to one of the clusters (which cluster the resource is rolled out to first is non-deterministic).Once the deployment is updated on the first cluster, we will wait for the deployment’s availability to be true before rolling out to the other clusters
And since we rolled out a bad image name update for the deployment it’s availability will always be false and hence the rollout for the other two clusters will be stuck
Users might think
maxSurge
of 1 might be utilized here but in this case since we are updating the resource in placemaxSurge
will not be utilized to surge and pick cluster-4.
Note:
maxSurge
will be utilized to pick cluster-4, if we change the policy to pick 4 cluster or change placement type toPickAll
.
Example 2:
Consider a fleet with 4 connected member clusters (cluster-1, cluster-2, cluster-3 & cluster-4) where,
- cluster-1 and cluster-2 has label
loc: west
- cluster-3 and cluster-4 has label
loc: east
The hub cluster has a namespace called test-ns
with a deployment in it.
Initially, the ClusterResourcePlacement
spec is defined as follows:
spec:
resourceSelectors:
- group: ""
kind: Namespace
version: v1
name: test-ns
policy:
placementType: PickN
numberOfClusters: 2
affinity:
clusterAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
clusterSelectorTerms:
- labelSelector:
matchLabels:
loc: west
strategy:
rollingUpdate:
maxSurge: 2
The rollout will be as follows:
- We try to pick clusters (cluster-1 and cluster-2) by specifying the label selector
loc: west
. - Since we can’t track the initial availability for the deployment, we rollout the namespace with deployment to cluster-1 and cluster-2 and wait till they become available.
Then we update the ClusterResourcePlacement
spec to the following:
spec:
resourceSelectors:
- group: ""
kind: Namespace
version: v1
name: test-ns
policy:
placementType: PickN
numberOfClusters: 2
affinity:
clusterAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
clusterSelectorTerms:
- labelSelector:
matchLabels:
loc: east
strategy:
rollingUpdate:
maxSurge: 2
The rollout will be as follows:
- We try to pick clusters (cluster-3 and cluster-4) by specifying the label selector
loc: east
. - But this time around since we have
maxSurge
set to 2 we are saying we can propagate resources to a maximum of 4 clusters but our target number of clusters specified is 2, we will rollout the namespace with deployment to both cluster-3 and cluster-4 before removing the deployment from cluster-1 and cluster-2. - And since
maxUnavailable
is always set to 25% by default which is rounded off to 1, we will remove the resource from one of the existing clusters (cluster-1 or cluster-2) because whenmaxUnavailable
is 1 the policy mandates at least one cluster to be available.
UnavailablePeriodSeconds
UnavailablePeriodSeconds
is used to configure the waiting time between rollout phases when we cannot determine if the
resources have rolled out successfully or not. This field is used only if the availability of resources we propagate
are not trackable. Refer to the Data only object section for more details.
Availability based Rollout
We have built-in mechanisms to determine the availability of some common Kubernetes native resources. We only mark them as available in the target clusters when they meet the criteria we defined.
How It Works
We have an agent running in the target cluster to check the status of the resources. We have specific criteria for each of the following resources to determine if they are available or not. Here are the list of resources we support:
Deployment
We only mark a Deployment
as available when all its pods are running, ready and updated according to the latest spec.
DaemonSet
We only mark a DaemonSet
as available when all its pods are available and updated according to the latest spec on all
desired scheduled nodes.
StatefulSet
We only mark a StatefulSet
as available when all its pods are running, ready and updated according to the latest revision.
Job
We only mark a Job
as available when it has at least one succeeded pod or one ready pod.
Service
For Service
based on the service type the availability is determined as follows:
- For
ClusterIP
&NodePort
service, we mark it as available when a cluster IP is assigned. - For
LoadBalancer
service, we mark it as available when aLoadBalancerIngress
has been assigned along with an IP or Hostname. - For
ExternalName
service, checking availability is not supported, so it will be marked as available with not trackable reason.
Data only objects
For the objects described below since they are a data resource we mark them as available immediately after creation,
- Namespace
- Secret
- ConfigMap
- Role
- ClusterRole
- RoleBinding
- ClusterRoleBinding
8 - Override
Overview
The ClusterResourceOverride
and ResourceOverride
provides a way to customize resource configurations before they are propagated
to the target cluster by the ClusterResourcePlacement
.
Difference Between ClusterResourceOverride
And ResourceOverride
ClusterResourceOverride
represents the cluster-wide policy that overrides the cluster scoped resources to one or more
clusters while ResourceOverride
will apply to resources in the same namespace as the namespace-wide policy.
Note: If a namespace is selected by the
ClusterResourceOverride
, ALL the resources under the namespace are selected automatically.
If the resource is selected by both ClusterResourceOverride
and ResourceOverride
, the ResourceOverride
will win
when resolving the conflicts.
When To Use Override
Overrides is useful when you want to customize the resources before they are propagated from the hub cluster to the target clusters. Some example use cases are:
- As a platform operator, I want to propagate a clusterRoleBinding to cluster-us-east and cluster-us-west and would like to grant the same role to different groups in each cluster.
- As a platform operator, I want to propagate a clusterRole to cluster-staging and cluster-production and would like to grant more permissions to the cluster-staging cluster than the cluster-production cluster.
- As a platform operator, I want to propagate a namespace to all the clusters and would like to customize the labels for each cluster.
- As an application developer, I would like to propagate a deployment to cluster-staging and cluster-production and would like to always use the latest image in the staging cluster and a specific image in the production cluster.
- As an application developer, I would like to propagate a deployment to all the clusters and would like to use different commands for my container in different regions.
Limits
- Each resource can be only selected by one override simultaneously. In the case of namespace scoped resources, up to two
overrides will be allowed, considering the potential selection through both
ClusterResourceOverride
(select its namespace) andResourceOverride
. - At most 100
ClusterResourceOverride
can be created. - At most 100
ResourceOverride
can be created.
Placement
This specifies which placement the override should be applied to.
Resource Selector
ClusterResourceSelector
of ClusterResourceOverride
selects which cluster-scoped resources need to be overridden before
applying to the selected clusters.
It supports the following forms of resource selection:
- Select resources by specifying the <group, version, kind> and name. This selection propagates only one resource that matches the <group, version, kind> and name.
Note: Label selector of
ClusterResourceSelector
is not supported.
ResourceSelector
of ResourceOverride
selects which namespace-scoped resources need to be overridden before applying to
the selected clusters.
It supports the following forms of resource selection:
- Select resources by specifying the <group, version, kind> and name. This selection propagates only one resource that
matches the <group, version, kind> and name under the
ResourceOverride
namespace.
Override Policy
Override policy defines how to override the selected resources on the target clusters.
It contains an array of override rules and its order determines the override order. For example, when there are two rules selecting the same fields on the target cluster, the last one will win.
Each override rule contains the following fields:
ClusterSelector
: which cluster(s) the override rule applies to. It supports the following forms of cluster selection:- Select clusters by specifying the cluster labels.
- An empty selector selects ALL the clusters.
- A nil selector selects NO target cluster.
IMPORTANT: Only
labelSelector
is supported in theclusterSelectorTerms
field.OverrideType
: which type of the override should be applied to the selected resources. The default type isJSONPatch
.JSONPatch
: applies the JSON patch to the selected resources using RFC 6902.Delete
: deletes the selected resources on the target cluster.
JSONPatchOverrides
: a list of JSON path override rules applied to the selected resources following RFC 6902 when the override type isJSONPatch
.
Note: Updating the fields in the TypeMeta (e.g.,
apiVersion
,kind
) is not allowed.
Note: Updating the fields in the ObjectMeta (e.g.,
name
,namespace
) excluding annotations and labels is not allowed.
Note: Updating the fields in the Status (e.g.,
status
) is not allowed.
Reserved Variables in the JSON Patch Override Value
There is a list of reserved variables that will be replaced by the actual values used in the value
of the JSON patch override rule:
${MEMBER-CLUSTER-NAME}
: this will be replaced by the name of thememberCluster
that represents this cluster.
For example, to add a label to the ClusterRole
named secret-reader
on clusters with the label env: prod
,
you can use the following configuration:
apiVersion: placement.kubernetes-fleet.io/v1alpha1
kind: ClusterResourceOverride
metadata:
name: example-cro
spec:
placement:
name: crp-example
clusterResourceSelectors:
- group: rbac.authorization.k8s.io
kind: ClusterRole
version: v1
name: secret-reader
policy:
overrideRules:
- clusterSelector:
clusterSelectorTerms:
- labelSelector:
matchLabels:
env: prod
jsonPatchOverrides:
- op: add
path: /metadata/labels
value:
{"cluster-name":"${MEMBER-CLUSTER-NAME}"}
The ClusterResourceOverride
object above will add a label cluster-name
with the value of the memberCluster
name to the ClusterRole
named secret-reader
on clusters with the label env: prod
.
When To Trigger Rollout
It will take the snapshot of each override change as a result of ClusterResourceOverrideSnapshot
and
ResourceOverrideSnapshot
. The snapshot will be used to determine whether the override change should be applied to the existing
ClusterResourcePlacement
or not. If applicable, it will start rolling out the new resources to the target clusters by
respecting the rollout strategy defined in the ClusterResourcePlacement
.
Examples
add annotations to the configmap by using clusterResourceOverride
Suppose we create a configmap named app-config-1
under the namespace application-1
in the hub cluster, and we want to
add an annotation to it, which is applied to all the member clusters.
apiVersion: v1
data:
data: test
kind: ConfigMap
metadata:
creationTimestamp: "2024-05-07T08:06:27Z"
name: app-config-1
namespace: application-1
resourceVersion: "1434"
uid: b4109de8-32f2-4ac8-9e1a-9cb715b3261d
Create a ClusterResourceOverride
named cro-1
to add an annotation to the namespace application-1
.
apiVersion: placement.kubernetes-fleet.io/v1alpha1
kind: ClusterResourceOverride
metadata:
creationTimestamp: "2024-05-07T08:06:27Z"
finalizers:
- kubernetes-fleet.io/override-cleanup
generation: 1
name: cro-1
resourceVersion: "1436"
uid: 32237804-7eb2-4d5f-9996-ff4d8ce778e7
spec:
placement:
name: crp-example
clusterResourceSelectors:
- group: ""
kind: Namespace
name: application-1
version: v1
policy:
overrideRules:
- clusterSelector:
clusterSelectorTerms: []
jsonPatchOverrides:
- op: add
path: /metadata/annotations
value:
cro-test-annotation: cro-test-annotation-val
Check the configmap on one of the member cluster by running kubectl get configmap app-config-1 -n application-1 -o yaml
command:
apiVersion: v1
data:
data: test
kind: ConfigMap
metadata:
annotations:
cro-test-annotation: cro-test-annotation-val
kubernetes-fleet.io/last-applied-configuration: '{"apiVersion":"v1","data":{"data":"test"},"kind":"ConfigMap","metadata":{"annotations":{"cro-test-annotation":"cro-test-annotation-val","kubernetes-fleet.io/spec-hash":"4dd5a08aed74884de455b03d3b9c48be8278a61841f3b219eca9ed5e8a0af472"},"name":"app-config-1","namespace":"application-1","ownerReferences":[{"apiVersion":"placement.kubernetes-fleet.io/v1beta1","blockOwnerDeletion":false,"kind":"AppliedWork","name":"crp-1-work","uid":"77d804f5-f2f1-440e-8d7e-e9abddacb80c"}]}}'
kubernetes-fleet.io/spec-hash: 4dd5a08aed74884de455b03d3b9c48be8278a61841f3b219eca9ed5e8a0af472
creationTimestamp: "2024-05-07T08:06:27Z"
name: app-config-1
namespace: application-1
ownerReferences:
- apiVersion: placement.kubernetes-fleet.io/v1beta1
blockOwnerDeletion: false
kind: AppliedWork
name: crp-1-work
uid: 77d804f5-f2f1-440e-8d7e-e9abddacb80c
resourceVersion: "1449"
uid: a8601007-1e6b-4b64-bc05-1057ea6bd21b
add annotations to the configmap by using resourceOverride
You can use the ResourceOverride
to add an annotation to the configmap app-config-1
explicitly in the namespace application-1
.
apiVersion: placement.kubernetes-fleet.io/v1alpha1
kind: ResourceOverride
metadata:
creationTimestamp: "2024-05-07T08:25:31Z"
finalizers:
- kubernetes-fleet.io/override-cleanup
generation: 1
name: ro-1
namespace: application-1
resourceVersion: "3859"
uid: b4117925-bc3c-438d-a4f6-067bc4577364
spec:
placement:
name: crp-example
policy:
overrideRules:
- clusterSelector:
clusterSelectorTerms: []
jsonPatchOverrides:
- op: add
path: /metadata/annotations
value:
ro-test-annotation: ro-test-annotation-val
resourceSelectors:
- group: ""
kind: ConfigMap
name: app-config-1
version: v1
How To Validate If Overrides Are Applied
You can validate if the overrides are applied by checking the ClusterResourcePlacement
status. The status output will
indicate both placement conditions and individual placement statuses on each member cluster that was overridden.
Sample output:
status:
conditions:
- lastTransitionTime: "2024-05-07T08:06:27Z"
message: found all the clusters needed as specified by the scheduling policy
observedGeneration: 1
reason: SchedulingPolicyFulfilled
status: "True"
type: ClusterResourcePlacementScheduled
- lastTransitionTime: "2024-05-07T08:06:27Z"
message: All 3 cluster(s) start rolling out the latest resource
observedGeneration: 1
reason: RolloutStarted
status: "True"
type: ClusterResourcePlacementRolloutStarted
- lastTransitionTime: "2024-05-07T08:06:27Z"
message: The selected resources are successfully overridden in the 3 clusters
observedGeneration: 1
reason: OverriddenSucceeded
status: "True"
type: ClusterResourcePlacementOverridden
- lastTransitionTime: "2024-05-07T08:06:27Z"
message: Works(s) are succcesfully created or updated in the 3 target clusters'
namespaces
observedGeneration: 1
reason: WorkSynchronized
status: "True"
type: ClusterResourcePlacementWorkSynchronized
- lastTransitionTime: "2024-05-07T08:06:27Z"
message: The selected resources are successfully applied to 3 clusters
observedGeneration: 1
reason: ApplySucceeded
status: "True"
type: ClusterResourcePlacementApplied
- lastTransitionTime: "2024-05-07T08:06:27Z"
message: The selected resources in 3 cluster are available now
observedGeneration: 1
reason: ResourceAvailable
status: "True"
type: ClusterResourcePlacementAvailable
observedResourceIndex: "0"
placementStatuses:
- applicableClusterResourceOverrides:
- cro-1-0
clusterName: kind-cluster-1
conditions:
- lastTransitionTime: "2024-05-07T08:06:27Z"
message: 'Successfully scheduled resources for placement in kind-cluster-1 (affinity
score: 0, topology spread score: 0): picked by scheduling policy'
observedGeneration: 1
reason: Scheduled
status: "True"
type: Scheduled
- lastTransitionTime: "2024-05-07T08:06:27Z"
message: Detected the new changes on the resources and started the rollout process
observedGeneration: 1
reason: RolloutStarted
status: "True"
type: RolloutStarted
- lastTransitionTime: "2024-05-07T08:06:27Z"
message: Successfully applied the override rules on the resources
observedGeneration: 1
reason: OverriddenSucceeded
status: "True"
type: Overridden
- lastTransitionTime: "2024-05-07T08:06:27Z"
message: All of the works are synchronized to the latest
observedGeneration: 1
reason: AllWorkSynced
status: "True"
type: WorkSynchronized
- lastTransitionTime: "2024-05-07T08:06:27Z"
message: All corresponding work objects are applied
observedGeneration: 1
reason: AllWorkHaveBeenApplied
status: "True"
type: Applied
- lastTransitionTime: "2024-05-07T08:06:27Z"
message: The availability of work object crp-1-work is not trackable
observedGeneration: 1
reason: WorkNotTrackable
status: "True"
type: Available
...
applicableClusterResourceOverrides
in placementStatuses
indicates which ClusterResourceOverrideSnapshot
that is applied
to the target cluster. Similarly, applicableResourceOverrides
will be set if the ResourceOverrideSnapshot
is applied.
9 - Staged Update
While users rely on the RollingUpdate
rollout strategy to safely roll out their workloads,
there is also a requirement for a staged rollout mechanism at the cluster level to enable more controlled and systematic continuous delivery (CD) across the fleet.
Introducing a staged update run feature would address this need by enabling gradual deployments, reducing risk, and ensuring greater reliability and consistency in workload updates across clusters.
Overview
We introduce two new Custom Resources, ClusterStagedUpdateStrategy
and ClusterStagedUpdateRun
.
ClusterStagedUpdateStrategy
defines a reusable orchestration pattern that organizes member clusters into distinct stages, controlling both the rollout sequence within each stage and incorporating post-stage validation tasks that must succeed before proceeding to subsequent stages. For brevity, we’ll refer to ClusterStagedUpdateStrategy
as updateRun strategy throughout this document.
ClusterStagedUpdateRun
orchestrates resource deployment across clusters by executing a ClusterStagedUpdateStrategy
. It requires three key inputs: the target ClusterResourcePlacement
name, a resource snapshot index specifying the version to deploy, and the strategy name that defines the rollout rules. The term updateRun will be used to represent ClusterStagedUpdateRun
in this document.
Specify Rollout Strategy for ClusterResourcePlacement
While ClusterResourcePlacement
uses RollingUpdate
as its default strategy, switching to staged updates requires setting the rollout strategy to External
:
apiVersion: placement.kubernetes-fleet.io/v1beta1
kind: ClusterResourcePlacement
metadata:
name: example-placement
spec:
resourceSelectors:
- group: ""
kind: Namespace
name: test-namespace
version: v1
policy:
placementType: PickAll
tolerations:
- key: gpu-workload
operator: Exists
strategy:
type: External # specify External here to use the stagedUpdateRun strategy.
Deploy a ClusterStagedUpdateStrategy
The ClusterStagedUpdateStrategy
custom resource enables users to organize member clusters into stages and define their rollout sequence. This strategy is reusable across multiple updateRuns, with each updateRun creating an immutable snapshot of the strategy at startup. This ensures that modifications to the strategy do not impact any in-progress updateRun executions.
An example ClusterStagedUpdateStrategy
looks like below:
apiVersion: placement.kubernetes-fleet.io/v1beta1
kind: ClusterStagedUpdateStrategy
metadata:
name: example-strategy
spec:
stages:
- name: staging
labelSelector:
matchLabels:
environment: staging
afterStageTasks:
- type: TimedWait
waitTime: 1h
- name: canary
labelSelector:
matchLabels:
environment: canary
afterStageTasks:
- type: Approval
- name: production
labelSelector:
matchLabels:
environment: production
sortingLabelKey: order
afterStageTasks:
- type: Approval
- type: TimedWait
waitTime: 1h
ClusterStagedUpdateStrategy
is cluster-scoped resource. Its spec contains a list of stageConfig
entries defining the configuration for each stage.
Stages execute sequentially in the order specified. Each stage must have a unique name and uses a labelSelector to identify member clusters for update. In above example, we define 3 stages: staging
selecting member clusters labeled with environment: staging
, canary
selecting member clusters labeled with environment: canary
and production
selecting member clusters labeled with environment: production
.
Each stage can optionally specify sortingLabelKey
and afterStageTasks
. sortingLabelKey
is used to define a label whose integer value determines update sequence within a stage. With above example, assuming there are 3 clusters selected in the production
(all 3 clusters have environment: production
label), then the fleet admin can label them with order: 1
, order: 2
, and order: 3
respectively to control the rollout sequence. Without sortingLabelKey
, clusters are updated in alphabetical order by name.
By default, the next stage begins immediately after the current stage completes. A user can control this cross-stage behavior by specifying the afterStageTasks
in each stage. These tasks execute after all clusters in a stage update successfully. We currently support two types of tasks: Approval
and Timedwait
. Each stage can include one task of each type (maximum of two tasks). Both tasks must be satisfied before advancing to the next stage.
Timedwait
task requires a specified waitTime duration. The updateRun waits for the duration to pass before executing the next stage. For Approval
task, the controller generates a ClusterApprovalRequest
object automatically named as <updateRun name>-<stage name>
. The name is also shown in the updateRun status. The ClusterApprovalRequest
object is pretty simple:
apiVersion: placement.kubernetes-fleet.io/v1beta1
kind: ClusterApprovalRequest
metadata:
name: example-run-canary
labels:
kubernetes-fleet.io/targetupdaterun: example-run
kubernetes-fleet.io/targetUpdatingStage: canary
kubernetes-fleet.io/isLatestUpdateRunApproval: "true"
spec:
parentStageRollout: example-run
targetStage: canary
The user then need to manually approve the task by patching its status:
kubectl patch clusterapprovalrequests example-run-canary --type='merge' -p '{"status":{"conditions":[{"type":"Approved","status":"True","reason":"lgtm","message":"lgtm","lastTransitionTime":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","observedGeneration":1}]}}' --subresource=status
The updateRun will only continue to next stage after the ClusterApprovalRequest
is approved.
Trigger rollout with ClusterStagedUpdateRun
When using External
rollout strategy, a ClusterResourcePlacement
begins deployment only when triggered by a ClusterStagedUpdateRun
. An example ClusterStagedUpdateRun
is shown below:
apiVersion: placement.kubernetes-fleet.io/v1beta1
kind: ClusterStagedUpdateRun
metadata:
name: example-run
spec:
placementName: example-placement
resourceSnapshotIndex: "0"
stagedRolloutStrategyName: example-strategy
This cluster-scoped resource requires three key parameters: the placementName
specifying the target ClusterResourcePlacement
, the resourceSnapshotIndex
identifying which version of resources to deploy (learn how to find resourceSnapshotIndex here), and the stagedRolloutStrategyName
indicating the ClusterStagedUpdateStrategy
to follow.
An updateRun executes in two phases. During the initialization phase, the controller performs a one-time setup where it captures a snapshot of the updateRun strategy, collects scheduled and to-be-deleted ClusterResourceBindings
, generates the cluster update sequence, and records all this information in the updateRun status.
In the execution phase, the controller processes each stage sequentially, updates clusters within each stage one at a time, and enforces completion of after-stage tasks. It then executes a final delete stage to clean up resources from unscheduled clusters. The updateRun succeeds when all stages complete successfully. However, it will fail if any execution-affecting events occur, for example, the target ClusterResourcePlacement being deleted, and member cluster changes triggering new scheduling. In such cases, error details are recorded in the updateRun status. Remember that once initialized, an updateRun operates on its strategy snapshot, making it immune to subsequent strategy modifications.
Understand ClusterStagedUpdateRun status
Let’s take a deep look into the status of a completed ClusterStagedUpdateRun
. It displays details about the rollout status for every clusters and stages.
$ kubectl describe csur run example-run
...
Status:
Conditions:
Last Transition Time: 2025-03-12T23:21:39Z
Message: ClusterStagedUpdateRun initialized successfully
Observed Generation: 1
Reason: UpdateRunInitializedSuccessfully
Status: True
Type: Initialized
Last Transition Time: 2025-03-12T23:21:39Z
Message:
Observed Generation: 1
Reason: UpdateRunStarted
Status: True
Type: Progressing
Last Transition Time: 2025-03-12T23:26:15Z
Message:
Observed Generation: 1
Reason: UpdateRunSucceeded
Status: True
Type: Succeeded
Deletion Stage Status:
Clusters:
Conditions:
Last Transition Time: 2025-03-12T23:26:15Z
Message:
Observed Generation: 1
Reason: StageUpdatingStarted
Status: True
Type: Progressing
Last Transition Time: 2025-03-12T23:26:15Z
Message:
Observed Generation: 1
Reason: StageUpdatingSucceeded
Status: True
Type: Succeeded
End Time: 2025-03-12T23:26:15Z
Stage Name: kubernetes-fleet.io/deleteStage
Start Time: 2025-03-12T23:26:15Z
Policy Observed Cluster Count: 2
Policy Snapshot Index Used: 0
Staged Update Strategy Snapshot:
Stages:
After Stage Tasks:
Type: Approval
Wait Time: 0s
Type: TimedWait
Wait Time: 1m0s
Label Selector:
Match Labels:
Environment: staging
Name: staging
After Stage Tasks:
Type: Approval
Wait Time: 0s
Label Selector:
Match Labels:
Environment: canary
Name: canary
Sorting Label Key: name
After Stage Tasks:
Type: TimedWait
Wait Time: 1m0s
Type: Approval
Wait Time: 0s
Label Selector:
Match Labels:
Environment: production
Name: production
Sorting Label Key: order
Stages Status:
After Stage Task Status:
Approval Request Name: example-run-staging
Conditions:
Last Transition Time: 2025-03-12T23:21:54Z
Message:
Observed Generation: 1
Reason: AfterStageTaskApprovalRequestCreated
Status: True
Type: ApprovalRequestCreated
Last Transition Time: 2025-03-12T23:22:55Z
Message:
Observed Generation: 1
Reason: AfterStageTaskApprovalRequestApproved
Status: True
Type: ApprovalRequestApproved
Type: Approval
Conditions:
Last Transition Time: 2025-03-12T23:22:54Z
Message:
Observed Generation: 1
Reason: AfterStageTaskWaitTimeElapsed
Status: True
Type: WaitTimeElapsed
Type: TimedWait
Clusters:
Cluster Name: member1
Conditions:
Last Transition Time: 2025-03-12T23:21:39Z
Message:
Observed Generation: 1
Reason: ClusterUpdatingStarted
Status: True
Type: Started
Last Transition Time: 2025-03-12T23:21:54Z
Message:
Observed Generation: 1
Reason: ClusterUpdatingSucceeded
Status: True
Type: Succeeded
Conditions:
Last Transition Time: 2025-03-12T23:21:54Z
Message:
Observed Generation: 1
Reason: StageUpdatingWaiting
Status: False
Type: Progressing
Last Transition Time: 2025-03-12T23:22:55Z
Message:
Observed Generation: 1
Reason: StageUpdatingSucceeded
Status: True
Type: Succeeded
End Time: 2025-03-12T23:22:55Z
Stage Name: staging
Start Time: 2025-03-12T23:21:39Z
After Stage Task Status:
Approval Request Name: example-run-canary
Conditions:
Last Transition Time: 2025-03-12T23:23:10Z
Message:
Observed Generation: 1
Reason: AfterStageTaskApprovalRequestCreated
Status: True
Type: ApprovalRequestCreated
Last Transition Time: 2025-03-12T23:25:15Z
Message:
Observed Generation: 1
Reason: AfterStageTaskApprovalRequestApproved
Status: True
Type: ApprovalRequestApproved
Type: Approval
Clusters:
Cluster Name: member2
Conditions:
Last Transition Time: 2025-03-12T23:22:55Z
Message:
Observed Generation: 1
Reason: ClusterUpdatingStarted
Status: True
Type: Started
Last Transition Time: 2025-03-12T23:23:10Z
Message:
Observed Generation: 1
Reason: ClusterUpdatingSucceeded
Status: True
Type: Succeeded
Conditions:
Last Transition Time: 2025-03-12T23:23:10Z
Message:
Observed Generation: 1
Reason: StageUpdatingWaiting
Status: False
Type: Progressing
Last Transition Time: 2025-03-12T23:25:15Z
Message:
Observed Generation: 1
Reason: StageUpdatingSucceeded
Status: True
Type: Succeeded
End Time: 2025-03-12T23:25:15Z
Stage Name: canary
Start Time: 2025-03-12T23:22:55Z
After Stage Task Status:
Conditions:
Last Transition Time: 2025-03-12T23:26:15Z
Message:
Observed Generation: 1
Reason: AfterStageTaskWaitTimeElapsed
Status: True
Type: WaitTimeElapsed
Type: TimedWait
Approval Request Name: example-run-production
Conditions:
Last Transition Time: 2025-03-12T23:25:15Z
Message:
Observed Generation: 1
Reason: AfterStageTaskApprovalRequestCreated
Status: True
Type: ApprovalRequestCreated
Last Transition Time: 2025-03-12T23:25:25Z
Message:
Observed Generation: 1
Reason: AfterStageTaskApprovalRequestApproved
Status: True
Type: ApprovalRequestApproved
Type: Approval
Clusters:
Conditions:
Last Transition Time: 2025-03-12T23:25:15Z
Message:
Observed Generation: 1
Reason: StageUpdatingWaiting
Status: False
Type: Progressing
Last Transition Time: 2025-03-12T23:26:15Z
Message:
Observed Generation: 1
Reason: StageUpdatingSucceeded
Status: True
Type: Succeeded
End Time: 2025-03-12T23:26:15Z
Stage Name: production
Events: <none>
UpdateRun overall status
At the very top, Status.Conditions
gives the overall status of the updateRun. The execution an update run consists of two phases: initialization and execution.
During initialization, the controller performs a one-time setup where it captures a snapshot of the updateRun strategy, collects scheduled and to-be-deleted ClusterResourceBindings
,
generates the cluster update sequence, and records all this information in the updateRun status.
The UpdateRunInitializedSuccessfully
condition indicates the initialization is successful.
After initialization, the controller starts executing the updateRun. The UpdateRunStarted
condition indicates the execution has started.
After all clusters are updated, all after-stage tasks are completed, and thus all stages are finished, the UpdateRunSucceeded
condition is set to True
, indicating the updateRun has succeeded.
Fields recorded in the updateRun status during initialization
During initialization, the controller records the following fields in the updateRun status:
PolicySnapshotIndexUsed
: the index of the policy snapshot used for the updateRun, it should be the latest one.PolicyObservedClusterCount
: the number of clusters selected by the scheduling policy.StagedUpdateStrategySnapshot
: the snapshot of the updateRun strategy, which ensures any strategy changes will not affect executing updateRuns.
Stages and clusters status
The Stages Status
section displays the status of each stage and cluster. As shown in the strategy snapshot, the updateRun has three stages: staging
, canary
, and production
. During initialization, the controller generates the rollout plan, classifies the scheduled clusters
into these three stages and dumps the plan into the updateRun status. As the execution progresses, the controller updates the status of each stage and cluster. Take the staging
stage as an example, member1
is included in this stage. ClusterUpdatingStarted
condition indicates the cluster is being updated and ClusterUpdatingSucceeded
condition shows the cluster is updated successfully.
After all clusters are updated in a stage, the controller executes the specified after-stage tasks. Stage staging
has two after-stage tasks: Approval
and TimedWait
. The Approval
task requires the admin to manually approve a ClusterApprovalRequest
generated by the controller. The name of the ClusterApprovalRequest
is also included in the status, which is example-run-staging
. AfterStageTaskApprovalRequestCreated
condition indicates the approval request is created and AfterStageTaskApprovalRequestApproved
condition indicates the approval request has been approved. The TimedWait
task enforces a suspension of the rollout until the specified wait time has elapsed and in this case, the wait time is 1 minute. AfterStageTaskWaitTimeElapsed
condition indicates the wait time has elapsed and the rollout can proceed to the next stage.
Each stage also has its own conditions. When a stage starts, the Progressing
condition is set to True
. When all the cluster updates complete, the Progressing
condition is set to False
with reason StageUpdatingWaiting
as shown above. It means the stage is waiting for
after-stage tasks to pass.
And thus the lastTransitionTime
of the Progressing
condition also serves as the start time of the wait in case there’s a TimedWait
task. When all after-stage tasks pass, the Succeeded
condition is set to True
. Each stage status also has Start Time
and End Time
fields, making it easier to read.
There’s also a Deletion Stage Status
section, which displays the status of the deletion stage. The deletion stage is the last stage of the updateRun. It deletes resources from the unscheduled clusters. The status is pretty much the same as a normal update stage, except that there are no after-stage tasks.
Note that all these conditions have lastTransitionTime
set to the time when the controller updates the status. It can help debug and check
the progress of the updateRun.
Relationship between ClusterStagedUpdateRun and ClusterResourcePlacement
A ClusterStagedUpdateRun
serves as the trigger mechanism for rolling out a ClusterResourcePlacement
. The key points of this relationship are:
- The
ClusterResourcePlacement
remains in a scheduled state without being deployed until a correspondingClusterStagedUpdateRun
is created. - During rollout, the
ClusterResourcePlacement
status is continuously updated with detailed information from each target cluster. - While a
ClusterStagedUpdateRun
only indicates whether updates have started and completed for each member cluster (as described in previous section), theClusterResourcePlacement
provides comprehensive details including:- Success/failure of resource creation
- Application of overrides
- Specific error messages
For example, below is the status of an in-progress ClusterStagedUpdateRun
:
kubectl describe csur example-run
Name: example-run
...
Status:
Conditions:
Last Transition Time: 2025-03-17T21:37:14Z
Message: ClusterStagedUpdateRun initialized successfully
Observed Generation: 1
Reason: UpdateRunInitializedSuccessfully
Status: True
Type: Initialized
Last Transition Time: 2025-03-17T21:37:14Z
Message:
Observed Generation: 1
Reason: UpdateRunStarted # updateRun started
Status: True
Type: Progressing
...
Stages Status:
After Stage Task Status:
Approval Request Name: example-run-staging
Conditions:
Last Transition Time: 2025-03-17T21:37:29Z
Message:
Observed Generation: 1
Reason: AfterStageTaskApprovalRequestCreated
Status: True
Type: ApprovalRequestCreated
Type: Approval
Conditions:
Last Transition Time: 2025-03-17T21:38:29Z
Message:
Observed Generation: 1
Reason: AfterStageTaskWaitTimeElapsed
Status: True
Type: WaitTimeElapsed
Type: TimedWait
Clusters:
Cluster Name: member1
Conditions:
Last Transition Time: 2025-03-17T21:37:14Z
Message:
Observed Generation: 1
Reason: ClusterUpdatingStarted
Status: True
Type: Started
Last Transition Time: 2025-03-17T21:37:29Z
Message:
Observed Generation: 1
Reason: ClusterUpdatingSucceeded # member1 has updated successfully
Status: True
Type: Succeeded
Conditions:
Last Transition Time: 2025-03-17T21:37:29Z
Message:
Observed Generation: 1
Reason: StageUpdatingWaiting # waiting for approval
Status: False
Type: Progressing
Stage Name: staging
Start Time: 2025-03-17T21:37:14Z
After Stage Task Status:
Approval Request Name: example-run-canary
Type: Approval
Clusters:
Cluster Name: member2
Stage Name: canary
After Stage Task Status:
Type: TimedWait
Approval Request Name: example-run-production
Type: Approval
Clusters:
Stage Name: production
...
In above status, member1 from stage staging
has been updated successfully. The stage is waiting for approval to proceed to the next stage. And member2 from stage canary
is not updated yet.
Let’s take a look at the status of the ClusterResourcePlacement
example-placement
:
kubectl describe crp example-placement
Name: example-placement
...
Status:
Conditions:
Last Transition Time: 2025-03-12T23:01:32Z
Message: found all cluster needed as specified by the scheduling policy, found 2 cluster(s)
Observed Generation: 1
Reason: SchedulingPolicyFulfilled
Status: True
Type: ClusterResourcePlacementScheduled
Last Transition Time: 2025-03-13T07:35:25Z
Message: There are still 1 cluster(s) in the process of deciding whether to roll out the latest resources or not
Observed Generation: 1
Reason: RolloutStartedUnknown
Status: Unknown
Type: ClusterResourcePlacementRolloutStarted
Observed Resource Index: 5
Placement Statuses:
Cluster Name: member1
Conditions:
Last Transition Time: 2025-03-12T23:01:32Z
Message: Successfully scheduled resources for placement in "member1" (affinity score: 0, topology spread score: 0): picked by scheduling policy
Observed Generation: 1
Reason: Scheduled
Status: True
Type: Scheduled
Last Transition Time: 2025-03-17T21:37:14Z
Message: Detected the new changes on the resources and started the rollout process, resourceSnapshotIndex: 5, clusterStagedUpdateRun: example-run
Observed Generation: 1
Reason: RolloutStarted
Status: True
Type: RolloutStarted
Last Transition Time: 2025-03-17T21:37:14Z
Message: No override rules are configured for the selected resources
Observed Generation: 1
Reason: NoOverrideSpecified
Status: True
Type: Overridden
Last Transition Time: 2025-03-17T21:37:14Z
Message: All of the works are synchronized to the latest
Observed Generation: 1
Reason: AllWorkSynced
Status: True
Type: WorkSynchronized
Last Transition Time: 2025-03-17T21:37:14Z
Message: All corresponding work objects are applied
Observed Generation: 1
Reason: AllWorkHaveBeenApplied
Status: True
Type: Applied
Last Transition Time: 2025-03-17T21:37:14Z
Message: All corresponding work objects are available
Observed Generation: 1
Reason: AllWorkAreAvailable # member1 is all good
Status: True
Type: Available
Cluster Name: member2
Conditions:
Last Transition Time: 2025-03-12T23:01:32Z
Message: Successfully scheduled resources for placement in "member2" (affinity score: 0, topology spread score: 0): picked by scheduling policy
Observed Generation: 1
Reason: Scheduled
Status: True
Type: Scheduled
Last Transition Time: 2025-03-13T07:35:25Z
Message: In the process of deciding whether to roll out the latest resources or not
Observed Generation: 1
Reason: RolloutStartedUnknown # member2 is not updated yet
Status: Unknown
Type: RolloutStarted
...
In the Placement Statuses
section, we can see the status of each member cluster. For member1, the RolloutStarted
condition is set to True
, indicating the rollout has started.
In the condition message, we print the ClusterStagedUpdateRun
name, which is example-run
. This indicates the most recent cluster update is triggered by example-run
.
It also displays the detailed update status: the works are synced and applied and are detected available. As a comparison, member2 is still in Scheduled
state only.
When troubleshooting a stalled updateRun, examining the ClusterResourcePlacement
status offers valuable diagnostic information that can help identify the root cause.
For comprehensive troubleshooting steps, refer to the troubleshooting guide.
Concurrent updateRuns
Multiple concurrent ClusterStagedUpdateRun
s can be created for the same ClusterResourcePlacement
, allowing fleet administrators to pipeline the rollout of different resource versions. However, to maintain consistency across the fleet and prevent member clusters from running different resource versions simultaneously, we enforce a key constraint: all concurrent ClusterStagedUpdateRun
s must use identical ClusterStagedUpdateStrategy
settings.
This strategy consistency requirement is validated during the initialization phase of each updateRun. This validation ensures predictable rollout behavior and prevents configuration drift across your cluster fleet, even when multiple updates are in progress.
Next Steps
- Learn how to rollout and rollback CRP resources with Staged Update Run
- Learn how to troubleshoot a Staged Update Run
10 - Eviction and Placement Disruption Budget
This document explains the concept of Eviction
and Placement Disruption Budget
in the context of the fleet.
Overview
Eviction
provides a way to force remove resources from a target cluster once the resources have already been propagated from the hub cluster by a Placement
object.
Eviction
is considered as an voluntary disruption triggered by the user. Eviction
alone doesn’t guarantee that resources won’t be propagated to target cluster again by the scheduler.
The users need to use taints in conjunction with Eviction
to prevent the scheduler from picking the target cluster again.
The Placement Disruption Budget
object protects against voluntary disruptions.
The only voluntary disruption that can occur in the fleet is the eviction of resources from a target cluster which can be achieved by creating the ClusterResourcePlacementEviction
object.
Some cases of involuntary disruptions in the context of fleet,
- The removal of resources from a member cluster by the scheduler due to scheduling policy changes.
- Users manually deleting workload resources running on a member cluster.
- Users manually deleting the
ClusterResourceBinding
object which is an internal resource the represents the placement of resources on a member cluster. - Workloads failing to run properly on a member cluster due to misconfiguration or cluster related issues.
For all the cases of involuntary disruptions described above, the Placement Disruption Budget
object does not protect against them.
ClusterResourcePlacementEviction
An eviction object is used to remove resources from a member cluster once the resources have already been propagated from the hub cluster.
The eviction object is only reconciled once after which it reaches a terminal state. Below is the list of terminal states for ClusterResourcePlacementEviction
,
ClusterResourcePlacementEviction
is valid and it’s executed successfully.ClusterResourcePlacementEviction
is invalid.ClusterResourcePlacementEviction
is valid but it’s not executed.
To successfully evict resources from a cluster, the user needs to specify:
- The name of the
ClusterResourcePlacement
object which propagated resources to the target cluster. - The name of the target cluster from which we need to evict resources.
When specifying the ClusterResourcePlacement
object in the eviction’s spec, the user needs to consider the following cases:
- For
PickFixed
CRP, eviction is not allowed; it is recommended that one directly edit the list of target clusters on the CRP object. - For
PickAll
&PickN
CRPs, eviction is allowed because the users cannot deterministically pick or unpick a cluster based on the placement strategy; it’s up to the scheduler.
Note: After an eviction is executed, there is no guarantee that the cluster won’t be picked again by the scheduler to propagate resources for a
ClusterResourcePlacement
resource. The user needs to specify a taint on the cluster to prevent the scheduler from picking the cluster again. This is especially true forPickAll ClusterResourcePlacement
because the scheduler will try to propagate resources to all the clusters in the fleet.
ClusterResourcePlacementDisruptionBudget
The ClusterResourcePlacementDisruptionBudget
is used to protect resources propagated by a ClusterResourcePlacement
to a target cluster from voluntary disruption, i.e., ClusterResourcePlacementEviction
.
Note: When specifying a
ClusterResourcePlacementDisruptionBudget
, the name should be the same as theClusterResourcePlacement
that it’s trying to protect.
Users are allowed to specify one of two fields in the ClusterResourcePlacementDisruptionBudget
spec since they are mutually exclusive:
- MaxUnavailable - specifies the maximum number of clusters in which a placement can be unavailable due to any form of disruptions.
- MinAvailable - specifies the minimum number of clusters in which placements are available despite any form of disruptions.
for both MaxUnavailable
and MinAvailable
, the user can specify the number of clusters as an integer or as a percentage of the total number of clusters in the fleet.
Note: For both MaxUnavailable and MinAvailable, involuntary disruptions are not subject to the disruption budget but will still count against it.
When specifying a disruption budget for a particular ClusterResourcePlacement
, the user needs to consider the following cases:
CRP type | MinAvailable DB with an integer | MinAvailable DB with a percentage | MaxUnavailable DB with an integer | MaxUnavailable DB with a percentage |
---|---|---|---|---|
PickFixed | ❌ | ❌ | ❌ | ❌ |
PickAll | ✅ | ❌ | ❌ | ❌ |
PickN | ✅ | ✅ | ✅ | ✅ |
Note: We don’t allow eviction for
PickFixed
CRP and hence specifying aClusterResourcePlacementDisruptionBudget
forPickFixed
CRP does nothing. And forPickAll
CRP, the user can only specifyMinAvailable
because total number of clusters selected by aPickAll
CRP is non-deterministic. If the user creates an invalidClusterResourcePlacementDisruptionBudget
object, when an eviction is created, the eviction won’t be successfully executed.