Scheduler
The scheduler component is a vital element in Fleet workload scheduling. Its primary responsibility is to determine the
schedule decision for a bundle of resources based on the latest ClusterSchedulingPolicySnapshot
generated by the ClusterResourcePlacement
.
By default, the scheduler operates in batch mode, which enhances performance. In this mode, it binds a ClusterResourceBinding
from a ClusterResourcePlacement
to multiple clusters whenever possible.
Batch in nature
Scheduling resources within a ClusterResourcePlacement
involves more dependencies compared with scheduling pods within
a deployment in Kubernetes. There are two notable distinctions:
- In a
ClusterResourcePlacement
, multiple replicas of resources cannot be scheduled on the same cluster, whereas pods belonging to the same deployment in Kubernetes can run on the same node. - The
ClusterResourcePlacement
supports different placement types within a single object.
These requirements necessitate treating the scheduling policy as a whole and feeding it to the scheduler, as opposed to handling individual pods like Kubernetes today. Specially:
- Scheduling the entire
ClusterResourcePlacement
at once enables us to increase the parallelism of the scheduler if needed. - Supporting the
PickAll
mode would require generating the replica for each cluster in the fleet to scheduler. This approach is not only inefficient but can also result in scheduler repeatedly attempting to schedule unassigned replica when there are no possibilities of placing them. - To support the
PickN
mode, the scheduler needs to compute the filtering and scoring for each replica. Conversely, in batch mode, these calculations are performed once. Scheduler sorts all the eligible clusters and pick the top N clusters.
Placement Decisions
The output of the scheduler is an array of ClusterResourceBinding
s on the hub cluster.
ClusterResourceBinding
sample:
apiVersion: placement.kubernetes-fleet.io/v1beta1
kind: ClusterResourceBinding
metadata:
annotations:
kubernetes-fleet.io/previous-binding-state: Bound
creationTimestamp: "2023-11-06T09:53:11Z"
finalizers:
- kubernetes-fleet.io/work-cleanup
generation: 8
labels:
kubernetes-fleet.io/parent-CRP: crp-1
name: crp-1-aks-member-1-2f8fe606
resourceVersion: "1641949"
uid: 3a443dec-a5ad-4c15-9c6d-05727b9e1d15
spec:
clusterDecision:
clusterName: aks-member-1
clusterScore:
affinityScore: 0
priorityScore: 0
reason: picked by scheduling policy
selected: true
resourceSnapshotName: crp-1-4-snapshot
schedulingPolicySnapshotName: crp-1-1
state: Bound
targetCluster: aks-member-1
status:
conditions:
- lastTransitionTime: "2023-11-06T09:53:11Z"
message: ""
observedGeneration: 8
reason: AllWorkSynced
status: "True"
type: Bound
- lastTransitionTime: "2023-11-10T08:23:38Z"
message: ""
observedGeneration: 8
reason: AllWorkHasBeenApplied
status: "True"
type: Applied
ClusterResourceBinding
can have three states:
- Scheduled: It indicates that the scheduler has selected this cluster for placing the resources. The resource is waiting to be picked up by the rollout controller.
- Bound: It indicates that the rollout controller has initiated the placement of resources on the target cluster. The resources are actively being deployed.
- Unscheduled: This states signifies that the target cluster is no longer selected by the scheduler for the placement. The resource associated with this cluster are in the process of being removed. They are awaiting deletion from the cluster.
The scheduler operates by generating scheduling decisions through the creating of new bindings in the “scheduled” state and the removal of existing bindings by marking them as “unscheduled”. There is a separate rollout controller which is responsible for executing these decisions based on the defined rollout strategy.
Enforcing the semantics of “IgnoreDuringExecutionTime”
The ClusterResourcePlacement
enforces the semantics of “IgnoreDuringExecutionTime” to prioritize the stability of resources
running in production. Therefore, the resources should not be moved or rescheduled without explicit changes to the scheduling
policy.
Here are some high-level guidelines outlining the actions that trigger scheduling and corresponding behavior:
Policy
changes trigger scheduling:- The scheduler makes the placement decisions based on the latest
ClusterSchedulingPolicySnapshot
. - When it’s just a scale out operation (
NumberOfClusters
of pickN mode is increased), theClusterResourcePlacement
controller updates the label of the existingClusterSchedulingPolicySnapshot
instead of creating a new one, so that the scheduler won’t move any existing resources that are already scheduled and just fulfill the new requirement.
- The scheduler makes the placement decisions based on the latest
The following cluster changes trigger scheduling:
- a cluster, originally ineligible for resource placement for some reason, becomes eligible, such as:
- the cluster setting changes, specifically
MemberCluster
labels has changed - an unexpected deployment which originally leads the scheduler to discard the cluster (for example, agents not joining, networking issues, etc.) has been resolved
- the cluster setting changes, specifically
- a cluster, originally eligible for resource placement, is leaving the fleet and becomes ineligible
Note: The scheduler is only going to place the resources on the new cluster and won’t touch the existing clusters.
- a cluster, originally ineligible for resource placement for some reason, becomes eligible, such as:
Resource-only changes do not trigger scheduling including:
ResourceSelectors
is updated in theClusterResourcePlacement
spec.- The selected resources is updated without directly affecting the
ClusterResourcePlacement
.
What’s next
- Read about Scheduling Framework
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can further improve.
Sorry to hear that. Please tell us how we can fix the experience for you.