Scheduler

Concept about the Fleet scheduler

The scheduler component is a vital element in Fleet workload scheduling. Its primary responsibility is to determine the schedule decision for a bundle of resources based on the latest ClusterSchedulingPolicySnapshotgenerated by the ClusterResourcePlacement. By default, the scheduler operates in batch mode, which enhances performance. In this mode, it binds a ClusterResourceBinding from a ClusterResourcePlacement to multiple clusters whenever possible.

Batch in nature

Scheduling resources within a ClusterResourcePlacement involves more dependencies compared with scheduling pods within a deployment in Kubernetes. There are two notable distinctions:

  1. In a ClusterResourcePlacement, multiple replicas of resources cannot be scheduled on the same cluster, whereas pods belonging to the same deployment in Kubernetes can run on the same node.
  2. The ClusterResourcePlacement supports different placement types within a single object.

These requirements necessitate treating the scheduling policy as a whole and feeding it to the scheduler, as opposed to handling individual pods like Kubernetes today. Specially:

  1. Scheduling the entire ClusterResourcePlacement at once enables us to increase the parallelism of the scheduler if needed.
  2. Supporting the PickAll mode would require generating the replica for each cluster in the fleet to scheduler. This approach is not only inefficient but can also result in scheduler repeatedly attempting to schedule unassigned replica when there are no possibilities of placing them.
  3. To support the PickN mode, the scheduler needs to compute the filtering and scoring for each replica. Conversely, in batch mode, these calculations are performed once. Scheduler sorts all the eligible clusters and pick the top N clusters.

Placement Decisions

The output of the scheduler is an array of ClusterResourceBindings on the hub cluster.

ClusterResourceBinding sample:

apiVersion: placement.kubernetes-fleet.io/v1beta1
kind: ClusterResourceBinding
metadata:
  annotations:
    kubernetes-fleet.io/previous-binding-state: Bound
  creationTimestamp: "2023-11-06T09:53:11Z"
  finalizers:
  - kubernetes-fleet.io/work-cleanup
  generation: 8
  labels:
    kubernetes-fleet.io/parent-CRP: crp-1
  name: crp-1-aks-member-1-2f8fe606
  resourceVersion: "1641949"
  uid: 3a443dec-a5ad-4c15-9c6d-05727b9e1d15
spec:
  clusterDecision:
    clusterName: aks-member-1
    clusterScore:
      affinityScore: 0
      priorityScore: 0
    reason: picked by scheduling policy
    selected: true
  resourceSnapshotName: crp-1-4-snapshot
  schedulingPolicySnapshotName: crp-1-1
  state: Bound
  targetCluster: aks-member-1
status:
  conditions:
  - lastTransitionTime: "2023-11-06T09:53:11Z"
    message: ""
    observedGeneration: 8
    reason: AllWorkSynced
    status: "True"
    type: Bound
  - lastTransitionTime: "2023-11-10T08:23:38Z"
    message: ""
    observedGeneration: 8
    reason: AllWorkHasBeenApplied
    status: "True"
    type: Applied

ClusterResourceBinding can have three states:

  • Scheduled: It indicates that the scheduler has selected this cluster for placing the resources. The resource is waiting to be picked up by the rollout controller.
  • Bound: It indicates that the rollout controller has initiated the placement of resources on the target cluster. The resources are actively being deployed.
  • Unscheduled: This states signifies that the target cluster is no longer selected by the scheduler for the placement. The resource associated with this cluster are in the process of being removed. They are awaiting deletion from the cluster.

The scheduler operates by generating scheduling decisions through the creating of new bindings in the “scheduled” state and the removal of existing bindings by marking them as “unscheduled”. There is a separate rollout controller which is responsible for executing these decisions based on the defined rollout strategy.

Enforcing the semantics of “IgnoreDuringExecutionTime”

The ClusterResourcePlacement enforces the semantics of “IgnoreDuringExecutionTime” to prioritize the stability of resources running in production. Therefore, the resources should not be moved or rescheduled without explicit changes to the scheduling policy.

Here are some high-level guidelines outlining the actions that trigger scheduling and corresponding behavior:

  1. Policy changes trigger scheduling:

    • The scheduler makes the placement decisions based on the latest ClusterSchedulingPolicySnapshot.
    • When it’s just a scale out operation (NumberOfClusters of pickN mode is increased), the ClusterResourcePlacement controller updates the label of the existing ClusterSchedulingPolicySnapshot instead of creating a new one, so that the scheduler won’t move any existing resources that are already scheduled and just fulfill the new requirement.
  2. The following cluster changes trigger scheduling:

    • a cluster, originally ineligible for resource placement for some reason, becomes eligible, such as:
      • the cluster setting changes, specifically MemberCluster labels has changed
      • an unexpected deployment which originally leads the scheduler to discard the cluster (for example, agents not joining, networking issues, etc.) has been resolved
    • a cluster, originally eligible for resource placement, is leaving the fleet and becomes ineligible

    Note: The scheduler is only going to place the resources on the new cluster and won’t touch the existing clusters.

  3. Resource-only changes do not trigger scheduling including:

    • ResourceSelectors is updated in the ClusterResourcePlacement spec.
    • The selected resources is updated without directly affecting the ClusterResourcePlacement.

What’s next