通过使用 Gang scheduling 能力，可有效解决原生调度器无法支持 All-or-Nothing 作业调度的问题。

前提条件

已安装智算套件。

背景信息

Gang scheduling 是一种保证一组相关任务同步执行的调度策略，多个任务的作业调度时，要么全部成功，要么全部失败，这种调度场景，称作为Gang scheduling。其中一个经典使用场景是分布式机器学习训练：在大规模机器学习模型的训练中，数据可能被分布到多个节点上，每个节点都需要运行一个模型的副本。这些模型副本需要同时开始训练，以保证参数更新的同步。随着大规模和复杂的工作负载在Kubernetes上的普及，需要对应的调度策略适配这种场景，避免资源浪费和延迟。由于Kubernetes的核心调度器默认不支持 Gang scheduling，使得一些工作负载无法很好地迁移至 Kubernetes。为了适配这种场景，目前的云容器引擎基于调度器框架实现 Gang scheduling 功能，可以在云容器引擎中非常方便使用该能力。

功能介绍

为了实现All-or-Nothing的特性，首先需要将一组同时调度的Pod通过annotations标识出来，这个标识可称为PodGroup。提交作业的时候调度器可根据工作负载的相关annotations，获取调度的配置并进行调度。只有当集群资源满足该任务最少运行个数时，才会统一调度，否则作业将一直处于Pending状态。

使用方法

下面使用kubeflow的TFJob作为例子展示Gang scheduling的能力。

apiVersion: "kubeflow.org/v1"
kind: TFJob
metadata:
  name: gang-example
spec:
  tfReplicaSpecs:
    Worker:
      replicas: 2
      restartPolicy: OnFailure
      template:
        spec:
          schedulerName: roc  # 指定使用智算调度器
          containers:
            - name: tensorflow
              image: busybox:latest
              imagePullPolicy: IfNotPresent
              command: ["sleep", "30s"]
              resources:
                limits:
                  nvidia.com/gpu: 1

作业提交到集群后，可看到调度组件自动为这个任务创建PodGroup自定义资源对象：

[root@pm-b86b yaml]# kubectl get pg
NAME           STATUS    MINMEMBER   RUNNINGS   AGE
gang-example   Running   2                      21s

[root@pm-b86b yaml]# kubectl get pg gang-example -oyaml
apiVersion: scheduling.roc/v1beta1
kind: PodGroup
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"kubeflow.org/v1","kind":"TFJob","metadata":{"annotations":{},"name":"gang-example","namespace":"default"},"spec":{"tfReplicaSpecs":{"Worker":{"replicas":2,"restartPolicy":"OnFailure","template":{"spec":{"containers":[{"command":["sleep","5m"],"image":"busybox:latest","imagePullPolicy":"IfNotPresent","name":"tensorflow","resources":{"limits":{"nvidia.com/gpu":1}}}],"schedulerName":"roc"}}}}}}
  creationTimestamp: "2024-04-14T03:32:54Z"
  generation: 5
  name: gang-example
  namespace: default
  ownerReferences:
  lastTransitionTime: "2024-04-14T03:33:19Z"
  reason: tasks in gang are ready to be scheduled
  status: "True"
  transitionID: 2afbaf4b-5424-414c-b89a-a3416925b9b0
  type: Scheduled
    phase: Running
    running: 2

关键字段

minMember：minMember表示该podgroup下最少需要运行的pod或任务数量。如果集群资源不满足miniMember数量任务的运行需求，调度器将不会调度任何一个该podgroup 内的任务。
queue：queue表示该podgroup所属的queue。queue必须提前已创建且状态为open。
priorityClassName：priorityClassName表示该podgroup的优先级，用于调度器为该queue中所有podgroup进行调度时进行排序。system-node-critical和system-cluster-critical 是2个预留的值，表示最高优先级。不特别指定时，默认使用default优先级或zero优先级。
minResources：minResources表示运行该podgroup所需要的最少资源。当集群可分配资源不满足minResources时，调度器将不会调度任何一个该podgroup内的任务。
phase：phase表示该podgroup当前的状态。
conditions：conditions表示该podgroup的具体状态日志，包含了podgroup生命周期中的关键事件。

检查运行状态

由于集群资源足够作业的所有pod运行，通过命令可知Pod已在运行中。

[root@pm-b86b yaml]# kubectl get po | grep gang
gang-example-worker-0       1/1     Running                  0             31s
gang-example-worker-1       1/1     Running                  0             31s

如果集群资源不足以让所有pod运行，则所有Pod都会调度失败，可通过PodGroup查看调度状态。

[root@pm-b86b yaml]# kubectl get pg gang-example -oyaml
apiVersion: scheduling.roc/v1beta1
kind: PodGroup
metadata:
  annotations:
kubectl.kubernetes.io/last-applied-configuration: |
  {"apiVersion":"kubeflow.org/v1","kind":"TFJob","metadata":{"annotations":{},"name":"gang-example","namespace":"default"},"spec":{"tfReplicaSpecs":{"Worker":{"replicas":10,"restartPolicy":"OnFailure","template":{"spec":{"containers":[{"command":["sleep","5m"],"image":"busybox:latest","imagePullPolicy":"IfNotPresent","name":"tensorflow","resources":{"limits":{"nvidia.com/gpu":1}}}],"schedulerName":"roc"}}}}}}
  creationTimestamp: "2024-04-14T03:47:09Z"
  generation: 4
  name: gang-example
  namespace: default
  ownerReferences:
apiVersion: kubeflow.org/v1
blockOwnerDeletion: true
controller: true
kind: TFJob
name: gang-example
uid: 8caecc94-7220-4bbc-bde2-6c94fe478a35
  resourceVersion: "40583543"
  uid: 69034c3f-3c51-4159-b8db-8a965a3838f7
spec:
  minMember: 10
  minResources:
nvidia.com/gpu: "10"
status:
  conditions:
  lastTransitionTime: "2024-04-14T03:47:40Z"
  message: '10/0 tasks in gang unschedulable: pod group is not ready, 10 minAvailable'
  reason: NotEnoughResources
  status: "True"
  transitionID: 3359ff1a-d558-4148-949f-f3f53f501a4c
  type: Unschedulable
    phase: Pending

息壤智算

应用商城

定价

合作伙伴

开发者

支持与服务

了解天翼云

智算套件

智算套件

前提条件

背景信息

功能介绍

使用方法

关键字段

检查运行状态

活动

息壤智算

应用商城

定价

合作伙伴

开发者

支持与服务

了解天翼云

智算套件

智算套件

前提条件

背景信息

功能介绍

使用方法

关键字段

检查运行状态