Gang Scheduling调度(1) 指定使用智算调度器 containers: name: tensorflow image: busybox:latest imagePullPolicy: IfNotPresent command: ["sleep", "30s"] resources: limits: nvidia.com/gpu: 1 作业提交到集群后,可看到调度组件自动为这个任务创建PodGroup自定义资源对象: plaintext [root@pmb86b yaml] kubectl get pg NAME STATUS MINMEMBER RUNNINGS AGE gangexample Running 2 21s [root@pmb86b yaml] kubectl get pg gangexample oyaml apiVersion: scheduling.roc/v1beta1 kind: PodGroup metadata: annotations: kubectl.kubernetes.io/lastappliedconfiguration: {"apiVersion":"kubeflow.org/v1","kind":"TFJob","metadata":{"annotations":{},"name":"gangexample","namespace":"default"},"spec":{"tfReplicaSpecs":{"Worker":{"replicas":2,"restartPolicy":"OnFailure","template":{"spec":{"containers":[{"command":["sleep","5m"],"image":"busybox:latest","imagePullPolicy":"IfNotPresent","name":"tensorflow","resources":{"limits":{"nvidia.com/gpu":1}}}],"schedulerName":"roc"}}}}}} creationTimestamp: "20240414T03:32:54Z" generation: 5 name: gangexample namespace: default ownerReferences: lastTransitionTime: "20240414T03:33:19Z" reason: tasks in gang are ready to be scheduled status: "True" transitionID: 2afbaf4b5424414cb89aa3416925b9b0 type: Scheduled phase: Running running: 2 关键字段 minMember:minMember表示该podgroup下最少需要运行的pod或任务数量。如果集群资源不满足miniMember数量任务的运行需求,调度器将不会调度任何一个该podgroup 内的任务。 queue:queue表示该podgroup所属的queue。queue必须提前已创建且状态为open。 priorityClassName:priorityClassName表示该podgroup的优先级,用于调度器为该queue中所有podgroup进行调度时进行排序。systemnodecritical和systemclustercritical 是2个预留的值,表示最高优先级。不特别指定时,默认使用default优先级或zero优先级。 minResources:minResources表示运行该podgroup所需要的最少资源。当集群可分配资源不满足minResources时,调度器将不会调度任何一个该podgroup内的任务。 phase:phase表示该podgroup当前的状态。 conditions:conditions表示该podgroup的具体状态日志,包含了podgroup生命周期中的关键事件。
来自: