Gang Scheduling调度(1) 检查运行状态 由于集群资源足够作业的所有pod运行,通过命令可知Pod已在运行中。 plaintext [root@pmb86b yaml] kubectl get po grep gang gangexampleworker0 1/1 Running 0 31s gangexampleworker1 1/1 Running 0 31s 如果集群资源不足以让所有pod运行,则所有Pod都会调度失败,可通过PodGroup查看调度状态。 plaintext [root@pmb86b yaml] kubectl get pg gangexample oyaml apiVersion: scheduling.roc/v1beta1 kind: PodGroup metadata: annotations: kubectl.kubernetes.io/lastappliedconfiguration: {"apiVersion":"kubeflow.org/v1","kind":"TFJob","metadata":{"annotations":{},"name":"gangexample","namespace":"default"},"spec":{"tfReplicaSpecs":{"Worker":{"replicas":10,"restartPolicy":"OnFailure","template":{"spec":{"containers":[{"command":["sleep","5m"],"image":"busybox:latest","imagePullPolicy":"IfNotPresent","name":"tensorflow","resources":{"limits":{"nvidia.com/gpu":1}}}],"schedulerName":"roc"}}}}}} creationTimestamp: "20240414T03:47:09Z" generation: 4 name: gangexample namespace: default ownerReferences: apiVersion: kubeflow.org/v1 blockOwnerDeletion: true controller: true kind: TFJob name: gangexample uid: 8caecc9472204bbcbde26c94fe478a35 resourceVersion: "40583543" uid: 69034c3f3c514159b8db8a965a3838f7 spec: minMember: 10 minResources: nvidia.com/gpu: "10" status: conditions: lastTransitionTime: "20240414T03:47:40Z" message: '10/0 tasks in gang unschedulable: pod group is not ready, 10 minAvailable' reason: NotEnoughResources status: "True" transitionID: 3359ff1ad5584148949ff3f53f501a4c type: Unschedulable phase: Pending
来自: