1.亲和性简介
在 HPC(高性能计算)场景中,亲和性是指将计算任务与硬件资源(CPU/GPU 等)进行精准匹配调度的能力:
同构亲和性:针对 Intel、鲲鹏、中科海光等同类 CPU 架构,实现任务与物理核 / 处理器插槽(Socket)的绑定。
异构亲和性:针对 NVIDIA GPU(CUDA 架构)、中科海光(RocM 架构)等异构硬件,实现任务与单 GPU、多 GPU 的灵活绑定。
2.ctbatch
传统 HPC 调度需依赖调度器(如 Slurm)+ 编译器 + MPI 通信库的组合适配,且不同调度器(Slurm/PBS)、硬件架构的适配规则不统一,导致任务部署效率低、资源利用率差。
ctbatch是天翼云HPC提交作业工具, 通过抽象化的 CLI 工具,统一不同调度器、硬件架构的亲和性调度逻辑。
$ ctbatch [-h] [-q QUEUE_NAME] [-J JOB_NAME] [-N NODE_NUM] [-w NODE_LSIT]
[-ppn TASKS_PER_NODE] [-c CPUS_PER_TASK] [-g GPUS_PER_NODE] [-t MAX_TIME]
[--exclusive] [--mpi] [--gpu_bind] [--env] [--user_defined_module] [--no_run]
[command]| 命名参数 | 说明 |
|---|---|
| -q QUEUE_NAME, --queue QUEUE_NAME | 作业队列名, 必填 |
| -J, JOB_NAME, --job_name JOB_NAME | 作业名, 最长50字符 |
| -N NODE_NUM, --nodes NODE_NUM | 节点数, 正整数 |
| -w NODE_LSIT, --node_list NODE_LSIT | 节点列表(逗号分隔) |
| -ppn TASKS_PER_NODE, --tasks_per_node TASKS_PER_NODE | 每节点任务数 |
| -c CPUS_PER_TASK, --cpus_per_task CPUS_PER_TASK | 每任务CPU核心数 |
| -g GPUS_PER_NODE, --gpus_per_node GPUS_PER_NODE | 每节点GPU数 |
| -t MAX_TIME, --max_time MAX_TIME | 最大运行时间 |
| --exclusive | 独占 |
| --mpi | 所用的mpi, OpenMPI, OpenMPI-cuda-aware |
| --gpu_bind | 使用gpu亲和性, mpi和gpu卡映射分为11 1n n1 |
| --env | 自定义环境变量 |
| --user_defined_module | 自定义需加载的module |
| --no_run | 生成作业脚本 |
| command | 执行的程序 |
3.实践样例
3.1普通作业
ctbatch参数文件
$ cat params.ctbatch
queue=batch
N=1
ppn=1
c=1ctbatch参数文件
$ cat params.ctbatch
queue=batch
N=1
ppn=1
c=1ctbatch生成作业脚本
$ ctbatch params.ctbatch -J test --exclusive -t 00:5:00 --no_run -- hostname最终slurm脚本
$ cat run_script_20250813171711.sh.slurm
#!/usr/bin/env bash
#SBATCH --job-name=test
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --nodes=1
#SBATCH --partition=batch
#SBATCH --exclusive
#SBATCH --time=00:5:00
hostname3.2GPU亲和&以太网
ctbatch参数文件
$ cat params.ctbatch
queue=batch
N=2
ppn=8
c=7
g=8
env=OMP_PROC_BIND=true;OMP_PLACES=coresctbatch生成作业脚本
$ ctbatch params.ctbatch -J test_affinity --exclusive -t 00:5:00 --mpi OpenMPI-cuda-aware --gpu_bind 11 --no_run -- test_app最终slurm脚本
$ cat run_script_20250818101338.sh.slurm
#!/usr/bin/env bash
#SBATCH --job-name=test_affinity
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=7
#SBATCH --gpus-per-node=8
#SBATCH --nodes=2
#SBATCH --partition=batch
#SBATCH --exclusive
#SBATCH --time=00:5:00
module load openmpi/5.0.8/gcc-8.5.0-cuda-12.8
export OMP_PROC_BIND=true
export OMP_PLACES=cores
export SLURM_NTASKS_PER_NODE=${SLURM_NTASKS_PER_NODE:-1}
export SLURM_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK:-1}
export OMPI_MCA_pml=ucx
export OMPI_MCA_pml_ucx_tls=tcp
export UCX_NET_DEVICES=bond0
export SLURM_MPI_TYPE=pmix_v5
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun --ntasks-per-node=$SLURM_NTASKS_PER_NODE --cpus-per-task=$SLURM_CPUS_PER_TASK --cpu-bind=cores --gpu-bind=single:1 -m plane=$SLURM_NTASKS_PER_NODE test_app3.2.1实测指令
11,即1个task(线程)对应1个gpu,将-ppn与-g设置成一样的
测试指令如下:
ctbatch -q test -N 1 -ppn 2 -c 2 -g 2 --env "OMP_PROC_BIND=true;OMP_PLACES=cores" -J gpu_affinity --mpi OpenMPI-cuda-aware --gpu_bind 11 -- /home/affinity/bin_affinity_gcc_openmpi_cuda12_gpu测试结果:每1个rank(即task或线程)对应1张GPU卡
rank 0 maps to 1 core [ 0] on gpuqinhexingcompute0001 (91258)
thread 0 maps to 1 core [ 0] on gpuqinhexingcompute0001 (91258)
thread 1 maps to 1 core [ 1] on gpuqinhexingcompute0001 (91275)
GPU: 1 device(s) CUDA Version [Runtime 12040 / Driver 12040]
device 0: 0000:0E:00.0 (default mode)
rank 1 maps to 1 core [ 2] on gpuqinhexingcompute0001 (91259)
thread 0 maps to 1 core [ 2] on gpuqinhexingcompute0001 (91259)
thread 1 maps to 1 core [ 3] on gpuqinhexingcompute0001 (91274)
GPU: 1 device(s) CUDA Version [Runtime 12040 / Driver 12040]
device 0: 0000:0F:00.0 (default mode)1n,即1个task(线程)对应n个gpu,将-ppn设置为-g的因数
测试指令如下:
ctbatch -q test -N 1 -ppn 2 -c 2 -g 4 --env "OMP_PROC_BIND=true;OMP_PLACES=cores" -J gpu_affinity --mpi OpenMPI-cuda-aware --gpu_bind 1n -- /home/affinity/bin_affinity_gcc_openmpi_cuda12_gpu测试结果:每1个rank(即task或线程)对应2张GPU卡
rank 0 maps to 1 core [ 0] on gpuqinhexingcompute0001 (92566)
thread 0 maps to 1 core [ 0] on gpuqinhexingcompute0001 (92566)
thread 1 maps to 1 core [ 1] on gpuqinhexingcompute0001 (92598)
GPU: 2 device(s) CUDA Version [Runtime 12040 / Driver 12040]
device 0: 0000:0E:00.0 (default mode)
device 1: 0000:0F:00.0 (default mode)
rank 1 maps to 1 core [ 2] on gpuqinhexingcompute0001 (92567)
thread 0 maps to 1 core [ 2] on gpuqinhexingcompute0001 (92567)
thread 1 maps to 1 core [ 3] on gpuqinhexingcompute0001 (92599)
GPU: 2 device(s) CUDA Version [Runtime 12040 / Driver 12040]
device 0: 0000:1F:00.0 (default mode)
device 1: 0000:20:00.0 (default mode)n1,即n个task(线程)对应1个gpu,将-g设置为-ppn的因数
测试指令如下:
ctbatch -q test -N 1 -ppn 2 -c 2 -g 1 --env "OMP_PROC_BIND=true;OMP_PLACES=cores" -J gpu_affinity --mpi OpenMPI-cuda-aware --gpu_bind n1 -- /home/affinity/bin_affinity_gcc_openmpi_cuda12_gpu测试结果:每2个rank(即task或线程)对应1张GPU卡
rank 0 maps to 1 core [ 0] on gpuqinhexingcompute0001 (93051)
thread 0 maps to 1 core [ 0] on gpuqinhexingcompute0001 (93051)
thread 1 maps to 1 core [ 1] on gpuqinhexingcompute0001 (93065)
GPU: 1 device(s) CUDA Version [Runtime 12040 / Driver 12040]
device 0: 0000:0E:00.0 (default mode)
rank 1 maps to 1 core [ 2] on gpuqinhexingcompute0001 (93052)
thread 0 maps to 1 core [ 2] on gpuqinhexingcompute0001 (93052)
thread 1 maps to 1 core [ 3] on gpuqinhexingcompute0001 (93064)
GPU: 1 device(s) CUDA Version [Runtime 12040 / Driver 12040]
device 0: 0000:0E:00.0 (default mode)3.3 GPU亲和&IB
ctbatch参数文件
$ cat params.ctbatch
queue=batch
N=2
ppn=8
c=7
g=8
env=OMP_PROC_BIND=true;OMP_PLACES=cores
exclusive
gpu_bind=11
mpi=OpenMPI-cuda-aware
max_time=00:5:00ctbatch生成作业脚本
$ ctbatch params.ctbatch -J test_affinity --no_run -- test_app最终slurm脚本
$ cat run_script_20250813222459.sh.slurm
#!/usr/bin/env bash
#SBATCH --job-name=test_affinity
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=7
#SBATCH --gpus-per-node=8
#SBATCH --nodes=2
#SBATCH --partition=batch
#SBATCH --exclusive
#SBATCH --time=00:5:00
module load openmpi/5.0.8/gcc-8.5.0-cuda-12.8
export OMP_PROC_BIND=true
export OMP_PLACES=cores
export SLURM_NTASKS_PER_NODE=${SLURM_NTASKS_PER_NODE:-1}
export SLURM_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK:-1}
export OMPI_MCA_pml=ucx
export UCX_NET_DEVICES=mlx5_ib0:1,mlx5_ib1:1,mlx5_ib2:1,mlx5_ib3:1
export SLURM_MPI_TYPE=pmix_v5
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun --ntasks-per-node=$SLURM_NTASKS_PER_NODE --cpus-per-task=$SLURM_CPUS_PER_TASK --cpu-bind=cores --gpu-bind=single:1 -m plane=$SLURM_NTASKS_PER_NODE test_app3.4 cpu亲和
NOTE: 现在只支持cores这一个选项
cores,即n个task(线程)对应n个cpu,将-ppn与-c设置成其乘积不超过单节点总核数
测试指令如下:
ctbatch -q test -N 1 -ppn 3 -c 2 --env "OMP_PROC_BIND=true;OMP_PLACES=cores" -J cpu_affinity --mpi OpenMPI --cpu_bind cores -- /home/affinity_cpu/bin_gcc_openmpi_cpu_affinity测试结果:每1个rank(即task或线程)对应2个CPU核,共3个rank
rank 0 maps to 1 core [ 0] on gpuqinhexingcompute0001 (6718)
thread 0 maps to 1 core [ 0] on gpuqinhexingcompute0001 (6718)
thread 1 maps to 1 core [ 1] on gpuqinhexingcompute0001 (6758)
rank 1 maps to 1 core [ 2] on gpuqinhexingcompute0001 (6719)
thread 0 maps to 1 core [ 2] on gpuqinhexingcompute0001 (6719)
thread 1 maps to 1 core [ 3] on gpuqinhexingcompute0001 (6757)
rank 2 maps to 1 core [ 4] on gpuqinhexingcompute0001 (6720)
thread 0 maps to 1 core [ 4] on gpuqinhexingcompute0001 (6720)
thread 1 maps to 1 core [ 5] on gpuqinhexingcompute0001 (6756)附录1 亲和性拓扑示例
mpi:gpu 1:1
多个节点, 每个节点使用8MPI, 每个MPI使用7个OpenMP线程和1块GPU
mpi:gpu 1:n
多个节点, 每个节点使用4MPI, 每个MPI使用14个OpenMP线程和2块GPU
mpi:gpu n:1
多个节点, 每个节点使用56MPI, 每个MPI均为单线程, 每7个MPI公用1块GPU