1.亲和性简介

在 HPC（高性能计算）场景中，亲和性是指将计算任务与硬件资源（CPU/GPU 等）进行精准匹配调度的能力：

同构亲和性：针对 Intel、鲲鹏、中科海光等同类 CPU 架构，实现任务与物理核 / 处理器插槽（Socket）的绑定。
异构亲和性：针对 NVIDIA GPU（CUDA 架构）、中科海光（RocM 架构）等异构硬件，实现任务与单 GPU、多 GPU 的灵活绑定。

2.ctbatch

传统 HPC 调度需依赖调度器（如 Slurm）+ 编译器 + MPI 通信库的组合适配，且不同调度器（Slurm/PBS）、硬件架构的适配规则不统一，导致任务部署效率低、资源利用率差。
ctbatch是天翼云HPC提交作业工具, 通过抽象化的 CLI 工具，统一不同调度器、硬件架构的亲和性调度逻辑。

$ ctbatch [-h] [-q QUEUE_NAME] [-J JOB_NAME] [-N NODE_NUM] [-w NODE_LSIT]
[-ppn TASKS_PER_NODE] [-c CPUS_PER_TASK] [-g GPUS_PER_NODE] [-t MAX_TIME]
[--exclusive] [--mpi] [--gpu_bind] [--env] [--user_defined_module] [--no_run]
[command]

命名参数	说明
-q QUEUE_NAME, --queue QUEUE_NAME	作业队列名, 必填
-J, JOB_NAME, --job_name JOB_NAME	作业名, 最长50字符
-N NODE_NUM, --nodes NODE_NUM	节点数, 正整数
-w NODE_LSIT, --node_list NODE_LSIT	节点列表(逗号分隔)
-ppn TASKS_PER_NODE, --tasks_per_node TASKS_PER_NODE	每节点任务数
-c CPUS_PER_TASK, --cpus_per_task CPUS_PER_TASK	每任务CPU核心数
-g GPUS_PER_NODE, --gpus_per_node GPUS_PER_NODE	每节点GPU数
-t MAX_TIME, --max_time MAX_TIME	最大运行时间
--exclusive	独占
--mpi	所用的mpi, OpenMPI, OpenMPI-cuda-aware
--gpu_bind	使用gpu亲和性, mpi和gpu卡映射分为11 1n n1
--env	自定义环境变量
--user_defined_module	自定义需加载的module
--no_run	生成作业脚本
command	执行的程序

3.实践样例

3.1普通作业

ctbatch参数文件

$ cat params.ctbatch
queue=batch
N=1
ppn=1
c=1

ctbatch参数文件

$ cat params.ctbatch
queue=batch
N=1
ppn=1
c=1

ctbatch生成作业脚本

$ ctbatch params.ctbatch -J test --exclusive -t 00:5:00 --no_run -- hostname

最终slurm脚本

$ cat run_script_20250813171711.sh.slurm
#!/usr/bin/env bash

#SBATCH --job-name=test
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --nodes=1
#SBATCH --partition=batch
#SBATCH --exclusive
#SBATCH --time=00:5:00

hostname

3.2GPU亲和&以太网

ctbatch参数文件

$ cat params.ctbatch
queue=batch
N=2
ppn=8
c=7
g=8
env=OMP_PROC_BIND=true;OMP_PLACES=cores

ctbatch生成作业脚本

$ ctbatch params.ctbatch -J test_affinity --exclusive -t 00:5:00 --mpi OpenMPI-cuda-aware --gpu_bind 11 --no_run -- test_app

最终slurm脚本

$ cat run_script_20250818101338.sh.slurm
#!/usr/bin/env bash

#SBATCH --job-name=test_affinity
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=7
#SBATCH --gpus-per-node=8
#SBATCH --nodes=2
#SBATCH --partition=batch
#SBATCH --exclusive
#SBATCH --time=00:5:00

module load openmpi/5.0.8/gcc-8.5.0-cuda-12.8

export OMP_PROC_BIND=true
export OMP_PLACES=cores
export SLURM_NTASKS_PER_NODE=${SLURM_NTASKS_PER_NODE:-1}
export SLURM_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK:-1}
export OMPI_MCA_pml=ucx
export OMPI_MCA_pml_ucx_tls=tcp
export UCX_NET_DEVICES=bond0
export SLURM_MPI_TYPE=pmix_v5
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

srun --ntasks-per-node=$SLURM_NTASKS_PER_NODE --cpus-per-task=$SLURM_CPUS_PER_TASK --cpu-bind=cores --gpu-bind=single:1 -m plane=$SLURM_NTASKS_PER_NODE test_app

3.2.1实测指令

11，即1个task（线程）对应1个gpu，将-ppn与-g设置成一样的

测试指令如下：

ctbatch -q test -N 1 -ppn 2 -c 2 -g 2 --env "OMP_PROC_BIND=true;OMP_PLACES=cores" -J gpu_affinity --mpi OpenMPI-cuda-aware --gpu_bind 11 -- /home/affinity/bin_affinity_gcc_openmpi_cuda12_gpu

测试结果：每1个rank（即task或线程）对应1张GPU卡

rank    0 maps to 1 core [  0] on gpuqinhexingcompute0001 (91258)
          thread   0 maps to 1 core [  0] on gpuqinhexingcompute0001 (91258)
          thread   1 maps to 1 core [  1] on gpuqinhexingcompute0001 (91275)

          GPU: 1 device(s) CUDA Version [Runtime 12040 / Driver 12040]
          device 0: 0000:0E:00.0 (default mode)

rank    1 maps to 1 core [  2] on gpuqinhexingcompute0001 (91259)
          thread   0 maps to 1 core [  2] on gpuqinhexingcompute0001 (91259)
          thread   1 maps to 1 core [  3] on gpuqinhexingcompute0001 (91274)

          GPU: 1 device(s) CUDA Version [Runtime 12040 / Driver 12040]
          device 0: 0000:0F:00.0 (default mode)

1n，即1个task（线程）对应n个gpu，将-ppn设置为-g的因数

测试指令如下：

ctbatch -q test -N 1 -ppn 2 -c 2 -g 4 --env "OMP_PROC_BIND=true;OMP_PLACES=cores" -J gpu_affinity --mpi OpenMPI-cuda-aware --gpu_bind 1n -- /home/affinity/bin_affinity_gcc_openmpi_cuda12_gpu

测试结果：每1个rank（即task或线程）对应2张GPU卡

rank    0 maps to 1 core [  0] on gpuqinhexingcompute0001 (92566)
          thread   0 maps to 1 core [  0] on gpuqinhexingcompute0001 (92566)
          thread   1 maps to 1 core [  1] on gpuqinhexingcompute0001 (92598)

          GPU: 2 device(s) CUDA Version [Runtime 12040 / Driver 12040]
          device 0: 0000:0E:00.0 (default mode)
          device 1: 0000:0F:00.0 (default mode)

rank    1 maps to 1 core [  2] on gpuqinhexingcompute0001 (92567)
          thread   0 maps to 1 core [  2] on gpuqinhexingcompute0001 (92567)
          thread   1 maps to 1 core [  3] on gpuqinhexingcompute0001 (92599)

          GPU: 2 device(s) CUDA Version [Runtime 12040 / Driver 12040]
          device 0: 0000:1F:00.0 (default mode)
          device 1: 0000:20:00.0 (default mode)

n1，即n个task（线程）对应1个gpu，将-g设置为-ppn的因数

测试指令如下：

ctbatch -q test -N 1 -ppn 2 -c 2 -g 1 --env "OMP_PROC_BIND=true;OMP_PLACES=cores" -J gpu_affinity --mpi OpenMPI-cuda-aware --gpu_bind n1 -- /home/affinity/bin_affinity_gcc_openmpi_cuda12_gpu

测试结果：每2个rank（即task或线程）对应1张GPU卡

rank    0 maps to 1 core [  0] on gpuqinhexingcompute0001 (93051)
          thread   0 maps to 1 core [  0] on gpuqinhexingcompute0001 (93051)
          thread   1 maps to 1 core [  1] on gpuqinhexingcompute0001 (93065)

          GPU: 1 device(s) CUDA Version [Runtime 12040 / Driver 12040]
          device 0: 0000:0E:00.0 (default mode)

rank    1 maps to 1 core [  2] on gpuqinhexingcompute0001 (93052)
          thread   0 maps to 1 core [  2] on gpuqinhexingcompute0001 (93052)
          thread   1 maps to 1 core [  3] on gpuqinhexingcompute0001 (93064)

          GPU: 1 device(s) CUDA Version [Runtime 12040 / Driver 12040]
          device 0: 0000:0E:00.0 (default mode)

3.3 GPU亲和&IB

ctbatch参数文件

$ cat params.ctbatch
queue=batch
N=2
ppn=8
c=7
g=8
env=OMP_PROC_BIND=true;OMP_PLACES=cores
exclusive
gpu_bind=11
mpi=OpenMPI-cuda-aware
max_time=00:5:00

ctbatch生成作业脚本

$ ctbatch params.ctbatch -J test_affinity --no_run -- test_app

最终slurm脚本

$ cat run_script_20250813222459.sh.slurm
#!/usr/bin/env bash

#SBATCH --job-name=test_affinity
#SBATCH --ntasks-per-node=8
#SBATCH --cpus-per-task=7
#SBATCH --gpus-per-node=8
#SBATCH --nodes=2
#SBATCH --partition=batch
#SBATCH --exclusive
#SBATCH --time=00:5:00

module load openmpi/5.0.8/gcc-8.5.0-cuda-12.8

export OMP_PROC_BIND=true
export OMP_PLACES=cores
export SLURM_NTASKS_PER_NODE=${SLURM_NTASKS_PER_NODE:-1}
export SLURM_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK:-1}
export OMPI_MCA_pml=ucx
export UCX_NET_DEVICES=mlx5_ib0:1,mlx5_ib1:1,mlx5_ib2:1,mlx5_ib3:1
export SLURM_MPI_TYPE=pmix_v5
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

srun --ntasks-per-node=$SLURM_NTASKS_PER_NODE --cpus-per-task=$SLURM_CPUS_PER_TASK --cpu-bind=cores --gpu-bind=single:1 -m plane=$SLURM_NTASKS_PER_NODE test_app

3.4 cpu亲和

NOTE: 现在只支持cores这一个选项

cores，即n个task（线程）对应n个cpu，将-ppn与-c设置成其乘积不超过单节点总核数

测试指令如下：

ctbatch -q test -N 1 -ppn 3 -c 2 --env "OMP_PROC_BIND=true;OMP_PLACES=cores" -J cpu_affinity --mpi OpenMPI --cpu_bind cores -- /home/affinity_cpu/bin_gcc_openmpi_cpu_affinity

测试结果：每1个rank（即task或线程）对应2个CPU核，共3个rank

rank    0 maps to 1 core [  0] on gpuqinhexingcompute0001 (6718)
          thread   0 maps to 1 core [  0] on gpuqinhexingcompute0001 (6718)
          thread   1 maps to 1 core [  1] on gpuqinhexingcompute0001 (6758)

rank    1 maps to 1 core [  2] on gpuqinhexingcompute0001 (6719)
          thread   0 maps to 1 core [  2] on gpuqinhexingcompute0001 (6719)
          thread   1 maps to 1 core [  3] on gpuqinhexingcompute0001 (6757)

rank    2 maps to 1 core [  4] on gpuqinhexingcompute0001 (6720)
          thread   0 maps to 1 core [  4] on gpuqinhexingcompute0001 (6720)
          thread   1 maps to 1 core [  5] on gpuqinhexingcompute0001 (6756)

附录1 亲和性拓扑示例

mpi:gpu 1:1
多个节点, 每个节点使用8MPI, 每个MPI使用7个OpenMP线程和1块GPU

mpi:gpu 1:n
多个节点, 每个节点使用4MPI, 每个MPI使用14个OpenMP线程和2块GPU

mpi:gpu n:1
多个节点, 每个节点使用56MPI, 每个MPI均为单线程, 每7个MPI公用1块GPU

息壤智算

应用商城

定价

合作伙伴

开发者

支持与服务

了解天翼云

弹性高性能计算 E-HPC

弹性高性能计算 E-HPC

1.亲和性简介

2.ctbatch

3.实践样例

3.1普通作业

3.2GPU亲和&以太网

3.2.1实测指令

3.3 GPU亲和&IB

3.4 cpu亲和

附录1 亲和性拓扑示例

活动

息壤智算

应用商城

定价

合作伙伴

开发者

支持与服务

了解天翼云

弹性高性能计算 E-HPC

弹性高性能计算 E-HPC

1.亲和性简介

2.ctbatch

3.实践样例

3.1普通作业

3.2GPU亲和&以太网

3.2.1实测指令

3.3 GPU亲和&IB

3.4 cpu亲和

附录1 亲和性拓扑示例