searchusermenu
  • 发布文章
  • 消息中心
点赞
收藏
评论
分享
原创

内核线程ksmd跑到PMD核心,导致OVS报文延时大

2025-06-09 10:08:18
3
0

Kernel Thread ksmd Running on PMD Isolated Cores Causes High Latency in OVS Packet Processing

内核线程ksmd跑到PMD核心,导致OVS报文延时大

perf sched

在物理机上查看隔离核心cpu 43 的时间片使用情况:

perf sched record -C 43
perf sched timehist

通过下图可以看到内核线程ksmd被周期性调度,每次运行时长12ms左右。
20240704154742.png

ksmd

内核中的Kernel Samepage Merging (KSM)机制被设计用于合并相同的内存页,以节省物理内存。

合并相同的内存页是Kernel Samepage Merging (KSM) 的核心功能,其目的是通过检测并合并相同的内存页来减少物理内存的使用。这一机制特别适用于虚拟化环境,在这些环境中,多个虚拟机可能会运行相同的操作系统和应用程序,从而拥有大量相同的内存页。

ksmd内核参数

ksmd是内核KSM机制的守护进程,它的相关信息可以在/sys/kernel/mm/ksm目录中找到。

/sys/kernel/mm/ksm目录中,您会找到以下文件,每个文件都有特定的功能和用途:

  1. full_scans​:显示自KSM启动以来完成的完整 scan 次数。
  2. max_page_sharing​:控制一个共享页面的最大引用次数。
  3. merge_across_nodes​:控制是否允许在不同NUMA节点之间合并页面。值为1表示允许,0表示不允许。
  4. pages_shared​:显示当前被共享的内存页数。
  5. pages_sharing​:显示当前有多少页在共享同一页。
  6. pages_to_scan​:控制每个 scan 周期中KSM scan 的页数。
  7. pages_unshared​:显示当前未被共享的页数。
  8. pages_volatile​:显示当前不稳定的页数,这些页可能会被KSM合并。
  9. run​:控制KSM的运行状态。写入0停止KSM,写入1启动KSM。
  10. sleep_millisecs​:控制KSM scan 周期的时间间隔,以毫秒为单位。
  11. stable_node_chains​:显示稳定节点链的数量。
  12. stable_node_chains_prune_millisecs​:控制修剪稳定节点链的时间间隔,以毫秒为单位。
  13. stable_node_dups​:显示稳定节点的副本数量。
  14. use_zero_pages​:控制是否使用零页来合并空闲页。值为1表示使用,0表示不使用。

动态调整ksmd参数

ksmd 线程在以下情况下会被调度运行:

启动KSM后

KSM通过将/sys/kernel/mm/ksm/run文件的值设置为1来启动。一旦启动,ksmd线程会按照配置开始周期性地 scan 和合并内存页。

配置的 scan 周期

ksmd线程的调度频率由/sys/kernel/mm/ksm/sleep_millisecs文件中配置的 scan 周期决定。该文件的值是一个以毫秒为单位的时间间隔,表示ksmd线程在每次 scan 之间的睡眠时间。例如:

# 设置 scan 周期为100毫秒
echo 100 | sudo tee /sys/kernel/mm/ksm/sleep_millisecs

在这种配置下,ksmd线程会每隔100毫秒被调度运行一次,以 scan 和合并内存页。

每次 scan 的页数

ksmd线程在每次被调度运行时,会根据/sys/kernel/mm/ksm/pages_to_scan文件中配置的值来决定每次 scan 的页数。例如:

# 设置每次 scan 的页数为256页
echo 256 | sudo tee /sys/kernel/mm/ksm/pages_to_scan

这意味着每次ksmd线程被调度运行时,它将 scan 256个内存页。

housekeeping_isolcpus_setup

在grub中传入isolcpus=参数时,kernel/sched/isolation.c文件中的housekeeping_isolcpus_setup函数会被调用。

// vim kernel/sched/isolation.c +126

126 static int __init housekeeping_isolcpus_setup(char *str)  // 定义一个静态的初始化函数 housekeeping_isolcpus_setup ,接收一个字符指针参数
127 {
128     unsigned int flags = 0;  // 定义一个无符号整型变量 flags 并初始化为 0
129
130     while (isalpha(*str)) {  // 当字符串指针所指字符为字母时进行循环
131         if (!strncmp(str, "nohz,", 5)) {  // 如果字符串开头为 "nohz,"
132             str += 5;  // 字符串指针向后移动 5 个位置
133             flags |= HK_FLAG_TICK;  // 对 flags 进行位或操作,设置相应标志位
134             continue;  // 继续下一次循环
135         }
136
137         if (!strncmp(str, "domain,", 7)) {  // 如果字符串开头为 "domain,"
138             str += 7;  // 字符串指针向后移动 7 个位置
139             flags |= HK_FLAG_DOMAIN;  // 对 flags 进行位或操作,设置相应标志位
140             continue;  // 继续下一次循环
141         }
142
143         pr_warn("isolcpus: Error, unknown flag\n");  // 打印警告信息
144         return 0;  // 返回 0
145     }
146
147     /* Default behaviour for isolcpus without flags */  // 对于没有标志的 isolcpus 的默认行为
148     if (!flags)  // 如果 flags 为 0
149         flags |= HK_FLAG_DOMAIN;  // 对 flags 进行位或操作,设置相应标志位
150
151     return housekeeping_setup(str, flags);  // 返回 housekeeping_setup 函数的执行结果
152 }
153 __setup("isolcpus=", housekeeping_isolcpus_setup);  // 定义一个设置项 "isolcpus=" ,关联处理函数 housekeeping_isolcpus_setup

内核调度器

ksmd线程作为内核线程,由内核调度器进行调度。内核调度器会根据系统的整体 load和ksmd线程的优先级来决定何时运行ksmd线程。尽管ksmd线程的优先级通常较低,但在配置的 scan 周期到达时,调度器会尝试调度它运行。

start_kernel  // 0号线程
    -> sched_init // 调度器初始化
        -> init_sched_fair_class // cfs_class初始化
    -> reset_init
        -> kernel_init  // 1号线程, sched_normal
                -> kernel_init_freeable
                        -> wait_for_completion(&kthreadd_done) // Wait until kthreadd is all set-up.
                        -> smp_init // 多核初始化
                        -> sched_init_smp // 多核调度器初始化
                                -> sched_init_domains
                                        -> cpumask_and // 排除isolated cpus
                                -> init_sched_rt_class
                                -> init_sched_vms_class
                                -> init_sched_dl_class
                        -> do_basic_setup
                                -> do_initcalls
                                        -> subsys_initcall(ksm_init)    // ksmd, sched_normal
        -> kthreadd  // 2号线程,sched_normal
        -> complete(&kthreadd_done)
taskset -pc 2
pid 2's current affinity list: 0-95
taskset -pc 510
pid 510's current affinity list: 0-95

rest_init

// vim init/main.c +397

397 static noinline void __ref rest_init(void)  // 定义静态内联函数 rest_init
398 {
399     struct task_struct *tsk;  // 定义任务结构体指针 tsk
400     int pid;  // 定义进程 ID 变量 pid
401
402     rcu_scheduler_starting();  // 启动 RCU 调度器
403     /*
404      * We need to spawn init first so that it obtains pid 1, however  // 我们首先需要生成 init 进程,以便它获得 PID 1,然而
405      * the init task will end up wanting to create kthreads, which, if  // init 任务最终会想要创建内核线程,如果
406      * we schedule it before we create kthreadd, will OOPS.  // 我们在创建 kthreadd 之前对其进行调度,将会出错
407      */
408     pid = kernel_thread(kernel_init, NULL, CLONE_FS);  // 创建内核线程 kernel_init,并获取其进程 ID
409     /*
410      * Pin init on the boot CPU. Task migration is not properly working  // 将 init 固定在启动 CPU 上。任务迁移尚未正常工作
411      * until sched_init_smp() has been run. It will set the allowed  // 直到 sched_init_smp() 运行。将设置
412      * CPUs for init to the non isolated CPUs.  //  init 到非隔离的 CPU 上
413      */
414     rcu_read_lock();  // 读取锁
415     tsk = find_task_by_pid_ns(pid, &init_pid_ns);  // 通过 PID 和命名空间查找任务
416     set_cpus_allowed_ptr(tsk, cpumask_of(smp_processor_id()));  // 设置允许的 CPU 掩码
417     rcu_read_unlock();  // 读取解锁
418
419     numa_default_policy();  // 设置 NUMA 默认策略
420     pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES);  // 创建 kthreadd 内核线程,并获取其进程 ID
421     rcu_read_lock();  // 读取锁
422     kthreadd_task = find_task_by_pid_ns(pid, &init_pid_ns);  // 通过 PID 和命名空间查找 kthreadd 任务
423     rcu_read_unlock();  // 读取解锁
424
425     /*
426      * Enable might_sleep() and smp_processor_id() checks.  // 启用 might_sleep() 和 smp_processor_id() 检查
427      * They cannot be enabled earlier because with CONFIG_PREEMPT=y  // 它们不能更早启用,因为在 CONFIG_PREEMPT=y 时
428      * kernel_thread() would trigger might_sleep() splats. With  // kernel_thread() 会触发 might_sleep() 问题。在
429      * CONFIG_PREEMPT_VOLUNTARY=y the init task might have scheduled  // CONFIG_PREEMPT_VOLUNTARY=y 时,init 任务可能已经调度
430      * already, but it's stuck on the kthreadd_done completion.  // 但它会卡在 kthreadd_done 完成处
431      */
432     system_state = SYSTEM_SCHEDULING;  // 设置系统状态为 SYSTEM_SCHEDULING
433
434     complete(&kthreadd_done);  // 完成 kthreadd_done 操作
435
436     /*
437      * The boot idle thread must execute schedule()  // 启动空闲线程必须执行 schedule()
438      * at least once to get things moving:  // 至少一次以启动进程
439      */
440     schedule_preempt_disabled();  // 执行调度,禁用抢占
441     /* Call into cpu_idle with preempt disabled */  // 以禁用抢占的方式调用 cpu_idle
442     cpu_startup_entry(CPUHP_ONLINE);  // CPU 启动入口
443 }

sched_init_numa

1385 void sched_init_numa(void)  // 定义一个名为 sched_init_numa 的无返回值函数
1386 {
1387     int next_distance, curr_distance = node_distance(0, 0);  // 定义两个整型变量,curr_distance 并初始化为 node_distance(0, 0) 的返回值
1388     struct sched_domain_topology_level *tl;  // 定义一个指向结构体 sched_domain_topology_level 的指针
1389     int level = 0;  // 定义并初始化整型变量 level 为 0
1390     int i, j, k;  // 定义三个整型变量 i, j, k
1391
1392     sched_domains_numa_distance = kzalloc(sizeof(int) * (nr_node_ids + 1), GFP_KERNEL);  // 动态分配内存
1393     if (!sched_domains_numa_distance)  // 如果分配失败
1394         return;  // 直接返回
1395
1396     /* Includes NUMA identity node at level 0. */  // 包含在 0 级的 NUMA 标识节点
1397     sched_domains_numa_distance[level++] = curr_distance;  // 为数组元素赋值并递增 level
1398     sched_domains_numa_levels = level;  // 更新 sched_domains_numa_levels 的值
1399
1400     /*
1401      * O(nr_nodes^2) deduplicating selection sort -- in order to find the
1402      * unique distances in the node_distance() table.
1403      *
1404      * Assumes node_distance(0,j) includes all distances in
1405      * node_distance(i,j) in order to avoid cubic time.
1406      */
1407     next_distance = curr_distance;  // 初始化 next_distance
1408     for (i = 0; i < nr_node_ids; i++) {  // 开始一个循环
1409         for (j = 0; j < nr_node_ids; j++) {  // 嵌套循环
1410             for (k = 0; k < nr_node_ids; k++) {  // 嵌套循环
1411                 int distance = node_distance(i, k);  // 获取距离值
1412
1413                 if (distance > curr_distance &&  // 条件判断
1414                     (distance < next_distance ||  // 条件判断
1415                      next_distance == curr_distance))
1416                     next_distance = distance;  // 更新 next_distance
1417
1418                 /*
1419                  * While not a strong assumption it would be nice to know
1420                  * about cases where if node A is connected to B, B is not
1421                  * equally connected to A.
1422                  */
1423                 if (sched_debug() && node_distance(k, i)!= distance)  // 条件判断
1424                     sched_numa_warn("Node-distance not symmetric");  // 调用警告函数
1425
1426                 if (sched_debug() && i &&!find_numa_distance(distance))  // 条件判断
1427                     sched_numa_warn("Node-0 not representative");  // 调用警告函数
1428             }
1429             if (next_distance!= curr_distance) {  // 条件判断
1430                 sched_domains_numa_distance[level++] = next_distance;  // 赋值并递增 level
1431                 sched_domains_numa_levels = level;  // 更新值
1432                 curr_distance = next_distance;  // 更新 curr_distance
1433             } else break;  // 否则退出循环
1434         }
1435
1436         /*
1437          * In case of sched_debug() we verify the above assumption.
1438          */
1439         if (!sched_debug())  // 条件判断
1440             break;  // 退出循环
1441     }
1442
1443     /*
1444      * 'level' contains the number of unique distances
1445      *
1446      * The sched_domains_numa_distance[] array includes the actual distance
1447      * numbers.
1448      */
1449
1450     /*
1451      * Here, we should temporarily reset sched_domains_numa_levels to 0.
1452      * If it fails to allocate memory for array sched_domains_numa_masks[][],
1453      * the array will contain less then 'level' members. This could be
1454      * dangerous when we use it to iterate array sched_domains_numa_masks[][]
1455      * in other functions.
1456      *
1457      * We reset it to 'level' at the end of this function.
1458      */
1459     sched_domains_numa_levels = 0;  // 暂时重置值
1460
1461     sched_domains_numa_masks = kzalloc(sizeof(void *) * level, GFP_KERNEL);  // 动态分配内存
1462     if (!sched_domains_numa_masks)  // 如果分配失败
1463         return;  // 直接返回
1464
1465     /*
1466      * Now for each level, construct a mask per node which contains all
1467      * CPUs of nodes that are that many hops away from us.
1468      */
1469     for (i = 0; i < level; i++) {  // 开始一个循环
1470         sched_domains_numa_masks[i] =  // 为数组元素赋值
1471             kzalloc(nr_node_ids * sizeof(void *), GFP_KERNEL);  // 动态分配内存
1472         if (!sched_domains_numa_masks[i])  // 如果分配失败
1473             return;  // 直接返回
1474
1475         for (j = 0; j < nr_node_ids; j++) {  // 嵌套循环
1476             struct cpumask *mask = kzalloc(cpumask_size(), GFP_KERNEL);  // 动态分配内存
1477             if (!mask)  // 如果分配失败
1478                 return;  // 直接返回
1479
1480             sched_domains_numa_masks[i][j] = mask;  // 赋值
1481
1482             for_each_node(k) {  // 循环
1483                 if (node_distance(j, k) > sched_domains_numa_distance[i])  // 条件判断
1484                     continue;  // 继续下一次循环
1485
1486                 cpumask_or(mask, mask, cpumask_of_node(k));  // 进行位或操作
1487             }
1488         }
1489     }
1490
1491     /* Compute default topology size */  // 计算默认拓扑大小
1492     for (i = 0; sched_domain_topology[i].mask; i++);  // 循环
1493
1494     tl = kzalloc((i + level + 1) *  // 动态分配内存
1495             sizeof(struct sched_domain_topology_level), GFP_KERNEL);  // 动态分配内存
1496     if (!tl)  // 如果分配失败
1497         return;  // 直接返回
1498
1499     /*
1500      * Copy the default topology bits..
1501      */
1502     for (i = 0; sched_domain_topology[i].mask; i++)  // 循环
1503         tl[i] = sched_domain_topology[i];  // 赋值
1504
1505     /*
1506      * Add the NUMA identity distance, aka single NODE.
1507      */
1508     tl[i++] = (struct sched_domain_topology_level){  // 赋值
1509        .mask = sd_numa_mask,
1510        .numa_level = 0,
1511         SD_INIT_NAME(NODE)
1512     };
1513
1514     /*
1515      *.. and append 'j' levels of NUMA goodness.
1516      */
1517     for (j = 1; j < level; i++, j++) {  // 循环
1518         tl[i] = (struct sched_domain_topology_level){  // 赋值
1519            .mask = sd_numa_mask,
1520            .sd_flags = cpu_numa_flags,
1521            .flags = SDTL_OVERLAP,
1522            .numa_level = j,
1523             SD_INIT_NAME(NUMA)
1524         };
1525     }
1526
1527     sched_domain_topology = tl;  // 赋值
1528
1529     sched_domains_numa_levels = level;  // 更新值
1530     sched_max_numa_distance = sched_domains_numa_distance[level - 1];  // 赋值
1531
1532     init_numa_topology_type();  // 调用函数
1533 }
init_numa_topology_type
1333 /*
1334  * A system can have three types of NUMA topology:  // 一个系统可以有三种 NUMA 拓扑类型:
1335  * NUMA_DIRECT: all nodes are directly connected, or not a NUMA system  // NUMA_DIRECT:所有节点直接相连,或者不是 NUMA 系统
1336  * NUMA_GLUELESS_MESH: some nodes reachable through intermediary nodes  // NUMA_GLUELESS_MESH:一些节点通过中间节点可达
1337  * NUMA_BACKPLANE: nodes can reach other nodes through a backplane  // NUMA_BACKPLANE:节点可以通过背板到达其他节点
1338  *
1339  * The difference between a glueless mesh topology and a backplane  // 无胶网格拓扑和背板之间的区别
1340  * topology lies in whether communication between not directly  // 在于不直接
1341  * connected nodes goes through intermediary nodes (where programs  // 连接的节点之间的通信是否通过中间节点(程序
1342  * could run), or through backplane controllers. This affects  // 可以运行),还是通过背板控制器。这会影响
1343  * placement of programs.  // 程序的放置
1344  *
1345  * The type of topology can be discerned with the following tests:  // 拓扑类型可以通过以下测试来辨别
1346  * - If the maximum distance between any nodes is 1 hop, the system  // - 如果任意节点之间的最大距离为 1 跳,系统
1347  *   is directly connected.  // 是直接连接的
1348  * - If for two nodes A and B, located N > 1 hops away from each other,  // - 如果对于两个节点 A 和 B,彼此距离 N > 1 跳
1349  *   there is an intermediary node C, which is < N hops away from both  // 存在一个中间节点 C,它距离 A 和 B 都小于 N 跳
1350  *   nodes A and B, the system is a glueless mesh.  // 系统是无胶网格
1351  */
1352 static void init_numa_topology_type(void)  // 定义一个静态的无返回值函数 init_numa_topology_type
1353 {
1354     int a, b, c, n;  // 定义四个整型变量 a, b, c, n
1355
1356     n = sched_max_numa_distance;  // 为 n 赋值
1357
1358     if (sched_domains_numa_levels <= 2) {  // 如果条件成立
1359         sched_numa_topology_type = NUMA_DIRECT;  // 赋值
1360         return;  // 返回
1361     }
1362
1363     for_each_online_node(a) {  // 开始循环
1364         for_each_online_node(b) {  // 嵌套循环
1365             /* Find two nodes furthest removed from each other. */  // 找到彼此距离最远的两个节点
1366             if (node_distance(a, b) < n)  // 如果条件成立
1367                 continue;  // 继续下一次循环
1368
1369             /* Is there an intermediary node between a and b? */  // a 和 b 之间是否存在中间节点?
1370             for_each_online_node(c) {  // 嵌套循环
1371                 if (node_distance(a, c) < n &&  // 如果条件成立
1372                     node_distance(b, c) < n) {  // 如果条件成立
1373                     sched_numa_topology_type =  // 赋值
1374                             NUMA_GLUELESS_MESH;
1375                     return;  // 返回
1376                 }
1377             }
1378
1379             sched_numa_topology_type = NUMA_BACKPLANE;  // 赋值
1380             return;  // 返回
1381         }
1382     }
1383 }

kernel_thread

// vim kernel/fork.c +2348

2348 /*
2349  * Create a kernel thread.
2350  */
2351 pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags)
2352 {
2353     return _do_fork(flags|CLONE_VM|CLONE_UNTRACED, (unsigned long)fn,
2354         (unsigned long)arg, NULL, NULL, 0);
2355 }

_do_fork

// vim kernel/fork.c +2257

2257 /*
2258  *  Ok, this is the main fork-routine.
2259  *
2260  * It copies the process, and if successful kick-starts
2261  * it and waits for it to finish using the VM if required.
2262  */
2263 long _do_fork(unsigned long clone_flags,
2264           unsigned long stack_start,
2265           unsigned long stack_size,
2266           int __user *parent_tidptr,
2267           int __user *child_tidptr,
2268           unsigned long tls)
2269 {
2270     struct completion vfork;
2271     struct pid *pid;
2272     struct task_struct *p;
2273     int trace = 0;
2274     long nr;
2275
2276     /*
2277      * Determine whether and which event to report to ptracer.  When
2278      * called from kernel_thread or CLONE_UNTRACED is explicitly
2279      * requested, no event is reported; otherwise, report if the event
2280      * for the type of forking is enabled.
2281      */
2282     if (!(clone_flags & CLONE_UNTRACED)) {
2283         if (clone_flags & CLONE_VFORK)
2284             trace = PTRACE_EVENT_VFORK;
2285         else if ((clone_flags & CSIGNAL) != SIGCHLD)
2286             trace = PTRACE_EVENT_CLONE;
2287         else
2288             trace = PTRACE_EVENT_FORK;
2289
2290         if (likely(!ptrace_event_enabled(current, trace)))
2291             trace = 0;
2292     }
2293
2294     p = copy_process(clone_flags, stack_start, stack_size,
2295              child_tidptr, NULL, trace, tls, NUMA_NO_NODE);
2296     add_latent_entropy();
2297
2298     if (IS_ERR(p))
2299         return PTR_ERR(p);
2300
2301     /*
2302      * Do this prior waking up the new thread - the thread pointer
2303      * might get invalid after that point, if the thread exits quickly.
2304      */
2305     trace_sched_process_fork(current, p);
2306
2307     pid = get_task_pid(p, PIDTYPE_PID);
2308     nr = pid_vnr(pid);
2309
2310     if (clone_flags & CLONE_PARENT_SETTID)
2311         put_user(nr, parent_tidptr);
2312
2313     if (clone_flags & CLONE_VFORK) {
2314         p->vfork_done = &vfork;
2315         init_completion(&vfork);
2316         get_task_struct(p);
2317     }
2318
2319     wake_up_new_task(p);
2320
2321     /* forking complete and child started to run, tell ptracer */
2322     if (unlikely(trace))
2323         ptrace_event_pid(trace, pid);
2324
2325     if (clone_flags & CLONE_VFORK) {
2326         if (!wait_for_vfork_done(p, &vfork))
2327             ptrace_event_pid(PTRACE_EVENT_VFORK_DONE, pid);
2328     }
2329
2330     put_pid(pid);
2331     return nr;
2332 }

wake_up_new_task

// vim kernel/sched/core.c +2420

2420  * wake_up_new_task - wake up a newly created task for the first time.  // wake_up_new_task - 首次唤醒新创建的任务
2421  *
2422  * This function will do some initial scheduler statistics housekeeping  // 此函数将进行一些初始的调度器统计维护工作
2423  * that must be done for every newly created context, then puts the task  // 这些工作必须为每个新创建的上下文完成,然后将任务
2424  * on the runqueue and wakes it.  // 放入运行队列并唤醒它
2425  */
2426 void wake_up_new_task(struct task_struct *p)  // 定义 wake_up_new_task 函数,参数为任务结构体指针 p
2427 {  // 函数体开始
2428     struct rq_flags rf;  // 定义 rq_flags 结构体变量 rf
2429     struct rq *rq;  // 定义 rq 结构体指针 rq
2430
2431     raw_spin_lock_irqsave(&p->pi_lock, rf.flags);  // 上锁并保存中断状态
2432     p->state = TASK_RUNNING;  // 设置任务状态为运行
2433 #ifdef CONFIG_SMP  // 如果定义了 CONFIG_SMP
2434     /*
2435      * Fork balancing, do it here and not earlier because:  // 分叉平衡,在此处进行而不是更早,因为:
2436      *  - cpus_allowed can change in the fork path  // 在分叉路径中 cpus_allowed 可能会改变
2437      *  - any previously selected CPU might disappear through hotplug  // 任何先前选择的 CPU 可能因热插拔而消失
2438      *
2439      * Use __set_task_cpu() to avoid calling sched_class::migrate_task_rq,  // 使用 __set_task_cpu() 避mian调用 sched_class::migrate_task_rq
2440      * as we're not fully set-up yet.  // 因为我们还没有完全设置好
2441      */
2442     p->recent_used_cpu = task_cpu(p);  // 设置最近使用的 CPU
2443     rseq_migrate(p);  // 进行 rseq 迁移
2444     __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));  // 设置任务的 CPU
2445 #endif  // 结束 CONFIG_SMP 条件编译
2446     rq = __task_rq_lock(p, &rf);  // 锁定任务的运行队列
2447     update_rq_clock(rq);  // 更新运行队列的时钟
2448     post_init_entity_util_avg(&p->se);  // 初始化实体的平均利用率
2449
2450     activate_task(rq, p, ENQUEUE_NOCLOCK);  // 激活任务
2451     p->on_rq = TASK_ON_RQ_QUEUED;  // 设置任务在运行队列的状态
2452     trace_sched_wakeup_new(p);  // 跟踪调度器的唤醒新任务
2453     check_preempt_curr(rq, p, WF_FORK);  // 检查抢占当前任务
2454 #ifdef CONFIG_SMP  // 如果定义了 CONFIG_SMP
2455     if (p->sched_class->task_woken) {  // 如果任务的调度类有 task_woken 函数
2456         /*
2457          * Nothing relies on rq->lock after this, so its fine to  // 在此之后没有依赖 rq->lock,所以可以
2458          * drop it.  // 释放它
2459          */
2460         rq_unpin_lock(rq, &rf);  // 解锁运行队列
2461         p->sched_class->task_woken(rq, p);  // 调用调度类的 task_woken 函数
2462         rq_repin_lock(rq, &rf);  // 重新锁定运行队列
2463     }
2464 #endif  // 结束 CONFIG_SMP 条件编译
2465     task_rq_unlock(rq, p, &rf);  // 解锁任务的运行队列
2466 }

select_task_rq_fair

:cl
9 kernel/sched/core.c:2066: <<try_to_wake_up>> cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
10 kernel/sched/core.c:2444: <<wake_up_new_task>> __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
11 kernel/sched/core.c:2994: <<sched_exec>> dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), SD_BALANCE_EXEC, 0);
// vim CTKernel/kernel/sched/fair.c +6437

6424 /*
6425  * select_task_rq_fair: 为具有设置“sd_flag”标志的域中的唤醒任务选择目标运行队列
6426  * 在实践中,这是 SD_BALANCE_WAKE、SD_BALANCE_FORK 或 SD_BALANCE_EXEC
6427  *
6428  * 通过选择最空闲组中的最空闲 CPU 来平衡 load,或者在某些条件下,如果域设置了 SD_WAKE_AFFINE,则选择空闲的兄弟 CPU
6429  *
6430  * 返回目标 CPU 编号
6431  *
6432  * 必须禁用抢占
6433  */
6434 static int  // 声明静态整型函数
6435 select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)  // 函数定义,接收任务结构体指针、前一个 CPU 编号、标志和唤醒标志
6436 {
6437     struct sched_domain *tmp, *sd = NULL;  // 定义调度域结构体指针 tmp 和 sd,并初始化为 NULL
6438     int cpu = smp_processor_id();  // 获取当前 CPU 编号
6439     int new_cpu = prev_cpu;  // 新的 CPU 编号初始化为前一个 CPU 编号
6440     int want_affine = 0;  // 初始化是否想要亲和性为 0
6441     int sync = (wake_flags & WF_SYNC) &&!(current->flags & PF_EXITING);  // 根据条件计算同步标志
6442
6443     if (sd_flag & SD_BALANCE_WAKE) {  // 如果标志包含 SD_BALANCE_WAKE
6444         record_wakee(p);  // 记录唤醒的任务
6445         want_affine =!wake_wide(p) &&!wake_cap(p, cpu, prev_cpu)  // 根据条件设置是否想要亲和性
6446                   && cpumask_test_cpu(cpu, &p->cpus_allowed);
6447     }
6448
6449     rcu_read_lock();  // 读锁
6450     for_each_domain(cpu, tmp) {  // 遍历调度域
6451         if (!(tmp->flags & SD_LOAD_BALANCE))  // 如果域没有 load均衡标志
6452             break;  // 跳出循环
6453
6454         /*
6455          * 如果“cpu”和“prev_cpu”都是这个域的一部分,
6456          * “cpu”是有效的 SD_WAKE_AFFINE 目标。
6457          */
6458         if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&  // 如果想要亲和性且域有相应标志
6459             cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {  // 且前一个 CPU 在域范围内
6460             if (cpu!= prev_cpu)  // 如果当前 CPU 与前一个 CPU 不同
6461                 new_cpu = wake_affine(tmp, p, cpu, prev_cpu, sync);  // 执行亲和性相关操作
6462
6463             sd = NULL; /* 优先选择 wake_affine 而不是平衡标志 */
6464             break;  // 跳出循环
6465         }
6466
6467         if (tmp->flags & sd_flag)  // 如果域具有指定标志
6468             sd = tmp;  // 赋值给 sd
6469         else if (!want_affine)  // 否则如果不想要亲和性
6470             break;  // 跳出循环
6471     }
6472
6473     if (unlikely(sd)) {  // 如果不太可能出现的情况:sd 不为空
6474         /* 慢路径 */
6475         new_cpu = find_idlest_cpu(sd, p, cpu, prev_cpu, sd_flag);  // 查找最空闲的 CPU
6476     } else if (sd_flag & SD_BALANCE_WAKE) {
6477         /* 快路径 */
6478
6479         new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);  // 选择空闲的兄弟 CPU
6480
6481         if (want_affine)  // 如果想要亲和性
6482             current->recent_used_cpu = cpu;  // 更新最近使用的 CPU
6483     }
6484     rcu_read_unlock();  // 读解锁
6485
6486     return new_cpu;  // 返回新的 CPU 编号
6487 }
// vim CTKernel/include/linux/sched/topology.h +20

20 #define SD_LOAD_BALANCE     0x0001  /* Do load balancing on this domain. */
21 #define SD_BALANCE_NEWIDLE  0x0002  /* Balance when about to become idle */
22 #define SD_BALANCE_EXEC     0x0004  /* Balance on exec */
23 #define SD_BALANCE_FORK     0x0008  /* Balance on fork, clone */
24 #define SD_BALANCE_WAKE     0x0010  /* Balance on wakeup */
25 #define SD_WAKE_AFFINE      0x0020  /* Wake task to waking CPU */
26 #define SD_ASYM_CPUCAPACITY 0x0040  /* Groups have different max cpu capacities */
27 #define SD_SHARE_CPUCAPACITY    0x0080  /* Domain members share cpu capacity */
28 #define SD_SHARE_POWERDOMAIN    0x0100  /* Domain members share power domain */
29 #define SD_SHARE_PKG_RESOURCES  0x0200  /* Domain members share cpu pkg resources */
30 #define SD_SERIALIZE        0x0400  /* Only a single load balancing instance */
31 #define SD_ASYM_PACKING     0x0800  /* Place busy groups earlier in the domain */
32 #define SD_PREFER_SIBLING   0x1000  /* Prefer to place tasks in a sibling domain */
33 #define SD_OVERLAP      0x2000  /* sched_domains of this level overlap */
34 #define SD_NUMA         0x4000  /* cross-node balancing */

select_idle_sibling

6258 /*
6259  * Try and locate an idle core/thread in the LLC cache domain.  // 尝试在 LLC 缓存域中定位一个空闲的核心/线程
6260  */
6261 static int select_idle_sibling(struct task_struct *p, int prev, int target)  // 定义一个静态函数 select_idle_sibling ,接收任务结构体指针、两个整型参数
6262 {
6263     struct sched_domain *sd;  // 定义一个调度域结构体指针
6264     int i, recent_used_cpu;  // 定义两个整型变量
6265
6266     if (available_idle_cpu(target))  // 如果目标 CPU 可用且空闲
6267         return target;  // 返回目标 CPU
6268
6269     /*
6270      * If the previous CPU is cache affine and idle, don't be stupid:  // 如果前一个 CPU 与缓存相关且空闲,别犯傻
6271      */
6272     if (prev!= target && cpus_share_cache(prev, target) && available_idle_cpu(prev))  // 如果前一个 CPU 满足条件
6273         return prev;  // 返回前一个 CPU
6274
6275     /* Check a recently used CPU as a potential idle candidate: */  // 检查最近使用的 CPU 作为潜在的空闲候选
6276     recent_used_cpu = p->recent_used_cpu;  // 获取最近使用的 CPU
6277     if (recent_used_cpu!= prev &&  // 如果最近使用的 CPU 满足条件
6278         recent_used_cpu!= target &&
6279         cpus_share_cache(recent_used_cpu, target) &&
6280         available_idle_cpu(recent_used_cpu) &&
6281         cpumask_test_cpu(p->recent_used_cpu, &p->cpus_allowed)) {
6282         /*
6283          * Replace recent_used_cpu with prev as it is a potential
6284          * candidate for the next wake:  // 因为前一个 CPU 是潜在的候选,所以替换最近使用的 CPU
6285          */
6286         p->recent_used_cpu = prev;  // 进行替换
6287         return recent_used_cpu;  // 返回最近使用的 CPU
6288     }
6289
6290     sd = rcu_dereference(per_cpu(sd_llc, target));  // 获取与目标 CPU 相关的调度域
6291     if (!sd)  // 如果获取失败
6292         return target;  // 返回目标 CPU
6293
6294     i = select_idle_core(p, sd, target);  // 执行一个选择空闲核心的操作
6295     if ((unsigned)i < nr_cpumask_bits)  // 如果结果满足条件
6296         return i;  // 返回结果
6297
6298     i = select_idle_cpu(p, sd, target);  // 执行一个选择空闲 CPU 的操作
6299     if ((unsigned)i < nr_cpumask_bits)  // 如果结果满足条件
6300         return i;  // 返回结果
6301
6302     i = select_idle_smt(p, sd, target);  // 执行一个选择空闲 SMT 的操作
6303     if ((unsigned)i < nr_cpumask_bits)  // 如果结果满足条件
6304         return i;  // 返回结果
6305
6306     return target;  // 否则返回目标 CPU
6307 }

kthreadd

// vim kernel/kthread.c +569

569 int kthreadd(void *unused)  // 定义 kthreadd 函数,参数为 void 指针 unused
570 {
571     struct task_struct *tsk = current;  // 获取当前任务结构体指针并赋值给 tsk
572
573     /* Setup a clean context for our children to inherit. */  // 为子进程设置一个干净的上下文以供继承
574     set_task_comm(tsk, "kthreadd");  // 设置任务的名称为 "kthreadd"
575     ignore_signals(tsk);  // 忽略信号
576     set_cpus_allowed_ptr(tsk, cpu_all_mask);  // 设置允许的 CPU
577     set_mems_allowed(node_states[N_MEMORY]);  // 设置允许的内存
578
579     current->flags |= PF_NOFREEZE;  // 设置当前进程的标志位 PF_NOFREEZE
580     cgroup_init_kthreadd();  // 初始化 cgroup 相关的 kthreadd
581
582     for (;;) {  // 无限循环
583         set_current_state(TASK_INTERRUPTIBLE);  // 设置当前状态为可中断等待
584         if (list_empty(&kthread_create_list))  // 如果 kthread_create_list 为空
585             schedule();  // 进行调度
586         __set_current_state(TASK_RUNNING);  // 设置当前状态为运行
587
588         spin_lock(&kthread_create_lock);  // 加自旋锁
589         while (!list_empty(&kthread_create_list)) {  // 当 kthread_create_list 不为空时
590             struct kthread_create_info *create;  // 定义 kthread_create_info 结构体指针 create
591
592             create = list_entry(kthread_create_list.next,  // 获取列表中的元素
593                         struct kthread_create_info, list);  // 并将其赋值给 create
594             list_del_init(&create->list);  // 从列表中删除并初始化
595             spin_unlock(&kthread_create_lock);  // 释放自旋锁
596
597             create_kthread(create);  // 创建线程
598
599             spin_lock(&kthread_create_lock);  // 加自旋锁
600         }  // 结束内层 while 循环
601         spin_unlock(&kthread_create_lock);  // 释放自旋锁
602     }  // 结束无限 for 循环
603
604     return 0;  // 返回 0
605 }

create_kthread

271 static void create_kthread(struct kthread_create_info *create)  // 定义静态函数 create_kthread,参数为 kthread_create_info 结构体指针
272 {
273     int pid;  // 定义进程 ID 变量
274
275 #ifdef CONFIG_NUMA  // 如果定义了 CONFIG_NUMA
276     current->pref_node_fork = create->node;  // 设置当前进程的偏好节点为传入的节点
277 #endif  // 结束 CONFIG_NUMA 条件编译
278     /* We want our own signal handler (we take no signals by default). */  // 我们想要自己的信号处理程序(默认不接收信号)
279     pid = kernel_thread(kthread, create, CLONE_FS | CLONE_FILES | SIGCHLD);  // 创建内核线程,并获取进程 ID
280     if (pid < 0) {  // 如果进程 ID 小于 0
281         /* If user was SIGKILLed, I release the structure. */  // 如果用户被 SIGKILL 信号终止,释放结构
282         struct completion *done = xchg(&create->done, NULL);  // 交换并获取完成变量
283         __se           tate(TASK_RUNNING);  // 设置任务状态为运行
284         if (!done) {  // 如果完成变量为空
285             kfree(create);  // 释放创建信息结构体的内存
286             return;  // 返回
287         }
288         create->result = ERR_PTR(pid);  // 设置创建结果为错误指针
289         complete(done);  // 完成操作
290     }
291 }

ksm_init

// vim mm/ksm.c +3172

3172 static int __init ksm_init(void)
3173 {
3174     struct task_struct *ksm_thread;
3175     int err;
3176
3177     /* The correct value depends on page size and endianness */
3178     zero_checksum = calc_checksum(ZERO_PAGE(0));
3179     /* Default to false for backwards compatibility */
3180     ksm_use_zero_pages = false;
3181
3182     err = ksm_slab_init();
3183     if (err)
3184         goto out;
3185
3186     ksm_thread = kthread_run(ksm_scan_thread, NULL, "ksmd");
3187     if (IS_ERR(ksm_thread)) {
3188         pr_err("ksm: creating kthread failed\n");
3189         err = PTR_ERR(ksm_thread);
3190         goto out_free;
3191     }
3192
3193 #ifdef CONFIG_SYSFS
3194     err = sysfs_create_group(mm_kobj, &ksm_attr_group);
3195     if (err) {
3196         pr_err("ksm: register sysfs failed\n");
3197         kthread_stop(ksm_thread);
3198         goto out_free;
3199     }
3200 #else
3201     ksm_run = KSM_RUN_MERGE;    /* no way for user to start it */
3202
3203 #endif /* CONFIG_SYSFS */
3204
3205 #ifdef CONFIG_MEMORY_HOTREMOVE
3206     /* There is no significance to this priority 100 */
3207     hotplug_memory_notifier(ksm_memory_callback, 100);
3208 #endif
3209     return 0;
3210
3211 out_free:
3212     ksm_slab_free();
3213 out:
3214     return err;
3215 }
// vim include/linux/kthread.h +34

34 /**
35  * kthread_run - create and wake a thread.
36  * @threadfn: the function to run until signal_pending(current).
37  * @data: data ptr for @threadfn.
38  * @namefmt: printf-style name for the thread.
39  *
40  * Description: Convenient wrapper for kthread_create() followed by
41  * wake_up_process().  Returns the kthread or ERR_PTR(-ENOMEM).
42  */
43 #define kthread_run(threadfn, data, namefmt, ...)              \
44 ({                                     \
45     struct task_struct *__k                        \
46         = kthread_create(threadfn, data, namefmt, ## __VA_ARGS__); \
47     if (!IS_ERR(__k))                          \
48         wake_up_process(__k);                      \
49     __k;                                   \
50 })

kthread_create

// vim include/linux/kthread.h +14

14 /**
15  * kthread_create - create a kthread on the current node
16  * @threadfn: the function to run in the thread
17  * @data: data pointer for @threadfn()
18  * @namefmt: printf-style format string for the thread name
19  * @arg...: arguments for @namefmt.
20  *
21  * This macro will create a kthread on the current node, leaving it in
22  * the stopped state.  This is just a helper for kthread_create_on_node();
23  * see the documentation there for more details.
24  */
25 #define kthread_create(threadfn, data, namefmt, arg...) \
26     kthread_create_on_node(threadfn, data, NUMA_NO_NODE, namefmt, ##arg)

kthread_create_on_node

// kernel/kthread.c +357

357 /**
358  * kthread_create_on_node - create a kthread.
359  * @threadfn: the function to run until signal_pending(current).
360  * @data: data ptr for @threadfn.
361  * @node: task and thread structures for the thread are allocated on this node
362  * @namefmt: printf-style name for the thread.
363  *
364  * Description: This helper function creates and names a kernel
365  * thread.  The thread will be stopped: use wake_up_process() to start
366  * it.  See also kthread_run().  The new thread has SCHED_NORMAL policy and
368  *
369  * If thread is going to be bound on a particular cpu, give its node
370  * in @node, to get NUMA affinity for kthread stack, or else give NUMA_NO_NODE.
371  * When woken, the thread will run @threadfn() with @data as its
372  * argument. @threadfn() can either call do_exit() directly if it is a
373  * standalone thread for which no one will call kthread_stop(), or
374  * return when 'kthread_should_stop()' is true (which means
375  * kthread_stop() has been called).  The return value should be zero
376  * or a negative error number; it will be passed to kthread_stop().
377  *
378  * Returns a task_struct or ERR_PTR(-ENOMEM) or ERR_PTR(-EINTR).
379  */
380 struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
381                        void *data, int node,
382                        const char namefmt[],
383                        ...)
384 {
385     struct task_struct *task;
386     va_list args;
387
388     va_start(args, namefmt);
389     task = __kthread_create_on_node(threadfn, data, node, namefmt, args);
390     va_end(args);
391
392     return task;
393 }
394 EXPORT_SYMBOL(kthread_create_on_node);

__kthread_create_on_node

293 static __printf(4, 0)
294 struct task_struct *__kthread_create_on_node(int (*threadfn)(void *data),
295                             void *data, int node,
296                             const char namefmt[],
297                             va_list args)
298 {
299     DECLARE_COMPLETION_ONSTACK(done);
300     struct task_struct *task;
301     struct kthread_create_info *create = kmalloc(sizeof(*create),
302                              GFP_KERNEL);
303
304     if (!create)
305         return ERR_PTR(-ENOMEM);
306     create->threadfn = threadfn;
307     create->data = data;
308     create->node = node;
309     create->done = &done;
310
311     spin_lock(&kthread_create_lock);
312     list_add_tail(&create->list, &kthread_create_list);
313     spin_unlock(&kthread_create_lock);
314
315     wake_up_process(kthreadd_task);
316     /*
317      * Wait for completion in killable state, for I might be chosen by
318      * the OOM killer while kthreadd is trying to allocate memory for
319      * new kernel thread.
320      */
321     if (unlikely(wait_for_completion_killable(&done))) {
322         /*
323          * If I was SIGKILLed before kthreadd (or new kernel thread)
324          * calls complete(), leave the cleanup of this structure to
325          * that thread.
326          */
327         if (xchg(&create->done, NULL))
328             return ERR_PTR(-EINTR);
329         /*
330          * kthreadd (or new kernel thread) will call complete()
331          * shortly.
332          */
333         wait_for_completion(&done);
334     }
335     task = create->result;
336     if (!IS_ERR(task)) {
337         static const struct sched_param param = { .sched_priority = 0 };
338         char name[TASK_COMM_LEN];
339
340         /*
341          * task is already visible to other tasks, so updating
342          * COMM must be protected.
343          */
344         vsnprintf(name, sizeof(name), namefmt, args);
345         set_task_comm(task, name);
346         /*
347          * root may have changed our (kthreadd's) priority or CPU mask.
348          * The kernel thread should not inherit these properties.
349          */
350         sched_setscheduler_nocheck(task, SCHED_NORMAL, &param);
351         set_cpus_allowed_ptr(task, cpu_all_mask);
352     }
353     kfree(create);
354     return task;

wake_up_process

2141 /**
2142  * wake_up_process - Wake up a specific process
2143  * @p: The process to be woken up.
2144  *
2145  * Attempt to wake up the nominated process and move it to the set of runnable
2146  * processes.
2147  *
2148  * Return: 1 if the process was woken up, 0 if it was already running.
2149  *
2150  * This function executes a full memory barrier before accessing the task state.
2151  */
2152 int wake_up_process(struct task_struct *p)
2153 {
2154     return try_to_wake_up(p, TASK_NORMAL, 0);
2155 }
2156 EXPORT_SYMBOL(wake_up_process);

try_to_wake_up

1959 /**
1960  * try_to_wake_up - wake up a thread
1961  * @p: the thread to be awakened
1962  * @state: the mask of task states that can be woken
1963  * @wake_flags: wake modifier flags (WF_*)
1964  *
1965  * If (@state & @p->state) @p->state = TASK_RUNNING.
1966  *
1967  * If the task was not queued/runnable, also place it back on a runqueue.
1968  *
1969  * Atomic against schedule() which would dequeue a task, also see
1970  * set_current_state().
1971  *
1972  * This function executes a full memory barrier before accessing the task
1973  * state; see set_current_state().
1974  *
1975  * Return: %true if @p->state changes (an actual wakeup was done),
1976  *     %false otherwise.
1977  */
1978 static int
1979 try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
1980 {
1981     unsigned long flags;
1982     int cpu, success = 0;
1983
1984     /*
1985      * If we are going to wake up a thread waiting for CONDITION we
1986      * need to ensure that CONDITION=1 done by the caller can not be
1987      * reordered with p->state check below. This pairs with mb() in
1988      * set_current_state() the waiting thread does.
1989      */
1990     raw_spin_lock_irqsave(&p->pi_lock, flags);
1991     smp_mb__after_spinlock();
1992     if (!(p->state & state))
1993         goto out;
1994
1995     trace_sched_waking(p);
1996
1997     /* We're going to change ->state: */
1998     success = 1;
1999     cpu = task_cpu(p);
2000
2001     /*
2002      * Ensure we load p->on_rq _after_ p->state, otherwise it would
2003      * be possible to, falsely, observe p->on_rq == 0 and get stuck
2004      * in smp_cond_load_acquire() below.
2005      *
2006      * sched_ttwu_pending()         try_to_wake_up()
2007      *   STORE p->on_rq = 1           LOAD p->state
2008      *   UNLOCK rq->lock
2009      *
2010      * __schedule() (switch to task 'p')
2011      *   LOCK rq->lock            smp_rmb();
2012      *   smp_mb__after_spinlock();
2013      *   UNLOCK rq->lock
2014      *
2015      * [task p]
2016      *   STORE p->state = UNINTERRUPTIBLE     LOAD p->on_rq
2017      *
2018      * Pairs with the LOCK+smp_mb__after_spinlock() on rq->lock in
2019      * __schedule().  See the comment for smp_mb__after_spinlock().
2020      */
2021     smp_rmb();
2022     if (p->on_rq && ttwu_remote(p, wake_flags))
2023         goto stat;
2024
2025 #ifdef CONFIG_SMP
2026     /*
2027      * Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be
2028      * possible to, falsely, observe p->on_cpu == 0.
2029      *
2030      * One must be running (->on_cpu == 1) in order to remove oneself
2031      * from the runqueue.
2032      *
2033      * __schedule() (switch to task 'p')    try_to_wake_up()
2034      *   STORE p->on_cpu = 1          LOAD p->on_rq
2035      *   UNLOCK rq->lock
2036      *
2037      * __schedule() (put 'p' to sleep)
2038      *   LOCK rq->lock            smp_rmb();
2039      *   smp_mb__after_spinlock();
2040      *   STORE p->on_rq = 0           LOAD p->on_cpu
2041      *
2042      * Pairs with the LOCK+smp_mb__after_spinlock() on rq->lock in
2043      * __schedule().  See the comment for smp_mb__after_spinlock().
2044      */
2045     smp_rmb();
2046
2047     /*
2048      * If the owning (remote) CPU is still in the middle of schedule() with
2049      * this task as prev, wait until its done referencing the task.
2050      *
2051      * Pairs with the smp_store_release() in finish_task().
2052      *
2053      * This ensures that tasks getting woken will be fully ordered against
2054      * their previous state and preserve Program Order.
2055      */
2056     smp_cond_load_acquire(&p->on_cpu, !VAL);
2057
2058     p->sched_contributes_to_load = !!task_contributes_to_load(p);
2059     p->state = TASK_WAKING;
2060
2061     if (p->in_iowait) {
2062         delayacct_blkio_end(p);
2063         atomic_dec(&task_rq(p)->nr_iowait);
2064     }
2065
2066     cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
2067     if (task_cpu(p) != cpu) {
2068         wake_flags |= WF_MIGRATED;
2069         psi_ttwu_dequeue(p);
2070         set_task_cpu(p, cpu);
2071     }
2072
2073 #else /* CONFIG_SMP */
2074
2075     if (p->in_iowait) {
2076         delayacct_blkio_end(p);
2077         atomic_dec(&task_rq(p)->nr_iowait);
2078     }
2079
2080 #endif /* CONFIG_SMP */
2081
2082     ttwu_queue(p, cpu, wake_flags);
2083 stat:
2084     ttwu_stat(p, cpu, wake_flags);
2085 out:
2086     raw_spin_unlock_irqrestore(&p->pi_lock, flags);
2087
2088     return success;
2089 }

修复方案

bug复现

17210964434930.png

0号进程

查看系统初始化smp多核心前0号进程位于哪个cpu上:

0号进程跑在0号cpu上

ksmd

从/var/log/messages中查看ksmd被调度到隔离核心上的日志:

grep 'p->comm:ksmd' /var/log/messages -inr --color > ksmd.log
vim ksmd.log

ksmd.log

kthreadd

从/var/log/messages中查看kthreadd被调度到隔离核心上的日志:

grep 'p->comm:kthreadd' /var/log/messages -inr --color > kthreadd.log
vim kthreadd.log

vim kthreadd.log

关键代码

上方日志代码位置参见下图。

select_task_rq_fair
vim kernel/sched/fair.c +6554

select_task_rq_fair

vim -t select_idle_sibling

select_idle_sibling

select_idle_smt

select_idle_smt

select_idle_smt

// vim -t select_idle_smt
6170 /*
6171  * Scan the local SMT mask for idle CPUs.
6172  */
6173 static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
6174 {
6175     int cpu;
6176
6177     if (!static_branch_likely(&sched_smt_present))
6178         return -1;
6179
6180     for_each_cpu(cpu, cpu_smt_mask(target)) {
6181         if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
6182             continue;
6183         if (available_idle_cpu(cpu))
6184             return cpu;
6185     }
6186
6187     return -1;
6188 }

select_idle_smt

修改源码修复bug

改一行代码,隔离cpu的时候可以不考虑超线程逻辑,可修复此问题:

git show HEAD
commit a90c573c74f55930f3b770ec67d7a84528e8dac8 (HEAD -> master)
Author: QiLiang Yuan <yuanql9@chinatelecom.cn>
Date:   Thu Jul 18 23:18:24 2024 +0800

    sched/fair: Fix wrong cpu selecting from isolated domain

    Signed-off-by: QiLiang Yuan <yuanql9@chinatelecom.cn>

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f58f13545450..30f32641a45c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6178,7 +6178,7 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t
                return -1;

        for_each_cpu(cpu, cpu_smt_mask(target)) {
-               if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
+               if (!cpumask_test_cpu(cpu, &p->cpus_allowed) || !cpumask_test_cpu(cpu, sched_domain_span(sd)))
                        continue;
                if (available_idle_cpu(cpu))
                        return cpu;

sched_domain_span

历史上select_idle_smt、cpumask_test_cpu(cpu, sched_domain_span(sd))多次删除再合入,最近一次在6个月前:

git log -S 'cpumask_test_cpu(cpu, sched_domain_span(sd))' --oneline kernel/sched/fair.c

8aeaffef8c6e sched/fair: Take the scheduling domain into account in select_idle_smt()
3e6efe87cd5c sched/fair: Remove redundant check in select_idle_smt()
3e8c6c9aac42 sched/fair: Remove task_util from effective utilization in feec()
c722f35b513f sched/fair: Bring back select_idle_smt(), but differently
6cd56ef1df39 sched/fair: Remove select_idle_smt()
df3cb4ea1fb6 sched/fair: Fix wrong cpu selecting from isolated domain

进一步使用git show 查看上述所有commit的具体内容:

git log -S 'cpumask_test_cpu(cpu, sched_domain_span(sd))' --oneline kernel/sched/fair.c | awk {'print $1'} | xargs git show > sched_domain_span.log

不修改源码避开此bug

隔离核心考虑超线程调度域

查看逻辑cpu0的超线程调度域下的兄弟节点:

cat /sys/devices/system/cpu/cpu0/topology/thread_siblings_list
76

从下图可以看出isolcpus=1-75,77-151的时候ksmd不会被调度上去。

17212691514223.png

17212686898314.png

17212686615589.png

手动将ksmd迁走

此方案适用于线上不停机的情况,可直接将ksmd迁至非隔离核心上。

  • [PF_NO_SETAFFINITY校验值]
    在内核源码中查看PF_NO_SETAFFINITY定义:
#linux/sched.h

#define PF_NO_SETAFFINITY 0x04000000        /* Userland is not allowed to meddle with cpus_allowed */

在物理机上查看ksmd进程号:

ps aux | grep -i ksmd
root         510 11.1  0.0      0     0 ?        RN   Jan19 26874:37 [ksmd]
root     1979672  0.0  0.0 213952  1600 pts/16   S+   16:13   0:00 grep -i ksmd

进一步查看ksmd进程的状态:

```bash
[root@gzinf-kunpeng-55e235e17e57 yql]# cat /proc/510/stat
510 (ksmd) S 2 0 0 0 -1 2097216 0 0 0 0 3 161248078 0 0 25 5 1 0 13 0 0 18446744073709551615 0 0 0 0 0 0 0 2147483647 0 1 0 0 17 69 0 0 0 0 0 0 0 0 0 0 0 0 0

查看ksmd进程的标志位(flags):

[root@gzinf-kunpeng-55e235e17e57 yql]# flags=$(cut -f 9 -d ' ' /proc/510/stat)

查看ksmd进程的标志位(flags)是否包含PF_NO_SETAFFINITY(0x04000000):

[root@gzinf-kunpeng-55e235e17e57 yql]# echo $(($flags & 0x40000000))
0

20240704161455.png

可以看到ksmd进程的标志位(flags)不包含PF_NO_SETAFFINITY(0x04000000)。故可以通过taskset或者cgroups设置ksmd的cpu亲和性,将起绑定到非客户虚拟机所在的cpu上。

将ksmd的亲和性设置为10号cpu:

taskset -pc 10 510

将ksmd的亲和性设置为非隔离核心:

此场景需要刨去超线程调度域下的兄弟节点,假设0号逻辑cpu、76号逻辑cpu位于同一个物理cpu上,即0号逻辑cpu跟76号逻辑cpu是一个超线程调度域下的兄弟节点。假设0号逻辑cpu是隔离核心,那设置ksmd亲核心的时候需要排除掉掉76号逻辑cpu。

taskset -pc 0-63 510

通过上述两步即可将ksmd从隔离的cpu上迁走,同时保障ksmd不会再被调度到隔离核心上。

cicd运维一键迁

已经打成了rpm包,安装ksmd-taskset.rpm包即可一键迁移,详细步骤可以参考ksmd-taskset打包说明。

stress-ng压测

从github clone stress-ng源码:

cd stress-ng
make
make install
stress-ng --timeout 10m --cpu 110 --vm 20 --vm-bytes 16G --vm-madvise mergeable --mmap 110 --mmap-stressful --mmap-mergeable --mmaphuge 25 --sock 110

20240802115921.png

stress-ng --timeout 10m --cpu 110 --vm 20 --vm-bytes 16G --vm-madvise mergeable --mmap 110 --mmap-stressful --mmap-mergeable --mmaphuge 25 --sock 110 --ksm

20240802120305.png

实测将ksmd迁移到非隔离核心没有太大影响,从pidstat结果来看ksmd消耗了整体cpu占比为0.19%。

总结

下一个发行版合修复源码。

线上机器通过设置kpatch方式进行热修复,通过设置ksmd的cpu亲和性,将其绑定到非客户虚拟机所在的cpu上。

进一步可以通过调整ksmd参数来提高虚拟机的性能。

0条评论
0 / 1000
c****e
2文章数
0粉丝数
c****e
2 文章 | 0 粉丝
c****e
2文章数
0粉丝数
c****e
2 文章 | 0 粉丝
原创

内核线程ksmd跑到PMD核心,导致OVS报文延时大

2025-06-09 10:08:18
3
0

Kernel Thread ksmd Running on PMD Isolated Cores Causes High Latency in OVS Packet Processing

内核线程ksmd跑到PMD核心,导致OVS报文延时大

perf sched

在物理机上查看隔离核心cpu 43 的时间片使用情况:

perf sched record -C 43
perf sched timehist

通过下图可以看到内核线程ksmd被周期性调度,每次运行时长12ms左右。
20240704154742.png

ksmd

内核中的Kernel Samepage Merging (KSM)机制被设计用于合并相同的内存页,以节省物理内存。

合并相同的内存页是Kernel Samepage Merging (KSM) 的核心功能,其目的是通过检测并合并相同的内存页来减少物理内存的使用。这一机制特别适用于虚拟化环境,在这些环境中,多个虚拟机可能会运行相同的操作系统和应用程序,从而拥有大量相同的内存页。

ksmd内核参数

ksmd是内核KSM机制的守护进程,它的相关信息可以在/sys/kernel/mm/ksm目录中找到。

/sys/kernel/mm/ksm目录中,您会找到以下文件,每个文件都有特定的功能和用途:

  1. full_scans​:显示自KSM启动以来完成的完整 scan 次数。
  2. max_page_sharing​:控制一个共享页面的最大引用次数。
  3. merge_across_nodes​:控制是否允许在不同NUMA节点之间合并页面。值为1表示允许,0表示不允许。
  4. pages_shared​:显示当前被共享的内存页数。
  5. pages_sharing​:显示当前有多少页在共享同一页。
  6. pages_to_scan​:控制每个 scan 周期中KSM scan 的页数。
  7. pages_unshared​:显示当前未被共享的页数。
  8. pages_volatile​:显示当前不稳定的页数,这些页可能会被KSM合并。
  9. run​:控制KSM的运行状态。写入0停止KSM,写入1启动KSM。
  10. sleep_millisecs​:控制KSM scan 周期的时间间隔,以毫秒为单位。
  11. stable_node_chains​:显示稳定节点链的数量。
  12. stable_node_chains_prune_millisecs​:控制修剪稳定节点链的时间间隔,以毫秒为单位。
  13. stable_node_dups​:显示稳定节点的副本数量。
  14. use_zero_pages​:控制是否使用零页来合并空闲页。值为1表示使用,0表示不使用。

动态调整ksmd参数

ksmd 线程在以下情况下会被调度运行:

启动KSM后

KSM通过将/sys/kernel/mm/ksm/run文件的值设置为1来启动。一旦启动,ksmd线程会按照配置开始周期性地 scan 和合并内存页。

配置的 scan 周期

ksmd线程的调度频率由/sys/kernel/mm/ksm/sleep_millisecs文件中配置的 scan 周期决定。该文件的值是一个以毫秒为单位的时间间隔,表示ksmd线程在每次 scan 之间的睡眠时间。例如:

# 设置 scan 周期为100毫秒
echo 100 | sudo tee /sys/kernel/mm/ksm/sleep_millisecs

在这种配置下,ksmd线程会每隔100毫秒被调度运行一次,以 scan 和合并内存页。

每次 scan 的页数

ksmd线程在每次被调度运行时,会根据/sys/kernel/mm/ksm/pages_to_scan文件中配置的值来决定每次 scan 的页数。例如:

# 设置每次 scan 的页数为256页
echo 256 | sudo tee /sys/kernel/mm/ksm/pages_to_scan

这意味着每次ksmd线程被调度运行时,它将 scan 256个内存页。

housekeeping_isolcpus_setup

在grub中传入isolcpus=参数时,kernel/sched/isolation.c文件中的housekeeping_isolcpus_setup函数会被调用。

// vim kernel/sched/isolation.c +126

126 static int __init housekeeping_isolcpus_setup(char *str)  // 定义一个静态的初始化函数 housekeeping_isolcpus_setup ,接收一个字符指针参数
127 {
128     unsigned int flags = 0;  // 定义一个无符号整型变量 flags 并初始化为 0
129
130     while (isalpha(*str)) {  // 当字符串指针所指字符为字母时进行循环
131         if (!strncmp(str, "nohz,", 5)) {  // 如果字符串开头为 "nohz,"
132             str += 5;  // 字符串指针向后移动 5 个位置
133             flags |= HK_FLAG_TICK;  // 对 flags 进行位或操作,设置相应标志位
134             continue;  // 继续下一次循环
135         }
136
137         if (!strncmp(str, "domain,", 7)) {  // 如果字符串开头为 "domain,"
138             str += 7;  // 字符串指针向后移动 7 个位置
139             flags |= HK_FLAG_DOMAIN;  // 对 flags 进行位或操作,设置相应标志位
140             continue;  // 继续下一次循环
141         }
142
143         pr_warn("isolcpus: Error, unknown flag\n");  // 打印警告信息
144         return 0;  // 返回 0
145     }
146
147     /* Default behaviour for isolcpus without flags */  // 对于没有标志的 isolcpus 的默认行为
148     if (!flags)  // 如果 flags 为 0
149         flags |= HK_FLAG_DOMAIN;  // 对 flags 进行位或操作,设置相应标志位
150
151     return housekeeping_setup(str, flags);  // 返回 housekeeping_setup 函数的执行结果
152 }
153 __setup("isolcpus=", housekeeping_isolcpus_setup);  // 定义一个设置项 "isolcpus=" ,关联处理函数 housekeeping_isolcpus_setup

内核调度器

ksmd线程作为内核线程,由内核调度器进行调度。内核调度器会根据系统的整体 load和ksmd线程的优先级来决定何时运行ksmd线程。尽管ksmd线程的优先级通常较低,但在配置的 scan 周期到达时,调度器会尝试调度它运行。

start_kernel  // 0号线程
    -> sched_init // 调度器初始化
        -> init_sched_fair_class // cfs_class初始化
    -> reset_init
        -> kernel_init  // 1号线程, sched_normal
                -> kernel_init_freeable
                        -> wait_for_completion(&kthreadd_done) // Wait until kthreadd is all set-up.
                        -> smp_init // 多核初始化
                        -> sched_init_smp // 多核调度器初始化
                                -> sched_init_domains
                                        -> cpumask_and // 排除isolated cpus
                                -> init_sched_rt_class
                                -> init_sched_vms_class
                                -> init_sched_dl_class
                        -> do_basic_setup
                                -> do_initcalls
                                        -> subsys_initcall(ksm_init)    // ksmd, sched_normal
        -> kthreadd  // 2号线程,sched_normal
        -> complete(&kthreadd_done)
taskset -pc 2
pid 2's current affinity list: 0-95
taskset -pc 510
pid 510's current affinity list: 0-95

rest_init

// vim init/main.c +397

397 static noinline void __ref rest_init(void)  // 定义静态内联函数 rest_init
398 {
399     struct task_struct *tsk;  // 定义任务结构体指针 tsk
400     int pid;  // 定义进程 ID 变量 pid
401
402     rcu_scheduler_starting();  // 启动 RCU 调度器
403     /*
404      * We need to spawn init first so that it obtains pid 1, however  // 我们首先需要生成 init 进程,以便它获得 PID 1,然而
405      * the init task will end up wanting to create kthreads, which, if  // init 任务最终会想要创建内核线程,如果
406      * we schedule it before we create kthreadd, will OOPS.  // 我们在创建 kthreadd 之前对其进行调度,将会出错
407      */
408     pid = kernel_thread(kernel_init, NULL, CLONE_FS);  // 创建内核线程 kernel_init,并获取其进程 ID
409     /*
410      * Pin init on the boot CPU. Task migration is not properly working  // 将 init 固定在启动 CPU 上。任务迁移尚未正常工作
411      * until sched_init_smp() has been run. It will set the allowed  // 直到 sched_init_smp() 运行。将设置
412      * CPUs for init to the non isolated CPUs.  //  init 到非隔离的 CPU 上
413      */
414     rcu_read_lock();  // 读取锁
415     tsk = find_task_by_pid_ns(pid, &init_pid_ns);  // 通过 PID 和命名空间查找任务
416     set_cpus_allowed_ptr(tsk, cpumask_of(smp_processor_id()));  // 设置允许的 CPU 掩码
417     rcu_read_unlock();  // 读取解锁
418
419     numa_default_policy();  // 设置 NUMA 默认策略
420     pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES);  // 创建 kthreadd 内核线程,并获取其进程 ID
421     rcu_read_lock();  // 读取锁
422     kthreadd_task = find_task_by_pid_ns(pid, &init_pid_ns);  // 通过 PID 和命名空间查找 kthreadd 任务
423     rcu_read_unlock();  // 读取解锁
424
425     /*
426      * Enable might_sleep() and smp_processor_id() checks.  // 启用 might_sleep() 和 smp_processor_id() 检查
427      * They cannot be enabled earlier because with CONFIG_PREEMPT=y  // 它们不能更早启用,因为在 CONFIG_PREEMPT=y 时
428      * kernel_thread() would trigger might_sleep() splats. With  // kernel_thread() 会触发 might_sleep() 问题。在
429      * CONFIG_PREEMPT_VOLUNTARY=y the init task might have scheduled  // CONFIG_PREEMPT_VOLUNTARY=y 时,init 任务可能已经调度
430      * already, but it's stuck on the kthreadd_done completion.  // 但它会卡在 kthreadd_done 完成处
431      */
432     system_state = SYSTEM_SCHEDULING;  // 设置系统状态为 SYSTEM_SCHEDULING
433
434     complete(&kthreadd_done);  // 完成 kthreadd_done 操作
435
436     /*
437      * The boot idle thread must execute schedule()  // 启动空闲线程必须执行 schedule()
438      * at least once to get things moving:  // 至少一次以启动进程
439      */
440     schedule_preempt_disabled();  // 执行调度,禁用抢占
441     /* Call into cpu_idle with preempt disabled */  // 以禁用抢占的方式调用 cpu_idle
442     cpu_startup_entry(CPUHP_ONLINE);  // CPU 启动入口
443 }

sched_init_numa

1385 void sched_init_numa(void)  // 定义一个名为 sched_init_numa 的无返回值函数
1386 {
1387     int next_distance, curr_distance = node_distance(0, 0);  // 定义两个整型变量,curr_distance 并初始化为 node_distance(0, 0) 的返回值
1388     struct sched_domain_topology_level *tl;  // 定义一个指向结构体 sched_domain_topology_level 的指针
1389     int level = 0;  // 定义并初始化整型变量 level 为 0
1390     int i, j, k;  // 定义三个整型变量 i, j, k
1391
1392     sched_domains_numa_distance = kzalloc(sizeof(int) * (nr_node_ids + 1), GFP_KERNEL);  // 动态分配内存
1393     if (!sched_domains_numa_distance)  // 如果分配失败
1394         return;  // 直接返回
1395
1396     /* Includes NUMA identity node at level 0. */  // 包含在 0 级的 NUMA 标识节点
1397     sched_domains_numa_distance[level++] = curr_distance;  // 为数组元素赋值并递增 level
1398     sched_domains_numa_levels = level;  // 更新 sched_domains_numa_levels 的值
1399
1400     /*
1401      * O(nr_nodes^2) deduplicating selection sort -- in order to find the
1402      * unique distances in the node_distance() table.
1403      *
1404      * Assumes node_distance(0,j) includes all distances in
1405      * node_distance(i,j) in order to avoid cubic time.
1406      */
1407     next_distance = curr_distance;  // 初始化 next_distance
1408     for (i = 0; i < nr_node_ids; i++) {  // 开始一个循环
1409         for (j = 0; j < nr_node_ids; j++) {  // 嵌套循环
1410             for (k = 0; k < nr_node_ids; k++) {  // 嵌套循环
1411                 int distance = node_distance(i, k);  // 获取距离值
1412
1413                 if (distance > curr_distance &&  // 条件判断
1414                     (distance < next_distance ||  // 条件判断
1415                      next_distance == curr_distance))
1416                     next_distance = distance;  // 更新 next_distance
1417
1418                 /*
1419                  * While not a strong assumption it would be nice to know
1420                  * about cases where if node A is connected to B, B is not
1421                  * equally connected to A.
1422                  */
1423                 if (sched_debug() && node_distance(k, i)!= distance)  // 条件判断
1424                     sched_numa_warn("Node-distance not symmetric");  // 调用警告函数
1425
1426                 if (sched_debug() && i &&!find_numa_distance(distance))  // 条件判断
1427                     sched_numa_warn("Node-0 not representative");  // 调用警告函数
1428             }
1429             if (next_distance!= curr_distance) {  // 条件判断
1430                 sched_domains_numa_distance[level++] = next_distance;  // 赋值并递增 level
1431                 sched_domains_numa_levels = level;  // 更新值
1432                 curr_distance = next_distance;  // 更新 curr_distance
1433             } else break;  // 否则退出循环
1434         }
1435
1436         /*
1437          * In case of sched_debug() we verify the above assumption.
1438          */
1439         if (!sched_debug())  // 条件判断
1440             break;  // 退出循环
1441     }
1442
1443     /*
1444      * 'level' contains the number of unique distances
1445      *
1446      * The sched_domains_numa_distance[] array includes the actual distance
1447      * numbers.
1448      */
1449
1450     /*
1451      * Here, we should temporarily reset sched_domains_numa_levels to 0.
1452      * If it fails to allocate memory for array sched_domains_numa_masks[][],
1453      * the array will contain less then 'level' members. This could be
1454      * dangerous when we use it to iterate array sched_domains_numa_masks[][]
1455      * in other functions.
1456      *
1457      * We reset it to 'level' at the end of this function.
1458      */
1459     sched_domains_numa_levels = 0;  // 暂时重置值
1460
1461     sched_domains_numa_masks = kzalloc(sizeof(void *) * level, GFP_KERNEL);  // 动态分配内存
1462     if (!sched_domains_numa_masks)  // 如果分配失败
1463         return;  // 直接返回
1464
1465     /*
1466      * Now for each level, construct a mask per node which contains all
1467      * CPUs of nodes that are that many hops away from us.
1468      */
1469     for (i = 0; i < level; i++) {  // 开始一个循环
1470         sched_domains_numa_masks[i] =  // 为数组元素赋值
1471             kzalloc(nr_node_ids * sizeof(void *), GFP_KERNEL);  // 动态分配内存
1472         if (!sched_domains_numa_masks[i])  // 如果分配失败
1473             return;  // 直接返回
1474
1475         for (j = 0; j < nr_node_ids; j++) {  // 嵌套循环
1476             struct cpumask *mask = kzalloc(cpumask_size(), GFP_KERNEL);  // 动态分配内存
1477             if (!mask)  // 如果分配失败
1478                 return;  // 直接返回
1479
1480             sched_domains_numa_masks[i][j] = mask;  // 赋值
1481
1482             for_each_node(k) {  // 循环
1483                 if (node_distance(j, k) > sched_domains_numa_distance[i])  // 条件判断
1484                     continue;  // 继续下一次循环
1485
1486                 cpumask_or(mask, mask, cpumask_of_node(k));  // 进行位或操作
1487             }
1488         }
1489     }
1490
1491     /* Compute default topology size */  // 计算默认拓扑大小
1492     for (i = 0; sched_domain_topology[i].mask; i++);  // 循环
1493
1494     tl = kzalloc((i + level + 1) *  // 动态分配内存
1495             sizeof(struct sched_domain_topology_level), GFP_KERNEL);  // 动态分配内存
1496     if (!tl)  // 如果分配失败
1497         return;  // 直接返回
1498
1499     /*
1500      * Copy the default topology bits..
1501      */
1502     for (i = 0; sched_domain_topology[i].mask; i++)  // 循环
1503         tl[i] = sched_domain_topology[i];  // 赋值
1504
1505     /*
1506      * Add the NUMA identity distance, aka single NODE.
1507      */
1508     tl[i++] = (struct sched_domain_topology_level){  // 赋值
1509        .mask = sd_numa_mask,
1510        .numa_level = 0,
1511         SD_INIT_NAME(NODE)
1512     };
1513
1514     /*
1515      *.. and append 'j' levels of NUMA goodness.
1516      */
1517     for (j = 1; j < level; i++, j++) {  // 循环
1518         tl[i] = (struct sched_domain_topology_level){  // 赋值
1519            .mask = sd_numa_mask,
1520            .sd_flags = cpu_numa_flags,
1521            .flags = SDTL_OVERLAP,
1522            .numa_level = j,
1523             SD_INIT_NAME(NUMA)
1524         };
1525     }
1526
1527     sched_domain_topology = tl;  // 赋值
1528
1529     sched_domains_numa_levels = level;  // 更新值
1530     sched_max_numa_distance = sched_domains_numa_distance[level - 1];  // 赋值
1531
1532     init_numa_topology_type();  // 调用函数
1533 }
init_numa_topology_type
1333 /*
1334  * A system can have three types of NUMA topology:  // 一个系统可以有三种 NUMA 拓扑类型:
1335  * NUMA_DIRECT: all nodes are directly connected, or not a NUMA system  // NUMA_DIRECT:所有节点直接相连,或者不是 NUMA 系统
1336  * NUMA_GLUELESS_MESH: some nodes reachable through intermediary nodes  // NUMA_GLUELESS_MESH:一些节点通过中间节点可达
1337  * NUMA_BACKPLANE: nodes can reach other nodes through a backplane  // NUMA_BACKPLANE:节点可以通过背板到达其他节点
1338  *
1339  * The difference between a glueless mesh topology and a backplane  // 无胶网格拓扑和背板之间的区别
1340  * topology lies in whether communication between not directly  // 在于不直接
1341  * connected nodes goes through intermediary nodes (where programs  // 连接的节点之间的通信是否通过中间节点(程序
1342  * could run), or through backplane controllers. This affects  // 可以运行),还是通过背板控制器。这会影响
1343  * placement of programs.  // 程序的放置
1344  *
1345  * The type of topology can be discerned with the following tests:  // 拓扑类型可以通过以下测试来辨别
1346  * - If the maximum distance between any nodes is 1 hop, the system  // - 如果任意节点之间的最大距离为 1 跳,系统
1347  *   is directly connected.  // 是直接连接的
1348  * - If for two nodes A and B, located N > 1 hops away from each other,  // - 如果对于两个节点 A 和 B,彼此距离 N > 1 跳
1349  *   there is an intermediary node C, which is < N hops away from both  // 存在一个中间节点 C,它距离 A 和 B 都小于 N 跳
1350  *   nodes A and B, the system is a glueless mesh.  // 系统是无胶网格
1351  */
1352 static void init_numa_topology_type(void)  // 定义一个静态的无返回值函数 init_numa_topology_type
1353 {
1354     int a, b, c, n;  // 定义四个整型变量 a, b, c, n
1355
1356     n = sched_max_numa_distance;  // 为 n 赋值
1357
1358     if (sched_domains_numa_levels <= 2) {  // 如果条件成立
1359         sched_numa_topology_type = NUMA_DIRECT;  // 赋值
1360         return;  // 返回
1361     }
1362
1363     for_each_online_node(a) {  // 开始循环
1364         for_each_online_node(b) {  // 嵌套循环
1365             /* Find two nodes furthest removed from each other. */  // 找到彼此距离最远的两个节点
1366             if (node_distance(a, b) < n)  // 如果条件成立
1367                 continue;  // 继续下一次循环
1368
1369             /* Is there an intermediary node between a and b? */  // a 和 b 之间是否存在中间节点?
1370             for_each_online_node(c) {  // 嵌套循环
1371                 if (node_distance(a, c) < n &&  // 如果条件成立
1372                     node_distance(b, c) < n) {  // 如果条件成立
1373                     sched_numa_topology_type =  // 赋值
1374                             NUMA_GLUELESS_MESH;
1375                     return;  // 返回
1376                 }
1377             }
1378
1379             sched_numa_topology_type = NUMA_BACKPLANE;  // 赋值
1380             return;  // 返回
1381         }
1382     }
1383 }

kernel_thread

// vim kernel/fork.c +2348

2348 /*
2349  * Create a kernel thread.
2350  */
2351 pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags)
2352 {
2353     return _do_fork(flags|CLONE_VM|CLONE_UNTRACED, (unsigned long)fn,
2354         (unsigned long)arg, NULL, NULL, 0);
2355 }

_do_fork

// vim kernel/fork.c +2257

2257 /*
2258  *  Ok, this is the main fork-routine.
2259  *
2260  * It copies the process, and if successful kick-starts
2261  * it and waits for it to finish using the VM if required.
2262  */
2263 long _do_fork(unsigned long clone_flags,
2264           unsigned long stack_start,
2265           unsigned long stack_size,
2266           int __user *parent_tidptr,
2267           int __user *child_tidptr,
2268           unsigned long tls)
2269 {
2270     struct completion vfork;
2271     struct pid *pid;
2272     struct task_struct *p;
2273     int trace = 0;
2274     long nr;
2275
2276     /*
2277      * Determine whether and which event to report to ptracer.  When
2278      * called from kernel_thread or CLONE_UNTRACED is explicitly
2279      * requested, no event is reported; otherwise, report if the event
2280      * for the type of forking is enabled.
2281      */
2282     if (!(clone_flags & CLONE_UNTRACED)) {
2283         if (clone_flags & CLONE_VFORK)
2284             trace = PTRACE_EVENT_VFORK;
2285         else if ((clone_flags & CSIGNAL) != SIGCHLD)
2286             trace = PTRACE_EVENT_CLONE;
2287         else
2288             trace = PTRACE_EVENT_FORK;
2289
2290         if (likely(!ptrace_event_enabled(current, trace)))
2291             trace = 0;
2292     }
2293
2294     p = copy_process(clone_flags, stack_start, stack_size,
2295              child_tidptr, NULL, trace, tls, NUMA_NO_NODE);
2296     add_latent_entropy();
2297
2298     if (IS_ERR(p))
2299         return PTR_ERR(p);
2300
2301     /*
2302      * Do this prior waking up the new thread - the thread pointer
2303      * might get invalid after that point, if the thread exits quickly.
2304      */
2305     trace_sched_process_fork(current, p);
2306
2307     pid = get_task_pid(p, PIDTYPE_PID);
2308     nr = pid_vnr(pid);
2309
2310     if (clone_flags & CLONE_PARENT_SETTID)
2311         put_user(nr, parent_tidptr);
2312
2313     if (clone_flags & CLONE_VFORK) {
2314         p->vfork_done = &vfork;
2315         init_completion(&vfork);
2316         get_task_struct(p);
2317     }
2318
2319     wake_up_new_task(p);
2320
2321     /* forking complete and child started to run, tell ptracer */
2322     if (unlikely(trace))
2323         ptrace_event_pid(trace, pid);
2324
2325     if (clone_flags & CLONE_VFORK) {
2326         if (!wait_for_vfork_done(p, &vfork))
2327             ptrace_event_pid(PTRACE_EVENT_VFORK_DONE, pid);
2328     }
2329
2330     put_pid(pid);
2331     return nr;
2332 }

wake_up_new_task

// vim kernel/sched/core.c +2420

2420  * wake_up_new_task - wake up a newly created task for the first time.  // wake_up_new_task - 首次唤醒新创建的任务
2421  *
2422  * This function will do some initial scheduler statistics housekeeping  // 此函数将进行一些初始的调度器统计维护工作
2423  * that must be done for every newly created context, then puts the task  // 这些工作必须为每个新创建的上下文完成,然后将任务
2424  * on the runqueue and wakes it.  // 放入运行队列并唤醒它
2425  */
2426 void wake_up_new_task(struct task_struct *p)  // 定义 wake_up_new_task 函数,参数为任务结构体指针 p
2427 {  // 函数体开始
2428     struct rq_flags rf;  // 定义 rq_flags 结构体变量 rf
2429     struct rq *rq;  // 定义 rq 结构体指针 rq
2430
2431     raw_spin_lock_irqsave(&p->pi_lock, rf.flags);  // 上锁并保存中断状态
2432     p->state = TASK_RUNNING;  // 设置任务状态为运行
2433 #ifdef CONFIG_SMP  // 如果定义了 CONFIG_SMP
2434     /*
2435      * Fork balancing, do it here and not earlier because:  // 分叉平衡,在此处进行而不是更早,因为:
2436      *  - cpus_allowed can change in the fork path  // 在分叉路径中 cpus_allowed 可能会改变
2437      *  - any previously selected CPU might disappear through hotplug  // 任何先前选择的 CPU 可能因热插拔而消失
2438      *
2439      * Use __set_task_cpu() to avoid calling sched_class::migrate_task_rq,  // 使用 __set_task_cpu() 避mian调用 sched_class::migrate_task_rq
2440      * as we're not fully set-up yet.  // 因为我们还没有完全设置好
2441      */
2442     p->recent_used_cpu = task_cpu(p);  // 设置最近使用的 CPU
2443     rseq_migrate(p);  // 进行 rseq 迁移
2444     __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));  // 设置任务的 CPU
2445 #endif  // 结束 CONFIG_SMP 条件编译
2446     rq = __task_rq_lock(p, &rf);  // 锁定任务的运行队列
2447     update_rq_clock(rq);  // 更新运行队列的时钟
2448     post_init_entity_util_avg(&p->se);  // 初始化实体的平均利用率
2449
2450     activate_task(rq, p, ENQUEUE_NOCLOCK);  // 激活任务
2451     p->on_rq = TASK_ON_RQ_QUEUED;  // 设置任务在运行队列的状态
2452     trace_sched_wakeup_new(p);  // 跟踪调度器的唤醒新任务
2453     check_preempt_curr(rq, p, WF_FORK);  // 检查抢占当前任务
2454 #ifdef CONFIG_SMP  // 如果定义了 CONFIG_SMP
2455     if (p->sched_class->task_woken) {  // 如果任务的调度类有 task_woken 函数
2456         /*
2457          * Nothing relies on rq->lock after this, so its fine to  // 在此之后没有依赖 rq->lock,所以可以
2458          * drop it.  // 释放它
2459          */
2460         rq_unpin_lock(rq, &rf);  // 解锁运行队列
2461         p->sched_class->task_woken(rq, p);  // 调用调度类的 task_woken 函数
2462         rq_repin_lock(rq, &rf);  // 重新锁定运行队列
2463     }
2464 #endif  // 结束 CONFIG_SMP 条件编译
2465     task_rq_unlock(rq, p, &rf);  // 解锁任务的运行队列
2466 }

select_task_rq_fair

:cl
9 kernel/sched/core.c:2066: <<try_to_wake_up>> cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
10 kernel/sched/core.c:2444: <<wake_up_new_task>> __set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
11 kernel/sched/core.c:2994: <<sched_exec>> dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), SD_BALANCE_EXEC, 0);
// vim CTKernel/kernel/sched/fair.c +6437

6424 /*
6425  * select_task_rq_fair: 为具有设置“sd_flag”标志的域中的唤醒任务选择目标运行队列
6426  * 在实践中,这是 SD_BALANCE_WAKE、SD_BALANCE_FORK 或 SD_BALANCE_EXEC
6427  *
6428  * 通过选择最空闲组中的最空闲 CPU 来平衡 load,或者在某些条件下,如果域设置了 SD_WAKE_AFFINE,则选择空闲的兄弟 CPU
6429  *
6430  * 返回目标 CPU 编号
6431  *
6432  * 必须禁用抢占
6433  */
6434 static int  // 声明静态整型函数
6435 select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)  // 函数定义,接收任务结构体指针、前一个 CPU 编号、标志和唤醒标志
6436 {
6437     struct sched_domain *tmp, *sd = NULL;  // 定义调度域结构体指针 tmp 和 sd,并初始化为 NULL
6438     int cpu = smp_processor_id();  // 获取当前 CPU 编号
6439     int new_cpu = prev_cpu;  // 新的 CPU 编号初始化为前一个 CPU 编号
6440     int want_affine = 0;  // 初始化是否想要亲和性为 0
6441     int sync = (wake_flags & WF_SYNC) &&!(current->flags & PF_EXITING);  // 根据条件计算同步标志
6442
6443     if (sd_flag & SD_BALANCE_WAKE) {  // 如果标志包含 SD_BALANCE_WAKE
6444         record_wakee(p);  // 记录唤醒的任务
6445         want_affine =!wake_wide(p) &&!wake_cap(p, cpu, prev_cpu)  // 根据条件设置是否想要亲和性
6446                   && cpumask_test_cpu(cpu, &p->cpus_allowed);
6447     }
6448
6449     rcu_read_lock();  // 读锁
6450     for_each_domain(cpu, tmp) {  // 遍历调度域
6451         if (!(tmp->flags & SD_LOAD_BALANCE))  // 如果域没有 load均衡标志
6452             break;  // 跳出循环
6453
6454         /*
6455          * 如果“cpu”和“prev_cpu”都是这个域的一部分,
6456          * “cpu”是有效的 SD_WAKE_AFFINE 目标。
6457          */
6458         if (want_affine && (tmp->flags & SD_WAKE_AFFINE) &&  // 如果想要亲和性且域有相应标志
6459             cpumask_test_cpu(prev_cpu, sched_domain_span(tmp))) {  // 且前一个 CPU 在域范围内
6460             if (cpu!= prev_cpu)  // 如果当前 CPU 与前一个 CPU 不同
6461                 new_cpu = wake_affine(tmp, p, cpu, prev_cpu, sync);  // 执行亲和性相关操作
6462
6463             sd = NULL; /* 优先选择 wake_affine 而不是平衡标志 */
6464             break;  // 跳出循环
6465         }
6466
6467         if (tmp->flags & sd_flag)  // 如果域具有指定标志
6468             sd = tmp;  // 赋值给 sd
6469         else if (!want_affine)  // 否则如果不想要亲和性
6470             break;  // 跳出循环
6471     }
6472
6473     if (unlikely(sd)) {  // 如果不太可能出现的情况:sd 不为空
6474         /* 慢路径 */
6475         new_cpu = find_idlest_cpu(sd, p, cpu, prev_cpu, sd_flag);  // 查找最空闲的 CPU
6476     } else if (sd_flag & SD_BALANCE_WAKE) {
6477         /* 快路径 */
6478
6479         new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);  // 选择空闲的兄弟 CPU
6480
6481         if (want_affine)  // 如果想要亲和性
6482             current->recent_used_cpu = cpu;  // 更新最近使用的 CPU
6483     }
6484     rcu_read_unlock();  // 读解锁
6485
6486     return new_cpu;  // 返回新的 CPU 编号
6487 }
// vim CTKernel/include/linux/sched/topology.h +20

20 #define SD_LOAD_BALANCE     0x0001  /* Do load balancing on this domain. */
21 #define SD_BALANCE_NEWIDLE  0x0002  /* Balance when about to become idle */
22 #define SD_BALANCE_EXEC     0x0004  /* Balance on exec */
23 #define SD_BALANCE_FORK     0x0008  /* Balance on fork, clone */
24 #define SD_BALANCE_WAKE     0x0010  /* Balance on wakeup */
25 #define SD_WAKE_AFFINE      0x0020  /* Wake task to waking CPU */
26 #define SD_ASYM_CPUCAPACITY 0x0040  /* Groups have different max cpu capacities */
27 #define SD_SHARE_CPUCAPACITY    0x0080  /* Domain members share cpu capacity */
28 #define SD_SHARE_POWERDOMAIN    0x0100  /* Domain members share power domain */
29 #define SD_SHARE_PKG_RESOURCES  0x0200  /* Domain members share cpu pkg resources */
30 #define SD_SERIALIZE        0x0400  /* Only a single load balancing instance */
31 #define SD_ASYM_PACKING     0x0800  /* Place busy groups earlier in the domain */
32 #define SD_PREFER_SIBLING   0x1000  /* Prefer to place tasks in a sibling domain */
33 #define SD_OVERLAP      0x2000  /* sched_domains of this level overlap */
34 #define SD_NUMA         0x4000  /* cross-node balancing */

select_idle_sibling

6258 /*
6259  * Try and locate an idle core/thread in the LLC cache domain.  // 尝试在 LLC 缓存域中定位一个空闲的核心/线程
6260  */
6261 static int select_idle_sibling(struct task_struct *p, int prev, int target)  // 定义一个静态函数 select_idle_sibling ,接收任务结构体指针、两个整型参数
6262 {
6263     struct sched_domain *sd;  // 定义一个调度域结构体指针
6264     int i, recent_used_cpu;  // 定义两个整型变量
6265
6266     if (available_idle_cpu(target))  // 如果目标 CPU 可用且空闲
6267         return target;  // 返回目标 CPU
6268
6269     /*
6270      * If the previous CPU is cache affine and idle, don't be stupid:  // 如果前一个 CPU 与缓存相关且空闲,别犯傻
6271      */
6272     if (prev!= target && cpus_share_cache(prev, target) && available_idle_cpu(prev))  // 如果前一个 CPU 满足条件
6273         return prev;  // 返回前一个 CPU
6274
6275     /* Check a recently used CPU as a potential idle candidate: */  // 检查最近使用的 CPU 作为潜在的空闲候选
6276     recent_used_cpu = p->recent_used_cpu;  // 获取最近使用的 CPU
6277     if (recent_used_cpu!= prev &&  // 如果最近使用的 CPU 满足条件
6278         recent_used_cpu!= target &&
6279         cpus_share_cache(recent_used_cpu, target) &&
6280         available_idle_cpu(recent_used_cpu) &&
6281         cpumask_test_cpu(p->recent_used_cpu, &p->cpus_allowed)) {
6282         /*
6283          * Replace recent_used_cpu with prev as it is a potential
6284          * candidate for the next wake:  // 因为前一个 CPU 是潜在的候选,所以替换最近使用的 CPU
6285          */
6286         p->recent_used_cpu = prev;  // 进行替换
6287         return recent_used_cpu;  // 返回最近使用的 CPU
6288     }
6289
6290     sd = rcu_dereference(per_cpu(sd_llc, target));  // 获取与目标 CPU 相关的调度域
6291     if (!sd)  // 如果获取失败
6292         return target;  // 返回目标 CPU
6293
6294     i = select_idle_core(p, sd, target);  // 执行一个选择空闲核心的操作
6295     if ((unsigned)i < nr_cpumask_bits)  // 如果结果满足条件
6296         return i;  // 返回结果
6297
6298     i = select_idle_cpu(p, sd, target);  // 执行一个选择空闲 CPU 的操作
6299     if ((unsigned)i < nr_cpumask_bits)  // 如果结果满足条件
6300         return i;  // 返回结果
6301
6302     i = select_idle_smt(p, sd, target);  // 执行一个选择空闲 SMT 的操作
6303     if ((unsigned)i < nr_cpumask_bits)  // 如果结果满足条件
6304         return i;  // 返回结果
6305
6306     return target;  // 否则返回目标 CPU
6307 }

kthreadd

// vim kernel/kthread.c +569

569 int kthreadd(void *unused)  // 定义 kthreadd 函数,参数为 void 指针 unused
570 {
571     struct task_struct *tsk = current;  // 获取当前任务结构体指针并赋值给 tsk
572
573     /* Setup a clean context for our children to inherit. */  // 为子进程设置一个干净的上下文以供继承
574     set_task_comm(tsk, "kthreadd");  // 设置任务的名称为 "kthreadd"
575     ignore_signals(tsk);  // 忽略信号
576     set_cpus_allowed_ptr(tsk, cpu_all_mask);  // 设置允许的 CPU
577     set_mems_allowed(node_states[N_MEMORY]);  // 设置允许的内存
578
579     current->flags |= PF_NOFREEZE;  // 设置当前进程的标志位 PF_NOFREEZE
580     cgroup_init_kthreadd();  // 初始化 cgroup 相关的 kthreadd
581
582     for (;;) {  // 无限循环
583         set_current_state(TASK_INTERRUPTIBLE);  // 设置当前状态为可中断等待
584         if (list_empty(&kthread_create_list))  // 如果 kthread_create_list 为空
585             schedule();  // 进行调度
586         __set_current_state(TASK_RUNNING);  // 设置当前状态为运行
587
588         spin_lock(&kthread_create_lock);  // 加自旋锁
589         while (!list_empty(&kthread_create_list)) {  // 当 kthread_create_list 不为空时
590             struct kthread_create_info *create;  // 定义 kthread_create_info 结构体指针 create
591
592             create = list_entry(kthread_create_list.next,  // 获取列表中的元素
593                         struct kthread_create_info, list);  // 并将其赋值给 create
594             list_del_init(&create->list);  // 从列表中删除并初始化
595             spin_unlock(&kthread_create_lock);  // 释放自旋锁
596
597             create_kthread(create);  // 创建线程
598
599             spin_lock(&kthread_create_lock);  // 加自旋锁
600         }  // 结束内层 while 循环
601         spin_unlock(&kthread_create_lock);  // 释放自旋锁
602     }  // 结束无限 for 循环
603
604     return 0;  // 返回 0
605 }

create_kthread

271 static void create_kthread(struct kthread_create_info *create)  // 定义静态函数 create_kthread,参数为 kthread_create_info 结构体指针
272 {
273     int pid;  // 定义进程 ID 变量
274
275 #ifdef CONFIG_NUMA  // 如果定义了 CONFIG_NUMA
276     current->pref_node_fork = create->node;  // 设置当前进程的偏好节点为传入的节点
277 #endif  // 结束 CONFIG_NUMA 条件编译
278     /* We want our own signal handler (we take no signals by default). */  // 我们想要自己的信号处理程序(默认不接收信号)
279     pid = kernel_thread(kthread, create, CLONE_FS | CLONE_FILES | SIGCHLD);  // 创建内核线程,并获取进程 ID
280     if (pid < 0) {  // 如果进程 ID 小于 0
281         /* If user was SIGKILLed, I release the structure. */  // 如果用户被 SIGKILL 信号终止,释放结构
282         struct completion *done = xchg(&create->done, NULL);  // 交换并获取完成变量
283         __se           tate(TASK_RUNNING);  // 设置任务状态为运行
284         if (!done) {  // 如果完成变量为空
285             kfree(create);  // 释放创建信息结构体的内存
286             return;  // 返回
287         }
288         create->result = ERR_PTR(pid);  // 设置创建结果为错误指针
289         complete(done);  // 完成操作
290     }
291 }

ksm_init

// vim mm/ksm.c +3172

3172 static int __init ksm_init(void)
3173 {
3174     struct task_struct *ksm_thread;
3175     int err;
3176
3177     /* The correct value depends on page size and endianness */
3178     zero_checksum = calc_checksum(ZERO_PAGE(0));
3179     /* Default to false for backwards compatibility */
3180     ksm_use_zero_pages = false;
3181
3182     err = ksm_slab_init();
3183     if (err)
3184         goto out;
3185
3186     ksm_thread = kthread_run(ksm_scan_thread, NULL, "ksmd");
3187     if (IS_ERR(ksm_thread)) {
3188         pr_err("ksm: creating kthread failed\n");
3189         err = PTR_ERR(ksm_thread);
3190         goto out_free;
3191     }
3192
3193 #ifdef CONFIG_SYSFS
3194     err = sysfs_create_group(mm_kobj, &ksm_attr_group);
3195     if (err) {
3196         pr_err("ksm: register sysfs failed\n");
3197         kthread_stop(ksm_thread);
3198         goto out_free;
3199     }
3200 #else
3201     ksm_run = KSM_RUN_MERGE;    /* no way for user to start it */
3202
3203 #endif /* CONFIG_SYSFS */
3204
3205 #ifdef CONFIG_MEMORY_HOTREMOVE
3206     /* There is no significance to this priority 100 */
3207     hotplug_memory_notifier(ksm_memory_callback, 100);
3208 #endif
3209     return 0;
3210
3211 out_free:
3212     ksm_slab_free();
3213 out:
3214     return err;
3215 }
// vim include/linux/kthread.h +34

34 /**
35  * kthread_run - create and wake a thread.
36  * @threadfn: the function to run until signal_pending(current).
37  * @data: data ptr for @threadfn.
38  * @namefmt: printf-style name for the thread.
39  *
40  * Description: Convenient wrapper for kthread_create() followed by
41  * wake_up_process().  Returns the kthread or ERR_PTR(-ENOMEM).
42  */
43 #define kthread_run(threadfn, data, namefmt, ...)              \
44 ({                                     \
45     struct task_struct *__k                        \
46         = kthread_create(threadfn, data, namefmt, ## __VA_ARGS__); \
47     if (!IS_ERR(__k))                          \
48         wake_up_process(__k);                      \
49     __k;                                   \
50 })

kthread_create

// vim include/linux/kthread.h +14

14 /**
15  * kthread_create - create a kthread on the current node
16  * @threadfn: the function to run in the thread
17  * @data: data pointer for @threadfn()
18  * @namefmt: printf-style format string for the thread name
19  * @arg...: arguments for @namefmt.
20  *
21  * This macro will create a kthread on the current node, leaving it in
22  * the stopped state.  This is just a helper for kthread_create_on_node();
23  * see the documentation there for more details.
24  */
25 #define kthread_create(threadfn, data, namefmt, arg...) \
26     kthread_create_on_node(threadfn, data, NUMA_NO_NODE, namefmt, ##arg)

kthread_create_on_node

// kernel/kthread.c +357

357 /**
358  * kthread_create_on_node - create a kthread.
359  * @threadfn: the function to run until signal_pending(current).
360  * @data: data ptr for @threadfn.
361  * @node: task and thread structures for the thread are allocated on this node
362  * @namefmt: printf-style name for the thread.
363  *
364  * Description: This helper function creates and names a kernel
365  * thread.  The thread will be stopped: use wake_up_process() to start
366  * it.  See also kthread_run().  The new thread has SCHED_NORMAL policy and
368  *
369  * If thread is going to be bound on a particular cpu, give its node
370  * in @node, to get NUMA affinity for kthread stack, or else give NUMA_NO_NODE.
371  * When woken, the thread will run @threadfn() with @data as its
372  * argument. @threadfn() can either call do_exit() directly if it is a
373  * standalone thread for which no one will call kthread_stop(), or
374  * return when 'kthread_should_stop()' is true (which means
375  * kthread_stop() has been called).  The return value should be zero
376  * or a negative error number; it will be passed to kthread_stop().
377  *
378  * Returns a task_struct or ERR_PTR(-ENOMEM) or ERR_PTR(-EINTR).
379  */
380 struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
381                        void *data, int node,
382                        const char namefmt[],
383                        ...)
384 {
385     struct task_struct *task;
386     va_list args;
387
388     va_start(args, namefmt);
389     task = __kthread_create_on_node(threadfn, data, node, namefmt, args);
390     va_end(args);
391
392     return task;
393 }
394 EXPORT_SYMBOL(kthread_create_on_node);

__kthread_create_on_node

293 static __printf(4, 0)
294 struct task_struct *__kthread_create_on_node(int (*threadfn)(void *data),
295                             void *data, int node,
296                             const char namefmt[],
297                             va_list args)
298 {
299     DECLARE_COMPLETION_ONSTACK(done);
300     struct task_struct *task;
301     struct kthread_create_info *create = kmalloc(sizeof(*create),
302                              GFP_KERNEL);
303
304     if (!create)
305         return ERR_PTR(-ENOMEM);
306     create->threadfn = threadfn;
307     create->data = data;
308     create->node = node;
309     create->done = &done;
310
311     spin_lock(&kthread_create_lock);
312     list_add_tail(&create->list, &kthread_create_list);
313     spin_unlock(&kthread_create_lock);
314
315     wake_up_process(kthreadd_task);
316     /*
317      * Wait for completion in killable state, for I might be chosen by
318      * the OOM killer while kthreadd is trying to allocate memory for
319      * new kernel thread.
320      */
321     if (unlikely(wait_for_completion_killable(&done))) {
322         /*
323          * If I was SIGKILLed before kthreadd (or new kernel thread)
324          * calls complete(), leave the cleanup of this structure to
325          * that thread.
326          */
327         if (xchg(&create->done, NULL))
328             return ERR_PTR(-EINTR);
329         /*
330          * kthreadd (or new kernel thread) will call complete()
331          * shortly.
332          */
333         wait_for_completion(&done);
334     }
335     task = create->result;
336     if (!IS_ERR(task)) {
337         static const struct sched_param param = { .sched_priority = 0 };
338         char name[TASK_COMM_LEN];
339
340         /*
341          * task is already visible to other tasks, so updating
342          * COMM must be protected.
343          */
344         vsnprintf(name, sizeof(name), namefmt, args);
345         set_task_comm(task, name);
346         /*
347          * root may have changed our (kthreadd's) priority or CPU mask.
348          * The kernel thread should not inherit these properties.
349          */
350         sched_setscheduler_nocheck(task, SCHED_NORMAL, &param);
351         set_cpus_allowed_ptr(task, cpu_all_mask);
352     }
353     kfree(create);
354     return task;

wake_up_process

2141 /**
2142  * wake_up_process - Wake up a specific process
2143  * @p: The process to be woken up.
2144  *
2145  * Attempt to wake up the nominated process and move it to the set of runnable
2146  * processes.
2147  *
2148  * Return: 1 if the process was woken up, 0 if it was already running.
2149  *
2150  * This function executes a full memory barrier before accessing the task state.
2151  */
2152 int wake_up_process(struct task_struct *p)
2153 {
2154     return try_to_wake_up(p, TASK_NORMAL, 0);
2155 }
2156 EXPORT_SYMBOL(wake_up_process);

try_to_wake_up

1959 /**
1960  * try_to_wake_up - wake up a thread
1961  * @p: the thread to be awakened
1962  * @state: the mask of task states that can be woken
1963  * @wake_flags: wake modifier flags (WF_*)
1964  *
1965  * If (@state & @p->state) @p->state = TASK_RUNNING.
1966  *
1967  * If the task was not queued/runnable, also place it back on a runqueue.
1968  *
1969  * Atomic against schedule() which would dequeue a task, also see
1970  * set_current_state().
1971  *
1972  * This function executes a full memory barrier before accessing the task
1973  * state; see set_current_state().
1974  *
1975  * Return: %true if @p->state changes (an actual wakeup was done),
1976  *     %false otherwise.
1977  */
1978 static int
1979 try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
1980 {
1981     unsigned long flags;
1982     int cpu, success = 0;
1983
1984     /*
1985      * If we are going to wake up a thread waiting for CONDITION we
1986      * need to ensure that CONDITION=1 done by the caller can not be
1987      * reordered with p->state check below. This pairs with mb() in
1988      * set_current_state() the waiting thread does.
1989      */
1990     raw_spin_lock_irqsave(&p->pi_lock, flags);
1991     smp_mb__after_spinlock();
1992     if (!(p->state & state))
1993         goto out;
1994
1995     trace_sched_waking(p);
1996
1997     /* We're going to change ->state: */
1998     success = 1;
1999     cpu = task_cpu(p);
2000
2001     /*
2002      * Ensure we load p->on_rq _after_ p->state, otherwise it would
2003      * be possible to, falsely, observe p->on_rq == 0 and get stuck
2004      * in smp_cond_load_acquire() below.
2005      *
2006      * sched_ttwu_pending()         try_to_wake_up()
2007      *   STORE p->on_rq = 1           LOAD p->state
2008      *   UNLOCK rq->lock
2009      *
2010      * __schedule() (switch to task 'p')
2011      *   LOCK rq->lock            smp_rmb();
2012      *   smp_mb__after_spinlock();
2013      *   UNLOCK rq->lock
2014      *
2015      * [task p]
2016      *   STORE p->state = UNINTERRUPTIBLE     LOAD p->on_rq
2017      *
2018      * Pairs with the LOCK+smp_mb__after_spinlock() on rq->lock in
2019      * __schedule().  See the comment for smp_mb__after_spinlock().
2020      */
2021     smp_rmb();
2022     if (p->on_rq && ttwu_remote(p, wake_flags))
2023         goto stat;
2024
2025 #ifdef CONFIG_SMP
2026     /*
2027      * Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be
2028      * possible to, falsely, observe p->on_cpu == 0.
2029      *
2030      * One must be running (->on_cpu == 1) in order to remove oneself
2031      * from the runqueue.
2032      *
2033      * __schedule() (switch to task 'p')    try_to_wake_up()
2034      *   STORE p->on_cpu = 1          LOAD p->on_rq
2035      *   UNLOCK rq->lock
2036      *
2037      * __schedule() (put 'p' to sleep)
2038      *   LOCK rq->lock            smp_rmb();
2039      *   smp_mb__after_spinlock();
2040      *   STORE p->on_rq = 0           LOAD p->on_cpu
2041      *
2042      * Pairs with the LOCK+smp_mb__after_spinlock() on rq->lock in
2043      * __schedule().  See the comment for smp_mb__after_spinlock().
2044      */
2045     smp_rmb();
2046
2047     /*
2048      * If the owning (remote) CPU is still in the middle of schedule() with
2049      * this task as prev, wait until its done referencing the task.
2050      *
2051      * Pairs with the smp_store_release() in finish_task().
2052      *
2053      * This ensures that tasks getting woken will be fully ordered against
2054      * their previous state and preserve Program Order.
2055      */
2056     smp_cond_load_acquire(&p->on_cpu, !VAL);
2057
2058     p->sched_contributes_to_load = !!task_contributes_to_load(p);
2059     p->state = TASK_WAKING;
2060
2061     if (p->in_iowait) {
2062         delayacct_blkio_end(p);
2063         atomic_dec(&task_rq(p)->nr_iowait);
2064     }
2065
2066     cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
2067     if (task_cpu(p) != cpu) {
2068         wake_flags |= WF_MIGRATED;
2069         psi_ttwu_dequeue(p);
2070         set_task_cpu(p, cpu);
2071     }
2072
2073 #else /* CONFIG_SMP */
2074
2075     if (p->in_iowait) {
2076         delayacct_blkio_end(p);
2077         atomic_dec(&task_rq(p)->nr_iowait);
2078     }
2079
2080 #endif /* CONFIG_SMP */
2081
2082     ttwu_queue(p, cpu, wake_flags);
2083 stat:
2084     ttwu_stat(p, cpu, wake_flags);
2085 out:
2086     raw_spin_unlock_irqrestore(&p->pi_lock, flags);
2087
2088     return success;
2089 }

修复方案

bug复现

17210964434930.png

0号进程

查看系统初始化smp多核心前0号进程位于哪个cpu上:

0号进程跑在0号cpu上

ksmd

从/var/log/messages中查看ksmd被调度到隔离核心上的日志:

grep 'p->comm:ksmd' /var/log/messages -inr --color > ksmd.log
vim ksmd.log

ksmd.log

kthreadd

从/var/log/messages中查看kthreadd被调度到隔离核心上的日志:

grep 'p->comm:kthreadd' /var/log/messages -inr --color > kthreadd.log
vim kthreadd.log

vim kthreadd.log

关键代码

上方日志代码位置参见下图。

select_task_rq_fair
vim kernel/sched/fair.c +6554

select_task_rq_fair

vim -t select_idle_sibling

select_idle_sibling

select_idle_smt

select_idle_smt

select_idle_smt

// vim -t select_idle_smt
6170 /*
6171  * Scan the local SMT mask for idle CPUs.
6172  */
6173 static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int target)
6174 {
6175     int cpu;
6176
6177     if (!static_branch_likely(&sched_smt_present))
6178         return -1;
6179
6180     for_each_cpu(cpu, cpu_smt_mask(target)) {
6181         if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
6182             continue;
6183         if (available_idle_cpu(cpu))
6184             return cpu;
6185     }
6186
6187     return -1;
6188 }

select_idle_smt

修改源码修复bug

改一行代码,隔离cpu的时候可以不考虑超线程逻辑,可修复此问题:

git show HEAD
commit a90c573c74f55930f3b770ec67d7a84528e8dac8 (HEAD -> master)
Author: QiLiang Yuan <yuanql9@chinatelecom.cn>
Date:   Thu Jul 18 23:18:24 2024 +0800

    sched/fair: Fix wrong cpu selecting from isolated domain

    Signed-off-by: QiLiang Yuan <yuanql9@chinatelecom.cn>

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f58f13545450..30f32641a45c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6178,7 +6178,7 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t
                return -1;

        for_each_cpu(cpu, cpu_smt_mask(target)) {
-               if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
+               if (!cpumask_test_cpu(cpu, &p->cpus_allowed) || !cpumask_test_cpu(cpu, sched_domain_span(sd)))
                        continue;
                if (available_idle_cpu(cpu))
                        return cpu;

sched_domain_span

历史上select_idle_smt、cpumask_test_cpu(cpu, sched_domain_span(sd))多次删除再合入,最近一次在6个月前:

git log -S 'cpumask_test_cpu(cpu, sched_domain_span(sd))' --oneline kernel/sched/fair.c

8aeaffef8c6e sched/fair: Take the scheduling domain into account in select_idle_smt()
3e6efe87cd5c sched/fair: Remove redundant check in select_idle_smt()
3e8c6c9aac42 sched/fair: Remove task_util from effective utilization in feec()
c722f35b513f sched/fair: Bring back select_idle_smt(), but differently
6cd56ef1df39 sched/fair: Remove select_idle_smt()
df3cb4ea1fb6 sched/fair: Fix wrong cpu selecting from isolated domain

进一步使用git show 查看上述所有commit的具体内容:

git log -S 'cpumask_test_cpu(cpu, sched_domain_span(sd))' --oneline kernel/sched/fair.c | awk {'print $1'} | xargs git show > sched_domain_span.log

不修改源码避开此bug

隔离核心考虑超线程调度域

查看逻辑cpu0的超线程调度域下的兄弟节点:

cat /sys/devices/system/cpu/cpu0/topology/thread_siblings_list
76

从下图可以看出isolcpus=1-75,77-151的时候ksmd不会被调度上去。

17212691514223.png

17212686898314.png

17212686615589.png

手动将ksmd迁走

此方案适用于线上不停机的情况,可直接将ksmd迁至非隔离核心上。

  • [PF_NO_SETAFFINITY校验值]
    在内核源码中查看PF_NO_SETAFFINITY定义:
#linux/sched.h

#define PF_NO_SETAFFINITY 0x04000000        /* Userland is not allowed to meddle with cpus_allowed */

在物理机上查看ksmd进程号:

ps aux | grep -i ksmd
root         510 11.1  0.0      0     0 ?        RN   Jan19 26874:37 [ksmd]
root     1979672  0.0  0.0 213952  1600 pts/16   S+   16:13   0:00 grep -i ksmd

进一步查看ksmd进程的状态:

```bash
[root@gzinf-kunpeng-55e235e17e57 yql]# cat /proc/510/stat
510 (ksmd) S 2 0 0 0 -1 2097216 0 0 0 0 3 161248078 0 0 25 5 1 0 13 0 0 18446744073709551615 0 0 0 0 0 0 0 2147483647 0 1 0 0 17 69 0 0 0 0 0 0 0 0 0 0 0 0 0

查看ksmd进程的标志位(flags):

[root@gzinf-kunpeng-55e235e17e57 yql]# flags=$(cut -f 9 -d ' ' /proc/510/stat)

查看ksmd进程的标志位(flags)是否包含PF_NO_SETAFFINITY(0x04000000):

[root@gzinf-kunpeng-55e235e17e57 yql]# echo $(($flags & 0x40000000))
0

20240704161455.png

可以看到ksmd进程的标志位(flags)不包含PF_NO_SETAFFINITY(0x04000000)。故可以通过taskset或者cgroups设置ksmd的cpu亲和性,将起绑定到非客户虚拟机所在的cpu上。

将ksmd的亲和性设置为10号cpu:

taskset -pc 10 510

将ksmd的亲和性设置为非隔离核心:

此场景需要刨去超线程调度域下的兄弟节点,假设0号逻辑cpu、76号逻辑cpu位于同一个物理cpu上,即0号逻辑cpu跟76号逻辑cpu是一个超线程调度域下的兄弟节点。假设0号逻辑cpu是隔离核心,那设置ksmd亲核心的时候需要排除掉掉76号逻辑cpu。

taskset -pc 0-63 510

通过上述两步即可将ksmd从隔离的cpu上迁走,同时保障ksmd不会再被调度到隔离核心上。

cicd运维一键迁

已经打成了rpm包,安装ksmd-taskset.rpm包即可一键迁移,详细步骤可以参考ksmd-taskset打包说明。

stress-ng压测

从github clone stress-ng源码:

cd stress-ng
make
make install
stress-ng --timeout 10m --cpu 110 --vm 20 --vm-bytes 16G --vm-madvise mergeable --mmap 110 --mmap-stressful --mmap-mergeable --mmaphuge 25 --sock 110

20240802115921.png

stress-ng --timeout 10m --cpu 110 --vm 20 --vm-bytes 16G --vm-madvise mergeable --mmap 110 --mmap-stressful --mmap-mergeable --mmaphuge 25 --sock 110 --ksm

20240802120305.png

实测将ksmd迁移到非隔离核心没有太大影响,从pidstat结果来看ksmd消耗了整体cpu占比为0.19%。

总结

下一个发行版合修复源码。

线上机器通过设置kpatch方式进行热修复,通过设置ksmd的cpu亲和性,将其绑定到非客户虚拟机所在的cpu上。

进一步可以通过调整ksmd参数来提高虚拟机的性能。

文章来自个人专栏
文章 | 订阅
0条评论
0 / 1000
请输入你的评论
0
0