numafast调优工具介绍

openEuler

发表于 2025/06/09

作者|刘长庚

numafast 调优工具介绍

介绍

numafast 是一个用于鲲鹏芯片的用户态调优工具，旨在减少系统的跨NUMA内存访问，提高系统性能。

使用限制

仅鲲鹏可用，目前代码闭源、软件二进制开源。
仅可在物理机运行，无法在虚拟机中运行。
运行依赖 SPE ，SPE 是一种硬件辅助CPU分析机制，运行前需要将SPE打开。
root 下运行

使能方法

开启SPE

运行 perf list | grep arm_spe 查看当前是否已经开启SPE，如果开启，则有如下显示：

[root@master ~]# perf list | grep arm_spe
  arm_spe_0//                                        [Kernel PMU event]

如果没有上述内容，表示SPE未打开，可按以下步骤开启SPE（kunpeng 920 高性能不需要手动配置，默认支持）。

检查BIOS配置项 MISC Config--> SPE的状态，如果状态为Disabled需要更改为Enabled。如果找不到这个选项，可能是BIOS版本过低。

安装

numafast目前在OEPKGS网站上发布： https://search.oepkgs.net/zh-CN/list?s=numafast&exactSearch=exact&searchType=default

可选取最高版本（第二列）运行，每个版本号目前对应这三个系统版本，分别对应4.19、5.10、6.6内核。系统版本号和当前OS版本无需完全匹配，只要内核匹配即可，例如，22.03-LTS-SP4 系统版本的软件包可以运行在5.10内核的各系统版本上。

由于oepkgs网页更新可能有延迟，最新的软件包也会在 https://gitee.com/src-oepkgs/numafast 中文档链接中更新下载。

下载后得到一个rpm包，在环境上安装 rpm -ivh numafast-xxx.rpm

rpm 安装后会在 /usr/bin/ 下添加 numafast 二进制命令。卸载命令：rpm -e numafast

运行

可选择以下任意一种方式运行。

方式1：采用二进制方式运行

Numafast是一个动态调优的程序，会持续调优，所以需要在业务运行期间一直运行。

终端执行 numafast , 此时会一直运行，CTRL+C 可退出运行。

如果想后台执行 nohup numafast &, 退出的时候请不要 kill -9 pid , 请用 kill -2 , 因为nuamfast要做一些复位操作。Kill -9 不会执行复位流程（会导致绑核未恢复解绑）。

上述方式建议临时调试使用，在v241-1版本之后，numafast支持systemd方式启动

systemctl start numafast

方式2：作为 oeAware 插件启动

nuamfast既可以独立二进制运行，又可以作为oeAware的插件使用，oeAware是openEuler 开源调优框架，在安装numafast后，安装oeAware

yum install oeAware-manager

启动：

systemctl start oeaware

使能numa调优：

oeaware -e tune_numa_mem_access

去使能numa调优

oeawarectl -d tune_numa_mem_access

oeaware 使能方式和numafast 二进制使能方式对调优而言无本质差别，但后续会oeaware会逐渐推广。 oeAware 相关使用方法请参考oeAware仓库文档：https://gitee.com/openeuler/oeAware-manager

注意事项

Numa balancing

Numa balancing 是 linux 的一个自带的numa调优功能，通常开箱是默认开启的，numafast开启后会自动将此功能关闭，退出时会恢复回启动前的状态。

业务绑核

程序目前会强制迁移所有程序(或只迁移白名单的程序、不迁移黑名单的程序)，如果预先对业务做了绑核等操作，numafast仍会迁移业务。

部分环境不适配

Numafast开发时是openeuler 环境下验证，其他OS可能无法正常运行。

维护方法

如果遇到此软件的相关问题，可以在社区提issue解决：

https://gitee.com/src-oepkgs/numafast/issues

调优测试

可选取简单用例快速验证效果：以sysbench为例

测试环境：

OS ： openEuler 22.03 (LTS-SP4)
服务器：Kunpeng 920(128核)

基线测试

[root@localhost ~]#  yum install sysbench
# sysbench memory 测试命令说明
# 随机访问内存，设置20个线程，每次访问8K，总计访问1000G
[root@localhost ~]# sysbench memory --threads=20 --memory-block-size=8K --memory-total-size=1000G --memory-access-mode=rnd run
sysbench 1.0.20 (using system LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 20
Initializing random number generator from current time


Running memory speed test with the following options:
  block size: 8KiB
  total size: 1024000MiB
  operation: write
  scope: global

Initializing worker threads...

Threads started!

Total operations: 5433883 (543278.35 per second)

42452.21 MiB transferred (4244.36 MiB/sec)


General statistics:
    total time:                          10.0003s
    total number of events:              5433883

Latency (ms):
         min:                                    0.00
         avg:                                    0.04
         max:                                    0.35
         95th percentile:                        0.04
         sum:                               197373.88

Threads fairness:
    events (avg/stddev):           271694.1500/18602.22
    execution time (avg/stddev):   9.8687/0.04

开启numafast

[root@localhost ~]# sysbench memory --threads=20 --memory-block-size=8K --memory-total-size=1000G --memory-access-mode=rnd run
sysbench 1.0.20 (using system LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 20
Initializing random number generator from current time


Running memory speed test with the following options:
  block size: 8KiB
  total size: 1024000MiB
  operation: write
  scope: global

Initializing worker threads...

Threads started!

Total operations: 7699846 (769859.71 per second)

60155.05 MiB transferred (6014.53 MiB/sec)


General statistics:
    total time:                          10.0003s
    total number of events:              7699846

Latency (ms):
         min:                                    0.00
         avg:                                    0.03
         max:                                   16.03
         95th percentile:                        0.03
         sum:                               197853.89

Threads fairness:
    events (avg/stddev):           384992.3000/7217.71
    execution time (avg/stddev):   9.8927/0.01

优化效果：(6014.53/4244.36 -1) = 41.7%

本页内容