测试运行
- 将测试线程数设置为NUMA最大线程数,以便测试NUMA整体带宽。
- 测试各NUMA的DDR带宽,参考测试命令:
for ((i = 0; i < 608; i += 38)); do OMP_NUM_THREADS=38 OMP_PROC_BIND=close taskset -c ${i}-$((${i}+37)) numactl -m $((${i}/38)) ./stream_c.exe done参考测试结果中的Triad值,每个NUMA测试结果在120000~150000MB/s:
------------------------------------------------------------- STREAM version $Revision: 5.10 $ ------------------------------------------------------------- This system uses 8 bytes per array element. ------------------------------------------------------------- Array size = 141648512 (elements), Offset = 0 (elements) Memory per array = 1080.7 MiB (= 1.1 GiB). Total memory required = 3242.1 MiB (= 3.2 GiB). Each kernel will be executed 20 times. The *best* time for each kernel (excluding the first iteration) will be used to compute the reported bandwidth. ------------------------------------------------------------- Number of Threads requested = 38 Number of Threads counted = 38 ------------------------------------------------------------- Your clock granularity/precision appears to be 1 microseconds. Each test below will take on the order of 17826 microseconds. (= 17826 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. ------------------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. ------------------------------------------------------------- Function Best Rate MB/s Avg time Min time Max time Copy: 125994.0 0.018811 0.017988 0.020024 Scale: 135105.3 0.017003 0.016775 0.017418 Add: 144631.7 0.023637 0.023505 0.023769 Triad: 144521.8 0.023850 0.023523 0.024183 ------------------------------------------------------------- Solution Validates: avg error less than 1.000000e-13 on all three arrays -------------------------------------------------------------
- 测试各NUMA的片上内存带宽,参考测试命令:
for ((i = 0; i < 608; i += 38)); do OMP_NUM_THREADS=38 OMP_PROC_BIND=close taskset -c ${i}-$((${i}+37)) numactl -m $((${i}/38+16)) ./stream_c.exe done参考测试结果中的Triad值,每个NUMA的带宽约在380000~400000MB/s:
------------------------------------------------------------- STREAM version $Revision: 5.10 $ ------------------------------------------------------------- This system uses 8 bytes per array element. ------------------------------------------------------------- Array size = 141648512 (elements), Offset = 0 (elements) Memory per array = 1080.7 MiB (= 1.1 GiB). Total memory required = 3242.1 MiB (= 3.2 GiB). Each kernel will be executed 20 times. The *best* time for each kernel (excluding the first iteration) will be used to compute the reported bandwidth. ------------------------------------------------------------- Number of Threads requested = 38 Number of Threads counted = 38 ------------------------------------------------------------- Your clock granularity/precision appears to be 1 microseconds. Each test below will take on the order of 5151 microseconds. (= 5151 clock ticks) Increase the size of the arrays if this shows that you are not getting at least 20 clock ticks per test. ------------------------------------------------------------- WARNING -- The above is only a rough guideline. For best results, please be sure you know the precision of your system timer. ------------------------------------------------------------- Function Best Rate MB/s Avg time Min time Max time Copy: 403423.6 0.005716 0.005618 0.005873 Scale: 399288.9 0.005857 0.005676 0.006221 Add: 432229.1 0.007951 0.007865 0.008594 Triad: 406941.0 0.008642 0.008354 0.009097 ------------------------------------------------------------- Solution Validates: avg error less than 1.000000e-13 on all three arrays -------------------------------------------------------------
父主题: STREAM测试