通过延迟/性能测量确定NUMA布局 [英] Determine NUMA layout via latency/performance measurements

查看:189
本文介绍了通过延迟/性能测量确定NUMA布局的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

最近,我一直在观察无法解释的内存密集型工作负载中的性能影响.试图深入了解这一点,我开始运行一些微基准测试来确定常见的性能参数,例如缓存行大小和L1/L2/L3缓存大小(我已经知道它们了,我只是想看看我的测量值是否反映了实际值).

对于高速缓存行测试,我的代码大致如下(Linux C,但是概念类似于Windows等):

char *array = malloc (ARRAY_SIZE);
int count = ARRAY_SIZE / STEP;
clock_gettime(CLOCK_REALTIME, &start_time);

for (int i = 0; i < ARRAY_SIZE; i += STEP) {
  array[i]++;
}
clock_gettime(CLOCK_REALTIME, &end_time);

// calculate time per element here:
[..]

STEP从1更改为128表示,从STEP=64开始,我看到每个元素的时间并没有进一步增加,即,每次迭代都需要获取一个新的运行时占主导地位的缓存行. 将ARRAY_SIZE从1K更改为16384K,并保持STEP=64,我能够创建一个不错的图,显示出大致对应于L1,L2和L3延迟的步进模式.但是,对于非常小的数组大小,甚至是100,000次,必须重复多次for循环才能获得可靠的数字.然后,在我的IvyBridge笔记本上,我可以清楚地看到L1以64K结尾,L2以256K结尾,甚至是L3以6M结尾.

现在我要问的真正问题是:在NUMA系统中,任何单个内核都将获取远程主内存,甚至共享缓存,这些内存不一定与本地缓存和内存一样近.我希望看到延迟/性能方面的差异,从而确定留在快速缓存/部分内存中时可以分配多少内存.

为此,我优化了测试,以1/10 MB的块遍历内存,分别测量延迟,然后收集最快的块,大致如下:

for (int chunk_start = 0; chunk_start < ARRAY_SIZE; chunk_start += CHUNK_SIZE) {
  int chunk_end = MIN (ARRAY_SIZE, chunk_start + CHUNK_SIZE);
  int chunk_els = CHUNK_SIZE / STEP;
  for (int i = chunk_start; i < chunk_end; i+= STEP) {
    array[i]++;
  }
  // calculate time per element
[..]

当我开始将ARRAY_SIZE增加到大于L3的大小时,我得到了难以置信的荒谬数字,即使有很多重复也无法消除.我无法以此方式确定可用于性能评估的模式,更不用说确定NUMA条带的确切起点,终点或位置了.

然后,我认为硬件预取器足够聪明,可以识别我的简单访问模式,并在访问之前将所需的行简单地提取到高速缓存中.在数组索引中添加一个随机数会增加每个元素的时间,但在其他方面似乎并没有多大帮助,这可能是因为我在每次迭代中都调用了rand ().对我来说,预先计算一些随机值并将其存储在数组中似乎不是一个好主意,因为该数组也将存储在热缓存中并歪曲我的测量值.将STEP增加到4097或8193也无济于事,预取器必须比我更聪明.

我的方法明智/可行吗,还是我错过了更大的前景?是否完全可以观察到这样的NUMA延迟?如果是,我在做什么错? 我禁用地址空间随机化只是为了确保并避免奇怪的缓存别名效应.在测量之前还需要对操作系统进行一些调整吗?

解决方案

是否完全可以观察到这样的NUMA延迟?如果是,我在做什么错了?

内存分配器是NUMA感知的,因此默认情况下,除非明确要求在另一个节点上分配内存,否则您将不会观察到任何NUMA效果.实现效果的最简单方法是numactl(8).只需在一个节点上运行您的应用程序,然后将内存分配绑定到另一个节点,就像这样:

numactl --cpunodebind 0 --membind 1 ./my-benchmark

另请参阅numa_alloc_onnode(3).

在测量之前是否还需要对操作系统进行一些调整?

关闭CPU缩放比例,否则您的测量结果可能会很吵:

find '/sys/devices/system/cpu/' -name 'scaling_governor' | while read F; do
        echo "==> ${F}"
        echo "performance" | sudo tee "${F}" > /dev/null
done

现在有关测试本身.当然,要测量延迟,访问模式必须是(伪)随机的.否则,您的测量将被高速缓存命中污染.

这是一个如何实现此目的的示例:

数据初始化

用随机数填充数组:

static void random_data_init()
{
    for (size_t i = 0; i < ARR_SZ; i++) {
        arr[i] = rand();
    }
}

基准

每进行一个基准迭代执行1M op操作以减少测量噪声.使用数组随机数跳过几条缓存行:

const size_t OPERATIONS = 1 * 1000 * 1000; // 1M operations per iteration

int random_step_sizeK(size_t size)
{
    size_t idx = 0;

    for (size_t i = 0; i < OPERATIONS; i++) {
        arr[idx & (size - 1)]++;
        idx += arr[idx & (size - 1)] * 64; // assuming cache line is 64B
    }
    return 0;
}

结果

这是i5-4460 CPU @ 3.20GHz的结果:

----------------------------------------------------------------
Benchmark                         Time           CPU Iterations
----------------------------------------------------------------
random_step_sizeK/4         4217004 ns    4216880 ns        166
random_step_sizeK/8         4146458 ns    4146227 ns        168
random_step_sizeK/16        4188168 ns    4187700 ns        168
random_step_sizeK/32        4180545 ns    4179946 ns        163
random_step_sizeK/64        5420788 ns    5420140 ns        129
random_step_sizeK/128       6187776 ns    6187337 ns        112
random_step_sizeK/256       7856840 ns    7856549 ns         89
random_step_sizeK/512      11311684 ns   11311258 ns         57
random_step_sizeK/1024     13634351 ns   13633856 ns         51
random_step_sizeK/2048     16922005 ns   16921141 ns         48
random_step_sizeK/4096     15263547 ns   15260469 ns         41
random_step_sizeK/6144     15262491 ns   15260913 ns         46
random_step_sizeK/8192     45484456 ns   45482016 ns         23
random_step_sizeK/16384    54070435 ns   54064053 ns         14
random_step_sizeK/32768    59277722 ns   59273523 ns         11
random_step_sizeK/65536    63676848 ns   63674236 ns         10
random_step_sizeK/131072   66383037 ns   66380687 ns         11

在32K/64K(因此我的L1缓存约为〜32K),256K/512K(因此我的L2缓存约为〜256K)和6144K/8192K(因此我的L3缓存约为〜6M)之间有明显的区别./p>

Recently I have been observing performance effects in memory-intensive workloads I was unable to explain. Trying to get to the bottom of this I started running several microbenchmarks in order to determine common performance parameters like cache line size and L1/L2/L3 cache size (I knew them already, I just wanted to see if my measurements reflected the actual values).

For the cache line test my code roughly looks as follows (Linux C, but the concept is similiar to Windows etc. of course):

char *array = malloc (ARRAY_SIZE);
int count = ARRAY_SIZE / STEP;
clock_gettime(CLOCK_REALTIME, &start_time);

for (int i = 0; i < ARRAY_SIZE; i += STEP) {
  array[i]++;
}
clock_gettime(CLOCK_REALTIME, &end_time);

// calculate time per element here:
[..]

Varying STEP from 1 to 128 shows that from STEP=64 on, I saw that the time per element did not increase further, i.e. every iteration would need to fetch a new cache line dominating the runtime. Varying ARRAY_SIZE from 1K to 16384K keeping STEP=64 I was able to create a nice plot exhibiting a step pattern that roughly corresponds to L1, L2 and L3 latency. It was necessary to repeat the for loop a number of times, for very small array sizes even 100,000s of times, to get reliable numbers, though. Then, on my IvyBridge notebook I can clearly see L1 ending at 64K, L2 at 256K and even the L3 at 6M.

Now on to my real question: In a NUMA system, any single core will obtain remote main memory and even shared cache that is not necessarily as close as its local cache and memory. I was hoping to see a difference in latency/performance thus determining how much memory I could allocate while staying in my fast caches/part of memory.

For this, I refined my test to walk through the memory in 1/10 MB chunks measuring the latency separately and later collect the fastest chunks, roughly like this:

for (int chunk_start = 0; chunk_start < ARRAY_SIZE; chunk_start += CHUNK_SIZE) {
  int chunk_end = MIN (ARRAY_SIZE, chunk_start + CHUNK_SIZE);
  int chunk_els = CHUNK_SIZE / STEP;
  for (int i = chunk_start; i < chunk_end; i+= STEP) {
    array[i]++;
  }
  // calculate time per element
[..]

As soon as I start increasing ARRAY_SIZE to something larger than the L3 size, I get wildy unrealiable numbers not even a large number of repeats is able to even out. There is no way I can make out a pattern usable for performance evaluation with this, let alone determine where exactly a NUMA stripe starts, ends or is located.

Then, I figured the Hardware prefetcher is smart enough to recognize my simple access pattern and simply fetch the needed lines into the cache before I access them. Adding a random number to the array index increases the time per element but did not seem to help much otherwise, probably because I had a rand () call every iteration. Precomputing some random values and storing them in an array did not seem a good idea to me as this array as well would be stored in a hot cache and skew my measurements. Increasing STEP to 4097 or 8193 did not help much either, the prefetcher must be smarter than me.

Is my approach sensible/viable or did I miss the larger picture? Is it possible to observe NUMA latencies like this at all? If yes, what am I doing wrong? I disabled address space randomization just to be sure and preclude strange cache aliasing effects. Is there something else operating-sytem wise that has to be tuned before measuring?

解决方案

Is it possible to observe NUMA latencies like this at all? If yes, what am I doing wrong?

Memory allocators are NUMA aware, so by default you will not observe any NUMA effects until you explicitly ask to allocate memory on another node. The most simple way to achieve the effect is numactl(8). Just run your application on one node and bind memory allocations to another, like so:

numactl --cpunodebind 0 --membind 1 ./my-benchmark

See also numa_alloc_onnode(3).

Is there something else operating-sytem wise that has to be tuned before measuring?

Turn off CPU scaling otherwise your measurements might be noisy:

find '/sys/devices/system/cpu/' -name 'scaling_governor' | while read F; do
        echo "==> ${F}"
        echo "performance" | sudo tee "${F}" > /dev/null
done

Now regarding the test itself. Sure, to measure the latency, access pattern must be (pseudo) random. Otherwise your measurements will be contaminated with fast cache hits.

Here is an example how you could achieve this:

Data Initialization

Fill the array with random numbers:

static void random_data_init()
{
    for (size_t i = 0; i < ARR_SZ; i++) {
        arr[i] = rand();
    }
}

Benchmark

Perform 1M op operations per one benchmark iteration to reduce measurement noise. Use array random number to jump over few cache lines:

const size_t OPERATIONS = 1 * 1000 * 1000; // 1M operations per iteration

int random_step_sizeK(size_t size)
{
    size_t idx = 0;

    for (size_t i = 0; i < OPERATIONS; i++) {
        arr[idx & (size - 1)]++;
        idx += arr[idx & (size - 1)] * 64; // assuming cache line is 64B
    }
    return 0;
}

Results

Here are the results for i5-4460 CPU @ 3.20GHz:

----------------------------------------------------------------
Benchmark                         Time           CPU Iterations
----------------------------------------------------------------
random_step_sizeK/4         4217004 ns    4216880 ns        166
random_step_sizeK/8         4146458 ns    4146227 ns        168
random_step_sizeK/16        4188168 ns    4187700 ns        168
random_step_sizeK/32        4180545 ns    4179946 ns        163
random_step_sizeK/64        5420788 ns    5420140 ns        129
random_step_sizeK/128       6187776 ns    6187337 ns        112
random_step_sizeK/256       7856840 ns    7856549 ns         89
random_step_sizeK/512      11311684 ns   11311258 ns         57
random_step_sizeK/1024     13634351 ns   13633856 ns         51
random_step_sizeK/2048     16922005 ns   16921141 ns         48
random_step_sizeK/4096     15263547 ns   15260469 ns         41
random_step_sizeK/6144     15262491 ns   15260913 ns         46
random_step_sizeK/8192     45484456 ns   45482016 ns         23
random_step_sizeK/16384    54070435 ns   54064053 ns         14
random_step_sizeK/32768    59277722 ns   59273523 ns         11
random_step_sizeK/65536    63676848 ns   63674236 ns         10
random_step_sizeK/131072   66383037 ns   66380687 ns         11

There are obvious steps between 32K/64K (so my L1 cache is ~32K), 256K/512K (so my L2 cache size is ~256K) and 6144K/8192K (so my L3 cache size is ~6M).

这篇关于通过延迟/性能测量确定NUMA布局的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆