如何CLFLUSH工作了,是不是在高速缓存中没有一个地址? [英] How does CLFLUSH work for an address that is not in cache yet?

查看:646
本文介绍了如何CLFLUSH工作了,是不是在高速缓存中没有一个地址?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们正在尝试使用英特尔CLFLUSH指令刷新在Linux下一个进程在用户空间缓存内容。

我们创建一个非常简单的C程序,首先访问一个大阵,然后调用CLFLUSH刷新整个阵列的虚拟地址空间。我们衡量花费CLFLUSH冲洗整个阵列的延迟。在程序中数组的大小是输入,我们改变输入从1MB到40MB具有2MB的步骤

在我们的理解,在CLFLUSH应刷新缓存中的内容的 的。因此,我们希望看到冲洗整个阵列首先在数组的大小方面线性增加的潜伏期,然后延迟应停止数组大小后增加为大于20MB,这是我们计划的有限责任公司的大小。

然而,实验结果是相当令人惊讶的,如图中所示。等待时间不会停止增加后,数组大小大于20MB。

我们想知道,如果CLFLUSH可能在地址带来CLFLUSH刷新地址从缓存中之前,如果地址不在缓存中了吗?
我们也尝试在英特尔软件开发人员手工进行搜索,并没有发现什么CLFLUSH会做任何解释,如果地址不在缓存中。

在这里输入的形象描述

下面是我们用来绘制图中的数据。第一列是在KB阵列的大小,而第二列是冲洗以秒为整个阵列的等待时间。

任何建议/意见比AP preciated多。

[修改]

在previous code是不必要的。 CLFLUSH可以在用户空间做要容易得多,但它具有类似的性能。所以我删除了凌乱code,以避免混淆。

 情景=只读
1024 0.00158601000000000000
3072,0.00299244000000000000
5120,0.00464945000000000000
7168,0.00630479000000000000
9216,0.00796194000000000000
11264,0.00961576000000000000
13312,0.01126760000000000000
15360,0.01300500000000000000
17408,0.01480760000000000000
19456,0.01696180000000000000
21504,0.01968410000000000000
23552,0.02300760000000000000
25600,0.02634970000000000000
27648,0.02990350000000000000
29696,0.03403090000000000000
31744,0.03749210000000000000
33792,0.04092470000000000000
35840,0.04438390000000000000
37888,0.04780050000000000000
39936,0.05163220000000000000情景=读取和写入
1024 0.00200558000000000000
3072,0.00488687000000000000
5120,0.00775943000000000000
7168,0.01064760000000000000
9216,0.01352920000000000000
11264,0.01641430000000000000
13312,0.01929260000000000000
15360,0.02217750000000000000
17408,0.02516330000000000000
19456,0.02837180000000000000
21504,0.03183180000000000000
23552,0.03509240000000000000
25600,0.03845220000000000000
27648,0.04178440000000000000
29696,0.04519920000000000000
31744,0.04858340000000000000
33792,0.05197220000000000000
35840,0.055269500​​00000000000
37888,0.05865630000000000000
39936,0.06202170000000000000


解决方案

您想看看新的优化指南SKYLAKE微架构,英特尔推出了CLFLUSH的另一个版本,被称为clflush_opt,这是弱有序,将执行好得多在您的方案。

见7.5.7这里 - <一个href=\"http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf\" rel=\"nofollow\">http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf


  

在一般情况下,CLFLUSHOPT吞吐量比CLFLUSH的更高,
  因为CLFLUSHOPT订单本身相对于一较小的组
  存储器流量以上并在第7.5.6所述。该
  可以通过CLFLUSHOPT也将有所不同。当使用CLFLUSHOPT,
  冲洗修改的高速缓存线将经历比更高的成本
  冲洗高速缓存行中的非改性田间状态。 CLFLUSHOPT将提供
  在CLFLUSH性能优势为高速缓存行中的任何语篇连贯ë
  状态。 CLFLUSHOPT更适合刷新大缓冲区(例如
  大于许多千字节),补偿ARED到CLFLUSH。在单线程
  应用,冲洗用CLFLUSHOPT缓冲剂可以是最多9X
  比使用CLFLUSH与SKYLAKE微架构微体系结构更好。


本节还解释说,冲洗修改的数据比较慢,这显然来自于回写点球。

对于增加延迟,你测量总时间花费走了过来地址范围,CLFLUSH每一行?在这种情况下,你是线性依赖于数组的大小,甚至当它通过LLC的大小。即使行不存在,则CLFLUSH就一定得由执行引擎和存储单元处理,并查找整个缓存层次结构中的每一行,即使它不是present。

We are trying to use the Intel CLFLUSH instruction to flush the cache content of a process in Linux at the userspace.

We create a very simple C program that first access a large array and then call the CLFLUSH to flush the virtual address space of the whole array. We measure the latency it takes for CLFLUSH to flush the whole array. The size of the array in the program is an input and we vary the input from 1MB to 40MB with a step of 2MB.

In our understanding, the CLFLUSH should flush the content in the cache. So we expect to see the latency of flushing the whole array first increase linearly in terms of the size of the array, and then the latency should stop increasing after the array size is larger than 20MB, which is the size of the LLC of our program.

However, the experiment result is quite surprising, as shown in the figure. The latency does not stop increasing after the array size is larger than 20MB.

We are wondering if the CLFLUSH could potentially bring in the address before CLFLUSH flushes the address out of the cache, if the address is not in the cache yet? We also tried to search in the Intel software developer manual, and didn't find any explanation of what CLFLUSH will do if an address is not in the cache.

Below is the data we used to draw the figure. The first column is the size of the array in KB, and the second column is the latency of flushing the whole array in seconds.

Any suggestion/advice is more than appreciated.

[Modified]

The previous code is unnecessary. CLFLUSH can be done in userspace much easier, although it has the similar performance. So I deleted the messy code to avoid confusion.

SCENARIO=Read Only
1024,.00158601000000000000
3072,.00299244000000000000
5120,.00464945000000000000
7168,.00630479000000000000
9216,.00796194000000000000
11264,.00961576000000000000
13312,.01126760000000000000
15360,.01300500000000000000
17408,.01480760000000000000
19456,.01696180000000000000
21504,.01968410000000000000
23552,.02300760000000000000
25600,.02634970000000000000
27648,.02990350000000000000
29696,.03403090000000000000
31744,.03749210000000000000
33792,.04092470000000000000
35840,.04438390000000000000
37888,.04780050000000000000
39936,.05163220000000000000

SCENARIO=Read and Write
1024,.00200558000000000000
3072,.00488687000000000000
5120,.00775943000000000000
7168,.01064760000000000000
9216,.01352920000000000000
11264,.01641430000000000000
13312,.01929260000000000000
15360,.02217750000000000000
17408,.02516330000000000000
19456,.02837180000000000000
21504,.03183180000000000000
23552,.03509240000000000000
25600,.03845220000000000000
27648,.04178440000000000000
29696,.04519920000000000000
31744,.04858340000000000000
33792,.05197220000000000000
35840,.05526950000000000000
37888,.05865630000000000000
39936,.06202170000000000000

解决方案

You want to look at the new optimization guide for Skylake, Intel came out with another version of clflush, called clflush_opt, which is weakly ordered and would perform much better in your scenario.

See section 7.5.7 in here - http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

In general, CLFLUSHOPT throughput is higher than that of CLFLUSH, because CLFLUSHOPT orders itself with respect to a smaller set of memory traffic as described above and in Section 7.5.6. The throughput of CLFLUSHOPT will also vary. When using CLFLUSHOPT, flushing modified cache lines will experience a higher cost than flushing cache lines in non-modi fied states. CLFLUSHOPT will provide a performance benefit over CLFLUSH for cache lines in any coherenc e states. CLFLUSHOPT is more suitable to flush large buffers (e.g. greater than many KBytes), comp ared to CLFLUSH. In single-threaded applications, flushing buffers using CLFLUSHOPT may be up to 9X better than using CLFLUSH with Skylake microarchi- tecture.

The section also explains that flushing modified data is slower, which obviously comes from the writeback penalty.

As for the increasing latency, are you measuring the overall time is takes to go over the address range and clflush each line? In that case you're linearly dependent on the array size, even when it passes the LLC size. Even if the lines aren't there, the clflush would have to get processed by the execution engine and memory unit, and lookup the entire cache hierarchy for each line, even if it's not present.

这篇关于如何CLFLUSH工作了,是不是在高速缓存中没有一个地址?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆