为什么在此示例中预取速度没有提高? [英] Why is prefetch speedup not greater in this example?

查看:81
本文介绍了为什么在此示例中预取速度没有提高?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在此这篇优秀论文的6.3.2中,Ulrich Drepper撰写了有关软件预取的文章.他说这是熟悉的指针追踪框架",我收集到的是他较早前对遍历随机指针所做的测试.在他的图中可以理解,当工作集超过缓存大小时,性能会降低,因为那样我们将越来越频繁地访问主内存.

In 6.3.2 of this this excellent paper Ulrich Drepper writes about software prefetching. He says this is the "familiar pointer chasing framework" which I gather is the test he gives earlier about traversing randomized pointers. It makes sense in his graph that performance tails off when the working set exceeds the cache size, because then we are going to main memory more and more often.

但是为什么预取在这里仅能提供8%的帮助?如果我们确切地告诉处理器要加载的内容,并且提前足够长的时间告诉他(他提前160个周期),那么为什么缓存不满足每个访问权限呢?他没有提到他的节点大小,所以当只需要一些数据时,由于获取完整的行可能会造成一些浪费?

But why does prefetch help only 8% here? If we are telling the processor exactly what we want to load, and we tell it far enough ahead of time (he does it 160 cycles ahead), why isn't every access satisfied by the cache? He doesn't mention his node size, so there could be some waste due to fetching a full line when only some of the data is needed?

我正在尝试将_mm_prefetch与树配合使用,但看不到明显的加速.我正在做这样的事情:

I am trying to use _mm_prefetch with a tree and I see no noticeable speed up. I'm doing something like this:

_mm_prefetch((const char *)pNode->m_pLeft, _MM_HINT_T0);
// do some work
traverse(pNode->m_pLeft);
traverse(pNode->m_pRight)

现在那只应该帮助遍历一侧,但是我只是看到性能没有任何变化.我确实将/arch:SSE添加到项目设置中.我正在将Visual Studio 2012与i7 4770结合使用.在此线程中少数人还谈论通过预取仅获得1%的加速.为什么预取对于随机访问主内存中的数据不起作用呢?

Now that should only help one side the traversal, but I just see no change at all in performance. I did add /arch:SSE to the project settings. I'm using Visual Studio 2012 with an i7 4770. In this thread a few people also talk about getting only 1% speedup with prefetch. Why does prefetch not work wonders for random access of data that's in main memory?

推荐答案

Prefetch不能增加主内存的吞吐量,只能帮助您进一步使用所有内存.

Prefetch can't increase the throughput of your main memory, it can only help you get closer to using it all.

如果您的代码在甚至从链表中的下一个节点请求数据之前就花费了很多计算时间,那么它就不会使内存100%处于忙碌状态.一旦知道地址,预取下一个节点将有所帮助,但是仍然存在上限.上限大约是您无需预取即可获得的上限,但是在加载节点和将指针追到下一个节点之间没有任何工作.即内存系统在100%的时间内获取结果.

If your code spends many cycles on computation before even requesting data from the next node in a linked list, it won't keep the memory 100% busy. A prefetch of the next node as soon as the address is known will help, but there's still an upper limit. The upper limit is approximately what you'd get with no prefetching, but no work between loading a node and chasing the pointer to the next. i.e. memory system fetching a result 100% of the time.

根据该论文的图表,即使在160个工作周期之前进行预取也不足以使数据准备就绪.由于DRAM必须打开一个新页面,一个新行和一个新列,因此随机访问延迟显然确实很慢.

Even prefetching before 160 cycles of work isn't far enough ahead for the data to be ready, according to the graph in that paper. Random access latency is apparently really slow, since DRAM has to open a new page, a new row, and a new column.

我没有足够详细地阅读论文,无法了解他如何可以预取未来的多个步骤,或者无法理解为什么预取线程比预取指令更有用.这是在P4上,而不是在Core或Sandybridge微体系结构上,而且我认为预取线程仍然不是问题. (具有超线程的现代CPU具有足够的执行单元和足够大的缓存,因此在每个内核的两个硬件线程上运行两个独立的东西实际上是有意义的,与P4不同,P4中通常没有多余的执行资源可供超线程利用.) .I-cache在P4中是个问题,因为它只有这么小的跟踪缓存.)

I didn't read the paper in enough detail see how he could prefetch multiple steps ahead, or to understand why a prefetch thread helped more than prefetch instructions. This was on a P4, not Core or Sandybridge microarchitecture, and I don't think prefetch threads are still a thing. (Modern CPUs with hyperthreading have enough execution units and big enough caches that running two independent things on the two hardware threads of each core actually makes sense, unlike in P4 where there were less extra execution resources normally going unused for hyperthreading to utilize. And esp. I-cache was a problem in P4, because it just had that small trace cache.)

如果您的代码已经基本上立即加载了下一个节点,则预取不能神奇地使其更快.当预取可以增加CPU计算与等待内存之间的重叠时,它会有所帮助.或者,也许在您的测试中,->left指针从分配内存时开始基本上是连续的,所以硬件预取有效吗?如果树足够浅,则在降下左侧之前预取->right节点(进入最后一级缓存,而不是L1)可能是一个胜利.

If your code already loads the next node essentially right away, prefetching can't magically make it faster. Prefetching helps when it can increase the overlap between CPU computation and waiting for memory. Or maybe in your tests, the ->left pointers were mostly sequential from when you allocated the memory, so HW prefetching worked? If trees were shallow enough, prefetching the ->right node (into last-level cache, not L1) before descending the left might be a win.

软件预取. (它们非常好,并且可以大步发现模式.并跟踪类似10个前向流(地址不断增加)的内容.请检查

Software prefetching is only needed when the access pattern is not recognizable for the CPUs hardware prefetchers. (They're quite good, and can spot pattern with a decent-size stride. And track something like 10 forward streams (increasing addresses). Check http://agner.org/optimize/ for details.)

这篇关于为什么在此示例中预取速度没有提高?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆