通过良好的设计减少缓存未命中 [英] decreasing cache misses through good design

查看:62
本文介绍了通过良好的设计减少缓存未命中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在设计C ++程序时如何减少可能的高速缓存未命中次数?

How to decrease the number of possible cache misses when designing a C++ program?

内联函数每次都有用吗?还是仅当程序受CPU限制时(即程序面向计算而非I / O定向)才好?

Does inlining functions help every time? or is it good only when the program is CPU-bounded (i.e. the program is computation oriented not I/O oriented)?

推荐答案

在处理这类代码时,有一些我想考虑的东西。

Here are some things that I like consider when working on this kind of code.


  • 考虑要使用数组结构还是结构数组。

  • 尝试将结构保持为32字节的倍数,以便它们均匀地打包缓存行。

  • 将数据划分为冷热元素。如果您有一组o类的对象,并且经常将ox,oy,oz一起使用,但仅偶尔需要访问oi,oj和ok,则考虑将ox,oy和oz放在一起并移动i,j和k个部分组成并行的腋窝数据结构。

  • 如果您具有多维数据数组,然后使用通常的行顺序布局,则沿首选维度进行扫描时访问将非常快速,并且跟其他人一起慢。沿空间填充 曲线相反,它将有助于在任何维度上遍历时平衡访问速度。 (阻塞技术类似,它们只是基数较大的Z阶。)

  • 如果必须引起缓存未命中,则尝试对数据进行尽可能多的处理为了摊销成本。

  • 您是否正在执行多线程操作?当心缓存一致性协议的降低。填充标志和小计数器,以便它们将位于单独的缓存行上。

  • 如果您知道要提前访问足够多的内容,则Intel的SSE提供了一些预取内在函数。 / li>
  • Consider whether you want "structures of arrays" or "arrays of structures". Which you want to use will depend on each part of the data.
  • Try to keep structures to multiples of 32 bytes so they pack cache lines evenly.
  • Partition your data in hot and cold elements. If you have an array of objects of class o, and you use o.x, o.y, o.z together frequently but only occasionally need to access o.i, o.j, o.k then consider puting o.x, o.y, and o.z together and moving the i, j, and k parts to a parallel axillary data structure.
  • If you have multi dimensional arrays of data then with the usual row-order layouts, access will be very fast when scanning along the preferred dimension and very slow along the others. Mapping it along a space-filling curve instead will help to balance access speeds when traversing in any dimension. (Blocking techniques are similar -- they're just Z-order with a larger radix.)
  • If you must incur a cache miss, then try to do as much as possible with that data in order to amortize the cost.
  • Are you doing anything multi-threaded? Watch out for slowdowns from cache consistency protocols. Pad flags and small counters so that they'll be on separate cache lines.
  • SSE on Intel provides some prefetch intrinsics if you know what you'll be accessing far enough ahead of time.

这篇关于通过良好的设计减少缓存未命中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆