在x86-64中测量TLB未命中处理成本 [英] Measuring TLB miss handling cost in x86-64

查看:111
本文介绍了在x86-64中测量TLB未命中处理成本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想估计由于运行Linux的x86-64(Intel Nehalem)计算机上的TLB丢失而导致的性能开销.我希望通过使用一些性能计数器来获得此估算值.是否有人指出什么是最好的估算方法?

I want to estimate the performance overhead due to TLB misses on a x86-64 (Intel Nehalem) machine running Linux. I wish to get this estimate by using some performance counters. Does anybody has some pointers on what is the best way to estimate this?

谢谢 阿卡

推荐答案

如果您可以访问基于"Westmere"的系统,则代码的性能特征应该与"Nehalem"上的相似.您将可以访问一个新的硬件性能计数器事件,该事件几乎可以完全测量您想要的结果.

If you can get access to a "Westmere" based system the performance characteristics of your code should be quite similar to what you have on the "Nehalem", but you will have access to a new hardware performance counter event that measures almost exactly what you want.

在Westmere上,在等待处理TLB丢失时对性能损失的最佳估计可能来自硬件性能计数器事件08H,掩码04H"DTLB_LOAD_MISSES.WALK_CYCLES",它被描述为对"Cycles Page Miss Handler忙"进行计数由于第二级TLB中的加载未命中而导致页面漫游". 在英特尔®64和IA-32架构软件开发人员手册"中对此进行了描述 第3B卷:系统编程指南,第2部分"(文档编号:253669),可在线获得,网址为: http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3b- part-2-manual.html

On Westmere, the best estimate of performance lost while waiting for TLB misses to be handled is probably from the hardware performance counter Event 08H, Mask 04H "DTLB_LOAD_MISSES.WALK_CYCLES", which is described as counting "Cycles Page Miss Handler is busy with a page walk due to a load miss in the Second Level TLB". This is described in "Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3B: System Programming Guide, Part 2" (document number: 253669), available online at http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.html

此事件之所以必要,是因为TLB未命中处理时间由读取包含页表项的缓存行所需的时间决定.如果该高速缓存行位于L2高速缓存中,则TLB未命中的开销将非常小(大约10个周期).如果该行位于L3高速缓存中,则可能需要25个周期.如果该行在内存中,则约200个周期.

The reason this event is necessary is that TLB miss processing time is dominated by the time required to read the cache line containing the page table entry. If that cache line is in the L2 cache, then the overhead of a TLB misses will be very small (of the order of 10 cycles). If the line is in the L3 cache, then maybe 25 cycles. If the line is in memory, then ~200 cycles.

  • 如果上层页面翻译缓存中也有未命中的内容,它将需要多次访问内存以查找和检索所需的页面表条目(例如,
  • If there is also a miss in the upper-level page translation caches, it will take multiple trips to memory to find and retrieve the desired page table entry (e.g., https://stackoverflow.com/a/9674980/1264917).
  • On some processors the L2 cache counters can tell you how many table walks hit and missed in the L2, but not on Nehalem. (It would not help a lot in this case since TLB walks that hit in the L3 are also fairly fast and what you really want are the TLB walks that have to go to memory.)

这篇关于在x86-64中测量TLB未命中处理成本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆