在什么情况下大页面可以提高速度? [英] In what circumstances can large pages produce a speedup?

查看:134
本文介绍了在什么情况下大页面可以提高速度?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

现代x86 CPU能够支持比传统4K(例如2MB或4MB)更大的页面大小,并且具有操作系统功能( Linux Windows )访问此功能.

Modern x86 CPUs have the ability to support larger page sizes than the legacy 4K (ie 2MB or 4MB), and there are OS facilities (Linux, Windows) to access this functionality.

上面的Microsoft链接指出,大页面提高了转换缓冲区的效率,这可以提高频繁访问的内存的性能".这对于预测大页面是否可以改善任何给定的情况不是很有帮助.我对具体的示例(最好是量化的示例)感兴趣,这些示例将某些程序逻辑(或整个应用程序)移动到使用大页面的位置导致了一些性能改进.任何人都有成功的故事吗?

The Microsoft link above states large pages "increase the efficiency of the translation buffer, which can increase performance for frequently accessed memory". Which isn't very helpful in predicting whether large pages will improve any given situation. I'm interested in concrete, preferably quantified, examples of where moving some program logic (or a whole application) to use huge pages has resulted in some performance improvement. Anyone got any success stories ?

我知道一个特殊情况,其中我自己:使用巨大页面可以大大地减少所需的时间分叉一个大过程(大概是因为需要复制的TLB记录的数量减少了1000左右).我对大型页面是否也可以在不那么奇特的情况下受益匪浅.

There's one particular case I know of myself: using huge pages can dramatically reduce the time needed to fork a large process (presumably as the number of TLB records needing copying is reduced by a factor on the order of 1000). I'm interested in whether huge pages can also be a benefit in less exotic scenarios.

推荐答案

我试图设计一些代码,以最大程度地利用4k页的TLB进行颠簸,以检查从大页中获得的收益.当libhugetlbfs的malloc(Intel i7,64位Debian Lenny)提供2MByte页面时,以下内容的运行速度比2K页面快 2.6倍(比4K页面快).希望能清楚地看到scoped_timerrandom0n的作用.

I tried to contrive some code which would maximise thrashing of the TLB with 4k pages in order to examine the gains possible from large pages. The stuff below runs 2.6 times faster (than 4K pages) when 2MByte pages are are provided by libhugetlbfs's malloc (Intel i7, 64bit Debian Lenny ); hopefully obvious what scoped_timer and random0n do.

  volatile char force_result;

  const size_t mb=512;
  const size_t stride=4096;
  std::vector<char> src(mb<<20,0xff);
  std::vector<size_t> idx;
  for (size_t i=0;i<src.size();i+=stride) idx.push_back(i);
  random0n r0n(/*seed=*/23);
  std::random_shuffle(idx.begin(),idx.end(),r0n);

  {
    scoped_timer t
      ("TLB thrash random",mb/static_cast<float>(stride),"MegaAccess");
    char hash=0;
    for (size_t i=0;i<idx.size();++i) 
      hash=(hash^src[idx[i]]);
    force_result=hash;
  }

仅使用hash=hash^src[i]的更简单的直线"版本仅从大页面中获得16%的份额,但是(很自然地猜测)英特尔的禁用预取以调查是否正确).

A simpler "straight line" version with just hash=hash^src[i] only gained 16% from large pages, but (wild speculation) Intel's fancy prefetching hardware may be helping the 4K case when accesses are predictable (I suppose I could disable prefetching to investigate whether that's true).

这篇关于在什么情况下大页面可以提高速度?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆