强制GCC执行memcpy运行时大小检查的循环unswitching? [英] Forcing GCC to perform loop unswitching of memcpy runtime size checks?

查看:135
本文介绍了强制GCC执行memcpy运行时大小检查的循环unswitching?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有任何可靠的方法强制GCC(或任何编译器)在 memcpy()中循环外的运行时大小检查 - 时间常数,但在该循环内是常数),专门针对每个相关大小范围的循环,而不是重复检查其中的大小?

Is there any reliable way to force GCC (or any compiler) to factor out runtime size checks in memcpy() outside of a loop (where that size is not compile-time constant, but constant within that loop), specializing the loop for each relevant size range rather than repeatedly checking the size within it?

这是一个测试用例从性能回归中报告了此处的设计用于大型数据集的高效内存分析的开放源代码库。 (回归恰好是因为我的一个提交...)

This is an test case reduced down from a performance regression reported here for an open source library designed for efficient in-memory analysis of large data sets. (The regression happens to be because of one of my commits...)

原始代码在Cython中,它到下面的纯C代理:

The original code is in Cython, but I've reduced it down to a pure C proxy as the following:

void take(double * out, double * in,
          int stride_out_0, int stride_out_1,
          int stride_in_0, int stride_in_1,
          int * indexer, int n, int k)
{
    int i, idx, j, k_local;
    k_local = k; /* prevent aliasing */
    for(i = 0; i < n; ++i) {
        idx = indexer[i];
        for(j = 0; j < k_local; ++j)
            out[i * stride_out_0 + j * stride_out_1] =
            in[idx * stride_in_0 + j * stride_in_1];
    }
}

步幅是可变的;通常,数组甚至不能保证是连续的(因为它们可能是较大数组的非连续片)。然而,对于c连续数组的特殊情况,我已经将上面的代码优化为:

The strides are variable; in general the arrays are not even guaranteed to be contiguous (since they might be non-contiguous slices of larger arrays.) However, for the particular case of c-contiguous arrays, I've optimized the above to the following:

void take(double * out, double * in,
          int stride_out_0, int stride_out_1,
          int stride_in_0, int stride_in_1,
          int * indexer, int n, int k)
{
    int i, idx, k_local;
    assert(stride_out_0 == k);
    assert(stride_out_0 == stride_in_0);
    assert(stride_out_1 == 1);
    assert(stride_out_1 == stride_in_1);
    k_local = k; /* prevent aliasing */
    for(i = 0; i < n; ++i) {
        idx = indexer[i];
        memcpy(&out[i * k_local], &in[idx * k_local],
               k_local * sizeof(double));
    }
}

(原始代码中不存在声明;如果可能的话,它会检查连续性并调用优化的版本,如果没有则调用优化的版本。)

(The asserts are not present in the original code; instead it checks for contiguity and calls the optimized version if possible, and the unoptimized one if not.)

此版本在大多数情况下优化非常好,如果对于小 n 和大 k 。然而,相反的用例也发生了(大 n 和小 k ), n == 10000 k == 4 的特殊情况(不能排除代表重要部分的假想工作流), memcpy()版本比原始版本慢3.6倍。很明显,这主要是由于 k 不是编译时常量的事实,这一事实证明这个下一个版本(几乎或完全取决于优化设置)以及 k == 4 的特定情况下的原始(或更好,有时):

This version optimizes very well in most cases, since the normal use case if for small n and large k. However, the opposite use case does happen as well (large n and small k), and it turns out for the particular case of n == 10000 and k == 4 (which cannot be ruled out as representative of an important part of a hypothetical workflow), the memcpy() version is 3.6x times slower than the original. This is, apparently, mainly due to the fact that k is not compile-time constant, as evidenced by the fact that this next version performs (almost or exactly, depending on optimization settings) as well as the original (or better, sometimes), for the particular case of k == 4:

    if (k_local == 4) {
        /* this optimizes */
        for(i = 0; i < n; ++i) {
            idx = indexer[i];
            memcpy(&out[i * k_local], &in[idx * k_local],
                   k_local * sizeof(double));
        }
    } else {
        for(i = 0; i < n; ++i) {
            idx = indexer[i];
            memcpy(&out[i * k_local], &in[idx * k_local],
                   k_local * sizeof(double));
        }
    }

显然, k 的特定值,所以我试图改为(作为第一次尝试,以后可以广义,如果它工作):

Obviously, it's not practical to hardcode a loop for every particular value of k, so I attempted the following instead (as a first attempt that could later by generalized, if it worked):

    if (k_local >= 0 && k_local <= 4) {
        /* this does not not optimize */
        for(i = 0; i < n; ++i) {
            idx = indexer[i];
            memcpy(&out[i * k_local], &in[idx * k_local],
                   k_local * sizeof(double));
        }
    } else {
        for(i = 0; i < n; ++i) {
            idx = indexer[i];
            memcpy(&out[i * k_local], &in[idx * k_local],
                   k_local * sizeof(double));
        }
    }


$ b < memcpy()版本,这对我对GCC的优化能力的信任有点令人沮丧。

Unfortunately, this last version is no faster than the original memcpy() version, which is somewhat disheartening for my faith in GCC's optimization abilities.

可以给GCC的额外提示(通过任何手段),这将帮助它做正确的事情在这里? (更好的是,有没有提示,可以可靠地工作在不同的编译器?这个库是为许多不同的目标编译。)

Is there any way I can give extra "hints" to GCC (through any means) that will help it do the right thing here? (And even better, are there "hints" that could reliably work across different compilers? This library is compiled for many different targets.)

引用的结果是GCC 4.6 .3在32位Ubuntu上用-O2标志,但我也测试了类似(但不完全相同)结果的GCC 4.7.2和-O3版本。我已经将我的测试工具发布到 LiveWorkspace ,但时间是从我自己的机器使用时间1)命令(我不知道可靠的LiveWorkspace计时是如何可靠。)

The results quoted are for GCC 4.6.3 on 32-bit Ubuntu with the "-O2" flag, but I've also tested GCC 4.7.2 and "-O3" versions with similar (but not identical) results. I've posted my test harness to LiveWorkspace, but the timings are from my own machine using the time(1) command (I don't know how reliable LiveWorkspace timings are.)

EDIT:也考虑只是为一些最小大小设置一个魔术数字调用 memcpy()与,我可以找到这样的值与重复测试,但我不确定我的结果是如何在不同的编译器/平台之间。

I've also considered just setting a "magic number" for some minimum size to call memcpy() with, and I could find such a value with repeated testing, but I'm not sure how generalizable my results would be across different compilers/platforms. Is there any rule of thumb I could use here?

进一步修改:实现了 k_local 变量在这种情况下是无用的,实际上,因为没有混叠是可能的;这是从一些实验减少了我跑在哪里是可能的( k 是全球性的),我忘了我改变了。只要忽略该部分即可。

FURTHER Realized the k_local variables are kind of useless in this case, actually, since no aliasing is possible; this was reduced from some experiments I ran where it was possible (k was global) and I forgot I changed it. Just ignore that part.

编辑标签:实现我也可以在Cython的较新版本中使用C ++,所以标记为C ++,这可以帮助从C ++ ...

EDIT TAG: Realized I can also use C++ in newer versions of Cython, so tagging as C++ in case there's anything that can help from C++...

FINAL EDIT:代替(现在) $ c> memcpy(),以下似乎是我的本地机器的最佳经验解决方案:

FINAL In lieu (for now) of dropping down to assembly for a specialized memcpy(), the following seems to be the best empirical solution for my local machine:

    int i, idx, j;
    double * subout, * subin;
    assert(stride_out_1 == 1);
    assert(stride_out_1 == stride_in_1);
    if (k < 32 /* i.e. 256 bytes: magic! */) {
        for(i = 0; i < n; ++i) {
            idx = indexer[i];
            subout = &out[i * stride_out_0];
            subin = &in[idx * stride_in_0];
            for(j = 0; j < k; ++j)
                subout[j] = subin[j];
        }
    } else {
        for(i = 0; i < n; ++i) {
            idx = indexer[i];
            subout = &out[i * stride_out_0];
            subin = &in[idx * stride_in_0];
            memcpy(subout, subin, k * sizeof(double));
        }
    }

这使用魔法数字来决定是否调用 memcpy(),但是仍然优化已知为连续的小数组的情况(因此它比原始数组更快,在大多数情况下,没有这样的假设)。

This uses a "magic number" to decide whether to call memcpy() or not, but still optimizes the case for small arrays that are known to be contiguous (so it's faster than the original, in most cases, since the original makes no such assumption).

推荐答案

最终,问题是要求优化器基于运行时行为多变量。虽然可以通过在关键变量上使用const和register声明为优化器提供一些编译时提示,但最终你需要依靠优化器做出很多假设。此外,虽然memcpy()可能是内在的,但是它不能保证是,即使/当它是,实现可以变化相当广泛。

Ultimately, the issue at hand is one of asking the optimizer to make assumptions about runtime behavior based on multiple variables. While it is possible to provide the optimizer some compile-time hints via the use of 'const' and 'register' declarations on the key variables, ultimately, you're depending on the optimizer to make a lot of assumptions. Further, while the memcpy() may well be intrinsic, it's not guaranteed to be and even if/when it is, the implementation(s) could vary fairly widely.

如果目标是实现最大的性能,有时你只是不得不依靠技术来计算出来,而是直接做到。对这种情况最好的建议是使用内联汇编来解决这个问题。这样做可以避免编译器和优化器的启发式的黑箱解决方案的所有缺陷,并有限地声明您的意图。使用内联汇编程序的关键优点是能够避免在内存复制问题的解决方案中的任何推送/流出和无关的泛化代码,以及直接利用处理器解决问题的能力的能力。不好的一面是维护,但考虑到你真的只需要解决英特尔和AMD覆盖大部分市场,这不是不可克服的。

If the goal is to achieve maximum performance, sometimes you just have to not rely on technology to figure it out for you, rather do it directly. The best advice for this situation is the use of inline assembler to address the problem. Doing so allows you to avoid all of the pitfalls of a "black box" solution compliments of the heuristics of the compiler and optimizer and to finitely state your intent. The key benefit to the use of inline assembler is the ability to avoid any pushes/pops and extraneous "generalization" code in the solution to the memory copy problem and the ability to take direct advantage of the processor's ability to solve the problem. The down side is maintenance, but given that you really need only address Intel and AMD to cover most of the market, it's not insurmountable.

我也可以添加这个解决方案很可能允许你利用多个内核/线程和/或GPU,如果/当可用的并行复制并真正获得性能增益。虽然延迟可能更高,但是吞吐量也将非常高。例如,如果你可以利用GPU的优势,你可以在每个副本中启动一个内核,并在单个操作中复制数千个元素。

I might add, too, that this solution could well allow you to take advantage of multiple cores/threads and/or a GPU if/when available to do the copying in parallel and truly get a performance gain. While the latency might be higher, the throughput would very likely be much higher, as well. If, for example, you could take advantage of a GPU when present, you could well launch one kernel per copy and copy thousands of elements in a single operation.

这是依靠编译器/优化器为你做最好的猜测,使用'const'和'register'声明,在那里你可以提供编译器提示并使用幻数来基于最佳解决方案路径进行分支。 ..但是,这将是特别的编译器/系统依赖,你的里程将从一个平台/环境到另一个不同的广泛。

The alternative to this is to depend on the compiler/optimizer to make the best guesses for you, use the 'const' and 'register' declarations where you can to offer the compiler hints and use magic numbers to branch based on "best solution" paths... this, however, is going to be exceptionally compiler/system dependent and your mileage will vary widely from one platform/environment to another.

这篇关于强制GCC执行memcpy运行时大小检查的循环unswitching?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆