GCC强制执行的memcpy运行时大小检查的循环外? [英] Forcing GCC to perform loop unswitching of memcpy runtime size checks?

查看:216
本文介绍了GCC强制执行的memcpy运行时大小检查的循环外?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有什么可靠的办法迫使海湾合作委员会(或编译器)来分解出在运行时的大小检查的memcpy()外循环的(这里的大小是无法编译 - 时间不变,但循环内不变),专门为每个相关大小范围的循环而不是反复检查中它的大小?

Is there any reliable way to force GCC (or any compiler) to factor out runtime size checks in memcpy() outside of a loop (where that size is not compile-time constant, but constant within that loop), specializing the loop for each relevant size range rather than repeatedly checking the size within it?

这是一个测试案例从性能回归降至报道这里开源库实现高效内存分析大数据集。 <子>(回归恰好是因为我提交的一个......的)

This is an test case reduced down from a performance regression reported here for an open source library designed for efficient in-memory analysis of large data sets. (The regression happens to be because of one of my commits...)

原来的code是在用Cython,但我已经缩小它归结为一个纯C代理如下:

The original code is in Cython, but I've reduced it down to a pure C proxy as the following:

void take(double * out, double * in,
          int stride_out_0, int stride_out_1,
          int stride_in_0, int stride_in_1,
          int * indexer, int n, int k)
{
    int i, idx, j, k_local;
    k_local = k; /* prevent aliasing */
    for(i = 0; i < n; ++i) {
        idx = indexer[i];
        for(j = 0; j < k_local; ++j)
            out[i * stride_out_0 + j * stride_out_1] =
            in[idx * stride_in_0 + j * stride_in_1];
    }
}

的进步是可变的;在一般的阵列甚至不保证是连续的但是,对c-邻接阵列的特定情况下,我已经优化上面到下面的(因为它们可能是更大的阵列的非连续切片。):

void take(double * out, double * in,
          int stride_out_0, int stride_out_1,
          int stride_in_0, int stride_in_1,
          int * indexer, int n, int k)
{
    int i, idx, k_local;
    assert(stride_out_0 == k);
    assert(stride_out_0 == stride_in_0);
    assert(stride_out_1 == 1);
    assert(stride_out_1 == stride_in_1);
    k_local = k; /* prevent aliasing */
    for(i = 0; i < n; ++i) {
        idx = indexer[i];
        memcpy(&out[i * k_local], &in[idx * k_local],
               k_local * sizeof(double));
    }
}

(该断言不在原始code present;而是它检查连续性和如果可能的话调用优化版本,和未优化的,如果不)

(The asserts are not present in the original code; instead it checks for contiguity and calls the optimized version if possible, and the unoptimized one if not.)

该版本优化了相当不错,在大多数情况下,因为在正常使用情况下,如果对小 N 和大 K 。但是,相反的使用情况确实发生,以及(大 N K ),它原来的特别情况下 n ==可10000 K == 4 (不能排除因为再presentative一个假设的工作流程的重要组成部分)的,在的memcpy()版本比原来慢3.6倍。这一点,显然,主要是由于这一事实,即 K 不是编译时常通过这个下一个版本执行(几乎或准确,这取决于优化的事实证明设置),以及原来的(或更好,有时),对于特殊情况 K == 4

This version optimizes very well in most cases, since the normal use case if for small n and large k. However, the opposite use case does happen as well (large n and small k), and it turns out for the particular case of n == 10000 and k == 4 (which cannot be ruled out as representative of an important part of a hypothetical workflow), the memcpy() version is 3.6x times slower than the original. This is, apparently, mainly due to the fact that k is not compile-time constant, as evidenced by the fact that this next version performs (almost or exactly, depending on optimization settings) as well as the original (or better, sometimes), for the particular case of k == 4:

    if (k_local == 4) {
        /* this optimizes */
        for(i = 0; i < n; ++i) {
            idx = indexer[i];
            memcpy(&out[i * k_local], &in[idx * k_local],
                   k_local * sizeof(double));
        }
    } else {
        for(i = 0; i < n; ++i) {
            idx = indexer[i];
            memcpy(&out[i * k_local], &in[idx * k_local],
                   k_local * sizeof(double));
        }
    }

显然,这是不切合实际的硬codeA环路 K 的每一个特定的值,所以我被后来的尝试以下代替(作为第一,可以尝试广义的,如果它的工作):

Obviously, it's not practical to hardcode a loop for every particular value of k, so I attempted the following instead (as a first attempt that could later by generalized, if it worked):

    if (k_local >= 0 && k_local <= 4) {
        /* this does not not optimize */
        for(i = 0; i < n; ++i) {
            idx = indexer[i];
            memcpy(&out[i * k_local], &in[idx * k_local],
                   k_local * sizeof(double));
        }
    } else {
        for(i = 0; i < n; ++i) {
            idx = indexer[i];
            memcpy(&out[i * k_local], &in[idx * k_local],
                   k_local * sizeof(double));
        }
    }

不幸的是,这最后的版本不得高于原的memcpy()的版本,这是我在GCC的优化能力的信心有些令人沮丧。

Unfortunately, this last version is no faster than the original memcpy() version, which is somewhat disheartening for my faith in GCC's optimization abilities.

有(通过任何方式),这将有助于它做正确的事在这里什么办法,我可以给额外的提示,以GCC? (更妙的是,有没有暗示,可以跨越可靠不同的编译器工作?这个库被编译为许多不同的目标。)

Is there any way I can give extra "hints" to GCC (through any means) that will help it do the right thing here? (And even better, are there "hints" that could reliably work across different compilers? This library is compiled for many different targets.)

所引用的结果是在32位的Ubuntu 4.6.3 GCC与-O2的旗帜,但我还测试了GCC 4.7.2和-O3版本的类似(但不相同)的结果。我已为我的测试工具来 LiveWorkspace ,但计时使用时间(1)命令(我不知道怎么计时可靠LiveWorkspace的。)

The results quoted are for GCC 4.6.3 on 32-bit Ubuntu with the "-O2" flag, but I've also tested GCC 4.7.2 and "-O3" versions with similar (but not identical) results. I've posted my test harness to LiveWorkspace, but the timings are from my own machine using the time(1) command (I don't know how reliable LiveWorkspace timings are.)

编辑:我也认为只是设置一个幻数对于一些最小尺寸调用的memcpy()我和能找到与反复测试这样的值,但我不知道如何概括性我的结果将是在不同的编译器/平台。有没有我可以在这里使用拇指任何规则?

I've also considered just setting a "magic number" for some minimum size to call memcpy() with, and I could find such a value with repeated testing, but I'm not sure how generalizable my results would be across different compilers/platforms. Is there any rule of thumb I could use here?

进一步编辑:实现 k_local 变量是在这种情况下,一种无用的,实际上,由于没有走样是可能的;这是从一些实验,我跑的地方是可能的( K 是全球性的)降低,我忘了我改变了它。只是忽略的部分。

FURTHER Realized the k_local variables are kind of useless in this case, actually, since no aliasing is possible; this was reduced from some experiments I ran where it was possible (k was global) and I forgot I changed it. Just ignore that part.

编辑标签:意识到,我还可以用C ++在用Cython的较新版本,所以标记为C ++万一有什么事,可以从C ++帮助...

EDIT TAG: Realized I can also use C++ in newer versions of Cython, so tagging as C++ in case there's anything that can help from C++...

最后编辑:在代替(现在)下降到下来大会专门的memcpy(),下面好像是我的本地机器的最佳解决方案的经验:

FINAL In lieu (for now) of dropping down to assembly for a specialized memcpy(), the following seems to be the best empirical solution for my local machine:

    int i, idx, j;
    double * subout, * subin;
    assert(stride_out_1 == 1);
    assert(stride_out_1 == stride_in_1);
    if (k < 32 /* i.e. 256 bytes: magic! */) {
        for(i = 0; i < n; ++i) {
            idx = indexer[i];
            subout = &out[i * stride_out_0];
            subin = &in[idx * stride_in_0];
            for(j = 0; j < k; ++j)
                subout[j] = subin[j];
        }
    } else {
        for(i = 0; i < n; ++i) {
            idx = indexer[i];
            subout = &out[i * stride_out_0];
            subin = &in[idx * stride_in_0];
            memcpy(subout, subin, k * sizeof(double));
        }
    }

本采用了幻数来决定是否调用的memcpy()或没有,但还是可以优化已知是连续的小数组的情况下(所以它比原来快,在大多数情况下,由于原没有这样的假设)。

This uses a "magic number" to decide whether to call memcpy() or not, but still optimizes the case for small arrays that are known to be contiguous (so it's faster than the original, in most cases, since the original makes no such assumption).

推荐答案

最后,手头的问题是询问优化器做出基于多个变量的运行时行为假设之一。虽然可以通过使用常量,提供优化的一些编译时的提示和注册声明的关键变量,最终,你根据不同的优化做出了很多假设。此外,虽然的memcpy()可能是内在的,它不能保证是,甚至如果/当它是,实施(多个)可以相当广泛地变化。

Ultimately, the issue at hand is one of asking the optimizer to make assumptions about runtime behavior based on multiple variables. While it is possible to provide the optimizer some compile-time hints via the use of 'const' and 'register' declarations on the key variables, ultimately, you're depending on the optimizer to make a lot of assumptions. Further, while the memcpy() may well be intrinsic, it's not guaranteed to be and even if/when it is, the implementation(s) could vary fairly widely.

如果该目标是实现最大的性能,有时你只需要不依赖于技术来弄明白你的,而不是直接做。造成这种情况的最好的建议是使用内联汇编程序来解决这个问题。这样做可以让你避免所有的编译器和优化的启发式的黑盒子的解决方案恭维的陷阱你的意图,并以有限的状态。要使用内联汇编的主要好处是避免在解决内存拷贝问题的能力的推/持久性有机污染物和多余的泛化code采取的处理器的解决问题的能力直接的好处的能力。不好的一面是维修,但考虑到你真的只需要地址英特尔和AMD涵盖了大部分市场,这并非不可克服。

If the goal is to achieve maximum performance, sometimes you just have to not rely on technology to figure it out for you, rather do it directly. The best advice for this situation is the use of inline assembler to address the problem. Doing so allows you to avoid all of the pitfalls of a "black box" solution compliments of the heuristics of the compiler and optimizer and to finitely state your intent. The key benefit to the use of inline assembler is the ability to avoid any pushes/pops and extraneous "generalization" code in the solution to the memory copy problem and the ability to take direct advantage of the processor's ability to solve the problem. The down side is maintenance, but given that you really need only address Intel and AMD to cover most of the market, it's not insurmountable.

我要补充,那就是,这个解决方案很可能让你把多个核心/线程和/或GPU的优势,如果/当可用做复制并行,真正获得的性能增益。而延迟可能更高,可以通过将极有可能是高得多,为好。如果,例如,你可以采取一个GPU的优势时present,你很可能推出每复制一个内核和复制数千个元素在一个单一的操作。

I might add, too, that this solution could well allow you to take advantage of multiple cores/threads and/or a GPU if/when available to do the copying in parallel and truly get a performance gain. While the latency might be higher, the throughput would very likely be much higher, as well. If, for example, you could take advantage of a GPU when present, you could well launch one kernel per copy and copy thousands of elements in a single operation.

解决这个问题的办法是依赖于编译/优化做出最好的猜测你,使用常量和注册的声明,您可以提供编译器提示和使用幻数转移是基于对最好的解决办法的路径......然而,这将是非常的编译器/系统相关,您的里程将在一个平台/环境有很大的不同而异。

The alternative to this is to depend on the compiler/optimizer to make the best guesses for you, use the 'const' and 'register' declarations where you can to offer the compiler hints and use magic numbers to branch based on "best solution" paths... this, however, is going to be exceptionally compiler/system dependent and your mileage will vary widely from one platform/environment to another.

这篇关于GCC强制执行的memcpy运行时大小检查的循环外?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆