为什么Clang会为引用和非空指针参数生成不同的代码? [英] Why does Clang generate different code for reference and non-null pointer arguments?

查看:84
本文介绍了为什么Clang会为引用和非空指针参数生成不同的代码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这与为什么GCC无法为两个int32的结构生成最优算子==?.我在godbolt.org上处理了该问题的代码,并注意到了这种奇怪的行为.

This is related to Why can't GCC generate an optimal operator== for a struct of two int32s?. I was playing around with the code from that question at godbolt.org and noticed this odd behavior.

struct Point {
    int x, y;
};

bool nonzero_ptr(Point const* a) {
    return a->x || a->y;
}

bool nonzero_ref(Point const& a) {
    return a.x || a.y;
}

https://godbolt.org/z/e49h6d

对于 nonzero_ptr ,clang -O3(所有版本)会生成以下或类似的代码:

For nonzero_ptr, clang -O3 (all versions) produces this or similar code:

    mov     al, 1
    cmp     dword ptr [rdi], 0
    je      .LBB0_1
    ret
.LBB0_1:
    cmp     dword ptr [rdi + 4], 0
    setne   al
    ret

这严格实现了C ++函数的短路行为,仅当 x 字段为零时才加载 y 字段.

This strictly implements the short-circuiting behavior of the C++ function, loading the y field only if the x field is zero.

对于 nonzero_ref ,clang 3.6和更早版本生成与 nonzero_ptr 相同的代码,但是clang 3.7到11.0.1产生

For nonzero_ref, clang 3.6 and earlier generate the same code as they do for nonzero_ptr, but clang 3.7 through 11.0.1 produce

    mov     eax, dword ptr [rdi + 4]
    or      eax, dword ptr [rdi]
    setne   al
    ret

将无条件加载 y .当参数是指针时,没有任何版本的clang愿意这样做.为什么?

which loads y unconditionally. No version of clang is willing to do that when the parameter is a pointer. Why?

(在x64平台上)我唯一能想到的分支代码的行为明显不同的情况是 [rdi + 4] 上没有映射的内存,但是我我仍然不确定为什么clang会认为这种情况对于指针而不是引用很重要.我最好的猜测是,有一些语言法律上的论点认为引用必须是针对完整对象"的.并且指针不必是:

The only situation I can think of (on the x64 platform) where the behavior of the branching code would be observably different is when there's no memory mapped at [rdi+4], but I'm still unsure why clang would consider that case important for pointers and not references. My best guess is that there is some language-lawyery argument that references must be to "full objects" and pointers needn't be:

char* p = alloc_4k_page_surrounded_by_guard_pages();
int* pi = reinterpret_cast<int*>(p + 4096 - sizeof(int));
Point* ppt = reinterpret_cast<Point*>(pi);  // ok???
ppt->x = 42;  // ok???
Point& rpt = *ppt;  // UB???

但是如果规范暗示了这一点,那我就不知道了.

But if the spec implies that, I'm not seeing how.

推荐答案

这是错过的优化;无分支代码对于两个C ++源代码版本都是安全的.

This is a missed optimization; the branchless code is safe for both C++ source versions.

>为什么允许gcc GCC实际上是 通过指针推测加载两个结构成员,即使C源仅引用了一个或另一个.因此,至少在解释C和C ++标准时,至少GCC开发人员已确定此优化是100%安全的(我认为这是有意的,不是错误).Clang会生成一个0或1索引来选择要加载的 int ,因此clang仍然与您发明负载的情况一样不情愿.(C与C ++:具有或不具有 -xc 的同一个asm,其源代码版本可移植为: https://godbolt.org/z/6oPKKd )

In Why is gcc allowed to speculatively load from a struct? GCC actually is speculatively loading both struct members through a pointer even though the C source only references one or the other. So at least GCC developers have decided that this optimization is 100% safe, in their interpretation of the C and C++ standards (I think that's intentional, not a bug). Clang generates a 0 or 1 index to choose which int to load, so clang is still just as reluctant as in your case to invent a load. (C vs C++: same asm with or without -xc, with a version of the source ported to work as either: https://godbolt.org/z/6oPKKd)

您的asm中的明显区别是,如果 a-> x!= 0 ,则指针版本避免访问 a-> y ,并且仅如果 a-> y 位于未映射的页面中,则对于正确性而言 1 很重要;您认为这是正确的案例是正确的.

The obvious difference in your asm is that the pointer version avoids access to a->y if a->x != 0, and that this only matters for correctness1 if a->y was in an unmapped page; you're right about that being the relevant corner case.

但是ISO C ++不允许部分对象.您的示例中的页面边界设置是我很确定未定义的行为.在读取 a-> x 的执行路径中,编译器可以假定也可以安全地读取 a-> y .

But ISO C++ doesn't allow partial objects. The page-boundary setup in your example is I'm pretty sure undefined behaviour. In a path of execution that reads a->x, the compiler can assume it's safe to also read a->y.

对于 int * p; p [0] ||,这当然不是 p [1] ,因为在页面的最后4个字节中有一个长度为1个元素的隐式长度0终止数组是完全有效的.

This would of course not be the case for int *p; and p[0] || p[1], because it's totally valid to have an implicit-length 0-terminated array that happens to be 1 element long, in the last 4 bytes of a page.

正如@Nate在评论中建议的那样,也许clang在优化时根本就没有利用ISO C ++的事实.也许在考虑"if-conversion"时,它确实在内部转换为更像数组的东西.优化类型(从无分支到无分支).也许LLVM只是不让自己通过指针发明负载.

As @Nate suggested in comments, perhaps clang simply doesn't take advantage of that ISO C++ fact when optimizing; maybe it does internally transform to something more like an array by the time it's considering this "if-conversion" type of optimization (branchy to branchless). Or maybe LLVM just don't let itself invent loads through pointers.

它总是可以为引用args做到这一点,因为保证引用为非NULL.这将是更多".UB用于调用方执行 nonzero_ref(* ppt),就像您的部分对象示例一样,因为在C ++术语中,我们取消了指向整个对象的指针.

It can always do it for reference args because references are guaranteed non-NULL. It would be "even more" UB for the caller to do nonzero_ref(*ppt), like in your partial-object example, because in C++ terms we're dereferencing a pointer to the whole object.

bool nonzero_ptr_full_deref(Point const* pa) {
    Point a = *pa;
    return a.x || a.y;
}

https://godbolt.org/z/ejrn9h -无分支编译,与 nonzero_ref .不知道这告诉我们什么/多少.这是我所期望的,因为它可以使在C ++源代码中有效地无条件访问 a-> y .

https://godbolt.org/z/ejrn9h - compiles branchlessly, same as nonzero_ref. Not sure what / how much this tells us. This is what I expected, given that it makes access to a->y effectively unconditional in the C++ source.

脚注1 :与所有主流ISA一样,x86-64也不进行硬件竞争检测,因此加载其他线程正在写的内容可能仅对性能有影响,然后因为我们已经在读取一个成员,所以整个结构在缓存行边界上被分割.如果对象不跨越缓存行,则已经引起了任何错误的共享性能影响.

Footnote 1: Like all mainstream ISAs, x86-64 doesn't do hardware race detection, so the possibility of loading something another thread might be writing only matters for performance, and then only if the full struct is split across a cache-line boundary since we're already reading one member. If the object doesn't span a cache line, any false-sharing performance effect is already incurred.

像这样使asm不会引入data-race UB".因为x86 asm为此功能提供了明确定义的行为,这与ISO C ++不同.asm适用于从 [rdi + 4] 加载的任何可能的值,因此它可以正确实现C ++源代码的语义.与写操作不同,发明读操作是线程安全的,并且由于它不是 volatile 而不被允许,因此允许访问不是可见的副作用.唯一的问题是指针是否必须指向完整的有效 Point 对象.

Making asm like this doesn't "introduce data-race UB" because x86 asm has well-defined behaviour for this possibility, unlike ISO C++. The asm works for any possible value loaded from [rdi+4] so it correctly implements the semantics of the C++ source. Inventing reads is thread-safe, unlike writes, and is allowed because it's not volatile so the access isn't a visible side-effect. The only question is whether the pointer must point to a full valid Point object.

部分数据争用(在非 atomic 对象上)为未定义行为,是为了允许在具有竞争检测的硬件上实现C ++实现.另一个是允许编译器假定重新加载他们曾经访问过的内容是安全的,并且期望相同的值,除非在两点之间有获取或seq_cst加载.即使第二次加载与第一次加载不同时,也会使代码崩溃.在这种情况下这是无关紧要的,因为我们不是在谈论将1的访问转换为2(而是将0转换为1,其值可能无关紧要),而是为什么自己拥有的原子(例如,在Linux内核中)需要使用 volatile * ACCESS_ONCE ( https://lwn.net/Articles/793253/#Invented%20Loads ).

Part of data races (on non-atomic objects) being Undefined Behaviour is to allow for C++ implementations on hardware with race detection. Another is to allow compilers to assume that it's safe to reload something they accessed once, and expect the same value unless there's an acquire or seq_cst load between the two points. Even making code that would crash if the 2nd load differed from the first. That's irrelevant in this case because we're not talking about turning 1 access into 2 (instead 0 into 1 whose value may not matter), but is why roll-your-own atomics (e.g. in the Linux kernel) need to use volatile* casts for ACCESS_ONCE (https://lwn.net/Articles/793253/#Invented%20Loads).

这篇关于为什么Clang会为引用和非空指针参数生成不同的代码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆