返回一个2元组的效率比std :: pair低吗? [英] Is returning a 2-tuple less efficient than std::pair?
问题描述
请考虑以下代码:
#include< utility>
#include< tuple>
std :: pair< int,int> f1()
{
return std :: make_pair(0x111,0x222);
}
std :: tuple< int,int> f2()
{
return std :: make_tuple(0x111,0x222);
}
Clang 3和4在x86-64上生成类似的代码: p>
f1():
movabs rax,0x22200000111
ret
f2():
movabs rax,0x11100000222;相反的包装顺序,不重要
ret
但是Clang 5为<$ c生成不同的代码
$ b
f2():
movabs rax,0x11100000222
mov QWORD PTR [rdi],rax
mov rax,rdi
ret
GCC 4至GCC 7也一样:
f2():
movabs rdx,0x11100000222
mov rax,rdi
mov QWORD PTR [rdi],rdx; GCC 4-6使用2个DWORD存储
ret
为什么生成的代码在返回时变得更糟一个 std :: tuple
,它适合单个寄存器,vs std :: pair
?它看起来特别奇怪,因为叮铛3和4看起来是最优的,但5不是。
在这里试试: https://godbolt.org/g/T2Yqrj
简短的回答是因为 gcc
和 clang 使用的
libstc ++
code>在Linux上使用非平凡的移动构造函数实现 std :: tuple
(特别是 _Tuple_impl
基类有一个非平凡的移动构造函数)。另一方面, std :: pair
的复制和移动构造函数都是默认的。
Gory细节
您在Linux上运行测试,该测试遵守SysV x86-64 ABI。该ABI具有将函数传递或返回到类或结构的特定规则,您可以阅读有关此处的更多信息。我们感兴趣的具体情况是这些结构中的两个 int
字段是否会得到 INTEGER
类或者 MEMORY
类。
A 最近的版本的ABI规范有这样的说法:
$ b
聚合的分类(结构和数组)和联合
类型的工作方式如下:
- 如果一个对象的大小大于八个八字节,或者它包含如果一个C ++对象有一个非平凡的拷贝构造函数或者一个非平凡的析构函数13,它就会被不可见的引用(
对象在参数列表中由具有类
INTEGER)14的指针替换。
- 如果聚集的大小超过单个八字节,则每个对象都被分类分别。每个八字节被初始化为class
NO_CLASS。
- 递归地分类对象的每个字段,以便始终考虑两个字段。
$ b 这里适用的条件(2)。请注意,它只提到拷贝构造函数,而不是移动构造函数 - 但显然它只是可能只是规范中的一个缺陷,因为引入了一般需要包含在任何分类中的移动构造函数算法,其中包含复制构造函数。特别是,gcc
的IA-64 cxx-abi记录在
如果参数类型对于调用而言并非微不足道,
调用者必须为临时分配空间并通过
引用传递临时值。特别是:
- 空间由调用者以通常的方式分配给临时的,通常在堆栈上。
,然后定义非平凡:
如果出现以下情况,则为调用而考虑非平凡:
- 它具有非平凡的拷贝构造函数,移动构造函数或析构函数, / li>
因此,从ABI角度来看,
元组不被认为是可以复制的 ,它会得到
中被调用传入的堆栈分配对象。MEMORY
处理,这意味着你的函数必须填充rdi std :: par
函数可以返回rax
中的整个结构,因为它适合一个EIGHTBYTE
并且拥有类INTEGER
。
这有什么关系吗?是的,严格来说,像你编译的那样的独立函数对于
元组
来说效率会更低,因为这个ABI不同点是烘焙进去的。
通常情况下,编译器将能够看到函数的主体并将其内联或执行跨程序分析,即使没有内联。在这两种情况下,ABI不再重要,至少在体面优化器的情况下,两种方法都可能同样有效。例如让我们调用您的
f1()
和<$ c $int add_pair(){
auto p = f1();
return p.first + p.second;
}
int add_tuple(){
auto t = f2();
return std :: get< 0>(t)+ std :: get< 1>(t);
原则上,
add_tuple
方法从一个缺点开始,因为它必须调用效率较低的f2()
,它还必须在栈上创建一个临时元组对象,以便它可以传递它至f2
作为隐藏参数。无论如何,这两个函数都被完全优化,直接返回正确的值:
pre $ add_pair():
mov eax,819
ret
add_tuple():
mov eax,819
ret
总的来说,你可以说这个ABI问题对
元组的影响会相对较弱:它为函数增加了一个小的固定开销,遵守ABI,但这对于非常小的函数来说只会在相对意义上很重要 - 但是这些函数可能会在可以内联的地方声明(或者如果不是,您将离开性能表)。
libcstc ++ vs libc +++
如上所述,这是一个ABI问题,而不是优化问题,本身。 clang和gcc都已经在ABI的约束下尽可能地优化库代码 - 如果它们为
生成代码如
它们会破坏ABI兼容的调用者。f1()
std :: tuple
如果切换到使用
libc ++
而不是Linux默认的libstdc ++
- 这个实现没有显式的移动构造函数(正如Marc Glisse在注释中提到的,它们是为了向后兼容而坚持使用该实现)。现在clang
(并且可能gcc,尽管我没有尝试过),生成相同的最优代码:
f1():#@ f1()
pre>
movabs rax,2345052143889
ret
f2():#@ f2()
movabs rax,2345052143889
ret
早期版本的Clang
为什么版本
clang
以不同的方式编译它?这只是 clang中的错误或规范中的错误,具体取决于如何你看看它。该规范没有明确包含在需要传递临时指针的情况下移动构建。不符合IA-64 C ++ ABI。例如,编译clang用于执行的方式与gcc
或较新版本的clang
不兼容。规范是最终更新和clang 在版本5.0中改变了行为。
更新: Marc Glisse ,最初有关非重要移动构造函数与C ++ ABI的交互以及
clang 中的混淆 / code>在某些时候改变了他们的行为,这可能解释了开关:
移动
的构造函数不清楚,并且在澄清时,clang将
更改为遵循ABI。这可能是其中的一种情况。
Consider this code:
#include <utility> #include <tuple> std::pair<int, int> f1() { return std::make_pair(0x111, 0x222); } std::tuple<int, int> f2() { return std::make_tuple(0x111, 0x222); }
Clang 3 and 4 generate similar code for both on x86-64:
f1(): movabs rax,0x22200000111 ret f2(): movabs rax,0x11100000222 ; opposite packing order, not important ret
But Clang 5 generates different code for
f2()
:f2(): movabs rax,0x11100000222 mov QWORD PTR [rdi],rax mov rax,rdi ret
As do GCC 4 through GCC 7:
f2(): movabs rdx,0x11100000222 mov rax,rdi mov QWORD PTR [rdi],rdx ; GCC 4-6 use 2 DWORD stores ret
Why is the generated code worse when returning a
std::tuple
that fits in a single register, vsstd::pair
? It seems especially strange since Clang 3 and 4 seemed to be optimal, yet 5 is not.Try it here: https://godbolt.org/g/T2Yqrj
解决方案The short answer is because the
libstc++
standard library implementation used bygcc
andclang
on Linux implementsstd::tuple
with a non-trivial move constructor (in particular, the_Tuple_impl
base class has a non-trivial move constructor). On the other hand, the copy and move constructors forstd::pair
are all defaulted.The Gory Details
You ran your tests on Linux, which adheres to the SysV x86-64 ABI. This ABI has specific rules for passing or returning classes or structures to functions, which you can read more about here. The specific case we are interested in with whether the two
int
fields in these structures will get theINTEGER
class or theMEMORY
class.A recent version of the ABI specification has this to say:
The classification of aggregate (structures and arrays) and union types works as follows:
- If the size of an object is larger than eight eightbytes, or it contains un- aligned fields, it has class MEMORY 12 .
- If a C++ object has either a non-trivial copy constructor or a non-trivial destructor 13 , it is passed by invisible reference (the object is replaced in the parameter list by a pointer that has class INTEGER) 14 .
- If the size of the aggregate exceeds a single eightbyte, each is classified separately. Each eightbyte gets initialized to class NO_CLASS.
- Each field of an object is classified recursively so that always two fields are considered. The resulting class is calculated according to the classes of the fields in the eightbyte
It is condition (2) that applies here. Note that it mentions only copy constructors, and not move constructors - but it is fairly apparently that just is probably just a defect in the specification given the introduction of move constructors which generally need to be included in any classification algorithm where copy constructors were included before. In particular, IA-64 cxx-abi, which
gcc
is documented to follow does include move constructors:If the parameter type is non-trivial for the purposes of calls, the caller must allocate space for a temporary and pass that temporary by reference. Specifically:
- Space is allocated by the caller in the usual manner for a temporary, typically on the stack.
and then the definition of non-trivial:
A type is considered non-trivial for the purposes of calls if:
- it has a non-trivial copy constructor, move constructor, or destructor, or
- all of its copy and move constructors are deleted.
So because
tuple
is not considered to be trivially copyable from an ABI perspective, it getsMEMORY
treatment, which means that your function must populate the stack allocated object passed in by the called inrdi
. Thestd::par
function can just pass back the entire structure inrax
since it fits in oneEIGHTBYTE
and has classINTEGER
.Does it matter? Yeah, strictly speaking, a standalone function like the one you have compiled will be less efficient for
tuple
since this ABI different is "baked in".Often however, the compiler will be able to see the body of the function and inline it or perform inter-procedural analysis even if not inlined. In both cases, the ABI is no longer important and it is likely both approaches would be equally efficient, at least with a decent optimizer. For example let's call your
f1()
andf2()
functions and do some math on the result:int add_pair() { auto p = f1(); return p.first + p.second; } int add_tuple() { auto t = f2(); return std::get<0>(t) + std::get<1>(t); }
In principle the
add_tuple
method starts from a disadvantage, since it has to callf2()
which is less efficient and it also has to create a temporary tuple object on the stack so it can pass it tof2
as the hidden parameter. Well no matter, both functions are fully optimized to just return the right value directly:add_pair(): mov eax, 819 ret add_tuple(): mov eax, 819 ret
So overall you can say that the effect of this ABI issue with
tuple
will be relatively muted: it adds a small fixed overhead to functions that must comply with the ABI, but this will only really matter in a relative sense for very small functions - but such functions are likely to be declared in a place where they can be inlined (or if not, you are leaving performance on the table).libcstc++ vs libc+++
As explained above, this is an ABI issue, not an optimization issue, per se. Both clang and gcc are already optimizing the library code to maximum extent possible under the constraints of the ABI - if they generated code like
f1()
for thestd::tuple
case they would break ABI compliant callers.You can see this clearly if you switch to using
libc++
rather than the Linux default oflibstdc++
- this implementation doesn't have the explicit move constructor (as Marc Glisse mentions in the comments, they are stuck with this implementation for backwards compatibility). Nowclang
(and presumably gcc although I didn't try it), generates the same optimal code in both cases:f1(): # @f1() movabs rax, 2345052143889 ret f2(): # @f2() movabs rax, 2345052143889 ret
Earlier Versions of Clang
Why do versions of
clang
compile it differently? It was simply a bug in clang or a bug in the spec depending on how you look at it. The spec didn't explicitly include move construction in the cases where a hidden pointer to a temporary needed to be passed. wasn't conforming to the IA-64 C++ ABI. For example compiled the way clang used to do it was not compatible withgcc
or newer versions ofclang
. The spec was eventually updated and the clang behavior changed in version 5.0.Update: Marc Glisse mentions in the comments that there was initially confusion about the interaction of non-trivial move constructors and the C++ ABI, and
clang
changed their behavior at some point, which probably explains the switch:The ABI specification for some argument passing cases involving move constructors were unclear, and when they were clarified, clang changed to follow the ABI. This is probably one of those cases.
这篇关于返回一个2元组的效率比std :: pair低吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!