C ++模板的性能? [英] C++ templates for performance?

查看:124
本文介绍了C ++模板的性能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在网上已经看到过几次有人提到C ++可以使用模板更快。



有人可以解释,包括在低级别为什么这是究竟?我总是认为这样一个好的功能会像最有用的概念那样有开销。



我真的很感兴趣从超低延迟的角度看!

$



在C中, qsort 获取指向比较函数的指针。一般来说,将有一个 qsort 代码的副本,它没有内联。它将通过指向比较例程的指针进行调用 - 这当然也不是内联的。



在C ++中, std :: sort 是一个模板,它可以使用一个函子对象作为比较器。对于用作比较器的每个不同类型,有一个不同的副本 std :: sort 。假设您使用重载 operator()的functor类,那么对比较器的调用可以轻松地内联到 std :: sort



所以,模板给你更多的内联,因为有更多的副本 sort 代码,每个代码可以内联一个不同的比较器。内联是一个很好的优化,并且排序例程做了很多比较,所以你可以经常测量 std :: sort 运行速度比同等的 qsort 。这样做的代价是更大代码的机会 - 如果你的程序使用了很多不同的比较器,那么你会得到很多不同的排序程序副本,每个都有一个不同的比较器。



原则上,没有理由C实现不能将 qsort 嵌入到它被调用的地方。然后,如果使用函数名调用它,则优化器在理论上可以观察到,在使用它的那一点上,函数指针必须仍然指向相同的函数。然后它可以内联对函数的调用,结果将类似于 std :: sort 的结果。但在实践中,编译器往往不采取第一步,内联 qsort 。这是因为(a)它是大的,和(b)它在一个不同的翻译单元,通常编译成一些库,你的程序链接,和(c)这样做,你会有一个内联的副本 qsort ,而不只是每个不同比较器的副本。所以它会比C ++更more肿,除非实现也可以找到一个方法来共同的代码在 qsort 在不同的地方调用相同的比较器。



因此,C中的 qsort 等通用函数往往由于函数调用而产生一些开销指针或其他间接[*]。 C ++中的模板是保持源代码通用,但确保它编译为专用函数(或几个这样的函数)的常见方法。特殊用途代码希望速度更快。



值得注意的是,模板不仅仅是关于性能。 std :: sort 在某些方面比 qsort 更通用。例如 qsort 只排序数组,而 std :: sort 可以排序提供随机访问迭代器的任何东西。它可以例如对 deque 进行排序,其中覆盖是分开分配的几个不相交的数组。因此,使用模板不一定提供任何性能优势,它可能由于其他原因。



[*]另一个排序示例 - qsort 使用一个整数参数说明数组的每个元素有多大,当它移动元素时,因此必须调用 memcpy 或类似于此变量的值。 std :: sort 在编译时知道元素的确切类型,因此确切的大小。它可以内联一个复制构造函数调用,反过来可能转换为复制该字节数的指令。与内联比较器一样,通常可以通过调用复制可变字节数的例程,将值4(或8)转换为4(或8或16)字节, ,或16,或任何)。和以前一样,如果你调用 qsort ,并且大小为字面值,并且调用 qsort 编译器可以在C中执行完全相同的优化。但在实践中,你看不到。


I have seen online a few times it has been mentioned that C++ can be ever faster using templates.

Could someone explain, including at a low level why this is exactly? I always presumed such a "nice" feature would have overhead like most useful concepts.

I am really intrigued by this from a ultra low latency perspective!

解决方案

A common example is sorting.

In C, qsort takes a pointer to a comparison function. Generally speaking, there will be one copy of the qsort code, which is not inlined. It will make a call through the pointer to the comparison routine -- this of course is also not inlined.

In C++, std::sort is a template, and it can take a functor object as comparator. There is a different copy of std::sort for each different type used as a comparator. Assuming you use a functor class with overloaded operator(), then the call to the comparator can easily be inlined into this copy of std::sort.

So, templates give you more inlining because there are more copies of the sort code, each of which can inline a different comparator. Inlining is quite a good optimization, and sort routines do a lot of comparisons, so you can often measure std::sort running faster than an equivalent qsort. The cost of this is the chance of much larger code -- if your program uses a lot of different comparators then you get a lot of different copies of the sort routine, each with a different comparator baked into it.

In principle there's no reason why a C implementation can't inline qsort into the place it is called. Then if it was called with the name of the function, the optimizer could in theory observe that at the point it is used, the function pointer must still point to that same function. Then it can inline the call to the function, and the result would be similar to the result with std::sort. But in practice, compilers tend not to take the first step, inlining qsort. That's because (a) it's large, and (b) it's in a different translation unit, usually compiled into some library that your program is linked against, and (c) to do it this way, you'd have an inlined copy of qsort for every call to it, not just a copy for every different comparator. So it would be even more bloated than the C++, unless the implementation could also find a way to common up the code in cases where qsort is called in different places with the same comparator.

So, general-purpose functions like qsort in C tend to have some overheads on account of calls through function pointers, or other indirection[*]. Templates in C++ are a common way of keeping the source code generic, but ensuring that it compiles to a special-purpose function (or several such functions). The special-purpose code hopefully is faster.

It's worth noting that templates are not by any means just about performance. std::sort is itself more general-purpose than qsort in some ways. For example qsort only sorts arrays, whereas std::sort can sort anything that provides a random-access iterator. It can for example sort a deque, which under the covers is several disjoint arrays allocated separately. So the use of templates doesn't necessarily provide any performance benefit, it might be done for other reasons. It just happens that templates do affect performance.

[*] another example with sorting - qsort takes an integer parameter saying how big each element of the array is, and when it moves elements it therefore must call memcpy or similar with the value of this variable. std::sort knows at compile-time the exact type of the elements, and hence the exact size. It can inline a copy constructor call that in turn might translate to instructions to copy that number of bytes. As with the inlined comparator, it's often possible to copy exactly 4 (or 8, or 16, or whatever) bytes faster than you'd get by calling a routine that copies a variable number of bytes, passing it the value 4 (or 8, or 16, or whatever). As before, if you called qsort with a literal value for the size, and that call to qsort was inlined, then the compiler could perform the exact same optimization in C. But in practice you don't see that.

这篇关于C ++模板的性能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆