为什么std :: sort比手工编码的内容快〜20%? [英] Why std::sort is ~20% faster then introsort coded by hand?

查看:68
本文介绍了为什么std :: sort比手工编码的内容快〜20%?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Hello C ++ Gurus!


我正在比较排序算法的研究目标。我已经在数百万(数千万)的
整数数组排序中比较了STL

std :: sort和手工编码的introsort。它是随机测试的,经过排序,并在

反向数组中排序。在所有测试中对这么大的数组进行std :: sort比我的版本更快〜/ 4
。所以它似乎有一些调用优化。


我的''手工编码''版本基于D. Musser的建议和

一些我在不同的STL安装中找到的STL标题(GCC,

和MSVS2008)。在两种情况下都使用了最大优化标志,对于g ++和cl,使用了
。但是''手工编码''版本仍然较慢。


为什么?可能是生产std :: sort版本它是用于内部或汇编器优化的一些SIMD

优化?


提前谢谢。

Alex。


内容版本的Pseudocode(C ++,可能是错误):


//代码

模板< typename T>

内联无效交换(T * a,T * b)

{

T temp = * a;

* a = * b;

* b = temp;

}


模板< typename T>

inline void cmpswap(T * a,T * b)

{

if(* a * b){

T temp = * a;

* a = * b;

* b = temp;

}

}


模板< typename T>

内联T *分区(T * first,T * last )

{

T * v = last-1;

first--;


做{

while(*(++ first)* v);

while(* last - < * v)if(first == last)break;


if(first> = last)b reak;

swap(first,last);

} while(true);


swap(v,first);

首先返回++;

}


模板< typename T>

inline void median3(T * first,T * last)

{

T * r = last-1,* rr = last-2;

swap(first +(last - first)/ 2,rr);

cmpswap(first,rr);

cmpswap(first,r);

cmpswap(rr,r);

}


模板< typename T>

inline void _introsort(T *首先,T * last,int recurs_depth)

{

while(last - first M){

if(recurs_depth--)return hsort(第一个,最后一个);


median3(第一个,最后一个);

T * k =分区(第一个+ 1,最后一个);

introsort(k,last,recurs_depth);

last = k;

}

}


模板< typename T>

void introsort(T * first,T * last,int recurs_depth)

{

_introsort(first,last,recur s_depth);

插入(第一个,最后一个);

}


//代码结束

Hello C++ Gurus!

I''m comparing sorting algorithm for study goals. I''ve compared STL
std::sort and hand-coded introsort on millions (tens of millions) of
integers array sorting. It was tested on random, sorted, and sorted in
reverse arrays. In all tests on such a huge arrays std::sort is faster
by ~1/4 then my version. So it seems that it some call optimization.

My ''hand coded'' version is based on D. Musser''s suggestions and on
some STL header which I''ve found in diffrent STL installations (GCC,
and MSVS2008). The maximum optimization flags was used in both cases,
for g++ and cl. But ''hand-coded'' version is still slower.

Why? May be in production std::sort version it is used some SIMD
optimization internally or assembler optiomization?

Thanks in advance.
Alex.

Pseudocode (C++, may be bugs) of introsort version:

// CODE

template <typename T>
inline void swap(T *a, T *b)
{
T temp = *a;
*a = *b;
*b = temp;
}

template <typename T>
inline void cmpswap(T *a, T *b)
{
if (*a *b) {
T temp = *a;
*a = *b;
*b = temp;
}
}

template <typename T>
inline T *partition(T *first, T *last)
{
T *v = last-1;
first--;

do {
while(*(++first) *v);
while(*last-- < *v) if (first == last) break;

if (first >= last) break;
swap(first,last);
} while(true);

swap(v,first);
return ++first;
}

template <typename T>
inline void median3(T *first, T *last)
{
T *r = last-1, *rr = last-2;
swap(first + (last - first)/2 , rr);
cmpswap(first, rr);
cmpswap(first, r);
cmpswap(rr, r);
}

template <typename T>
inline void _introsort(T *first, T *last, int recurs_depth)
{
while(last - first M) {
if (recurs_depth--) return hsort(first, last);

median3(first, last);
T *k = partition(first+1, last-1);
introsort(k,last, recurs_depth);
last = k;
}
}

template <typename T>
void introsort(T *first, T *last, int recurs_depth)
{
_introsort(first, last, recurs_depth);
insertion(first, last);
}

// CODE END

推荐答案

ikarus写道:
ikarus wrote:

为什么?
Why?



因为std :: sort()跳过小于

特定大小的子分区分区,而是使用插入排序对这些分区进行排序。 br />
这大约增加了20%的速度。


我不知道分区在跳过之前必须有多小。

对它进行子分区,但根据我自己的测试,15 / b
和25个元素之间的东西似乎是最优的(尽管它取决于元素类型)。


如果你将这个额外的步骤添加到你的排序算法中,你将很可能看到~20%的速度增加。

Because std::sort() skips subpartitioning partitions smaller than a
certain size, and instead sorts those partitions with insertion sort.
That accounts for approximately that 20% speed increase.

I don''t know exactly how small the partition must be before it skips
subpartitioning it, but according to my own tests something between 15
and 25 elements seems optimal (although it depends on the element type).

If you add this additional step to your sorting algorithm you will
most probably see the ~20% speed increase.


Juha Nieminen写道:
Juha Nieminen wrote:

因为std :: sort()跳过小于

特定大小的子分区分区,而是使用插入排序对这些分区进行排序。
Because std::sort() skips subpartitioning partitions smaller than a
certain size, and instead sorts those partitions with insertion sort.



哦,顺便说一句:当我说插入排序时,我的意思是插入排序。

我做*不*意味着冒泡排序,选择排序或任何其他类型的排序。

I * do *表示插入排序。没有别的。


(是的,我已经看到很多关于这个主题的误解。插入

sort是最优化的基于比较的排序算法非常

小数组。别打扰其他任何东西。)

Oh, and by the way: When I say insertion sort, I mean insertion sort.
I do *not* mean bubble sort, selection sort or any other type of sort.
I *do* mean insertion sort. Nothing else.

(Yes, I have seen tons of misconceptions about this subject. Insertion
sort is the most optimal comparison-based sorting algorithm for very
small arrays. Don''t even bother trying anything else.)


文章< aae2f3e6-16c1-4e60-a214-
90**********@c58g2000hsc.googlegroups.com > ;, Ka ************* @ gmail.com

说...
In article <aae2f3e6-16c1-4e60-a214-
90**********@c58g2000hsc.googlegroups.com>, Ka*************@gmail.com
says...

你好C ++大师!


我在比较研究目标的排序算法。我已经在数百万(数千万)的
整数数组排序中比较了STL

std :: sort和手工编码的introsort。它是随机测试的,经过排序,并在

反向数组中排序。在所有测试中对这么大的数组进行std :: sort比我的版本更快〜/ 4
。所以它似乎有些调用优化。
Hello C++ Gurus!

I''m comparing sorting algorithm for study goals. I''ve compared STL
std::sort and hand-coded introsort on millions (tens of millions) of
integers array sorting. It was tested on random, sorted, and sorted in
reverse arrays. In all tests on such a huge arrays std::sort is faster
by ~1/4 then my version. So it seems that it some call optimization.



虽然不知道_sure_是不可能的,甚至不知道你正在查看哪个确实的b $ b库实现,显而易见猜测将是

图书馆正在使用修改后的内含物,这基本上与修改后的快速排序相同,但是(显然已经足够)与额外的

预订和更换内存的转换。


改进的quicksort的想法非常简单:低于一些

项,简单排序(插入排序,选择排序,冒泡排序)

通常比快速排序快。因此,当你达到那个

的门槛时,你最好切换到别的东西(并且

门槛很少非常关键)。


进一步的优化是由于Robert Sedgewick:他注意到你

仍有相当多的开销,反复调用所有那些简单的类别

小分区。他还指出,插入分类

(与其他任何东西不同),速度并不真正取决于被分类的

项目的数量,而是严格按照距离他们需要在排序过程中移动
- 因此,不是每隔一次调用一次

时分区达到一定的大小阈值,你只需要_stop_做

当大小达到阈值时更多排序,然后当所有

分区减少到那个大小时,你调用插入排序

_once_为整个系列。即使你在一个大型集合上使用插入

排序,它的速度也与

分区的大小成比例(当你停止分区时)而不是

整个系列的大小。


这个差异取决于_heavily_对

函数调用的速度。例如,它在x86

上比在SPARC上产生更大的差异。基于您注意到的20%的差异,这是一个

公平的猜测,您可能正在测试x86或者非常类似于
的东西。更重要的是,这个开销也会影响你应该停止分区的那一点 - 没有一个尺寸'

适合所有人架构(虽然,如上所述,门槛

通常并不重要)。


-

后来,

杰瑞。


宇宙是自己想象的虚构。

While it''s impossible to say for _sure_ without even knowing which exact
library implementation you''re looking at, the obvious guess would be
that the library is using a "modified introsort", which is essentially
the same as a modified quicksort, but (obviously enough) with the extra
book-keeping and switching of an introsort added.

The idea of a modified quicksort is pretty simple: below some number of
items, the "simple" sorts (insertion sort, selection sort, bubble sort)
are generally faster than a quicksort. Therefore, when you reach that
threshold, you''re better off switching to something else (and the
threshold is rarely very critical).

A further optimization is due to Robert Sedgewick: he noted that you
still have a fair amount of overhead repeatedly calling the simple sort
on all those small partitions. He also noted that with an insertion sort
(unlike anything else) the speed doesn''t really depend on the number of
items being sorted, but strictly upon the distances they need to be
moved during sorting -- therefore, rather than calling it once every
time a partition reaches a certain size threshold, you just _stop_ doing
more sorting when the size reaches the threshold, and then when all the
partitions have been reduced to that size, you call an insertion sort
_once_ for the entire collection. Even though you''re using an insertion
sort on a large colletion, its speed is proportional to the sizes of the
partitions (when you stopped partitioning) rather than to the size of
the entire collection.

The difference this makes will depend _heavily_ upon the speed of
function calls. For example, it makes a MUCH larger difference on an x86
than it does on a SPARC. Based on the 20% difference you noted, it''s a
fair guess that you were probably testing on an x86 or something very
similar. More importantly, that overhead also has an effect on the point
at which you should stop partitioning -- there''s no one size that''s
right for all architectures (though, as I mentioned above, the threshold
isn''t usually critical).

--
Later,
Jerry.

The universe is a figment of its own imagination.


这篇关于为什么std :: sort比手工编码的内容快〜20%?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆