哪种排序算法最适合非常大的数据集 [英] Which sorting algorithm works best on very large data set

查看:255
本文介绍了哪种排序算法最适合非常大的数据集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在Internet上搜索,以找到最适合于非常大的数据集的排序算法。我发现许多人认为合并排序是最好的,因为它是公平的,并且它可以确保时间复杂度为O(n log n)并且快速排序是不安全的:诚然,快速排序的变体也可以不安全,因为实际数据集可以是任何数据。

I was searching on the Internet to find which sorting algorithm is best suitable for a very large data set. I found that many have an opinion that merge sort is best because it is fair, as well as that it ensures that time complexity is O(n log n) and quick sort is not safe: It is also true that variations of quicksort can also be not safe because the real data set can be anything.

如果交换两个元素的时间成本可忽略不计,那么为什么在这种情况下为什么不能选择堆排序作为最佳排序算法呢?

If swapping of the two elements has negligible time cost, then why can't we choose heap sort as the best sorting algorithm in this case because it is in place as well as O(n log n)?.

如果是Merge sort,则需要另一个O(n)空间;如果数据非常大,则我们不能使用此算法。

In case of Merge sort it requires another O(n) space; if the data is very large then we can't use this algorithm.

请告诉我:在这种情况下哪种算法应该是最好的?

Please tell me: which algorithm should be the best in this scenario?.

推荐答案

没有一种算法显然是最佳算法。这取决于许多因素。

There's no one algorithm that's clearly the "best" algorithm. It depends on a bunch of factors.

对于初学者来说,您可以将数据放入主存储器吗?如果不能,则需要依靠外部排序算法。这些算法通常是基于quicksort和mergesort的。

For starters, can you fit your data into main memory? If you can't, then you'd need to rely on an external sorting algorithm. These algorithms are often based on quicksort and mergesort.

第二,您对您的输入分布一无所知吗?如果大多数情况下都是排序的,那么Timsort之类的东西可能是个不错的选择,因为它可以很好地处理排序后的数据。如果大多数情况下是随机的,那么Timsort可能不是一个好选择。

Second, do you know anything about your input distribution? If it's mostly sorted, then something like Timsort might be a great option, since it's designed to work well on sorted data. If it's mostly random, Timsort is probably not a good choice.

第三,您要排序哪种元素?如果您要对通用对象进行排序,那么您几乎就只能进行比较排序。如果没有,也许您可​​以使用非比较类,例如计数类或基数类。

Third, what kind of elements are you sorting? If you are sorting generic objects, then you're pretty much locked into comparison sorting. If not, perhaps you could use a non-comparison sort like counting sort or radix sort.

第四,您有多少个核?某些排序算法(快速排序,合并排序,MSD基数排序)可以很好地并行化,而另一些则不能(堆排序)。

Fourth, how many cores do you have? Some sorting algorithms (quicksort, mergesort, MSD radix sort) parallelize really well, while others do not (heapsort).

第五,您的数据如何表示?如果将它们存储在数组中,则由于引用的局限性,quicksort或quicksort变体可能会做得很好,而由于需要额外的内存,mergesort可能会变慢。但是,如果它们在链表中,则来自quicksort的参考位置会消失,而mergesort突然又变得具有竞争力。

Fifth, how are your data represented? If they're stored in an array, quicksort or a quicksort variant will likely do well because of locality of reference, while mergesort might be slow due to the extra memory needed. If they're in a linked list, though, the locality of reference from quicksort goes away and mergesort suddenly becomes competitive again.

最好的选择可能是花很多钱考虑各种因素,然后从那里做出决定。设计和研究算法之所以如此有趣的原因之一是,几乎没有一个最佳选择。通常,最佳选择取决于您的具体情况,并根据您所看到的内容进行更改。

The best option is probably to take a lot of different factors into account and then make a decision from there. One of the reason it's so fun to design and study algorithms is that there's rarely one single best choice; often, the best option depends a ton on your particular situation and changes based on what you're seeing.

(您提到了有关quicksort,heapsort和mergesort的一些详细信息我想在总结这个答案之前先讲一下。您说对了,快速排序的简并O(n 2 )简直是最糟糕的情况,但是有很多方法可以避免这种情况。递归深度并在看起来像快速排序将退化的情况下将算法切换到堆排序,这保证了O(n log n)最坏情况的行为且内存开销较低,并最大化了您从快速排序中获得的收益。仍然有O(n 2 )最坏的情况,实际上碰到最坏情况的可能性很小。

(You mentioned a few details about quicksort, heapsort, and mergesort that I wanted to touch on before wrapping up this answer. While you're right that quicksort has a degenerate O(n2) worst case, there are many ways to avoid this. The introsort algorithm keeps track of the recursion depth and switches the algorithm to heapsort if it looks like the quicksort will degenerate. This guarantees O(n log n) worst-case behavior with low memory overhead and maximizes the amount of benefit you get from quicksort. Randomized quicksort, while still having an O(n2) worst case, has a vanishingly small probability of actually hitting that worst case.

堆排序是一个很好的算法在实践中,但是由于某些算法没有很好的参照位置,因此在某些情况下并没有其他算法那么快。也就是说,它永远不会退化率高且仅需要O(1)辅助空间是一个巨大的卖点。

Heapsort is a good algorithm in practice, but isn't as fast as the other algorithms in some cases because it doesn't have good locality of reference. That said, the fact that it never degenerates and needs only O(1) auxiliary space is a huge selling point.

Mergesort确实需要大量辅助内存,这就是为什么您不想如果您要整理大量数据,请使用它。不过,由于它的变体被广泛使用,因此值得了解。)

Mergesort does need a lot of auxiliary memory, which is one reason why you might not want to use it if you have a huge amount of data to sort. It's worth knowing about, though, since its variants are widely used.)

这篇关于哪种排序算法最适合非常大的数据集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆