Scala集合对Performance进行排序,sortWith和sortBy [英] Scala Collection sorted, sortWith and sortBy Performance

查看:387
本文介绍了Scala集合对Performance进行排序,sortWith和sortBy的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Scala在标准库中包括几种对列表进行排序的方法,例如,对列表 list 进行排序,可以使用以下方法:

Scala includes several methods in the standard library for sorting a list, for example to sort the list list, one can use:

list.sorted
list.sortWith(_<_)
list.sortBy(x=>x)

虽然这可能是对列表进行排序的最简单方法,但我发现对于较大的列表,它们具有明显的性能缺陷.

While these might be the simplest ways to sort a list, I found that for larger lists, they have significant performance drawback.

例如,要对一百万个整数进行排序,排序平均需要500毫秒,而sortWith和sortBy则需要700毫秒左右.相比之下,scala.util.Sorting.quickSort需要120毫秒左右,而java.util.Arrays.sort需要100毫秒左右.对于较大的列表,随着我们进一步扩展,会观察到这种多因素差异.如下图所示.

For example, to sort one million integers, sorted takes on average 500ms, while sortWith and sortBy take around 700ms. This is compared to scala.util.Sorting.quickSort which takes around 120ms and java.util.Arrays.sort which takes around 100ms. For larger lists, this multiple factor difference is observed as we scale further. The pattern is shown in the following chart.

这种性能滞后的原因是什么?为什么不将更有效的算法/实现用于标准方法呢?

What is the reason for this lag in performance? And why aren't more efficient algorithms/implementations used for the standard methods?

推荐答案

注意两条线的斜率如何相同,但彼此偏移吗?使用对数标度,我们正在寻找一个恒定的因子差异. sorted和朋友支付将List转换为Array,排序(实际上是使用java.util.Arrays.sort)并转换回List的成本. scala.util.Sorting.quickSortjava.util.Arrays.sort直接对数组进行操作. quicksort的n log n性能中的log n因素在很大程度上是不相关的,因此,创建数组所需的线性时间以及结果列表中,我们最终得到了一个恒定的因子差.性能下降五倍可能看起来很可怕,但请记住List每个元素都有一个cons单元,这会在创建Array时产生大量的随机访问,然后创建新的List则需要花费一些时间来分配内存,很可能是一两个垃圾收集周期.

Notice how the lines have the same slope, but are offset from each other? With the logarithmic scale, we're looking at a constant factor difference. sorted and friends pay the cost of converting a List to an Array, sorting (with java.util.Arrays.sort, in fact), and converting back to List. scala.util.Sorting.quickSort and java.util.Arrays.sort operate on arrays directly. The log n factor in quicksort's n log n performance is largely irrelevant, so with the linear time needed to create the Array and the resulting list we end up with a constant factor difference. Five times worse performance might looking horrible, but remember the List has a cons cell for each element, which makes for massive amounts of random access when creating the Array, and then creating the new List requires time spent allocating memory, and in all likelihood, a garbage collection cycle or two.

对于基元列表,甚至更糟. List是通用的,因此必须将所有原语装箱,这会增加另一层间接寻址.不幸的是,所创建的Array也包含装箱的值.实际上,当您真的要对Array[Int]进行排序时,您最终对Array[java.lang.Integer]进行了排序.

For lists of primitives, it's even worse. List is generic, so any primitives must be boxed, which adds another layer of indirection. And unfortunately the Array that is created holds boxed values as well. Effectively, you end up sorting an Array[java.lang.Integer] when you really want to be sorting an Array[Int].

总结:排序算法是相同的,但是有充分的理由说明可变数组的性能优于不可变的单链列表.

To summarize: the sorting algorithms are identical, but there are good reasons why mutable arrays outperform immutable singly-linked lists.

这篇关于Scala集合对Performance进行排序,sortWith和sortBy的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆