在Python中对集合进行排序与​​对列表进行排序之间的时间差异很大 [英] Huge difference in timing between sorting a set vs sorting a list in Python

查看:135
本文介绍了在Python中对集合进行排序与​​对列表进行排序之间的时间差异很大的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想知道我应该将数据结构设置为集合还是列表.通常,我会进行设置操作,但最后需要对它进行排序.

I was wondering whether I should have my data structure as a set or a list. Mostly I will do set operations, but in the end I will need to sort it.

我想知道是否应该先将集合设为列表,然后使用sorted(list(my_set)),还是立即对集合进行排序sorted(my_set).可以说,我可能会考虑一个一般的列出"阶段,因为在那个时间点有一个有序的迭代可能是有意义的.

I wondered whether I should first make the set a list, and then use sorted(list(my_set)), or just sort the set immediately sorted(my_set). Arguably, I might consider a general "to list" phase, since having an ordered iterable at that point in time might make sense anyway.

所以我决定测试一下,希望列表更快.

So I decided to test it, expecting a list to be quicker.

基准标记:

import time
def sorter(x):
    t1 = time.time()
    for i in range(1000000):
        sorted(x)
    return time.time() - t1

数据:

one = range(1000)
a1 = list(one)
b1 = set(one)
sorter(a1)
# time: 16.5 s 
sorter(b1)
# time: 20.7 s

然后,我意识到这可能与元素已经存在的事实有关,并记住了

I then realized it might have to do with the fact that the elements are already in place, and remembered this amazing question & answer.

然后,我尝试了一些随机数据:

Then, I tried some random data:

two = numpy.random.randint(1, 1000, 1000)
a2 = list(two)
b2 = set(two)

结果:

sorter(a2)
# time: 4min 49s
sorter(b2)
# time: 18.9 s

巨大的差异,这是怎么回事?

Huge difference, what is going on?

奖金:sorted(set(a_list))甚至在一分钟的时间出现,比sorted(a_list)快得多.

Bonus: It even appears at a timing of one minute, that sorted(set(a_list)) is impressively faster than sorted(a_list).

实际上,在第二种情况下,可能会有重复项被过滤,从而加快了排序速度.

Indeed in the second case, there can be duplicates which would be filtered, and thus speed up sorting.

推荐答案

我稍微扩展了您的代码,并希望这可以使您洞悉正在发生的事情:

I extended your code a bit and hope that this will give you insight into what is happening:

import numpy
import uuid
import random
import time

def sorter(x):
    t1 = time.time()
    for i in range(10000):
        sorted(x)
    return time.time() - t1

def pr(name, x):
    print('sorter {:<12s} {:<11} (length {:>4})'.format(
        name, '{:.8}'.format(sorter(x)), len(x)))

a2sizes = []
b2sizes = []

for x in range(1000):
    two = numpy.random.randint(1, 1000, 1000)
    a2 = list(two)
    b2 = set(two)
    a2sizes.append(len(a2))
    b2sizes.append(len(b2))

print 'average number of elements in a2', sum(a2sizes)/len(a2sizes)
n = sum(b2sizes)/len(b2sizes)
print 'average number of elements in b2', n

此打印输出:

average number of elements in a2 1000
average number of elements in b2 632

这是由于随机数范围内的冲突

This is because of the collision in the random number range

print
pr('a2', a2)
# making a list of set gives you already sorted elements
y = list(b2)
pr('y', y)
random.shuffle(y)
pr('shuffled y ', y)
pr('b2', b2)

给出输出:

sorter a2           2.492537    (length 1000)
sorter b2           0.25028086  (length  633)
sorter y            0.19689608  (length  633)
sorter shuffled y   1.4935901   (length  633)

b2会更快,因为逻辑上元素更少,但是如果您首先列出该集合的列表必须有一定原因,这会更快.如果重新整理列表是合乎逻辑的,那再慢一点,并且在补偿列表的长度时,重新整理的结果与a2的结果相当接近.

That b2 would be faster because there are less elements is logical, but that this is even faster if you first make a list of the set must have some reason. That it again is slower if you shuffle that list is again logical and the shuffled result is rather close to the result for a2 when compensated for the length of the lists.

因此,让我们尝试在列表中添加其他内容:

So lets try to put something else in the list:

b3 = set()
for x in range(1000):
    b3.add(uuid.uuid4())

print '\nuuid elements', len(b3)

a3 = list(b3)
pr('a3', a3)
random.shuffle(a3)
pr('shuffled a3', a3)
pr('b3', b3)

给出(如果少于1000个元素,我会感到非常惊讶):

gives (I would have been rather surprised to have less than 1000 elements):

uuid elements 1000
sorter a3           32.437758   (length 1000)
sorter shuffled a3  32.178433   (length 1000)
sorter b3           32.163802   (length 1000)

因此,它必须与集合中的数字有关:

So it must have something to do with having numbers in the set:

previous = -1
ordered = True
for popped in b2:
    if popped < previous:
        print 'popped', popped, previous
        ordered = False
    previous = popped

print '\nOrdered', ordered

给您

Ordered True

set 代替了迭代,它具有pop()函数您可以尝试使用:

Instead of iterating , a set has a pop() function you can try and use:

pop()

pop()

从集合中删除并返回任意元素.如果集合为空,则引发KeyError.

Remove and return an arbitrary element from the set. Raises KeyError if the set is empty.

因此,任意可以从集合b2中检索元素,并查看是否有特殊之处:

So lets arbitrarily retrieve elements from the set b2 and see if there is something special:

previous = -1
ordered = True
while(b2):
    popped = b2.pop()
    if popped < previous:
        print 'popped', popped, previous
        ordered = False
    previous = popped

print '\nOrdered', ordered

给出相同的结果:

Ordered True

因此,任意检索一组数字的元素将按顺序检索这些数字,与这些数字的放置顺序无关. 由于迭代是列表制作者一次检索要添加到列表中的元素的方式,因此list(b2)的结果是一个有序列表,并且可以使用

So arbitrarily retrieving elements of set of number retrieves those numbers in order, independent off the ordering how these numbers were put in. As the iteration is how the list-making retrieves an element at a time to append to the list, the result of list(b2) is an ordered list and that gets sorted quite fast using the Timsort algorithm used in Python.

这篇关于在Python中对集合进行排序与​​对列表进行排序之间的时间差异很大的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆