获得列表联合的最快方法 - Python [英] Fastest way to get union of lists - Python

查看:39
本文介绍了获得列表联合的最快方法 - Python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有一个 C++ 比较来从列表列表中获取列表的并集:找到集合并集的最快方法

还有其他几个与 python 相关的问题,但没有一个建议联合列表的最快方法:

从答案中,我发现至少有两种方法可以做到:

<预><代码>>>>从 itertools 导入链>>>x = [[1,2,3], [3,4,5], [1,7,8]]>>>列表(设置().联合(* x))[1, 2, 3, 4, 5, 7, 8]>>>列表(设置(链(* x)))[1, 2, 3, 4, 5, 7, 8]

请注意,我之后将集合投射到列表中,因为我需要修复列表的顺序以进行进一步处理.

经过一些比较,似乎list(set(chain(*x)))更稳定,花费的时间更少:

来自 itertools 导入链导入时间随机导入# 试运行.x = [[随机选择(范围(10000))对于 i in range(10)] for j in range(10)]列表(设置().联合(* x))列表(设置(链(* x)))y_time = 0z_time = 0对于 _ 范围(1000):x = [[随机选择(范围(10000))对于 i in range(10)] for j in range(10)]开始 = time.time()y = list(set().union(*x))y_time += time.time() - 开始#print 'list(set().union(*x)):\t', y_time开始 = time.time()z = 列表(设置(链(* x)))z_time += time.time() - 开始#print 'list(set(chain(*x))):\t', z_time断言排序(y)==排序(z)#打印打印 y_time/1000.打印 z_time/1000.

[输出]:

1.39586925507e-051.09834671021e-05

取出铸造集的变量列表:

y_time = 0z_time = 0对于 _ 范围(1000):x = [[随机选择(范围(10000))对于 i in range(10)] for j in range(10)]开始 = time.time()y = set().union(*x)y_time += time.time() - 开始开始 = time.time()z = 设置(链(* x))z_time += time.time() - 开始断言排序(y)==排序(z)打印 y_time/1000.打印 z_time/1000.

[输出]:

1.22241973877e-051.02684497833e-05

这是我尝试打印中间时间时的完整输出(没有列表转换):http://pastebin.com/raw/y3i6dXZ8

为什么 list(set(chain(*x)))list(set().union(*x)) 花费的时间少?

是否有另一种方法可以实现相同的列表联合?使用 numpypandassframe 之类的? 替代方案是否更快?

解决方案

什么最快取决于 x 的性质 -- 无论是长列表还是短列表,子列表很多还是很少子列表,子列表的长短,重复多还是少.

以下是一些比较一些替代方案的时间结果.有很多可能性,这绝不是一个完整的分析,但也许这将为您提供一个研究用例的框架.

func |× |时间unique_concatenate |many_uniques |0.863empty_set_union |many_uniques |1.191short_set_union_rest |many_uniques |1.192long_set_union_rest |many_uniques |1.194set_chain |many_uniques |1.224功能 |× |时间long_set_union_rest |many_duplicates |0.958short_set_union_rest |many_duplicates |0.969empty_set_union |many_duplicates |0.971set_chain |many_duplicates |1.128unique_concatenate |many_duplicates |2.411功能 |× |时间empty_set_union |many_small_lists |1.023long_set_union_rest |many_small_lists |1.028set_chain |many_small_lists |1.032short_set_union_rest |many_small_lists |1.036unique_concatenate |many_small_lists |1.351功能 |× |时间long_set_union_rest |几个大列表|0.791empty_set_union |几个大列表|0.813unique_concatenate |几个大列表|0.814set_chain |几个大列表|0.829short_set_union_rest |几个大列表|0.849

请务必在您自己的机器上运行 timeit 基准测试,因为结果可能会有所不同.

<小时>

from __future__ import print_function随机导入导入时间从 itertools 导入链将 numpy 导入为 npdef unique_concatenate(x):返回 np.unique(np.concatenate(x))def short_set_union_rest(x):# 这里假设 x[0] 是 x 中最短的列表返回列表(set(x[0]).union(*x[1:]))def long_set_union_rest(x):# 这里假设 x[-1] 是 x 中最长的列表返回列表(set(x[-1]).union(*x[1:]))def empty_set_union(x):返回列表(set().union(*x))def set_chain(x):返回列表(设置(链(* x)))big_range = 列表(范围(10**7))small_range = 列表(范围(10**5))many_uniques = [[random.choice(big_range) for i in range(j)]对于范围内的 j (10, 10000, 10)]many_duplicates = [[random.choice(small_range) for i in range(j)]对于范围内的 j (10, 10000, 10)]many_small_lists = [[random.choice(big_range) for i in range(10)]对于范围内的 j (10, 10000, 10)]很少的大列表 = [[random.choice(big_range) for i in range(1000)]对于范围内的 j (10, 100, 10)]如果 __name__=='__main__':对于 x, n 在 [('many_uniques', 1), ('many_duplicates', 4),('many_small_lists', 800), ('few_large_lists', 800)]:时间 = dict()对于 func 在 ['unique_concatenate', 'short_set_union_rest', 'long_set_union_rest','empty_set_union', 'set_chain']:计时[func, x] = timeit.timeit('{}({})'.format(func, x), number=n,setup='from __main__ import {}, {}'.format(func, x))打印('{:20} | {:20} | {}'.format('func', 'x', 'time'))对于 key, t in sorted(timing.items(), key=lambda item: item[1]):功能,x = 键打印('{:20} | {:20} | {:.3f}'.format(func, x, t))打印(结束='\n')

There's a C++ comparison to get union of lists from lists of lists: The fastest way to find union of sets

And there's several other python related questions but none suggest the fastest way to unionize the lists:

From the answers, I've gathered that there are at least 2 ways to do it:

>>> from itertools import chain
>>> x = [[1,2,3], [3,4,5], [1,7,8]]
>>> list(set().union(*x))
[1, 2, 3, 4, 5, 7, 8]
>>> list(set(chain(*x)))
[1, 2, 3, 4, 5, 7, 8]

Note that I'm casting the set to list afterwards because I need the order of the list to be fixed for further processing.

After some comparison, it seems like list(set(chain(*x))) is more stable and takes less time:

from itertools import chain
import time
import random

# Dry run.
x = [[random.choice(range(10000)) 
    for i in range(10)] for j in range(10)]
list(set().union(*x))
list(set(chain(*x)))

y_time = 0
z_time = 0

for _ in range(1000):
    x = [[random.choice(range(10000)) 
        for i in range(10)] for j in range(10)]
    start = time.time()
    y = list(set().union(*x))
    y_time += time.time() - start 
    #print 'list(set().union(*x)):\t', y_time
    start = time.time()
    z = list(set(chain(*x)))
    z_time += time.time() - start 
    #print 'list(set(chain(*x))):\t', z_time
    assert sorted(y) == sorted(z)
    #print 

print y_time / 1000.
print z_time / 1000. 

[out]:

1.39586925507e-05
1.09834671021e-05

Taking out the variable of casting sets to list:

y_time = 0
z_time = 0

for _ in range(1000):
    x = [[random.choice(range(10000)) 
        for i in range(10)] for j in range(10)]

    start = time.time()
    y = set().union(*x)
    y_time += time.time() - start 

    start = time.time()
    z = set(chain(*x))
    z_time += time.time() - start 

    assert sorted(y) == sorted(z)

print y_time / 1000.
print z_time / 1000. 

[out]:

1.22241973877e-05
1.02684497833e-05

Here's the full output when I try to print the intermediate timings (without list casting): http://pastebin.com/raw/y3i6dXZ8

Why is it that list(set(chain(*x))) takes less time than list(set().union(*x))?

Is there another way of achieving the same union of lists? Using numpy or pandas or sframe or something? Is the alternative faster?

解决方案

What's fastest depends on the nature of x -- whether it is a long list or a short list, with many sublists or few sublists, whether the sublists are long or short, and whether there are many duplicates or few duplicates.

Here are some timeit results comparing some alternatives. There are so many possibilities that this is by no means a complete analysis, but perhaps this will give you a framework for studying your use case.

func                 | x                    | time
unique_concatenate   | many_uniques         | 0.863
empty_set_union      | many_uniques         | 1.191
short_set_union_rest | many_uniques         | 1.192
long_set_union_rest  | many_uniques         | 1.194
set_chain            | many_uniques         | 1.224

func                 | x                    | time
long_set_union_rest  | many_duplicates      | 0.958
short_set_union_rest | many_duplicates      | 0.969
empty_set_union      | many_duplicates      | 0.971
set_chain            | many_duplicates      | 1.128
unique_concatenate   | many_duplicates      | 2.411

func                 | x                    | time
empty_set_union      | many_small_lists     | 1.023
long_set_union_rest  | many_small_lists     | 1.028
set_chain            | many_small_lists     | 1.032
short_set_union_rest | many_small_lists     | 1.036
unique_concatenate   | many_small_lists     | 1.351

func                 | x                    | time
long_set_union_rest  | few_large_lists      | 0.791
empty_set_union      | few_large_lists      | 0.813
unique_concatenate   | few_large_lists      | 0.814
set_chain            | few_large_lists      | 0.829
short_set_union_rest | few_large_lists      | 0.849

Be sure to run the timeit benchmarks on your own machine since results may vary.


from __future__ import print_function
import random
import timeit
from itertools import chain
import numpy as np

def unique_concatenate(x):
    return np.unique(np.concatenate(x))

def short_set_union_rest(x):
    # This assumes x[0] is the shortest list in x
    return list(set(x[0]).union(*x[1:]))

def long_set_union_rest(x):
    # This assumes x[-1] is the longest list in x
    return list(set(x[-1]).union(*x[1:]))

def empty_set_union(x):
    return list(set().union(*x))

def set_chain(x):
    return list(set(chain(*x)))


big_range = list(range(10**7))
small_range = list(range(10**5))
many_uniques = [[random.choice(big_range) for i in range(j)] 
                for j in range(10, 10000, 10)]
many_duplicates = [[random.choice(small_range) for i in range(j)] 
              for j in range(10, 10000, 10)]
many_small_lists = [[random.choice(big_range) for i in range(10)] 
                    for j in range(10, 10000, 10)]
few_large_lists = [[random.choice(big_range) for i in range(1000)] 
                    for j in range(10, 100, 10)]

if __name__=='__main__':
    for x, n in [('many_uniques', 1), ('many_duplicates', 4), 
                 ('many_small_lists', 800), ('few_large_lists', 800)]:
        timing = dict()
        for func in [
                'unique_concatenate', 'short_set_union_rest', 'long_set_union_rest',
                'empty_set_union', 'set_chain']:
            timing[func, x] = timeit.timeit(
                '{}({})'.format(func, x), number=n,
                setup='from __main__ import {}, {}'.format(func, x))
        print('{:20} | {:20} | {}'.format('func', 'x', 'time'))
        for key, t in sorted(timing.items(), key=lambda item: item[1]):
            func, x = key
            print('{:20} | {:20} | {:.3f}'.format(func, x, t))
        print(end='\n')

这篇关于获得列表联合的最快方法 - Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆