优化大型python字典的解析,多线程 [英] Optimizing parsing of massive python dictionary, multi-threading

查看:306
本文介绍了优化大型python字典的解析,多线程的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

让我们以一个小的python字典示例为例,其中的值是整数列表.

Let's take a small example python dictionary, where the values are lists of integers.

example_dict1 = {'key1':[367, 30, 847, 482, 887, 654, 347, 504, 413, 821],
    'key2':[754, 915, 622, 149, 279, 192, 312, 203, 742, 846], 
    'key3':[586, 521, 470, 476, 693, 426, 746, 733, 528, 565]}

假设我需要解析列表的值,这些值已实现为以下功能:

Let's say I need to parse the values of the lists, which I've implemented into the following function:

def manipulate_values(input_list):
    return_values = []
    for i in input_list:
        new_value = i ** 2 - 13
        return_values.append(new_value)
    return return_values

现在,我可以轻松地解析此字典的值,如下所示:

Now, I can easily parse the values of this dictionary as follows:

for key, value in example_dict1.items():
    example_dict1[key] = manipulate_values(value)

导致以下结果:

example_dict1 = {'key1': [134676, 887, 717396, 232311, 786756, 427703, 120396, 254003, 170556, 674028], 
     'key2': [568503, 837212, 386871, 22188, 77828, 36851, 97331, 41196, 550551, 715703], 
     'key3': [343383, 271428, 220887, 226563, 480236, 181463, 556503, 537276, 278771, 319212]}

这对于小型词典非常有效.

That works very well for small dictionaries.

我的问题是,我有一本庞大的字典,其中包含数百万个键和长列表.如果我要应用上述方法,该算法将非常慢.

My problem is, I have a massive dictionary with millions of keys and long lists. If I were to apply the above approach, the algorithm would be prohibitively slow.

如何优化以上内容?

(1)多线程-除了传统的threading模块之外,字典中是否还有其他更有效的选项可用于此语句语句?

(1) Multithreading---are there more efficient options available for multithreading this for statement in the dictionary besides the traditional threading module?

(2)更好的数据结构是否合适?

(2) Would a better data structure be appropriate?

我问这个问题是因为,在这种情况下,我非常困惑如何最好地进行下去.我没有看到比字典更好的数据结构,但是遍历字典(然后遍历值列表)的for循环非常慢.这里可能有一些设计得更快的东西.

I'm asking this question as, I'm quite stuck how to best proceed in this case. I don't see a better data structure than a dictionary, but the for loops across the dictionary (and then across the value lists) is quite slow. There may be something here which has been designed to be faster.

正如您可以想象的那样,这在某种程度上是一个玩具示例-所讨论的函数比x ** 2-13复杂.

As you can imagine, this is somewhat of a toy example---the function in question is a bit more complicated than x**2-13.

我对如何使用一本拥有数百万个键,一长串值的字典的价值更感兴趣.

I'm more interested in how to possibly worth with a dictionary with millions of keys, with long lists of values.

推荐答案

如果可以将所有内容存储在numpy数组中,则处理速度会更快.为了测试可伸缩性,我将每个列表的大小增加了50万倍,这些是我的结果:

If you can store everything inside a numpy array processing will be faster. I increased the size of each list by a factor of 0.5 millions to test scalability, and these are my results:

from timeit import timeit
import numpy as np

n = 500000
example_dict1 = {'key1':[367, 30, 847, 482, 887, 654, 347, 504, 413, 821]*n,
    'key2':[754, 915, 622, 149, 279, 192, 312, 203, 742, 846]*n, 
    'key3':[586, 521, 470, 476, 693, 426, 746, 733, 528, 565]*n}

def manipulate_values(input_list):
    return_values = []
    for i in input_list:
        new_value = i ** 2 - 13
        return_values.append(new_value)
    return return_values

使用您的方法:

for_with_dictionary = timeit("""
for key, value in example_dict1.items():
    example_dict1[key] = manipulate_values(value)
""", "from __main__ import example_dict1,manipulate_values ",number=5)

print(for_with_dictionary)

>>> 33.2095841

使用numpy:

numpy_broadcasting = timeit("""
array = np.array(list(example_dict1.values()))
array = array ** 2 - 13
""", "from __main__ import example_dict1, np",number=5)
print(numpy_broadcasting)

>>> 5.039885

速度显着提高,至少提高了6倍.

There is a significant upgrade in speed, at least 6 times.

这篇关于优化大型python字典的解析,多线程的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆