关于python dict的问题 [英] a question on python dict

查看:73
本文介绍了关于python dict的问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Python dict是一个哈希表,不是吗?


我知道哈希表具有桶大小的概念。和min bucket

count东西,

它们应该是可配置的,所以当我知道我要处理多少项目时,我可以将它们设置为正确的值




如果不能设置这两个值,哈希表会给它们默认值

值。当有越来越多的项目被添加到

哈希表时,它会增加其桶并将旧项目复制到新的


我认为这会浪费一些时间来做复制的事情。

如果我知道我要处理大约20000件物品,我可以将

哈希表的大小设置为30000.


那么,我可以在python中执行此操作吗?我无法理解,所以我来这里是为了

帮助。


谢谢!

Python dict is a hash table, isn''t it?

I know that hashtable has the concept of "bucket size" and "min bucket
count" stuff,
and they should be configurable so I can set them to the proper value
when I know how many items I''m going to handle.

If these two values can''t be set, the hashtable will give them default
values. When there are more and more items being added to the
hashtable, it increase its buckets and copy the old items into the new
one and re-calculate the hash value of each item.

I think this will waste some time doing the increasing-copying thing.
If I know I''m going to handle about 20000 items, I can set the size of
hashtable to 30000.

So, can I do this in python? I can''t figure it out so I come here for
help.

Thanks!

推荐答案

[co ******* @ gmail.com]
[co*******@gmail.com]

Python dict是一个哈希表,isn 是吗?
Python dict is a hash table, isn''t it?



是的。

Yup.


我知道哈希表具有桶大小的概念。和min bucket

count stuff,
I know that hashtable has the concept of "bucket size" and "min bucket
count" stuff,



哈希表的一些实现可以。 Python'没有。 Python的'

使用'所谓的'开放寻址相反。

Some implementations of hash tables do. Python''s does not. Python''s
uses what''s called "open addressing" instead.


它们应该是可配置的,所以我可以将它们设置为正确的价值

当我知道我有多少项目时要处理。


如果不能设置这两个值,哈希表会给它们默认值

值。当有越来越多的项目被添加到

哈希表时,它会增加其桶并将旧项目复制到新的
and they should be configurable so I can set them to the proper value
when I know how many items I''m going to handle.

If these two values can''t be set, the hashtable will give them default
values. When there are more and more items being added to the
hashtable, it increase its buckets and copy the old items into the new
one and re-calculate the hash value of each item.



这是几个不同的声明,每个声明都有一些哈希值b / b
表实现但不是其他声明。 Python的实现具有

nobuckets,所以这些都是无关紧要的。 Python'的
实现(并非所有)都保存了每个项目的哈希值,所以当

时,它确实需要增加(或缩小)分配给它的空间table,它

不需要重新计算任何项目的哈希值。


你还应该注意复制一个dict键或值(无论是否

什么类型)包括复制一个机器地址(

4或8字节指针,具体取决于平台)。

That''s several distinct claims, each of which is true of some hash
table implementations but not of others. Python''s implementation has
no "buckets", so all that''s simply irrelevant. Python''s
implementation (not all do) saves the hash value of each item, so when
it does need to grow (or shrink) the space allocated to the table, it
does not need to recalculate the hash value of any item.

You should also note that copying a dict key or value (no matter of
what type) consists in its entirety of copying one machine address (a
4- or 8-byte pointer, depending on platform).


我认为这会浪费一些时间来处理增加复制的事情。
I think this will waste some time doing the increasing-copying thing.



调整大小确实需要时间。 "废物及QUOT;是一个价值判断;-)

It does take time to resize. "Waste" is a value judgment ;-)


如果我知道我要处理大约20000件物品,我可以设定
$ b $的大小b哈希表到30000.


所以,我可以在python中这样做吗?
If I know I''m going to handle about 20000 items, I can set the size of
hashtable to 30000.

So, can I do this in python?



你不能。除非你一遍又一遍地累积20000个物品(很多次,数千美元),否则你不可能测量时间

的差异甚至如果可以的话。

You can''t. Unless you build up to 20000 items over and over (many
thousands of times), it''s unlikely you''d be able to measure the time
difference even if you could.


非常感谢您的解释!


我犯了一个错误,我说哈希值应该每次重新计算时,我都想重新计算每个项目

的位置应该重新计算。


也许我应该看看有一天python dict的源代码。

我会在这里寻求帮助。


再次感谢!

Tim Peters写道:
Thank you very much for your explanation!

I made a mistake that I said the hash value should be recalculated each
time the dict resize, what I wanted to say was that the position of
each item should be recalculated.

Maybe I should take a look at the source code of python dict some day.
I''ll some here for help then.

Thanks again!
Tim Peters wrote:

[co*******@gmail.com]
[co*******@gmail.com]

Python dict是一个哈希表,不是吗?
Python dict is a hash table, isn''t it?



是的。


Yup.


我知道哈希表具有桶大小的概念。和min bucket

count东西,
I know that hashtable has the concept of "bucket size" and "min bucket
count" stuff,



哈希表的一些实现呢。 Python'没有。 Python的'

使用'所谓的'开放寻址相反。


Some implementations of hash tables do. Python''s does not. Python''s
uses what''s called "open addressing" instead.


它们应该是可配置的,所以我可以将它们设置为正确的价值

当我知道我有多少项目时要处理。


如果不能设置这两个值,哈希表会给它们默认值

值。当有越来越多的项目被添加到

哈希表时,它会增加其桶并将旧项目复制到新的
and they should be configurable so I can set them to the proper value
when I know how many items I''m going to handle.

If these two values can''t be set, the hashtable will give them default
values. When there are more and more items being added to the
hashtable, it increase its buckets and copy the old items into the new
one and re-calculate the hash value of each item.



这是几个不同的声明,每个声明都有一些哈希表b实现但不是其他声明。 Python的实现具有

nobuckets,所以这些都是无关紧要的。 Python'的
实现(并非所有)都保存了每个项目的哈希值,所以当

时,它确实需要增加(或缩小)分配给它的空间table,它

不需要重新计算任何项目的哈希值。


你还应该注意复制一个dict键或值(无论是否

什么类型)包括复制一个机器地址(

4或8字节指针,具体取决于平台)。


That''s several distinct claims, each of which is true of some hash
table implementations but not of others. Python''s implementation has
no "buckets", so all that''s simply irrelevant. Python''s
implementation (not all do) saves the hash value of each item, so when
it does need to grow (or shrink) the space allocated to the table, it
does not need to recalculate the hash value of any item.

You should also note that copying a dict key or value (no matter of
what type) consists in its entirety of copying one machine address (a
4- or 8-byte pointer, depending on platform).


我认为这会浪费一些时间来处理增加复制的事情。
I think this will waste some time doing the increasing-copying thing.



调整大小需要一些时间。 "废物及QUOT;是一个价值判断;-)


It does take time to resize. "Waste" is a value judgment ;-)


如果我知道我要处理大约20000件物品,我可以设定
$ b $的大小b哈希表到30000.


所以,我可以在python中这样做吗?
If I know I''m going to handle about 20000 items, I can set the size of
hashtable to 30000.

So, can I do this in python?



你不能。除非你一遍又一遍地累积20000个物品(很多次,数千美元),否则你不可能测量时间

的差异甚至如果你可以。


You can''t. Unless you build up to 20000 items over and over (many
thousands of times), it''s unlikely you''d be able to measure the time
difference even if you could.


co ******* @ gmail .com 写道:
co*******@gmail.com wrote:

非常感谢您的解释!


我弄错了我表示哈希值应该重新计算

每次dict调整大小时,我想说的是每个项目的位置

应该重新计算。


也许有一天我应该看一下python dict的源代码。

我会在这里寻求帮助。
Thank you very much for your explanation!

I made a mistake that I said the hash value should be recalculated
each time the dict resize, what I wanted to say was that the position
of each item should be recalculated.

Maybe I should take a look at the source code of python dict some day.
I''ll some here for help then.



在您的原始邮件中,您说:

In your original message you said:


我认为这会浪费一些时间进行增加复制的事情。

如果我知道我要处理大约20000项,我可以将

哈希表的大小设置为30000.
I think this will waste some time doing the increasing-copying thing.
If I know I''m going to handle about 20000 items, I can set the size of
hashtable to 30000.



你真的尝试过时间浪费了多少时间吗? 20000

项目对于字典来说真的不是很多。


我跑了一个快速测试,看看创建字典需要多长时间与

20000项目。请记住,正如已经向您解释的那样,这些项目不是复制的,而是在字典调整大小时重新计算哈希值,所以如果你的字典需要更长的时间创造比测试
下面的
给你(在你的机器而不是我的机器上),那么

问题就在别的地方。例如一个缓慢的哈希函数,或者
所花费的时间创建你在字典中存储的任何内容。


只需创建没有其他Python开销的字典: br />

timeit.py -s" import random; r = [random.random()for i in range(20000)]"

" d = dict.fromkeys(r)"

100循环,最佳3:每循环11.3毫秒


使用Python for循环创建字典:


timeit.py -s" import random; r = [random.random()for i in range(20000)]"

" d = {}" 对于k in r:d [k] =无

100循环,最佳3:每循环11.7毫秒


分配给元素一个预先填充的(因此是正确的

大小)字典:


timeit.py -s" import random; r = [random.random()for i in range(20000)];

d = dict.fromkeys(r)" 对于k in r:d [k] =无

100循环,最佳3:每循环10.1毫秒


你多少次创建这个字典,削减1.5毫秒

关闭11毫秒对你来说很重要吗?


FWIW,对于一个2,000,000元字典,时间分别为2.5s和1.48s <因为字典

变得非常大,所以情况会逐渐恶化。

Have you actually tried timing how much time is wasted doing this? 20000
items really isn''t very many for a dictionary to handle.

I ran a quick test to see how long it takes to create a dictionary with
20000 items. Remember, as has been explained to you, the items are not
copied, nor are the hash values recalculated when the dictionary is
resized, so if your dictionary takes much longer to create than the test
below gives you (on your machine that is rather than mine), then the
problem lies elsewhere. e.g. a slow hash function, or the time taken to
create whatever you are storing in the dictionary.

Just creating the dictionary with no other Python overhead:

timeit.py -s "import random; r = [ random.random() for i in range(20000)]"
"d = dict.fromkeys(r)"
100 loops, best of 3: 11.3 msec per loop

Creating the dictionary using a Python for loop:

timeit.py -s "import random; r = [ random.random() for i in range(20000)]"
"d={}" "for k in r: d[k] = None"
100 loops, best of 3: 11.7 msec per loop

Assigning to the elements in a pre-populated (and therefore the right
size) dictionary:

timeit.py -s "import random; r = [ random.random() for i in range(20000)];
d=dict.fromkeys(r)" "for k in r: d[k] = None"
100 loops, best of 3: 10.1 msec per loop

How many times do you create this dictionary that shaving 1.5 milliseconds
off 11 millseconds matters to you?

FWIW, for a 2,000,000 element dictionary the times are 2.5s and 1.48s
respectively so the situation does get progressively worse as dictionaries
get very large.


这篇关于关于python dict的问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆