数量级 [英] Orders of magnitude

查看:75
本文介绍了数量级的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

x = [chr(a)+ chr(b)for xrange(100)for b in xrange(100)]


#list version:c = []

for x in x:

如果我不在c:

c.append(i)

< blockquote class =post_quotes>

t1.timeit(1)
2.0331145438875637

#dict version:c = {}

for i in x:

如果我不在c:

c [i] =无t2.timeit(1)
0.0067952770534134288


#bsddb版本:c = bsddb.btopen(无)

for x in x:

如果我不在c:

c [i] =无t3.timeit(1)



0.18430750276922936


哇。 Dicts是*快*。


我正在重写一个1000万条记录的数据集,尝试不同的方法

用于构建索引。内存中的dicts显然更快,但我得到了内存错误(Win2k,512 MB RAM,4 G虚拟)。任何关于

的建议如何建立一个大型指数而不会减慢因素
25美元?

Robert Brewer

MIS

Amor Ministries
fu ****** @ amor.org

解决方案

" Robert Brewer" <福****** @ amor.org>写道:

我正在重复删除一个1000万条记录的数据集,尝试不同的方法来构建索引。内存中的dicts显然更快,但我得到了内存错误(Win2k,512 MB RAM,4 G虚拟)。有关构建大型索引的其他方法的任何建议,而不会减慢因素吗?




排序,然后删除重复。


排序,然后删除重复。




列表1000万整数吮吸用Python增加大约160兆的内存。我怀疑字符串是否合适。


我建议使用多阶段方法。将序列分解为100k

元素块,并使用字典删除重复项。单独对每个

块进行排序并将每个块写入文件。


使用heapq合并100个文件。

#heap结构类似于...

[(文件a中的下一个字符串,file_handle_a),

(文件b中的下一个字符串,file_handle_b),. ..]


如果你继续使用来自heapq的heappop和heappush,以及一个

字符串的内存,你可以很容易地删除重复项,当然

不会用完内存。


- Josiah





这一切都归结为你的钥匙占用多少空间。

当你寻找傻瓜时,你必须只拿着内存中的钥匙,而不是

数据(用这种方式会快得多)。


我会说用btree排序创建一个bsddb来保存你所有的密钥。应该花费大约20分钟来填充它。然后按排序键顺序扫描,并且
duplciates将彼此相邻显示。


x = [chr(a) + chr(b) for a in xrange(100) for b in xrange(100)]

# list version: c = []
for i in x:
if i not in c:
c.append(i)

t1.timeit(1) 2.0331145438875637

# dict version: c = {}
for i in x:
if i not in c:
c[i] = None t2.timeit(1) 0.0067952770534134288

# bsddb version: c = bsddb.btopen(None)
for i in x:
if i not in c:
c[i] = None t3.timeit(1)


0.18430750276922936

Wow. Dicts are *fast*.

I''m dedup''ing a 10-million-record dataset, trying different approaches
for building indexes. The in-memory dicts are clearly faster, but I get
Memory Errors (Win2k, 512 MB RAM, 4 G virtual). Any recommendations on
other ways to build a large index without slowing down by a factor of
25?
Robert Brewer
MIS
Amor Ministries
fu******@amor.org

解决方案

"Robert Brewer" <fu******@amor.org> writes:

I''m dedup''ing a 10-million-record dataset, trying different approaches
for building indexes. The in-memory dicts are clearly faster, but I get
Memory Errors (Win2k, 512 MB RAM, 4 G virtual). Any recommendations on
other ways to build a large index without slowing down by a factor of
25?



Sort, then remove dups.


Sort, then remove dups.



A list 10 million integers suck up ~160 megs of memory with Python. I
doubt the strings would fit even then.

I would suggest a multi-phase method. Break the sequence up into 100k
element blocks each and remove duplicates using dictionaries. Sort each
block individually and write each to a file.

Merge the 100 files by using a heapq.
#heap is structured like...
[("next string in file a", file_handle_a),
("next string in file b", file_handle_b),...]

If you keep using heappop and heappush from heapq, along with a single
string of memory, you can remove duplicates quite easily, and certainly
won''t run out of memory.

- Josiah




It all boils down to how much space your keys take.
When you look for dupes, you must hold only the keys in memory, not the
data (it''ll be a lot faster this way).

I''d say create a bsddb with btree sort to hold all your keys. Should take
about 20 minutues to fill it. Then scan it in sorted key order, and
duplciates will appear next to each other.


这篇关于数量级的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆