在巨大的文件中排序 [英] Sorting in huge files
问题描述
大家好
我有排序问题,但我对Python的体验相当有限(3天),所以我运行的是首先是列表。
我有一个15GB的大型数据库,包含10 ^ 8个
的条目,每个大约100个字节。我设计了一个相对简单的关键地图
我的数据库,我想订购关于
键的数据库。
我希望大多数键能重复几次,而这实际上是我最终想要弄清楚的一部分。 (松散地说,我想分组
所有具有相似键的数据条目。为此,我需要先对
键进行排序(具有_same_键的数据条目) ),然后弄清楚哪些
键是相似的。
对此有几点想法:
- 空间不会成为问题。我有一个Tb可用。
- 列表上的Python sort()应该足够好,如果我可以将
整个数据库加载到list / dict中>
- 每个数据输入相对较小,所以我不应该使用指针
- 键可以是字符串,整数与通常的顺序,无论是什么是
$ b $方便的,对我来说没关系。选择可能需要使用什么sort()更喜欢
。
- 我也会对任何关键空间大小感到满意。所以我想数据库将会有100 *大小的
。
有什么意见吗?
我希望这种情况需要多长时间会采取?这听起来很奇怪,但是我实际上有12个不同的键映射,我想用
对每张地图进行排序,所以我要排序12次。
Paul
Hi all
I have a sorting problem, but my experience with Python is rather
limited (3 days), so I am running this by the list first.
I have a large database of 15GB, consisting of 10^8 entries of
approximately 100 bytes each. I devised a relatively simple key map on
my database, and I would like to order the database with respect to the
key.
I expect a few repeats for most of the keys, and that s actually part
of what I want to figure out in the end. (Said loosely, I want to group
all the data entries having "similar" keys. For this I need to sort the
keys first (data entries having _same_ key), and then figure out which
keys are "similar").
A few thoughts on this:
- Space is not going to be an issue. I have a Tb available.
- The Python sort() on list should be good enough, if I can load the
whole database into a list/dict
- each data entry is relatively small, so I shouldn''t use pointers
- Keys could be strings, integers with the usual order, whatever is
handy, it doesn''t matter to me. The choice will probably have to do
with what sort() prefers.
- Also I will be happy with any key space size. So I guess 100*size of
the database will do.
Any comments?
How long should I hope this sort will take? It will sound weird, but I
actually have 12 different key maps and I want to sort this with
respect to each map, so I will have to sort 12 times.
Paul
推荐答案
Paul写道:
I对于大多数键来说,期望重复几次,而这实际上是我想要最终弄清楚的部分。 (松散地说,我想将所有具有相似键的数据条目分组。为此我需要先对
键进行排序(具有_same_键的数据条目),然后找出哪个
键是相似的。)
I expect a few repeats for most of the keys, and that s actually part
of what I want to figure out in the end. (Said loosely, I want to group
all the data entries having "similar" keys. For this I need to sort the
keys first (data entries having _same_ key), and then figure out which
keys are "similar").
如果这确实是你的最终目标,你可能不想排序。考虑
代码如下:
If this is really your final goal, you may not want to sort. Consider
code like the following:
entries = [(''a' ',''4''),
......(''''',''7''),
......(''''', ''2''),
....(''b'',''7''),
....(''x' ',''4'')] count = {}
用于输入条目:
.... key = entry [0]
....计数。 setdefault(key,[])。append(entry)
.... for key in count:
.... print key,counts [key]
....
a [('''',''4''),('''',''2'')
x [(''x'',''7''),(''x'',''4'')]
b [(''b'',''7' ')]
我使用dict对象将所有条目与相同的键组合在一起
,无需任何排序。如果你有一个很好的定义
''相似'',你可以将所有''相似''键映射到
dict中的相同值。 />
如果你确实需要排序,Python 2.4提供了一个非常好的方法来按特定键排序:
导入运算符
entries = [('''',''4''),
....(''''',''7''),
....(''a'',''2''),
......(''''',''7''),
....(''x'',''4'')] entries.sort(key = operator.itemgetter(1))
条目
[(''a''' ,'''''),('''',''4''),(''''',''4''),(''x'',''7''), (''b'',''7'')]
这里,我已经按照每个元组中的第二项对条目进行了排序。如果你走这条路,你也应该看看itertools.groupby:
import itertools
entries = [(''a'',''4 ''),
....(''x'',''7''),
....('''',''2'' ),
....(''b'',''7''),
....(''x'',''4 '')] entries.sort(key = operator.itemgetter(1))
对于key,值在itertools.groupby(entries,operator.itemgetter(1))中:
entries = [(''a'', ''4''), .... (''x'', ''7''),
.... (''a'', ''2''),
.... (''b'', ''7''),
.... (''x'', ''4'')] counts = {}
for entry in entries: .... key = entry[0]
.... counts.setdefault(key, []).append(entry)
.... for key in counts: .... print key, counts[key]
....
a [(''a'', ''4''), (''a'', ''2'')]
x [(''x'', ''7''), (''x'', ''4'')]
b [(''b'', ''7'')]
I''ve grouped all entries with the same key together using a dict object
and without the need for any sorting. If you had a good definition of
''similar'', you could perhaps map all ''similar'' keys to the same value in
the dict.
If you really do need to sort, Python 2.4 provides a very nice way to
sort by a particular key:
import operator
entries = [(''a'', ''4''), .... (''x'', ''7''),
.... (''a'', ''2''),
.... (''b'', ''7''),
.... (''x'', ''4'')] entries.sort(key=operator.itemgetter(1))
entries [(''a'', ''2''), (''a'', ''4''), (''x'', ''4''), (''x'', ''7''), (''b'', ''7'')]
Here, I''ve sorted the entries by the second item in each tuple. If you
go this route, you should also look at itertools.groupby:
import itertools
entries = [(''a'', ''4''), .... (''x'', ''7''),
.... (''a'', ''2''),
.... (''b'', ''7''),
.... (''x'', ''4'')] entries.sort(key=operator.itemgetter(1))
for key, values in itertools.groupby(entries, operator.itemgetter(1)):
....打印键,列表(值)
....
2 [(''a'', ''2'')]
4 [('''',''4''),(''''',''4'')]
7 [(''x'',''7''),(''b'',''7'')]
groupby基本上是排序列表的分类,我认为你想到了...
史蒂夫
.... print key, list(values)
....
2 [(''a'', ''2'')]
4 [(''a'', ''4''), (''x'', ''4'')]
7 [(''x'', ''7''), (''b'', ''7'')]
The groupby basically does the sort of grouping of a sorted list that I
think you had in mind...
Steve
我真的这么做需要排序。它很复杂,我没有说明原因,但
它将有助于以后找到类似的密钥。对不起,我不能更精确,这与我的研究有关。
你对itertools和operator的另外两个建议更有用,
但是我主要想知道性能问题。
这对于10 ^ 8个元素在键中重复是否合理?我猜想我应该亲自去看看。
I really do need to sort. It is complicated and I haven''t said why, but
it will help in finding similar keys later on. Sorry I can''t be more
precise, this has to do with my research.
Your two other suggestions with itertools and operator are more useful,
but I was mostly wondering about performance issue.
Is this reasonnable to do on 10^8 elements with repeats in the keys? I
guess I should just try and see for myself.
我真的需要排序。它很复杂,我没有说明原因,但
它将有助于以后找到类似的密钥。对不起,我不能更精确,这与我的研究有关。
你对itertools和operator的另外两个建议更有用,
但是我主要想知道性能问题。
这对于10 ^ 8个元素在键中重复是否合理?我想b $ b猜我应该亲自去看看。
I really do need to sort. It is complicated and I haven''t said why, but
it will help in finding similar keys later on. Sorry I can''t be more
precise, this has to do with my research.
Your two other suggestions with itertools and operator are more useful,
but I was mostly wondering about performance issue.
Is this reasonnable to do on 10^8 elements with repeats in the keys? I
guess I should just try and see for myself.
这篇关于在巨大的文件中排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!