在巨大的文件中排序 [英] Sorting in huge files

查看:75
本文介绍了在巨大的文件中排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

大家好


我有排序问题,但我对Python的体验相当有限(3天),所以我运行的是首先是列表。


我有一个15GB的大型数据库,包含10 ^ 8个

的条目,每个大约100个字节。我设计了一个相对简单的关键地图

我的数据库,我想订购关于

键的数据库。


我希望大多数键能重复几次,而这实际上是我最终想要弄清楚的一部分。 (松散地说,我想分组

所有具有相似键的数据条目。为此,我需要先对

键进行排序(具有_same_键的数据条目) ),然后弄清楚哪些

键是相似的。


对此有几点想法:

- 空间不会成为问题。我有一个Tb可用。

- 列表上的Python sort()应该足够好,如果我可以将

整个数据库加载到list / dict中
- 每个数据输入相对较小,所以我不应该使用指针

- 键可以是字符串,整数与通常的顺序,无论是什么是
$ b $方便的,对我来说没关系。选择可能需要使用什么sort()更喜欢



- 我也会对任何关键空间大小感到满意。所以我想数据库将会有100 *大小的




有什么意见吗?

我希望这种情况需要多长时间会采取?这听起来很奇怪,但是我实际上有12个不同的键映射,我想用

对每张地图进行排序,所以我要排序12次。


Paul

Hi all

I have a sorting problem, but my experience with Python is rather
limited (3 days), so I am running this by the list first.

I have a large database of 15GB, consisting of 10^8 entries of
approximately 100 bytes each. I devised a relatively simple key map on
my database, and I would like to order the database with respect to the
key.

I expect a few repeats for most of the keys, and that s actually part
of what I want to figure out in the end. (Said loosely, I want to group
all the data entries having "similar" keys. For this I need to sort the
keys first (data entries having _same_ key), and then figure out which
keys are "similar").

A few thoughts on this:
- Space is not going to be an issue. I have a Tb available.
- The Python sort() on list should be good enough, if I can load the
whole database into a list/dict
- each data entry is relatively small, so I shouldn''t use pointers
- Keys could be strings, integers with the usual order, whatever is
handy, it doesn''t matter to me. The choice will probably have to do
with what sort() prefers.
- Also I will be happy with any key space size. So I guess 100*size of
the database will do.

Any comments?
How long should I hope this sort will take? It will sound weird, but I
actually have 12 different key maps and I want to sort this with
respect to each map, so I will have to sort 12 times.

Paul

推荐答案

Paul写道:
I对于大多数键来说,期望重复几次,而这实际上是我想要最终弄清楚的部分。 (松散地说,我想将所有具有相似键的数据条目分组。为此我需要先对
键进行排序(具有_same_键的数据条目),然后找出哪个
键是相似的。)
I expect a few repeats for most of the keys, and that s actually part
of what I want to figure out in the end. (Said loosely, I want to group
all the data entries having "similar" keys. For this I need to sort the
keys first (data entries having _same_ key), and then figure out which
keys are "similar").




如果这确实是你的最终目标,你可能不想排序。考虑

代码如下:



If this is really your final goal, you may not want to sort. Consider
code like the following:

entries = [(''a' ',''4''),
......(''''',''7''),

......(''''', ''2''),

....(''b'',''7''),

....(''x' ',''4'')] count = {}
用于输入条目:
.... key = entry [0]

....计数。 setdefault(key,[])。append(entry)

.... for key in count:
.... print key,counts [key]

....

a [('''',''4''),('''',''2'')

x [(''x'',''7''),(''x'',''4'')]

b [(''b'',''7' ')]


我使用dict对象将所有条目与相同的键组合在一起

,无需任何排序。如果你有一个很好的定义

''相似'',你可以将所有''相似''键映射到

dict中的相同值。 />

如果你确实需要排序,Python 2.4提供了一个非常好的方法来按特定键排序:
导入运算符
entries = [('''',''4''),
....(''''',''7''),

....(''a'',''2''),

......(''''',''7''),

....(''x'',''4'')] entries.sort(key = operator.itemgetter(1))
条目
[(''a''' ,'''''),('''',''4''),(''''',''4''),(''x'',''7''), (''b'',''7'')]


这里,我已经按照每个元组中的第二项对条目进行了排序。如果你走这条路,你也应该看看itertools.groupby:

import itertools
entries = [(''a'',''4 ''),
....(''x'',''7''),

....('''',''2'' ),

....(''b'',''7''),

....(''x'',''4 '')] entries.sort(key = operator.itemgetter(1))
对于key,值在itertools.groupby(entries,operator.itemgetter(1))中:
entries = [(''a'', ''4''), .... (''x'', ''7''),
.... (''a'', ''2''),
.... (''b'', ''7''),
.... (''x'', ''4'')] counts = {}
for entry in entries: .... key = entry[0]
.... counts.setdefault(key, []).append(entry)
.... for key in counts: .... print key, counts[key]
....
a [(''a'', ''4''), (''a'', ''2'')]
x [(''x'', ''7''), (''x'', ''4'')]
b [(''b'', ''7'')]

I''ve grouped all entries with the same key together using a dict object
and without the need for any sorting. If you had a good definition of
''similar'', you could perhaps map all ''similar'' keys to the same value in
the dict.

If you really do need to sort, Python 2.4 provides a very nice way to
sort by a particular key:
import operator
entries = [(''a'', ''4''), .... (''x'', ''7''),
.... (''a'', ''2''),
.... (''b'', ''7''),
.... (''x'', ''4'')] entries.sort(key=operator.itemgetter(1))
entries [(''a'', ''2''), (''a'', ''4''), (''x'', ''4''), (''x'', ''7''), (''b'', ''7'')]

Here, I''ve sorted the entries by the second item in each tuple. If you
go this route, you should also look at itertools.groupby:
import itertools
entries = [(''a'', ''4''), .... (''x'', ''7''),
.... (''a'', ''2''),
.... (''b'', ''7''),
.... (''x'', ''4'')] entries.sort(key=operator.itemgetter(1))
for key, values in itertools.groupby(entries, operator.itemgetter(1)):



....打印键,列表(值)

....

2 [(''a'', ''2'')]

4 [('''',''4''),(''''',''4'')]

7 [(''x'',''7''),(''b'',''7'')]


groupby基本上是排序列表的分类,我认为你想到了...


史蒂夫


.... print key, list(values)
....
2 [(''a'', ''2'')]
4 [(''a'', ''4''), (''x'', ''4'')]
7 [(''x'', ''7''), (''b'', ''7'')]

The groupby basically does the sort of grouping of a sorted list that I
think you had in mind...

Steve


我真的这么做需要排序。它很复杂,我没有说明原因,但

它将有助于以后找到类似的密钥。对不起,我不能更精确,这与我的研究有关。


你对itertools和operator的另外两个建议更有用,

但是我主要想知道性能问题。


这对于10 ^ 8个元素在键中重复是否合理?我猜想我应该亲自去看看。

I really do need to sort. It is complicated and I haven''t said why, but
it will help in finding similar keys later on. Sorry I can''t be more
precise, this has to do with my research.

Your two other suggestions with itertools and operator are more useful,
but I was mostly wondering about performance issue.

Is this reasonnable to do on 10^8 elements with repeats in the keys? I
guess I should just try and see for myself.


我真的需要排序。它很复杂,我没有说明原因,但

它将有助于以后找到类似的密钥。对不起,我不能更精确,这与我的研究有关。


你对itertools和operator的另外两个建议更有用,

但是我主要想知道性能问题。


这对于10 ^ 8个元素在键中重复是否合理?我想b $ b猜我应该亲自去看看。

I really do need to sort. It is complicated and I haven''t said why, but
it will help in finding similar keys later on. Sorry I can''t be more
precise, this has to do with my research.

Your two other suggestions with itertools and operator are more useful,
but I was mostly wondering about performance issue.

Is this reasonnable to do on 10^8 elements with repeats in the keys? I
guess I should just try and see for myself.


这篇关于在巨大的文件中排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆