在巨大的文件中排序 [英] Sorting in huge files

查看：75 发布时间：2019/6/5 13:42:21 python

本文介绍了在巨大的文件中排序的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

大家好

我有排序问题，但我对Python的体验相当有限（3天），所以我运行的是首先是列表。

我有一个15GB的大型数据库，包含10 ^ 8个

的条目，每个大约100个字节。我设计了一个相对简单的关键地图

我的数据库，我想订购关于

键的数据库。

我希望大多数键能重复几次，而这实际上是我最终想要弄清楚的一部分。（松散地说，我想分组

所有具有相似键的数据条目。为此，我需要先对

键进行排序（具有_same_键的数据条目）），然后弄清楚哪些

键是相似的。

对此有几点想法：

- 空间不会成为问题。我有一个Tb可用。

- 列表上的Python sort（）应该足够好，如果我可以将

整个数据库加载到list / dict中
- 每个数据输入相对较小，所以我不应该使用指针

- 键可以是字符串，整数与通常的顺序，无论是什么是
$ b $方便的，对我来说没关系。选择可能需要使用什么sort（）更喜欢

。

- 我也会对任何关键空间大小感到满意。所以我想数据库将会有100 *大小的

。

有什么意见吗？

我希望这种情况需要多长时间会采取？这听起来很奇怪，但是我实际上有12个不同的键映射，我想用

对每张地图进行排序，所以我要排序12次。

Paul

Hi all

I have a sorting problem, but my experience with Python is rather
limited (3 days), so I am running this by the list first.

I have a large database of 15GB, consisting of 10^8 entries of
approximately 100 bytes each. I devised a relatively simple key map on
my database, and I would like to order the database with respect to the
key.

I expect a few repeats for most of the keys, and that s actually part
of what I want to figure out in the end. (Said loosely, I want to group
all the data entries having "similar" keys. For this I need to sort the
keys first (data entries having _same_ key), and then figure out which
keys are "similar").

A few thoughts on this:
- Space is not going to be an issue. I have a Tb available.
- The Python sort() on list should be good enough, if I can load the
whole database into a list/dict
- each data entry is relatively small, so I shouldn''t use pointers
- Keys could be strings, integers with the usual order, whatever is
handy, it doesn''t matter to me. The choice will probably have to do
with what sort() prefers.
- Also I will be happy with any key space size. So I guess 100*size of
the database will do.

Any comments?
How long should I hope this sort will take? It will sound weird, but I
actually have 12 different key maps and I want to sort this with
respect to each map, so I will have to sort 12 times.

Paul

推荐答案

Paul写道：

I对于大多数键来说，期望重复几次，而这实际上是我想要最终弄清楚的部分。（松散地说，我想将所有具有相似键的数据条目分组。为此我需要先对
键进行排序（具有_same_键的数据条目），然后找出哪个
键是相似的。）

I expect a few repeats for most of the keys, and that s actually part
of what I want to figure out in the end. (Said loosely, I want to group
all the data entries having "similar" keys. For this I need to sort the
keys first (data entries having _same_ key), and then figure out which
keys are "similar").

如果这确实是你的最终目标，你可能不想排序。考虑

代码如下：

If this is really your final goal, you may not want to sort. Consider
code like the following:

entries = [（''a' '，''4''），
......（'''''，''7''），

......（'''''， ''2''），

....（''b''，''7''），

....（''x' '，''4''）] count = {}
用于输入条目：
.... key = entry [0]

....计数。 setdefault（key，[]）。append（entry）

.... for key in count：
.... print key，counts [key]

....

a [（''''，''4''），（''''，''2''）

x [（''x''，''7''），（''x''，''4''）]

b [（''b''，''7' '）]

我使用dict对象将所有条目与相同的键组合在一起

，无需任何排序。如果你有一个很好的定义

''相似''，你可以将所有''相似''键映射到

dict中的相同值。 />

如果你确实需要排序，Python 2.4提供了一个非常好的方法来按特定键排序：
导入运算符
entries = [（''''，''4''），
....（'''''，''7''），

....（''a''，''2''），

......（'''''，''7''），

....（''x''，''4''）] entries.sort（key = operator.itemgetter（1））
条目
[（''a''' ，'''''），（''''，''4''），（'''''，''4''），（''x''，''7''），（''b''，''7''）]

这里，我已经按照每个元组中的第二项对条目进行了排序。如果你走这条路，你也应该看看itertools.groupby：

import itertools
entries = [（''a''，''4 ''），
....（''x''，''7''），

....（''''，''2'' ），

....（''b''，''7''），

....（''x''，''4 ''）] entries.sort（key = operator.itemgetter（1））
对于key，值在itertools.groupby（entries，operator.itemgetter（1））中：

entries = [(''a'', ''4''), .... (''x'', ''7''),
.... (''a'', ''2''),
.... (''b'', ''7''),
.... (''x'', ''4'')] counts = {}
for entry in entries: .... key = entry[0]
.... counts.setdefault(key, []).append(entry)
.... for key in counts: .... print key, counts[key]
....
a [(''a'', ''4''), (''a'', ''2'')]
x [(''x'', ''7''), (''x'', ''4'')]
b [(''b'', ''7'')]

I''ve grouped all entries with the same key together using a dict object
and without the need for any sorting. If you had a good definition of
''similar'', you could perhaps map all ''similar'' keys to the same value in
the dict.

If you really do need to sort, Python 2.4 provides a very nice way to
sort by a particular key:
import operator
entries = [(''a'', ''4''), .... (''x'', ''7''),
.... (''a'', ''2''),
.... (''b'', ''7''),
.... (''x'', ''4'')] entries.sort(key=operator.itemgetter(1))
entries [(''a'', ''2''), (''a'', ''4''), (''x'', ''4''), (''x'', ''7''), (''b'', ''7'')]

Here, I''ve sorted the entries by the second item in each tuple. If you
go this route, you should also look at itertools.groupby:
import itertools
entries = [(''a'', ''4''), .... (''x'', ''7''),
.... (''a'', ''2''),
.... (''b'', ''7''),
.... (''x'', ''4'')] entries.sort(key=operator.itemgetter(1))
for key, values in itertools.groupby(entries, operator.itemgetter(1)):

....打印键，列表（值）

....

2 [（''a''， ''2''）]

4 [（''''，''4''），（'''''，''4''）]

7 [（''x''，''7''），（''b''，''7''）]

groupby基本上是排序列表的分类，我认为你想到了...

史蒂夫

.... print key, list(values)
....
2 [(''a'', ''2'')]
4 [(''a'', ''4''), (''x'', ''4'')]
7 [(''x'', ''7''), (''b'', ''7'')]

The groupby basically does the sort of grouping of a sorted list that I
think you had in mind...

Steve

我真的这么做需要排序。它很复杂，我没有说明原因，但

它将有助于以后找到类似的密钥。对不起，我不能更精确，这与我的研究有关。

你对itertools和operator的另外两个建议更有用，

但是我主要想知道性能问题。

这对于10 ^ 8个元素在键中重复是否合理？我猜想我应该亲自去看看。

I really do need to sort. It is complicated and I haven''t said why, but
it will help in finding similar keys later on. Sorry I can''t be more
precise, this has to do with my research.

Your two other suggestions with itertools and operator are more useful,
but I was mostly wondering about performance issue.

Is this reasonnable to do on 10^8 elements with repeats in the keys? I
guess I should just try and see for myself.

我真的需要排序。它很复杂，我没有说明原因，但

它将有助于以后找到类似的密钥。对不起，我不能更精确，这与我的研究有关。

你对itertools和operator的另外两个建议更有用，

但是我主要想知道性能问题。

这对于10 ^ 8个元素在键中重复是否合理？我想b $ b猜我应该亲自去看看。

I really do need to sort. It is complicated and I haven''t said why, but
it will help in finding similar keys later on. Sorry I can''t be more
precise, this has to do with my research.

Your two other suggestions with itertools and operator are more useful,
but I was mostly wondering about performance issue.

Is this reasonnable to do on 10^8 elements with repeats in the keys? I
guess I should just try and see for myself.

这篇关于在巨大的文件中排序的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在巨大的文件中排序 [英] Sorting in huge files

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

在巨大的文件中排序 [英] Sorting in huge files

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭