实施外部合并排序 [英] Implementing an external merge sort

查看:93
本文介绍了实施外部合并排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试学习Python,并且正在使用带有int的输入文件进行外部合并排序.我正在使用heapq.merge,我的代码几乎可以正常工作,但是它似乎将我的行作为字符串而不是整数进行排序.如果我尝试转换为整数,则写行将不接受数据.谁能帮我找到替代方案?另外,我认为这将使我能够排序一个大于内存的文件(如果有足够的磁盘空间)是正确的

I'm trying to learn Python and am working on making an external merge sort using an input file with ints. I'm using heapq.merge, and my code almost works, but it seems to be sorting my lines as strings instead of ints. If I try to convert to ints, writelines won't accept the data. Can anyone help me find an alternative? Additionally, am I correct in thinking this will allow me to sort a file bigger than memory (given adequate disk space)

import itertools
from itertools import islice
import tempfile
import heapq

#converts heapq.merge to ints
#def merge(*temp_files):
#    return heapq.merge(*[itertools.imap(int, s) for s in temp_files])

with open("path\to\input", "r") as f:
    temp_file = tempfile.TemporaryFile()
    temp_files = []
    elements = []
    while True:
        elements = list(islice(f, 1000))
        if not elements:
           break
        elements.sort(key=int)
        temp_files.append(elements)
        temp_file.writelines(elements)
        temp_file.flush()
        temp_file.seek(0)
        with open("path\to\output", "w") as output_file:
            output_file.writelines(heapq.merge(*temp_files))

推荐答案

您的代码对我来说意义不大(temp_files.append(elements)?是否在循环中合并?),但是这是一种以数字方式合并文件的方式:

Your code doesn't make much sense to me (temp_files.append(elements)? Merging inside the loop?), but here's a way to merge files sorting numerically:

import heapq
files = open('a.txt'), open('b.txt')
with open('merged.txt', 'w') as out:
    out.writelines(map('{}\n'.format,
                       heapq.merge(*(map(int, f)
                                     for f in files))))

首先,map(int, ...)将每个文件的行转换为整数.然后将这些与heapq.merge合并.然后map('{}\n'.format用换行符将每个整数重新转换成字符串.然后writelines写入这些行.换句话说,您已经很亲密,只需要在编写它们之前将int转换回字符串即可.

First the map(int, ...) turns each file's lines into ints. Then those get merged with heapq.merge. Then map('{}\n'.format turns each of the integers back into a string, with newline. Then writelines writes those lines. In other words, you were already close, just had to convert the ints back to strings before writing them.

一种不同的编写方式(有些人可能会更清楚):

A different way to write it (might be clearer for some):

import heapq
files = open('a.txt'), open('b.txt')
with open('merged.txt', 'w') as out:
    int_streams = (map(int, f) for f in files)
    int_stream = heapq.merge(*int_streams)
    line_stream = map('{}\n'.format, int_stream)
    out.writelines(line_stream)

在任何情况下,如果您使用的是Python 2,请务必使用itertools.imap,否则它将立即将整个文件读入内存.在Python 3中,您可以只使用普通的map.

And in any case, do use itertools.imap if you're using Python 2 as otherwise it'll read the whole files into memory at once. In Python 3, you can just use the normal map.

是的,如果操作正确,这将使您可以用很少的内存对巨大的文件进行排序.

And yes, if you do it right, this will allow you to sort gigantic files with very little memory.

这篇关于实施外部合并排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆