在Python中优化查找和替换大文件 [英] Optimizing find and replace over large files in Python

查看:278
本文介绍了在Python中优化查找和替换大文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是Python或任何严肃的编程语言的初学者.我终于有了一个可以工作的原型代码,但是我认为它太慢了.

I am a complete beginner to Python or any serious programming language for that matter. I finally got a prototype code to work but I think it will be too slow.

我的目标是根据我拥有的csv文件,使用整数查找并替换目录中所有文件(它们为csv)中的一些汉字.这些文件按年份-月份进行了很好的编号,例如2000-01.csv,并且将是该目录中唯一的文件.

My goal is to find and replace some Chinese characters across all files (they are csv) in a directory with integers as per a csv file I have. The files are nicely numbered by year-month, for example 2000-01.csv, and will be the only files in that directory.

我将遍历大约25个文件,每个文件约500mb(约一百万行).我将使用的字典将包含大约300个元素,并且我会将unicode(汉字)更改为整数.我尝试进行测试,并假设所有内容都呈线性增长(?),看来这大约需要一周的时间.

I will be looping across about 25 files that are in the neighborhood of 500mb each (and about a million lines). The dictionary I will be using will have about 300 elements and I will be changing unicode (Chinese character) to integers. I tried with a test run and, assuming everything scales up linearly (?), it looks like it would take about a week for this to run.

先谢谢了.这是我的代码(不要笑!):

Thanks in advance. Here is my code (don't laugh!):

# -*- coding: utf-8 -*-

import os, codecs

dir = "C:/Users/Roy/Desktop/test/"

Dict = {'hello' : 'good', 'world' : 'bad'}

for dirs, subdirs, files in os.walk(dir):
    for file in files:
        inFile = codecs.open(dir + file, "r", "utf-8")
        inFileStr = inFile.read()
        inFile.close()
        inFile = codecs.open(dir + file, "w", "utf-8")
        for key in Dict:
            inFileStr = inFileStr.replace(key, Dict[key])
        inFile.write(inFileStr)
        inFile.close()

推荐答案

在当前代码中,您将立即将整个文件读入内存.由于它们是500Mb文件,因此意味着500Mb字符串.然后,您需要重复替换它们,这意味着Python必须使用第一个替换项创建一个新的500Mb字符串,然后销毁第一个字符串,然后为第二个替换项创建第二个500Mb字符串,然后销毁第二个字符串,等等.每次更换.事实证明,这是相当多的数据来回复制,更不用说要使用大量的内存了.

In your current code, you're reading the whole file into memory at once. Since they're 500Mb files, that means 500Mb strings. And then you do repeated replacements of them, which means Python has to create a new 500Mb string with the first replacement, then destroy the first string, then create a second 500Mb string for the second replacement, then destroy the second string, et cetera, for each replacement. That turns out to be quite a lot of copying of data back and forth, not to mention using a lot of memory.

如果您知道替换项将始终包含在一行中,则可以通过对其进行迭代来逐行读取文件. Python将缓冲读取,这意味着将对其进行相当优化.您应该以新名称打开一个新文件,以便同时写入新文件.依次对每行进行替换,并立即将其写出.这样做将大大减少 所用的内存量,以及在进行替换时来回复制的内存量:

If you know the replacements will always be contained in a line, you can read the file line by line by iterating over it. Python will buffer the read, which means it will be fairly optimized. You should open a new file, under a new name, for writing the new file simultaneously. Perform the replacement on each line in turn, and write it out immediately. Doing this will greatly reduce the amount of memory used and the amount of memory copied back and forth as you do the replacements:

for file in files:
    fname = os.path.join(dir, file)
    inFile = codecs.open(fname, "r", "utf-8")
    outFile = codecs.open(fname + ".new", "w", "utf-8")
    for line in inFile:
        newline = do_replacements_on(line)
        outFile.write(newline)
    inFile.close()
    outFile.close()
    os.rename(fname + ".new", fname)

如果您不能确定它们是否始终处于同一行,事情会变得有些困难;您必须使用inFile.read(blocksize)手动读取块,并仔细跟踪在块末尾是否可能存在部分匹配.并非易事,但通常还是值得避免500Mb的字符串.

If you can't be certain if they'll always be on one line, things get a little harder; you'd have to read in blocks manually, using inFile.read(blocksize), and keep careful track of whether there might be a partial match at the end of the block. Not as easy to do, but usually still worth it to avoid the 500Mb strings.

另一个重大改进是,如果您可以一次性进行替换,而不是顺序尝试一堆替换.有几种方法可以做到这一点,但哪种方法最合适则完全取决于您要替换的内容以及替换的内容.为了将单个字符转换为其他字符,unicode对象的translate方法可能很方便.您将字典传递给字典,将unicode代码点(作为整数)映射到unicode字符串:

Another big improvement would be if you could do the replacements in one go, rather than trying a whole bunch of replacements in order. There are several ways of doing that, but which fits best depends entirely on what you're replacing and with what. For translating single characters into something else, the translate method of unicode objects may be convenient. You pass it a dict mapping unicode codepoints (as integers) to unicode strings:

>>> u"\xff and \ubd23".translate({0xff: u"255", 0xbd23: u"something else"})
u'255 and something else'

要替换子字符串(而不仅仅是单个字符),可以使用re模块. re.sub函数(和已编译的正则表达式的sub方法)可以将可调用对象(一个函数)作为第一个参数,然后将为每个匹配项调用该参数:

For replacing substrings (and not just single characters), you could use the re module. The re.sub function (and the sub method of compiled regexps) can take a callable (a function) as the first argument, which will then be called for each match:

>>> import re
>>> d = {u'spam': u'spam, ham, spam and eggs', u'eggs': u'saussages'}
>>> p = re.compile("|".join(re.escape(k) for k in d))
>>> def repl(m):
...     return d[m.group(0)]
...
>>> p.sub(repl, u"spam, vikings, eggs and vikings")
u'spam, ham, spam and eggs, vikings, saussages and vikings'

这篇关于在Python中优化查找和替换大文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆