在 Python 中优化查找和替换大文件 [英] Optimizing find and replace over large files in Python

查看:31
本文介绍了在 Python 中优化查找和替换大文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是 Python 或任何严肃的编程语言的初学者.我终于得到了一个原型代码,但我认为它会太慢.

我的目标是在一个目录中的所有文件(它们是 csv)中查找和替换一些汉字,并根据我拥有的 csv 文件使用整数.这些文件按年-月很好地编号,例如 2000-01.csv,并且将是该目录中唯一的文件.

我将遍历大约 25 个文件,每个文件大约 500 mb(大约一百万行).我将使用的字典将有大约 300 个元素,我会将 unicode(汉字)更改为整数.我尝试了一次测试运行,假设一切都呈线性增长(?),它看起来需要大约一周的时间才能运行.

提前致谢.这是我的代码(别笑!):

# -*- 编码:utf-8 -*-导入操作系统,编解码器dir = "C:/Users/Roy/Desktop/test/"字典 = {'你好':'好','世界':'坏'}对于 os.walk(dir) 中的目录、子目录和文件:对于文件中的文件:inFile = codecs.open(dir + file, "r", "utf-8")inFileStr = inFile.read()inFile.close()inFile = codecs.open(dir + file, "w", "utf-8")对于字典中的键:inFileStr = inFileStr.replace(key, Dict[key])inFile.write(inFileStr)inFile.close()

解决方案

在您当前的代码中,您一次将整个文件读入内存.由于它们是 500Mb 的文件,这意味着 500Mb 的字符串.然后你重复替换它们,这意味着 Python 必须用第一个替换创建一个新的 500Mb 字符串,然后销毁第一个字符串,然后为第二个替换创建第二个 500Mb 字符串,然后销毁第二个字符串,等等,每次更换.事实证明,这需要大量来回复制数据,更不用说使用大量内存了.

如果您知道替换项总是包含在一行中,您可以通过迭代逐行读取文件.Python 将缓冲读取,这意味着它将被相当优化.您应该以新名称打开一个新文件,以便同时写入新文件.依次对每一行进行替换,并立即写出.这样做将大大减少使用的内存量在您进行替换时来回复制的内存量:

 用于文件中的文件:fname = os.path.join(目录,文件)inFile = codecs.open(fname, "r", "utf-8")outFile = codecs.open(fname + ".new", "w", "utf-8")对于 inFile 中的行:换行符 = do_replacements_on(line)outFile.write(换行符)inFile.close()outFile.close()os.rename(fname + ".new", fname)

如果你不能确定他们是否总是在一条线上,事情就会变得有点困难;您必须使用 inFile.read(blocksize) 手动读取块,并仔细跟踪块末尾是否可能存在部分匹配.不那么容易做到,但通常仍然值得避免 500Mb 的字符串.

另一个重大改进是,如果您可以一次性完成替换,而不是按顺序尝试一大堆替换.有几种方法可以做到这一点,但哪种方法最适合完全取决于您要替换的内容和替换内容.要将单个字符转换为其他字符,unicode 对象的 translate 方法可能很方便.您将 dict 映射 unicode 代码点(作为整数)传递给 unicode 字符串:

<预><代码>>>>u"xff 和 ubd23".translate({0xff: u"255", 0xbd23: u"别的东西"})u'255 和别的东西'

要替换子字符串(而不仅仅是单个字符),您可以使用 re 模块.re.sub 函数(以及编译后的正则表达式的 sub 方法)可以将可调用(一个函数)作为第一个参数,然后在每次匹配时调用它:

<预><代码>>>>进口重新>>>d = {u'spam': u'spam, ham, spam and Eggs', u'eggs': u'saussages'}>>>p = re.compile("|".join(re.escape(k) for k in d))>>>def repl(m):...返回 d[m.group(0)]...>>>p.sub(repl,u垃圾邮件,维京人,鸡蛋和维京人")你的垃圾邮件,火腿,垃圾邮件和鸡蛋,维京人,香肠和维京人'

I am a complete beginner to Python or any serious programming language for that matter. I finally got a prototype code to work but I think it will be too slow.

My goal is to find and replace some Chinese characters across all files (they are csv) in a directory with integers as per a csv file I have. The files are nicely numbered by year-month, for example 2000-01.csv, and will be the only files in that directory.

I will be looping across about 25 files that are in the neighborhood of 500mb each (and about a million lines). The dictionary I will be using will have about 300 elements and I will be changing unicode (Chinese character) to integers. I tried with a test run and, assuming everything scales up linearly (?), it looks like it would take about a week for this to run.

Thanks in advance. Here is my code (don't laugh!):

# -*- coding: utf-8 -*-

import os, codecs

dir = "C:/Users/Roy/Desktop/test/"

Dict = {'hello' : 'good', 'world' : 'bad'}

for dirs, subdirs, files in os.walk(dir):
    for file in files:
        inFile = codecs.open(dir + file, "r", "utf-8")
        inFileStr = inFile.read()
        inFile.close()
        inFile = codecs.open(dir + file, "w", "utf-8")
        for key in Dict:
            inFileStr = inFileStr.replace(key, Dict[key])
        inFile.write(inFileStr)
        inFile.close()

解决方案

In your current code, you're reading the whole file into memory at once. Since they're 500Mb files, that means 500Mb strings. And then you do repeated replacements of them, which means Python has to create a new 500Mb string with the first replacement, then destroy the first string, then create a second 500Mb string for the second replacement, then destroy the second string, et cetera, for each replacement. That turns out to be quite a lot of copying of data back and forth, not to mention using a lot of memory.

If you know the replacements will always be contained in a line, you can read the file line by line by iterating over it. Python will buffer the read, which means it will be fairly optimized. You should open a new file, under a new name, for writing the new file simultaneously. Perform the replacement on each line in turn, and write it out immediately. Doing this will greatly reduce the amount of memory used and the amount of memory copied back and forth as you do the replacements:

for file in files:
    fname = os.path.join(dir, file)
    inFile = codecs.open(fname, "r", "utf-8")
    outFile = codecs.open(fname + ".new", "w", "utf-8")
    for line in inFile:
        newline = do_replacements_on(line)
        outFile.write(newline)
    inFile.close()
    outFile.close()
    os.rename(fname + ".new", fname)

If you can't be certain if they'll always be on one line, things get a little harder; you'd have to read in blocks manually, using inFile.read(blocksize), and keep careful track of whether there might be a partial match at the end of the block. Not as easy to do, but usually still worth it to avoid the 500Mb strings.

Another big improvement would be if you could do the replacements in one go, rather than trying a whole bunch of replacements in order. There are several ways of doing that, but which fits best depends entirely on what you're replacing and with what. For translating single characters into something else, the translate method of unicode objects may be convenient. You pass it a dict mapping unicode codepoints (as integers) to unicode strings:

>>> u"xff and ubd23".translate({0xff: u"255", 0xbd23: u"something else"})
u'255 and something else'

For replacing substrings (and not just single characters), you could use the re module. The re.sub function (and the sub method of compiled regexps) can take a callable (a function) as the first argument, which will then be called for each match:

>>> import re
>>> d = {u'spam': u'spam, ham, spam and eggs', u'eggs': u'saussages'}
>>> p = re.compile("|".join(re.escape(k) for k in d))
>>> def repl(m):
...     return d[m.group(0)]
...
>>> p.sub(repl, u"spam, vikings, eggs and vikings")
u'spam, ham, spam and eggs, vikings, saussages and vikings'

这篇关于在 Python 中优化查找和替换大文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆