如何在Windows上使用Python处理包含EOF / Ctrl-Z字符的巨大文本文件? [英] How to process huge text files that contain EOF / Ctrl-Z characters using Python on Windows?

查看:100
本文介绍了如何在Windows上使用Python处理包含EOF / Ctrl-Z字符的巨大文本文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有许多需要用Python脚本处理的大型逗号分隔文本文件(最大约为15GB)。问题是文件中间偶尔包含DOS EOF(Ctrl-Z)字符。 (不要问我为什么,我没有生成它们。)另一个问题是文件在Windows计算机上。

I have a number of large comma-delimited text files (the biggest is about 15GB) that I need to process using a Python script. The problem is that the files sporadically contain DOS EOF (Ctrl-Z) characters in the middle of them. (Don't ask me why, I didn't generate them.) The other problem is that the files are on a Windows machine.

在Windows上,当我的脚本运行时遇到这些字符之一,它假定它在文件末尾并停止处理。由于各种原因,不允许将文件复制到任何其他计算机。但是我仍然需要处理它们。

On Windows, when my script encounters one of these characters, it assumes it is at the end of the file and stops processing. For various reasons, I am not allowed to copy the files to any other machine. But I still need to process them.

到目前为止,我的想法是:

Here are my ideas so far:


  1. 以二进制模式读取文件,并抛出等于 chr(26)的字节。可以,但是大约需要永久的时间。

  2. 使用 sed 之类的东西来消除EOF字符。不幸的是,据我所知,Windows上的 sed 有相同的问题,并且在看到EOF时将退出。

  3. 使用某种记事本程序并进行查找和替换。但是事实证明,记事本型程序不能很好地处理15GB的文件。

  1. Read the file in binary mode, throwing out bytes that equal chr(26). This would work, but it would take approximately forever.
  2. Use something like sed to eliminate the EOF characters. Unfortunately, as far as I can tell, sed on Windows has the same problem and will quit when it sees the EOF.
  3. Use some kind of Notepad program and do a find-and-replace. But it turns out that Notepad-type programs don't cope well with 15GB files.

我的IDEAL解决方案是通过某种方式将文件作为文本读取,而只是忽略Ctrl-Z字符。有没有合理的方法可以做到这一点?

My IDEAL solution would be some way to just read the file as text and simply ignore the Ctrl-Z characters. Is there a reasonable way to accomplish this?

推荐答案

使用Python删除DOS EOF字符很容易;例如,

It's easy to use Python to delete the DOS EOF chars; for example,

def delete_eof(fin, fout):
    BUFSIZE = 2**15
    EOFCHAR = chr(26)
    data = fin.read(BUFSIZE)
    while data:
        fout.write(data.translate(None, EOFCHAR))
        data = fin.read(BUFSIZE)

import sys
ipath = sys.argv[1]
opath = ipath + ".new"
with open(ipath, "rb") as fin, open(opath, "wb") as fout:
    delete_eof(fin, fout)

它将文件路径作为第一个参数,然后将文件(但不包含 chr(26)个字节)复制到具有的相同文件路径。新增

That takes a file path as its first argument, and copies the file but without chr(26) bytes to the same file path with .new appended. Fiddle to taste.

顺便说一句,您是否确定DOS EOF字符是您唯一的问题?很难想象有一种合理的方法可以将它们最终存储在意向文件中,以被视为文本文件。

By the way, are you sure that DOS EOF characters are your only problem? It's hard to conceive of a sane way in which they could end up in files intended to be treated as text files.

这篇关于如何在Windows上使用Python处理包含EOF / Ctrl-Z字符的巨大文本文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆