如何在 Windows 上使用 Python 处理包含 EOF/Ctrl-Z 字符的巨大文本文件? [英] How to process huge text files that contain EOF / Ctrl-Z characters using Python on Windows?

查看:16
本文介绍了如何在 Windows 上使用 Python 处理包含 EOF/Ctrl-Z 字符的巨大文本文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有许多以逗号分隔的大型文本文件(最大的大约 15GB),需要使用 Python 脚本进行处理.问题是文件中间偶尔会包含 DOS EOF (Ctrl-Z) 字符.(不要问我为什么,我没有生成它们.)另一个问题是这些文件在 Windows 机器上.

I have a number of large comma-delimited text files (the biggest is about 15GB) that I need to process using a Python script. The problem is that the files sporadically contain DOS EOF (Ctrl-Z) characters in the middle of them. (Don't ask me why, I didn't generate them.) The other problem is that the files are on a Windows machine.

在 Windows 上,当我的脚本遇到这些字符之一时,它假定它位于文件末尾并停止处理.由于各种原因,我不允许将文件复制到任何其他机器上.但我仍然需要处理它们.

On Windows, when my script encounters one of these characters, it assumes it is at the end of the file and stops processing. For various reasons, I am not allowed to copy the files to any other machine. But I still need to process them.

这是我目前的想法:

  1. 以二进制模式读取文件,丢弃等于 chr(26) 的字节.这会奏效,但大约需要很长时间.
  2. 使用类似 sed 的东西来消除 EOF 字符.不幸的是,据我所知,Windows 上的 sed 也有同样的问题,当它看到 EOF 时会退出.
  3. 使用某种Notepad 程序并进行查找和替换.但事实证明,Notepad 类型的程序无法很好地处理 15GB 的文件.
  1. Read the file in binary mode, throwing out bytes that equal chr(26). This would work, but it would take approximately forever.
  2. Use something like sed to eliminate the EOF characters. Unfortunately, as far as I can tell, sed on Windows has the same problem and will quit when it sees the EOF.
  3. Use some kind of Notepad program and do a find-and-replace. But it turns out that Notepad-type programs don't cope well with 15GB files.

我的理想解决方案是将文件作为文本读取并忽略Ctrl-Z字符.有没有合理的方法来实现这一点?

My IDEAL solution would be some way to just read the file as text and simply ignore the Ctrl-Z characters. Is there a reasonable way to accomplish this?

推荐答案

使用Python删除DOS EOF字符很简单;例如,

It's easy to use Python to delete the DOS EOF chars; for example,

def delete_eof(fin, fout):
    BUFSIZE = 2**15
    EOFCHAR = chr(26)
    data = fin.read(BUFSIZE)
    while data:
        fout.write(data.translate(None, EOFCHAR))
        data = fin.read(BUFSIZE)

import sys
ipath = sys.argv[1]
opath = ipath + ".new"
with open(ipath, "rb") as fin, open(opath, "wb") as fout:
    delete_eof(fin, fout)

这将文件路径作为它的第一个参数,并将不带 chr(26) 字节的文件复制到附加了 .new 的相同文件路径.小提琴的味道.

That takes a file path as its first argument, and copies the file but without chr(26) bytes to the same file path with .new appended. Fiddle to taste.

顺便说一下,您确定 DOS EOF 字符是您唯一的问题吗?很难想象有一种理智的方式可以让它们最终成为打算被视为文本文件的文件.

By the way, are you sure that DOS EOF characters are your only problem? It's hard to conceive of a sane way in which they could end up in files intended to be treated as text files.

这篇关于如何在 Windows 上使用 Python 处理包含 EOF/Ctrl-Z 字符的巨大文本文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆