在 Python 中将带 BOM 的 UTF-8 转换为不带 BOM 的 UTF-8 [英] Convert UTF-8 with BOM to UTF-8 with no BOM in Python

查看:47
本文介绍了在 Python 中将带 BOM 的 UTF-8 转换为不带 BOM 的 UTF-8的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这里有两个问题.我有一组文件,通常是带有 BOM 的 UTF-8.我想将它们(理想情况下)转换为没有 BOM 的 UTF-8.看起来 codecs.StreamRecoder(stream, encode, decode, Reader, Writer, errors) 会处理这个.但我真的没有看到任何关于使用的好例子.这会是处理这个问题的最好方法吗?

Two questions here. I have a set of files which are usually UTF-8 with BOM. I'd like to convert them (ideally in place) to UTF-8 with no BOM. It seems like codecs.StreamRecoder(stream, encode, decode, Reader, Writer, errors) would handle this. But I don't really see any good examples on usage. Would this be the best way to handle this?

source files:
Tue Jan 17$ file brh-m-157.json 
brh-m-157.json: UTF-8 Unicode (with BOM) text

此外,如果我们可以在不明确知道的情况下处理不同的输入编码(见 ASCII 和 UTF-16),那将是理想的.看起来这一切都应该是可行的.有没有一种解决方案可以在没有 BOM 的情况下将任何已知的 Python 编码和输出作为 UTF-8 输出?

Also, it would be ideal if we could handle different input encoding wihtout explicitly knowing (seen ASCII and UTF-16). It seems like this should all be feasible. Is there a solution that can take any known Python encoding and output as UTF-8 without BOM?

编辑 1 从下面提出的 sol'n(谢谢!)

edit 1 proposed sol'n from below (thanks!)

fp = open('brh-m-157.json','rw')
s = fp.read()
u = s.decode('utf-8-sig')
s = u.encode('utf-8')
print fp.encoding  
fp.write(s)

这给了我以下错误:

IOError: [Errno 9] Bad file descriptor

快讯

我在评论中被告知错误是我用rw"而不是r+"/r+b"模式打开文件,所以我最终应该重新编辑我的问题并删除已解决的部分.

Newsflash

I'm being told in comments that the mistake is I open the file with mode 'rw' instead of 'r+'/'r+b', so I should eventually re-edit my question and remove the solved part.

推荐答案

只需使用 "utf-8-sig" 编解码器:

fp = open("file.txt")
s = fp.read()
u = s.decode("utf-8-sig")

这会给你一个没有 BOM 的 unicode 字符串.然后你可以使用

That gives you a unicode string without the BOM. You can then use

s = u.encode("utf-8")

s 中获取正常的UTF-8 编码字符串.如果您的文件很大,那么您应该避免将它们全部读入内存.BOM 只是文件开头的三个字节,因此您可以使用以下代码将它们从文件中剥离:

to get a normal UTF-8 encoded string back in s. If your files are big, then you should avoid reading them all into memory. The BOM is simply three bytes at the beginning of the file, so you can use this code to strip them out of the file:

import os, sys, codecs

BUFSIZE = 4096
BOMLEN = len(codecs.BOM_UTF8)

path = sys.argv[1]
with open(path, "r+b") as fp:
    chunk = fp.read(BUFSIZE)
    if chunk.startswith(codecs.BOM_UTF8):
        i = 0
        chunk = chunk[BOMLEN:]
        while chunk:
            fp.seek(i)
            fp.write(chunk)
            i += len(chunk)
            fp.seek(BOMLEN, os.SEEK_CUR)
            chunk = fp.read(BUFSIZE)
        fp.seek(-BOMLEN, os.SEEK_CUR)
        fp.truncate()

它打开文件,读取一个块,并在读取它之前 3 个字节将其写入文件.该文件被就地重写.更简单的解决方案是将较短的文件写入一个新文件,例如 newtover 的答案.那会更简单,但在短时间内使用两倍的磁盘空间.

It opens the file, reads a chunk, and writes it out to the file 3 bytes earlier than where it read it. The file is rewritten in-place. As easier solution is to write the shorter file to a new file like newtover's answer. That would be simpler, but use twice the disk space for a short period.

至于猜测编码,那么您可以从最具体到最不具体的编码循环:

As for guessing the encoding, then you can just loop through the encoding from most to least specific:

def decode(s):
    for encoding in "utf-8-sig", "utf-16":
        try:
            return s.decode(encoding)
        except UnicodeDecodeError:
            continue
    return s.decode("latin-1") # will always work

UTF-16 编码的文件不会解码为 UTF-8,因此我们首先尝试使用 UTF-8.如果失败,那么我们尝试使用 UTF-16.最后,我们使用 Latin-1——这将始终有效,因为所有 256 个字节都是 Latin-1 中的合法值.在这种情况下,您可能希望返回 None,因为它确实是一个后备,您的代码可能希望更仔细地处理这个问题(如果可以的话).

An UTF-16 encoded file wont decode as UTF-8, so we try with UTF-8 first. If that fails, then we try with UTF-16. Finally, we use Latin-1 — this will always work since all 256 bytes are legal values in Latin-1. You may want to return None instead in this case since it's really a fallback and your code might want to handle this more carefully (if it can).

这篇关于在 Python 中将带 BOM 的 UTF-8 转换为不带 BOM 的 UTF-8的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆