在Python中将没有BOM的UTF-8转换为带有BOM的UTF-8 [英] Convert UTF-8 with BOM to UTF-8 with no BOM in Python

查看:544
本文介绍了在Python中将没有BOM的UTF-8转换为带有BOM的UTF-8的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这里有两个问题.我有一组文件,这些文件通常是带有BOM的UTF-8.我想将它们(理想情况下)转换为没有BOM的UTF-8.似乎codecs.StreamRecoder(stream, encode, decode, Reader, Writer, errors)将处理此问题.但是我真的看不到任何有关用法的好例子.这将是处理此问题的最佳方法吗?

Two questions here. I have a set of files which are usually UTF-8 with BOM. I'd like to convert them (ideally in place) to UTF-8 with no BOM. It seems like codecs.StreamRecoder(stream, encode, decode, Reader, Writer, errors) would handle this. But I don't really see any good examples on usage. Would this be the best way to handle this?

source files:
Tue Jan 17$ file brh-m-157.json 
brh-m-157.json: UTF-8 Unicode (with BOM) text

此外,如果我们能够处理清楚地知道(见ASCII和UTF-16)的不同输入编码,那将是理想的.看来这一切都是可行的.是否有一种解决方案可以采用任何已知的Python编码并以UTF-8格式输出而无需BOM表?

Also, it would be ideal if we could handle different input encoding wihtout explicitly knowing (seen ASCII and UTF-16). It seems like this should all be feasible. Is there a solution that can take any known Python encoding and output as UTF-8 without BOM?

编辑1 从下面提议的解决方案(谢谢!)

edit 1 proposed sol'n from below (thanks!)

fp = open('brh-m-157.json','rw')
s = fp.read()
u = s.decode('utf-8-sig')
s = u.encode('utf-8')
print fp.encoding  
fp.write(s)

这给了我以下错误:

IOError: [Errno 9] Bad file descriptor

新闻快讯

在评论中被告知我,错误是我以'rw'模式而不是'r +'/'r + b'模式打开文件,因此我最终应该重新编辑问题并删除已解决的部分.

Newsflash

I'm being told in comments that the mistake is I open the file with mode 'rw' instead of 'r+'/'r+b', so I should eventually re-edit my question and remove the solved part.

推荐答案

只需使用"utf-8-sig"编解码器:

fp = open("file.txt")
s = fp.read()
u = s.decode("utf-8-sig")

这为您提供了一个没有BOM的unicode字符串.然后,您可以使用

That gives you a unicode string without the BOM. You can then use

s = u.encode("utf-8")

返回s中的普通UTF-8编码字符串.如果文件很大,则应避免将它们全部读入内存. BOM只是文件开头的三个字节,因此您可以使用以下代码将其从文件中删除:

to get a normal UTF-8 encoded string back in s. If your files are big, then you should avoid reading them all into memory. The BOM is simply three bytes at the beginning of the file, so you can use this code to strip them out of the file:

import os, sys, codecs

BUFSIZE = 4096
BOMLEN = len(codecs.BOM_UTF8)

path = sys.argv[1]
with open(path, "r+b") as fp:
    chunk = fp.read(BUFSIZE)
    if chunk.startswith(codecs.BOM_UTF8):
        i = 0
        chunk = chunk[BOMLEN:]
        while chunk:
            fp.seek(i)
            fp.write(chunk)
            i += len(chunk)
            fp.seek(BOMLEN, os.SEEK_CUR)
            chunk = fp.read(BUFSIZE)
        fp.seek(-BOMLEN, os.SEEK_CUR)
        fp.truncate()

它将打开文件,读取一个块,并将其写到比读取位置早3个字节的文件中.该文件被原位重写.更为简便的解决方案是将较短的文件写入新文件,例如 newtover的答案.这样会更简单,但是在短时间内使用两倍的磁盘空间.

It opens the file, reads a chunk, and writes it out to the file 3 bytes earlier than where it read it. The file is rewritten in-place. As easier solution is to write the shorter file to a new file like newtover's answer. That would be simpler, but use twice the disk space for a short period.

对于猜测编码,您可以从最具体到最不具体遍历整个编码:

As for guessing the encoding, then you can just loop through the encoding from most to least specific:

def decode(s):
    for encoding in "utf-8-sig", "utf-16":
        try:
            return s.decode(encoding)
        except UnicodeDecodeError:
            continue
    return s.decode("latin-1") # will always work

UTF-16编码的文件不会解码为UTF-8,因此我们首先尝试使用UTF-8.如果失败,那么我们尝试使用UTF-16.最后,我们使用Latin-1-这将始终有效,因为所有256个字节在Latin-1中都是合法值.在这种情况下,您可能想返回None,因为它实际上是一个后备,并且您的代码可能希望更仔细地处理(如果可以).

An UTF-16 encoded file wont decode as UTF-8, so we try with UTF-8 first. If that fails, then we try with UTF-16. Finally, we use Latin-1 — this will always work since all 256 bytes are legal values in Latin-1. You may want to return None instead in this case since it's really a fallback and your code might want to handle this more carefully (if it can).

这篇关于在Python中将没有BOM的UTF-8转换为带有BOM的UTF-8的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆