追加到末尾时,python utf-8-sig BOM位于文件中间 [英] python utf-8-sig BOM in the middle of the file when appending to the end

查看:142
本文介绍了追加到末尾时,python utf-8-sig BOM位于文件中间的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近注意到,使用utf-8-sig编码将Python附加到文件时,Python的行为表现得很明显.见下文:

I've noticed recently that Python behaves in such non-obvious way when appending to the file using utf-8-sig encoding. See below:

>>> import codecs, os
>>> os.path.isfile('123')
False
>>> codecs.open('123', 'a', encoding='utf-8-sig').write('123\n')
>>> codecs.open('123', 'a', encoding='utf-8-sig').write('123\n')

以下文本最终指向文件:

The following text ends up to the file:

<BOM>123
<BOM>123

那不是一个错误吗?这是不合逻辑的. 有人可以向我解释为什么这样做吗? 为什么只有当文件不存在并且需要创建文件时,他们才设法不添加BOM表?

Isn't that a bug? This is so not logical. Could anyone explain to me why it was done so? Why didn't they manage to prepend BOM only when file doesn't exist and needs to be created?

推荐答案

不,这不是bug;这是完全正常的预期行为.编解码器无法检测到已写入文件的数量.您可以使用它来附加到预先创建但为 empty 的文件中.该文件不是新文件,但也不包含BOM.

No, it's not a bug; that's perfectly normal, expected behavior. The codec cannot detect how much was already written to a file; you could use it to append to a pre-created but empty file for example. The file would not be new, but it would not contain a BOM either.

然后还有其他用例,其中在流或字节串上使用编解码器(例如,不与codecs.open()一起使用),根本没有文件可以测试,或者开发人员想要始终在输出开始时强制实施BOM.

Then there are other use-cases where the codec is used on a stream or bytestring (e.g. not with codecs.open()) where there is no file at all to test, or where the developer wants to enforce a BOM at the start of the output, always.

仅在 new 文件上使用utf-8-sig;每当您使用编解码器时,编解码器就会始终将其写出.

Only use utf-8-sig on a new file; the codec will always write the BOM out whenever you use it.

如果直接使用文件,则可以自己测试启动情况;请使用utf-8并手动编写BOM,这只是经过编码的 U + FEFF零宽度无中断空格:

If you are working directly with files, you can test for the start yourself; use utf-8 instead and write the BOM manually, which is just an encoded U+FEFF ZERO WIDTH NO-BREAK SPACE:

import io

with io.open(filename, 'a', encoding='utf8') as outfh:
    if outfh.tell() == 0:
        # start of file
        outfh.write(u'\ufeff')

我使用了更新的 io.open() 而不是codecs.open(); io是为Python 3开发的新I/O框架,以我的经验,在处理编码文件方面比codecs更强大.

I used the newer io.open() instead of codecs.open(); io is the new I/O framework developed for Python 3, and is more robust than codecs for handling encoded files, in my experience.

请注意,实际上,UTF-8 BOM几乎是无用的. UTF-8 没有可变的字节顺序,因此只有一个字节顺序标记.另一方面,UTF-16或UTF-32可以用两个不同的字节顺序之一写入,这就是为什么需要BOM的原因.

Note that the UTF-8 BOM is next to useless, really. UTF-8 has no variable byte order, so there is only one Byte Order Mark. UTF-16 or UTF-32, on the other hand, can be written with one of two distinct byte orders, which is why a BOM is needed.

Microsoft产品大多使用UTF-8 BOM自动检测文件的编码(例如, not 遗留代码页之一).

The UTF-8 BOM is mostly used by Microsoft products to auto-detect the encoding of a file (e.g. not one of the legacy code pages).

这篇关于追加到末尾时,python utf-8-sig BOM位于文件中间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆