追加到末尾时,python utf-8-sig BOM位于文件中间 [英] python utf-8-sig BOM in the middle of the file when appending to the end
问题描述
我最近注意到,使用utf-8-sig
编码将Python附加到文件时,Python的行为表现得很明显.见下文:
I've noticed recently that Python behaves in such non-obvious way when appending to the file using utf-8-sig
encoding. See below:
>>> import codecs, os
>>> os.path.isfile('123')
False
>>> codecs.open('123', 'a', encoding='utf-8-sig').write('123\n')
>>> codecs.open('123', 'a', encoding='utf-8-sig').write('123\n')
以下文本最终指向文件:
The following text ends up to the file:
<BOM>123
<BOM>123
那不是一个错误吗?这是不合逻辑的. 有人可以向我解释为什么这样做吗? 为什么只有当文件不存在并且需要创建文件时,他们才设法不添加BOM表?
Isn't that a bug? This is so not logical. Could anyone explain to me why it was done so? Why didn't they manage to prepend BOM only when file doesn't exist and needs to be created?
推荐答案
不,这不是bug;这是完全正常的预期行为.编解码器无法检测到已写入文件的数量.您可以使用它来附加到预先创建但为 empty 的文件中.该文件不是新文件,但也不包含BOM.
No, it's not a bug; that's perfectly normal, expected behavior. The codec cannot detect how much was already written to a file; you could use it to append to a pre-created but empty file for example. The file would not be new, but it would not contain a BOM either.
然后还有其他用例,其中在流或字节串上使用编解码器(例如,不与codecs.open()
一起使用),根本没有文件可以测试,或者开发人员想要始终在输出开始时强制实施BOM.
Then there are other use-cases where the codec is used on a stream or bytestring (e.g. not with codecs.open()
) where there is no file at all to test, or where the developer wants to enforce a BOM at the start of the output, always.
仅在 new 文件上使用utf-8-sig
;每当您使用编解码器时,编解码器就会始终将其写出.
Only use utf-8-sig
on a new file; the codec will always write the BOM out whenever you use it.
如果直接使用文件,则可以自己测试启动情况;请使用utf-8
并手动编写BOM,这只是经过编码的 U + FEFF零宽度无中断空格:
If you are working directly with files, you can test for the start yourself; use utf-8
instead and write the BOM manually, which is just an encoded U+FEFF ZERO WIDTH NO-BREAK SPACE:
import io
with io.open(filename, 'a', encoding='utf8') as outfh:
if outfh.tell() == 0:
# start of file
outfh.write(u'\ufeff')
我使用了更新的 io.open()
而不是codecs.open()
; io
是为Python 3开发的新I/O框架,以我的经验,在处理编码文件方面比codecs
更强大.
I used the newer io.open()
instead of codecs.open()
; io
is the new I/O framework developed for Python 3, and is more robust than codecs
for handling encoded files, in my experience.
请注意,实际上,UTF-8 BOM几乎是无用的. UTF-8 没有可变的字节顺序,因此只有一个字节顺序标记.另一方面,UTF-16或UTF-32可以用两个不同的字节顺序之一写入,这就是为什么需要BOM的原因.
Note that the UTF-8 BOM is next to useless, really. UTF-8 has no variable byte order, so there is only one Byte Order Mark. UTF-16 or UTF-32, on the other hand, can be written with one of two distinct byte orders, which is why a BOM is needed.
Microsoft产品大多使用UTF-8 BOM自动检测文件的编码(例如, not 遗留代码页之一).
The UTF-8 BOM is mostly used by Microsoft products to auto-detect the encoding of a file (e.g. not one of the legacy code pages).
这篇关于追加到末尾时,python utf-8-sig BOM位于文件中间的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!