Python无法打开UTF-8编码的文本文件 [英] Python can not open UTF-8 encoded text file

查看:713
本文介绍了Python无法打开UTF-8编码的文本文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有.py脚本,其中包含以下代码以打开特定的文本文件(由Exchange Powershell生成):

I have .py script which contains following code to open specific text file (which was generated by Exchange Powershell):

with codecs.open("C:\\Temp\\myfile.txt",encoding="utf_8",mode="r",errors="replace") as myfile:
    content = myfile.readlines() #here we convert lines to list
    print(content)

但是,我也尝试了utf-16-be和utf-16-le(显然是标准ASCII),但是文件输出仍然像这样(这只是其中的一部分):

however, i tried also utf-16-be and utf-16-le (and standard ASCII obviously), but the file output is still looking like this (this is just part of it):

['��\r', '\x00\n', '\x00D\x00o\x00m\x00a\x00i\x00n\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00\r', '\x00\n', '\x00-\x00-\x00-\x00-\x00-\x00-\x00 

我要打开的文件是位于此处

有人请知道我在做什么错吗?这是另一种编码吗?

does anybody please know what am i doing wrong? Is this some different kind of encoding?

推荐答案

首先,此文本绝对不是UTF-8,所以这就是Python无法将其作为UTF-8编码的文本文件打开的原因.

First, this text is definitely not UTF-8, so that's why Python can't open it as a UTF-8-encoded text file.

第二,您声称您也尝试过utf-16-be和utf-16-le",但没有说明您是如何做到的,而且我怀疑您做错了.

Second, you claim you "tried also utf-16-be and utf-16-le", but didn't show how you did that, and I suspect you did it wrong.

从输出中看,这很有可能是BOM编码的UTF-16-LE.

From the output, this is very likely BOM-encoded UTF-16-LE.

前两个字节-由于您打印它们的方式,我们无法确定它们是哪个字节,但这就是您打印出\xFF\xFE字节时的样子.其余字符串是一堆NUL偶数字节和看起来合理的字节,这几乎总是表示UTF-16-LE.另外,最常见的带有BOM的2字节是UTF-16-LE,而您正在使用所有Microsoft工具的事实使这种可能性更大.

The first two bytes—because of the way you've printed them, we can't tell which bytes they are, but this is what it looks like when you print out \xFF and \xFE bytes. And the rest of the strings are a bunch of NUL even bytes alternating with reasonable-looking bytes, which almost always means UTF-16-LE. Plus, most common two-byte with a BOM in the wild is UTF-16-LE, and the fact that you're using all Microsoft tools makes that even more likely.

因此,如果您真的尝试过utf-16-le,则几乎可以肯定会得到正确的字符串,但是在开始时要有一个额外的\ufeff.

So, if you'd really tried utf-16-le, you would almost certainly have gotten the right string, but with an extra \ufeff at the start.

但是,当然,正确的答案是将其解码为'utf-16',这将正确消耗和使用BOM.

But of course the right answer is to just decode it as 'utf-16', which will consume and use the BOM properly.

这篇关于Python无法打开UTF-8编码的文本文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆