Python 3:读取包含德语变音符号的UTF-8文件 [英] Python 3: Read UTF-8 file containing German umlaut
问题描述
我搜索并发现了许多类似的问题和文章,但是没有一个问题和文章能让我解决。
I searched and found many similar questions and articles but none would allow me to resolve the issue.
我使用Python 3.5.0(v3.5.0:374f501f4567, 2015年9月13日,02:27:37)[Windows 10上的MSC v.1900 64位(AMD64)]。
I use Python 3.5.0 (v3.5.0:374f501f4567, Sep 13 2015, 02:27:37) [MSC v.1900 64 bit (AMD64)] on Windows 10.
我有一个简单的文本文件,其编码为UTF-8中的Windows如下所示:
I have a simple text file which is encoded for Windows in UTF-8 like so:
我要做的就是阅读内容
这里是第一次尝试,但是失败很惨:
Here is a first attempt that fails miserably:
file_name=r'c:\temp\encoding_test.txt'
fh=open(file_name,'r')
f_str=fh.read()
fh.close()
print(f_str)
打印语句引发异常:
'charmap'编解码器c在位置100上不对字符'\u201e'进行编码:字符映射到未定义
'charmap' codec can't encode character '\u201e' in position 100: character maps to undefined
使用调试器,f_str包含以下内容:
Using a debugger, f_str contains the following:
'我希望在将文件读入Python后正确显示以下字符:\n\nÄ ÖÜä¤ ö¼ÃŸ\n'
'I would like the following characters to display correctly after reading this file into Python:\n\nÄÖÜäöüß\n'
这已经让我很困惑。 Python 3不会在所有地方都使用UTF-8作为默认值吗?还有什么其他编码可以使用?我尝试了Notepad ++支持的所有功能,但均无效果。
This is already very puzzling to me. Doesn't Python 3 use UTF-8 as a default everywhere? What other encoding would work? I tried all of the ones Notepad++ supports, none works.
好,更加复杂,我尝试了:
OK, a bit more sophisticated, I tried:
import codecs
file_name=r'c:\temp\encoding_test.txt'
my_encoding='utf-8'
fh=codecs.open(file_name,'r',encoding=my_encoding)
f_str=fh.read().encode(my_encoding)
fh.close()
print(f_str)
这至少不会引发异常,但是会产生收益
This does not raise an exception, at least, but yields
b'将文件读入Python后,我希望以下字符正确显示:\r\n\r\n\xc3\x84\xc3\ x96\xc3\x9c\xc3\xa4\xc3\xb6\xc3\xbc\xc3\x9f\r\n'
I
b'I would like the following characters to display correctly after reading this file into Python:\r\n\r\n\xc3\x84\xc3\x96\xc3\x9c\xc3\xa4\xc3\xb6\xc3\xbc\xc3\x9f\r\n' I
对我来说这真是一团糟。有人可以帮我解决这个问题吗?
This is a complete mess to me. Can anyone here please help me sort this out?
推荐答案
您正在使用 codecs.open
编码为字节打印数据应该给您想要的,就像我们解码回来时所看到的:
You are encoding to bytes after using codecs.open
, just printing the data should give you want as you can see when we decode back:
In [31]: s = b'I would like the following characters to display correctly after reading this file into Python:\r\n\r\n\xc3\x84\xc3\x96\xc3\x9c\xc3\xa4\xc3\xb6\xc3\xbc\xc3\x9f\r\n'
In [32]: print(s)
b'I would like the following characters to display correctly after reading this file into Python:\r\n\r\n\xc3\x84\xc3\x96\xc3\x9c\xc3\xa4\xc3\xb6\xc3\xbc\xc3\x9f\r\n'
In [33]: print(s.decode("utf-8"))
I would like the following characters to display correctly after reading this file into Python:
ÄÖÜäöüß
如果您没有看到正确的输出,那么问题就是您的shell编码。 Windows控制台的编码不是utf-8,因此在哪里运行代码以及shell编码都很重要。
If you are not seeing the correct output then it is your shell encoding that is the problem. The windows console encoding is not utf-8 so where you are running the code from and the shell encoding matters.
这篇关于Python 3:读取包含德语变音符号的UTF-8文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!