Python 3:读取包含德语变音符号的UTF-8文件 [英] Python 3: Read UTF-8 file containing German umlaut

查看:336
本文介绍了Python 3:读取包含德语变音符号的UTF-8文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我搜索并发现了许多类似的问题和文章,但是没有一个问题和文章能让我解决。

I searched and found many similar questions and articles but none would allow me to resolve the issue.

我使用Python 3.5.0(v3.5.0:374f501f4567, 2015年9月13日,02:27:37)[Windows 10上的MSC v.1900 64位(AMD64)]。

I use Python 3.5.0 (v3.5.0:374f501f4567, Sep 13 2015, 02:27:37) [MSC v.1900 64 bit (AMD64)] on Windows 10.

我有一个简单的文本文件,其编码为UTF-8中的Windows如下所示:

I have a simple text file which is encoded for Windows in UTF-8 like so:

我要做的就是阅读内容

这里是第一次尝试,但是失败很惨:

Here is a first attempt that fails miserably:

    file_name=r'c:\temp\encoding_test.txt'
    fh=open(file_name,'r')
    f_str=fh.read()
    fh.close()
    print(f_str)

打印语句引发异常:


'charmap'编解码器c在位置100上不对字符'\u201e'进行编码:字符映射到未定义

'charmap' codec can't encode character '\u201e' in position 100: character maps to undefined

使用调试器,f_str包含以下内容:

Using a debugger, f_str contains the following:


'我希望在将文件读入Python后正确显示以下字符:\n\nÄ ÖÜä¤ ö¼ÃŸ\n'

'I would like the following characters to display correctly after reading this file into Python:\n\nÄÖÜäöüß\n'

这已经让我很困惑。 Python 3不会在所有地方都使用UTF-8作为默认值吗?还有什么其他编码可以使用?我尝试了Notepad ++支持的所有功能,但均无效果。

This is already very puzzling to me. Doesn't Python 3 use UTF-8 as a default everywhere? What other encoding would work? I tried all of the ones Notepad++ supports, none works.

好,更加复杂,我尝试了:

OK, a bit more sophisticated, I tried:

    import codecs
    file_name=r'c:\temp\encoding_test.txt'
    my_encoding='utf-8'
    fh=codecs.open(file_name,'r',encoding=my_encoding)
    f_str=fh.read().encode(my_encoding)
    fh.close()
    print(f_str)

这至少不会引发异常,但是会产生收益

This does not raise an exception, at least, but yields

b'将文件读入Python后,我希望以下字符正确显示:\r\n\r\n\xc3\x84\xc3\ x96\xc3\x9c\xc3\xa4\xc3\xb6\xc3\xbc\xc3\x9f\r\n'
I

b'I would like the following characters to display correctly after reading this file into Python:\r\n\r\n\xc3\x84\xc3\x96\xc3\x9c\xc3\xa4\xc3\xb6\xc3\xbc\xc3\x9f\r\n' I

对我来说这真是一团糟。有人可以帮我解决这个问题吗?

This is a complete mess to me. Can anyone here please help me sort this out?

推荐答案

您正在使用 codecs.open 编码为字节打印数据应该给您想要的,就像我们解码回来时所看到的:

You are encoding to bytes after using codecs.open , just printing the data should give you want as you can see when we decode back:

In [31]: s = b'I would like the following characters to display correctly after reading this file into Python:\r\n\r\n\xc3\x84\xc3\x96\xc3\x9c\xc3\xa4\xc3\xb6\xc3\xbc\xc3\x9f\r\n'

In [32]: print(s)
b'I would like the following characters to display correctly after reading this file into Python:\r\n\r\n\xc3\x84\xc3\x96\xc3\x9c\xc3\xa4\xc3\xb6\xc3\xbc\xc3\x9f\r\n'

In [33]: print(s.decode("utf-8"))
I would like the following characters to display correctly after reading this file into Python:

ÄÖÜäöüß

如果您没有看到正确的输出,那么问题就是您的shell编码。 Windows控制台的编码不是utf-8,因此在哪里运行代码以及shell编码都很重要。

If you are not seeing the correct output then it is your shell encoding that is the problem. The windows console encoding is not utf-8 so where you are running the code from and the shell encoding matters.

这篇关于Python 3:读取包含德语变音符号的UTF-8文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆