Windows Python上的UTF-8 [英] Utf-8 on windows python

查看:158
本文介绍了Windows Python上的UTF-8的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有html文件可以读取解析等内容,它是在unicode上编码的(我在记事本中看到了),但是当我尝试

I have html file to read parse etc, it's encode on unicode (I saw it with the notepad) but when I tried

infile = open("path", "r") 
infile.read()

它失败了,我遇到了一个著名的错误:

it fails and I had the famous error :

UnicodeEncodeError:"charmap"编解码器无法对位置xx处的字符进行编码:字符映射为未定义

UnicodeEncodeError: 'charmap' codec can't encode characters in position xx: character maps to undefined

因此,为了进行测试,我尝试将文件包含的内容复制粘贴到一个新文件中,然后将其保存在utf-8中,然后尝试使用像这样的编解码器将其打开:

So for test I tried to copy paste the contain of the file in a new one and save it in utf-8 and then tried to open it with codecs like this :

inFile = codecs.open("path", "r", encoding="utf-8")
outputStream = inFile.read()

但是我收到此错误消息:

But I get this error message :

UnicodeEncodeError:'charmap'编解码器无法在位置0编码字符u'\ ufeff':charcater映射为未定义

UnicodeEncodeError : 'charmap' codec can't encode character u'\ufeff' in position 0: charcater maps to undefined

我真的不明白,因为我是在utf8中创建的.

I really don't understand because I was created this file in utf8.

推荐答案

UnicodeEncodeError提示代码失败,而 encoding Unicode文本转换为字节,即您的实际代码尝试打印到Windows控制台.请参见 Python,Unicode和Windows控制台.

UnicodeEncodeError suggests that the code fails while encoding Unicode text to bytes i.e., your actual code tries to print to Windows console. See Python, Unicode, and the Windows console.

上面的链接修复了UnicodeEncodeError.下一个问题是找出"path"文件中的文本使用什么字符编码.如果notepad.exe正确显示文本,则意味着它是使用locale.getprefferedencoding(False)编码(在Windows上类似于cp1252)或文件具有

The link above fixes UnicodeEncodeError. The next issue is to find out what character encoding is used by the text in your "path" file. If notepad.exe shows the text correctly then it means that it is either encoded using locale.getprefferedencoding(False) (something like cp1252 on Windows) or the file has BOM.

如果您确定编码为utf-8,则将其直接传递给open().不要使用codecs.open():

If you are sure that the encoding is utf-8 then pass it to open() directly. Don't use codecs.open():

with open('path', encoding='utf-8') as file:
    html = file.read()

有时,输入内容可能包含使用多种(不一致)编码进行编码的文本,例如,智能引号可能使用cp1252进行编码,而html的其余部分为utf-8 -您可以使用在Python中获取HTTP响应的字符集/编码的好方法

Sometimes, the input may contain text encoded using multiple (inconsistent) encodings e.g., smart quotes may be encoded using cp1252 while the rest of html is utf-8 -- you could fix it using bs4.UnicodeDammit. See also A good way to get the charset/encoding of an HTTP response in Python

这篇关于Windows Python上的UTF-8的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆