Python 字符串中的 u'ufeff' [英] u'ufeff' in Python string
问题描述
我收到以下异常消息的错误:
I got an error with the following exception message:
UnicodeEncodeError: 'ascii' codec can't encode character u'ufeff' in
position 155: ordinal not in range(128)
不确定 u'ufeff'
是什么,它在我抓取网页时出现.我该如何补救?.replace()
字符串方法对它不起作用.
Not sure what u'ufeff'
is, it shows up when I'm web scraping. How can I remedy the situation? The .replace()
string method doesn't work on it.
推荐答案
Unicode 字符 U+FEFF
是字节顺序标记或 BOM,用于区分 big- 和小端 UTF-16 编码.如果您使用正确的编解码器解码网页,Python 会为您删除它.示例:
The Unicode character U+FEFF
is the byte order mark, or BOM, and is used to tell the difference between big- and little-endian UTF-16 encoding. If you decode the web page using the right codec, Python will remove it for you. Examples:
#!python2
#coding: utf8
u = u'ABC'
e8 = u.encode('utf-8') # encode without BOM
e8s = u.encode('utf-8-sig') # encode with BOM
e16 = u.encode('utf-16') # encode with BOM
e16le = u.encode('utf-16le') # encode without BOM
e16be = u.encode('utf-16be') # encode without BOM
print 'utf-8 %r' % e8
print 'utf-8-sig %r' % e8s
print 'utf-16 %r' % e16
print 'utf-16le %r' % e16le
print 'utf-16be %r' % e16be
print
print 'utf-8 w/ BOM decoded with utf-8 %r' % e8s.decode('utf-8')
print 'utf-8 w/ BOM decoded with utf-8-sig %r' % e8s.decode('utf-8-sig')
print 'utf-16 w/ BOM decoded with utf-16 %r' % e16.decode('utf-16')
print 'utf-16 w/ BOM decoded with utf-16le %r' % e16.decode('utf-16le')
请注意,EF BB BF
是 UTF-8 编码的 BOM.UTF-8 不需要它,它仅用作签名(通常在 Windows 上).
Note that EF BB BF
is a UTF-8-encoded BOM. It is not required for UTF-8, but serves only as a signature (usually on Windows).
输出:
utf-8 'ABC'
utf-8-sig 'xefxbbxbfABC'
utf-16 'xffxfeAx00Bx00Cx00' # Adds BOM and encodes using native processor endian-ness.
utf-16le 'Ax00Bx00Cx00'
utf-16be 'x00Ax00Bx00C'
utf-8 w/ BOM decoded with utf-8 u'ufeffABC' # doesn't remove BOM if present.
utf-8 w/ BOM decoded with utf-8-sig u'ABC' # removes BOM if present.
utf-16 w/ BOM decoded with utf-16 u'ABC' # *requires* BOM to be present.
utf-16 w/ BOM decoded with utf-16le u'ufeffABC' # doesn't remove BOM if present.
请注意,utf-16
编解码器要求 BOM 存在,否则 Python 将不知道数据是大端还是小端.
Note that the utf-16
codec requires BOM to be present, or Python won't know if the data is big- or little-endian.
这篇关于Python 字符串中的 u'ufeff'的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!