为什么Python中坚持使用ASCII? [英] Why is Python insisting on using ascii?
问题描述
当与解析请求和美丽的汤HTML文件,下面一行是在一些网页抛出一个异常:
When parsing an HTML file with Requests and Beautiful Soup, the following line is throwing an exception on some web pages:
if 'var' in str(tag.string):
下面是上下文:
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text.encode('utf-8'))
for tag in soup.findAll('script'):
if 'var' in str(tag.string): # This is the line throwing the exception
print(tag.string)
下面是个例外:
统一codeDE codeError:ASCIIcodeC可以在15位没有去code字节0xc3:序数不在范围内(128)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 15: ordinal not in range(128)
我曾尝试既没有使用连接code(UTF-8)在
功能 BeautifulSoup
行,这都没有区别。我注意到,在页面抛出异常出现在JavaScript的注释字符 A
,即使通过response.encoding报告的编码是 ISO-8859-1
。我不知道我可以用单codedata.normalize拆除违规的文字,但我会preFER到标签转换成
变量 UTF-8
并保持字符。下面的方法都不利于改变变量 UTF-8
:
I have tried both with and without using the encode('utf-8')
function in the BeautifulSoup
line, it makes no difference. I do note that for the pages throwing the exception there is a character Ã
in a comment in the javascript, even though the encoding reported by response.encoding is ISO-8859-1
. I do realise that I can remove the offending characters with unicodedata.normalize however I would prefer to convert the tag
variable to utf-8
and keep the characters. None of the following methods help to change the variable to utf-8
:
tag.encode('utf-8')
tag.decode('ISO-8859-1').encode('utf-8')
tag.decode(response.encoding).encode('utf-8')
我必须做这个字符串为了什么把它转化成有用的 UTF-8
?谢谢!
推荐答案
好了,所以基本上你要在连接codeD HTTP响应的Latin-1
。性格给你上课的问题确实 A
,因为看的这里您可能会看到 0xC3
是正是人物Latin-1的。
Ok so basically you're getting an HTTP response encoded in Latin-1
. The character giving you problem es indeed Ã
because looking here you may see that 0xC3
is exactly that character in Latin-1.
我觉得你盲目测试你想象中的约解码/编码要求每个组合。首先,如果你这样做:如果STR(tag.string)VAR:
每当字符串
VAR包含非ASCII字节,Python会投诉。
I think you blinded test every combination you imagined about decoding/encoding the request. First of all, if you do this: if 'var' in str(tag.string):
whenever string
var contains non-ASCII bytes, python will complaint.
纵观code你与我们分享,正确的做法恕我直言是:
Looking at the code you've shared with us, the right approach IMHO would be:
response = requests.get(url)
# decode the latin-1 bytes to unicode
#soup = bs4.BeautifulSoup(response.text.decode('latin-1'))
#try this line instead
soup = bs4.BeautifulSoup(response.text, from_encoding=response.encoding)
for tag in soup.findAll('script'):
# since now soup was made with unicode strings I supposed you can treat
# its elements as so
if u'var' in tag.string: # This is the line throwing the exception
# now if you want output in utf-8
print(tag.string.encode('utf-8'))
编辑:这将是有益的给你看看的从BeautifiulSoup 4文档中的编码部分
It will be useful for you to take a look at the encoding section from the BeautifiulSoup 4 doc
基本上,逻辑是:
- 您得到一些字节连接$ C $光盘编码
X
- 您去code
X
通过执行bytes.de code('X'),这会返回一个单向code字节序列
- 您使用UNI code 工作
- 您连接code单向code一些编码
是
为输出ubytes.en code(' Y')
- You get some bytes encoded in encoding
X
- You decode
X
by doingbytes.decode('X') and this returns a unicode byte sequence
- You work with unicode
- You encode the unicode to some encoding
Y
for the outputubytes.encode('Y')
希望这会带来一些轻的问题。
Hope this bring some light to the problem.
这篇关于为什么Python中坚持使用ASCII?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!