“字符映射”codeC无法连接code字符“\\ XAE”虽然刮网页 [英] 'charmap' codec can't encode character '\xae' While Scraping a Webpage
问题描述
我是网络与刮的Python
使用 BeautifulSoap
我收到此错误
字符映射codeC无法连接code字符\\ XAE在第69位:字符映射到<&未定义GT;
刮网页时
这是我的的Python
酒店= BeautifulSoup(状态)。
打印(hotel.select(div.details.cf span.hotel-名))
#尝试:打印(hotel.select(div.details.cf span.hotel-名))恩code(UTF-8)
我们这里通常会遇到这样的问题,当我们试图 .EN code()
的已经连接codeD字节的字符串。所以,你可以尝试脱code它首先在
HTML =了urllib.urlopen(链接).read()
UNI code_str = html.de code(小于信源编码>)
EN coded_str = UNI code_str.en code(UTF8)
作为一个例子:
HTML ='\\ XAE
EN coded_str = html.en code(UTF8)
与失败
的Uni codeDE codeError:ASCIIcodeC可以在0位置没有去code字节0XA0:序数不在范围内(128)
在
HTML ='\\ XAE
德coded_str = html.de code(窗口-1252」)
EN coded_str =去coded_str.en code(UTF8)
打印连接coded_str
®
成功,没有错误。请注意,窗口1252是我作为的例如的。我从chardet的得到这个和它有0.5的信心,这是正确的! (当然,如用1个字符长度的字符串给定的,你能指望什么)你应该改变这种状况,从 .urlopen()返回的字节串的编码。阅读()
什么适用于您检索到的内容。
I am web-scraping with Python
using BeautifulSoap
I am getting this error
'charmap' codec can't encode character '\xae' in position 69: character maps to <undefined>
when scraping a webpage
This is my Python
hotel = BeautifulSoup(state.)
print (hotel.select("div.details.cf span.hotel-name a"))
# Tried: print (hotel.select("div.details.cf span.hotel-name a")).encode('utf-8')
We usually encounter this problem here when we are trying to .encode()
an already encoded byte string. So you might try to decode it first as in
html = urllib.urlopen(link).read()
unicode_str = html.decode(<source encoding>)
encoded_str = unicode_str.encode("utf8")
As an example:
html = '\xae'
encoded_str = html.encode("utf8")
Fails with
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)
While:
html = '\xae'
decoded_str = html.decode("windows-1252")
encoded_str = decoded_str.encode("utf8")
print encoded_str
®
Succeeds without error. Do note that "windows-1252" is something I used as an example. I got this from chardet and it had 0.5 confidence that it is right! (well, as given with a 1-character-length string, what do you expect) You should change that to the encoding of the byte string returned from .urlopen().read()
to what applies to the content you retrieved.
这篇关于“字符映射”codeC无法连接code字符“\\ XAE”虽然刮网页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!