“字符映射”codeC无法连接code字符“\\ XAE”虽然刮网页 [英] 'charmap' codec can't encode character '\xae' While Scraping a Webpage

查看:181
本文介绍了“字符映射”codeC无法连接code字符“\\ XAE”虽然刮网页的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是网络与刮的Python 使用 BeautifulSoap
我收到此错误

 字符映射codeC无法连接code字符\\ XAE在第69位:字符映射到<&未定义GT;

刮网页时

这是我的的Python

 酒店= BeautifulSoup(状态)。
打印(hotel.select(div.details.cf span.hotel-名))
#尝试:打印(hotel.select(div.details.cf span.hotel-名))恩code(UTF-8)


解决方案

我们这里通常会遇到这样的问题,当我们试图 .EN code()的已经连接codeD字节的字符串。所以,你可以尝试脱code它首先在

  HTML =了urllib.urlopen(链接).read()
UNI code_str = html.de code(小于信源编码>)
EN coded_str = UNI code_str.en code(UTF8)

作为一个例子:

  HTML ='\\ XAE
EN coded_str = html.en code(UTF8)

与失败

 的Uni codeDE codeError:ASCIIcodeC可以在0位置没有去code字节0XA0:序数不在范围内(128)

  HTML ='\\ XAE
德coded_str = html.de code(窗口-1252」)
EN coded_str =去coded_str.en code(UTF8)
打印连接coded_str
®

成功,没有错误。请注意,窗口1252是我作为的例如的。我从chardet的得到这个和它有0.5的信心,这是正确的! (当然,如用1个字符长度的字符串给定的,你能指望什么)你应该改变这种状况,从 .urlopen()返回的字节串的编码。阅读()什么适用于您检索到的内容。

I am web-scraping with Python using BeautifulSoap I am getting this error

'charmap' codec can't encode character '\xae' in position 69: character maps to <undefined>

when scraping a webpage

This is my Python

hotel = BeautifulSoup(state.)
print (hotel.select("div.details.cf span.hotel-name a"))
# Tried:  print (hotel.select("div.details.cf span.hotel-name a")).encode('utf-8')

解决方案

We usually encounter this problem here when we are trying to .encode() an already encoded byte string. So you might try to decode it first as in

html = urllib.urlopen(link).read()
unicode_str = html.decode(<source encoding>)
encoded_str = unicode_str.encode("utf8")

As an example:

html = '\xae'
encoded_str = html.encode("utf8")

Fails with

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)

While:

html = '\xae'
decoded_str = html.decode("windows-1252")
encoded_str = decoded_str.encode("utf8")
print encoded_str
®

Succeeds without error. Do note that "windows-1252" is something I used as an example. I got this from chardet and it had 0.5 confidence that it is right! (well, as given with a 1-character-length string, what do you expect) You should change that to the encoding of the byte string returned from .urlopen().read() to what applies to the content you retrieved.

这篇关于“字符映射”codeC无法连接code字符“\\ XAE”虽然刮网页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆