Python'ascii'编解码器无法使用request.get编码字符 [英] Python 'ascii' codec can't encode character with request.get
问题描述
我有一个Python程序,该程序从站点爬网数据并返回json.爬网的站点具有元标记charset = ISO-8859-1.这是源代码:
I have a Python program which crawls data from a site and returns a json. The crawled site has the meta tag charset = ISO-8859-1. Here is the source code:
url = 'https://www.example.com'
source_code = requests.get(url)
plain_text = source_code.text
之后,我将使用Beautiful Soup获取信息,然后创建一个json.问题是,某些符号(即€
符号)显示为\ u0080或\ x80(在python中),因此我无法在php中使用或解码它们.所以我尝试了plain_text.decode('ISO-8859-1)
和plain_text.decode('cp1252')
以便以后可以将它们编码为utf-8,但是每次出现错误时:'ascii'编解码器无法在位置8496处编码字符u'\ xf6':序数不在范围内(128).
After that I am getting the information with Beautiful Soup and then creating a json. The problem is, that some symbols i.e. the €
symbol are displayed as \u0080 or \x80 (in python) so I can't use or decode them in php. So I tried plain_text.decode('ISO-8859-1)
and plain_text.decode('cp1252')
so I could encode them afterwards as utf-8 but every time I get the error: 'ascii' codec can't encode character u'\xf6' in position 8496: ordinal not in range(128).
编辑
@ChrisKoston建议后使用.content
而不是.text
the new code after @ChrisKoston suggestion using .content
instead of .text
url = 'https://www.example.com'
source_code = requests.get(url)
plain_text = source_code.content
the_sourcecode = plain_text.decode('cp1252').encode('UTF-8')
soup = BeautifulSoup(the_sourcecode, 'html.parser')
现在可以进行编码和解码,但是仍然存在字符问题.
encoding and decoding is now possible but still the character problem.
EDIT2
解决方案是将其设置为.content.decode('cp1252')
the solution is to set it .content.decode('cp1252')
url = 'https://www.example.com'
source_code = requests.get(url)
plain_text = source_code.content.decode('cp1252')
soup = BeautifulSoup(plain_text, 'html.parser')
特别感谢Tomalak提供的解决方案
推荐答案
您实际上必须将decode()
的结果存储在某个地方,因为它不会修改原始变量.
You must actually store the result of decode()
somewhere because it does not modify the original variable.
另一件事:
-
decode()
将字节列表转换为字符串. -
encode()
进行相反的处理,它将字符串转换为字节列表
decode()
turns a list of bytes into a string.encode()
does the oposite, it turns a string into a list of bytes
BeautifulSoup对字符串感到满意;您根本不需要使用encode()
.
BeautifulSoup is happy with strings; you don't need to use encode()
at all.
import requests
from bs4 import BeautifulSoup
url = 'https://www.example.com'
response = requests.get(url)
html = response.content.decode('cp1252')
soup = BeautifulSoup(html, 'html.parser')
提示:要使用HTML,您可能需要查看 pyquery 而不是BeautifulSoup.
Hint: For working with HTML you might want to look at pyquery instead of BeautifulSoup.
这篇关于Python'ascii'编解码器无法使用request.get编码字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!