Python'ascii'编解码器无法使用request.get编码字符 [英] Python 'ascii' codec can't encode character with request.get

查看:195
本文介绍了Python'ascii'编解码器无法使用request.get编码字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个Python程序,该程序从站点爬网数据并返回json.爬网的站点具有元标记charset = ISO-8859-1.这是源代码:

I have a Python program which crawls data from a site and returns a json. The crawled site has the meta tag charset = ISO-8859-1. Here is the source code:

url = 'https://www.example.com'
source_code = requests.get(url)
plain_text = source_code.text

之后,我将使用Beautiful Soup获取信息,然后创建一个json.问题是,某些符号(即符号)显示为\ u0080或\ x80(在python中),因此我无法在php中使用或解码它们.所以我尝试了plain_text.decode('ISO-8859-1)plain_text.decode('cp1252')以便以后可以将它们编码为utf-8,但是每次出现错误时:'ascii'编解码器无法在位置8496处编码字符u'\ xf6':序数不在范围内(128).

After that I am getting the information with Beautiful Soup and then creating a json. The problem is, that some symbols i.e. the symbol are displayed as \u0080 or \x80 (in python) so I can't use or decode them in php. So I tried plain_text.decode('ISO-8859-1) and plain_text.decode('cp1252') so I could encode them afterwards as utf-8 but every time I get the error: 'ascii' codec can't encode character u'\xf6' in position 8496: ordinal not in range(128).

编辑

@ChrisKoston建议后使用.content而不是.text

the new code after @ChrisKoston suggestion using .content instead of .text

url = 'https://www.example.com'
source_code = requests.get(url)
plain_text = source_code.content
the_sourcecode = plain_text.decode('cp1252').encode('UTF-8')
soup = BeautifulSoup(the_sourcecode, 'html.parser')

现在可以进行编码和解码,但是仍然存在字符问题.

encoding and decoding is now possible but still the character problem.

EDIT2

解决方案是将其设置为.content.decode('cp1252')

the solution is to set it .content.decode('cp1252')

url = 'https://www.example.com'
source_code = requests.get(url)
plain_text = source_code.content.decode('cp1252')
soup = BeautifulSoup(plain_text, 'html.parser')

特别感谢Tomalak提供的解决方案

推荐答案

您实际上必须将decode()的结果存储在某个地方,因为它不会修改原始变量.

You must actually store the result of decode() somewhere because it does not modify the original variable.

另一件事:

  • decode()将字节列表转换为字符串.
  • encode()进行相反的处理,它将字符串转换为字节列表
  • decode() turns a list of bytes into a string.
  • encode() does the oposite, it turns a string into a list of bytes

BeautifulSoup对字符串感到满意;您根本不需要使用encode().

BeautifulSoup is happy with strings; you don't need to use encode() at all.

import requests
from bs4 import BeautifulSoup

url = 'https://www.example.com'
response = requests.get(url)
html = response.content.decode('cp1252')
soup = BeautifulSoup(html, 'html.parser')

提示:要使用HTML,您可能需要查看 pyquery 而不是BeautifulSoup.

Hint: For working with HTML you might want to look at pyquery instead of BeautifulSoup.

这篇关于Python'ascii'编解码器无法使用request.get编码字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆