Python'ascii'编解码器无法使用request.get编码字符 [英] Python 'ascii' codec can't encode character with request.get

查看：195 发布时间：2020/9/7 20:26:32 python json encoding utf-8 ascii

本文介绍了Python'ascii'编解码器无法使用request.get编码字符的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个Python程序，该程序从站点爬网数据并返回json.爬网的站点具有元标记charset = ISO-8859-1.这是源代码:

I have a Python program which crawls data from a site and returns a json. The crawled site has the meta tag charset = ISO-8859-1. Here is the source code:

url = 'https://www.example.com'
source_code = requests.get(url)
plain_text = source_code.text

之后，我将使用Beautiful Soup获取信息，然后创建一个json.问题是，某些符号(即€符号)显示为\ u0080或\ x80(在python中)，因此我无法在php中使用或解码它们.所以我尝试了plain_text.decode('ISO-8859-1)和plain_text.decode('cp1252')以便以后可以将它们编码为utf-8，但是每次出现错误时:'ascii'编解码器无法在位置8496处编码字符u'\ xf6':序数不在范围内(128).

After that I am getting the information with Beautiful Soup and then creating a json. The problem is, that some symbols i.e. the € symbol are displayed as \u0080 or \x80 (in python) so I can't use or decode them in php. So I tried plain_text.decode('ISO-8859-1) and plain_text.decode('cp1252') so I could encode them afterwards as utf-8 but every time I get the error: 'ascii' codec can't encode character u'\xf6' in position 8496: ordinal not in range(128).

编辑

@ChrisKoston建议后使用.content而不是.text

the new code after @ChrisKoston suggestion using .content instead of .text

url = 'https://www.example.com'
source_code = requests.get(url)
plain_text = source_code.content
the_sourcecode = plain_text.decode('cp1252').encode('UTF-8')
soup = BeautifulSoup(the_sourcecode, 'html.parser')

现在可以进行编码和解码，但是仍然存在字符问题.

encoding and decoding is now possible but still the character problem.

EDIT2

解决方案是将其设置为.content.decode('cp1252')

the solution is to set it .content.decode('cp1252')

url = 'https://www.example.com'
source_code = requests.get(url)
plain_text = source_code.content.decode('cp1252')
soup = BeautifulSoup(plain_text, 'html.parser')

特别感谢Tomalak提供的解决方案

推荐答案

您实际上必须将decode()的结果存储在某个地方，因为它不会修改原始变量.

You must actually store the result of decode() somewhere because it does not modify the original variable.

另一件事:

decode()将字节列表转换为字符串.
encode()进行相反的处理，它将字符串转换为字节列表

decode() turns a list of bytes into a string.
encode() does the oposite, it turns a string into a list of bytes

BeautifulSoup对字符串感到满意；您根本不需要使用encode().

BeautifulSoup is happy with strings; you don't need to use encode() at all.

import requests
from bs4 import BeautifulSoup

url = 'https://www.example.com'
response = requests.get(url)
html = response.content.decode('cp1252')
soup = BeautifulSoup(html, 'html.parser')

提示:要使用HTML，您可能需要查看 pyquery 而不是BeautifulSoup.

Hint: For working with HTML you might want to look at pyquery instead of BeautifulSoup.

这篇关于Python'ascii'编解码器无法使用request.get编码字符的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Python'ascii'编解码器无法使用request.get编码字符 [英] Python 'ascii' codec can't encode character with request.get

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Python'ascii'编解码器无法使用request.get编码字符 [英] Python &#39;ascii&#39; codec can&#39;t encode character with request.get

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

Python'ascii'编解码器无法使用request.get编码字符 [英] Python 'ascii' codec can't encode character with request.get

登录关闭