Python请求以utf-8编码但无法解码的响应 [英] Python requests response encoded in utf-8 but cannot be decoded
问题描述
我正在尝试使用 python 抓取我的 messenger.com(facebook messenger)聊天记录,我使用谷歌浏览器开发人员工具查看聊天记录的 POST 请求,我已将整个标题和正文复制为请求的格式可以使用.
I am trying to scrape my messenger.com (facebook messenger) chats using python and i have used google chromes developer tools to see the POST request for the chat history and i have copied the entire header and body into a format that requests can use.
我得到 HTTP 代码 200,这意味着该请求至少得到了东西,但是我可以打印 res.encoding
以获取它返回的编码,它说的是 utf-8.但我无法解码!
I get HTTP code 200 implying the request at least got something but and i can print res.encoding
to get the encoding it returned in which it says is utf-8. But i cannot decode it!
这里是功能:
def download_thread(self, limit, offset, message_timestamp):
"""Download the specified number of messages from the
provided thread, with an optional offset
"""
data = request_data(self.thread, offset=offset,
limit=limit, group=self.group,
timestamp=message_timestamp)
res = self.ses.post(url_thread, data=data, headers=headers)
print(res.content)
thread_contents = json.loads(res.content)
print(thread_contents)
return thread_contents
收益
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x87 in position 0: invalid start byte
当它尝试json.load
(或loads
)数据时
但是 res.encoding
确实返回 utf-8.
But res.encoding
does return utf-8.
我尝试用 gzip 解压,但那不是 gzip 压缩的内容.
I tried unzipping with gzip but that says it is is not gzipped content.
如果我只是尝试做 print(res.content)
我得到
If i just try to do print(res.content)
i get
Traceback (most recent call last):
File "FBChatScraper.py", line 200, in <module>
main()
File "FBChatScraper.py", line 134, in main
fbms.run()
0fx82x048xbbxb9=x87xebK0.xffx90xddxebxfax16xc6xbbzx8bx82)xe8xaaVx01^xdax8bxbdx15d-xb1x10@x17\xd43xa8x92wxe8xc0xcdUxc4xffxc7xfax90xb2xb3xf5x84x11ux0b x8fx83rxf3}xe5!y$xe6xf6c0xf0xb4x98xcat_x0cx08xb5xddx8ctxx91xa9x95
B%xe2x93xa52x85_xa6x10xc2xc9xa3xee4SDbxa5x18QJx83Xx19)xaa$xf4xb4xb7x0bx84x15&x88x08Lxc9iPxa2xb9xf2xafx96x96Nxd8xcf=x05xc1x18x8dxa0xf2Yx8e
xcfxc8x0fE4xd6)xa1xd4xb7Dxd6{ixc8Px96Rx11HCxacxbcKyT#~}x93xf7@Kxc7r/x82xb0xe4xefXxf9jx08xa6Hpxfcnx06xfdox9axd0wJxb4fJ(x89+x1cxf6x0eOIx90xacx9eDDxfd,xa5xe9x89x1blhx86Zx98x05xdd9xc7xf4x80xfcYx8exadxeex99!x15x13+x9bx07xe8Fdjxfcx11xfcxfe7x06hx02x00@>]Wx92xc9x02xb1c3x82xcdxa4xefN9x90xe6x81yx9cx84erxd4xc3x06x1cx06x14xcfxc7x07hjxbfHxdcxf5~xf7zx18Cexaf^x8cxab xdfVxcexb8x11xf8x06x03'
Traceback (most recent call last):
File "FBChatScraper.py", line 200, in <module>
main()
File "FBChatScraper.py", line 134, in main
fbms.run()
File "FBChatScraper.py", line 43, in run
thread_contents = self.download_thread(limit, offset, message_timestamp)
File "FBChatScraper.py", line 74, in download_thread
thread_contents = json.loads(res.content)
File "/Users/silman/anaconda/lib/python3.6/json/__init__.py", line 349, in loads
s = s.decode(detect_encoding(s), 'surrogatepass')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x87 in position 0: invalid start byte
奇怪地将内容打印在回溯的中间,让我认为有一些不可见的字符将其向下推.
oddly printing the content in the middle of the traceback leading me to think there are some invisible characters pushing it down.
我无法将响应加载为 json 格式,因为无论我如何处理响应内容,它都没有正确格式化以供 json 库解释.
I am unable to get the response loaded into a json format because no matter how i handle the response content it isn't properly formatted for json library to interpret.
此外,如果我只做 print(res.text)
我会得到垃圾:
Moreover if i just do print(res.text)
i get garbage:
Traceback (most recent call last):
File "FBChatScraper.py", line 200, in <module>
main()
File "FBChatScraper.py", line 134, in main
fbms.run()
}sP���c���f�u0���� QZed�C��� M$x�Ҹ�H�����eǘ�]���5���^�*�ӄaM�Y��b���/ڶ�JW/���>H6z���l4����t=i��%Ҳu�x��%�x�
F <���{1i�#%;�rɲ=Rχm��1B�Z(+�(S-���#��v�{b��
� f/V�i̴��_��83� �_����*��O��
������Z��i-�TVeaG54�!v�a?ǯ|gu-g��.���"J$�L`&�tΊ#s)�H����s���q���^0��[)���j�ॽ�T���U���J�ЁwW���!eg�#j ��r��$y���3�4��4.��M�@Kb�AX�SDb�QJ�X)�,���a� "Sp�h�����sOA0Vé|�������:%�rKdKC���@ M��.�^
� �g���SWQHӳ.��BӄG�,����@E��������
nras��L�/��ch@>]W���c3�ͤ�N9��y��er����hj�H��~�zCe�^�� �Vθ�
Traceback (most recent call last):
File "FBChatScraper.py", line 200, in <module>
main()
File "FBChatScraper.py", line 134, in main
fbms.run()
File "FBChatScraper.py", line 43, in run
thread_contents = self.download_thread(limit, offset, message_timestamp)
File "FBChatScraper.py", line 74, in download_thread
thread_contents = json.loads(res.content)
File "/Users/silman/anaconda/lib/python3.6/json/__init__.py", line 349, in loads
s = s.decode(detect_encoding(s), 'surrogatepass')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x87 in position 0: invalid start byte
MWE 尽我所能,不确定我的帖子请求中的哪些数据是私人的,所以我遗漏了一些
MWE as best i can, not sure what data from my post request is private so i left some out
使用这些数据
url_thread = "https://www.messenger.com/api/graphqlbatch/"
request_data = {
"batch_name": "MessengerGraphQLThreadFetcher",
"__user": "<user_id>",
"__a": "1",
"__dyn": "<dyn>",
"__req": "9",
'__be' : '-1',
'__pc' : 'PHASED:messengerdotcom_pkg',
"fb_dtsg": "AQFni7TU2nes:AQGSC8FSDqyw",
"ttstamp": "265817254666710077746711957586581715370521181008510710777",
"__rev": "3791607",
"jazoest": "<jazoest>",
"queries": '<queries>'
}
headers = {
"authority": "www.messenger.com",
"method": "POST",
"path": "/api/graphqlbatch/",
"scheme": "https",
"accept": "*/*",
"accept-encoding": "gzip, deflate, br",
"accept-language": "en-US,en;q=0.9",
"cache-control": "no-cache",
"content-length": "754",
"content-type" : "application/x-www-form-urlencoded",
"cookie": "<cookies>",
"origin": "https://www.messenger.com",
"pragma": "no-cache",
"referer": "https://www.messenger.com/t/<chatID>",
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36"
}
您可以使用 chrome 开发人员工具获取所有
,并在网络选项卡上查找对 Request URL:https://www.messenger.com 的 POST 请求/api/graphqlbatch/
.
You can get all the <items>
by using chrome developer tools and lookng on the network tab for a POST request to Request URL: https://www.messenger.com/api/graphqlbatch/
.
如果您在 chrome 开发工具正在录制时向上滚动以重新加载旧消息,则很容易找到.
Its easy to find if you scroll up to reload old messages while chrome dev tools is recording.
然后用python组合一个简单的请求
Then put together a simple request with python
import requests as rq
import time
ses = rq.Session()
thread = <ID of thread found in URL of messenger.com>
conversation_type = <'thread_fbids' if group chat else 'user_ids'>
data = request_data
data['messages[{}][{}][offset]'.format(conversation_type, thread)] = 0
data['messages[{}][{}][timestamp]'.format(conversation_type, thread)] = int(time.time())
data['messages[{}][{}][limit]'.format(conversation_type, thread)] = 2000
res = ses.post(url_thread, data=data, headers=headers)
print(res.content)
thread_contents = json.loads(res.content)
print(thread_contents)
作为我的开发工具返回的内容,您可以在 此处
As what my dev tools got back you can see the start of the json here
推荐答案
问题在于你的请求头中的这一行:
The problem is this line in your request headers:
"accept-encoding": "gzip, deflate, br",
br
要求 Brotli 压缩,这是一种新的压缩标准(请参阅 RFC 7932)Google 正在推动替换网络上的 gzip.Chrome 要求使用 Brotli,因为最新版本的 Chrome 本身就理解它.您要求 Brotli 是因为您从 Chrome 复制了标题.但是 requests
本身并不理解 Brotli.
That br
requests Brotli compression, a new-ish compression standard (see RFC 7932) that Google is pushing to replace gzip on the web. Chrome is asking for Brotli because recent versions of Chrome understand it natively. You're asking for Brotli because you copied the headers from Chrome. But requests
doesn't understand Brotli natively.
您可以pip install brotli
并注册解压器或在res.content
上手动调用它.但更简单的解决方案是删除 br
:
You can pip install brotli
and register the decompresser or just call it manually on res.content
. But a simpler solution is to just remove the br
:
"accept-encoding": "gzip, deflate",
... 然后你应该得到 gzip
,你和 requests
已经知道如何处理了.
… and then you should get gzip
, which you and requests
already know how to handle.
这篇关于Python请求以utf-8编码但无法解码的响应的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!