Python请求以utf-8编码的响应,但无法解码 [英] Python requests response encoded in utf-8 but cannot be decoded
问题描述
我正在尝试使用python抓取我的messenger.com(facebook Messenger)聊天,并且我已使用Google chromes开发者工具查看聊天历史记录的POST请求,并且我已将整个标头和正文复制为一种请求格式可以使用.
I am trying to scrape my messenger.com (facebook messenger) chats using python and i have used google chromes developer tools to see the POST request for the chat history and i have copied the entire header and body into a format that requests can use.
我得到的HTTP代码200暗示该请求至少得到了某物,但是我可以print res.encoding
使其返回的编码表示为utf-8.但是我无法解码!
I get HTTP code 200 implying the request at least got something but and i can print res.encoding
to get the encoding it returned in which it says is utf-8. But i cannot decode it!
这是函数:
def download_thread(self, limit, offset, message_timestamp):
"""Download the specified number of messages from the
provided thread, with an optional offset
"""
data = request_data(self.thread, offset=offset,
limit=limit, group=self.group,
timestamp=message_timestamp)
res = self.ses.post(url_thread, data=data, headers=headers)
print(res.content)
thread_contents = json.loads(res.content)
print(thread_contents)
return thread_contents
收益
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x87 in position 0: invalid start byte
尝试json.load
(或loads
)数据时
但是res.encoding
确实返回utf-8.
But res.encoding
does return utf-8.
我尝试用gzip解压缩,但那不是gzip压缩的内容.
I tried unzipping with gzip but that says it is is not gzipped content.
如果我只是想做print(res.content)
我会得到
If i just try to do print(res.content)
i get
Traceback (most recent call last):
File "FBChatScraper.py", line 200, in <module>
main()
File "FBChatScraper.py", line 134, in main
fbms.run()
0f\x82\x048\xbb\xb9=\x87\xebK0.\xff\x90\xdd\xeb\xfa\x16\xc6\xbbz\x8b\x82)\xe8\xaaV\x01^\xda\x8b\xbd\x15d-\xb1\x10@\x17\\\xd43\xa8\x92w\xe8\xc0\xcdU\xc4\xff\xc7\xfa\x90\xb2\xb3\xf5\x84\x11u\x0b\t\x8f\x83r\xf3}\xe5!y$\xe6\xf6c0\xf0\xb4\x98\xcat_\x0c\x08\xb5\xdd\x8ctx\x91\xa9\x95\rB%\xe2\x93\xa52\x85_\xa6\x10\xc2\xc9\xa3\xee4SDb\xa5\x18QJ\x83X\x19)\xaa$\xf4\xb4\xb7\x0b\x84\x15&\x88\x08L\xc9iP\xa2\xb9\xf2\xaf\x96\x96N\xd8\xcf=\x05\xc1\x18\x8d\xa0\xf2Y\x8e\n\xcf\xc8\x0fE4\xd6)\xa1\xd4\xb7D\xd6{i\xc8P\x96R\x11HC\xac\xbcKyT#~}\x93\xf7@K\xc7r/\x82\xb0\xe4\xefX\xf9j\x08\xa6Hp\xfcn\x06\xfdo\x9a\xd0wJ\xb4fJ(\x89+\x1c\xf6\x0eOI\x90\xac\x9eDD\xfd,\xa5\xe9\x89\x1blh\x86Z\x98\x05\xdd9\xc7\xf4\x80\xfcY\x8e\xad\xee\x99!\x15\x13+\x9b\x07\xe8Fdj\xfc\x11\xfc\xfe7\x06h\x02\x00@>]W\x92\xc9\x02\xb1c3\x82\xcd\xa4\xefN9\x90\xe6\x81y\x9c\x84er\xd4\xc3\x06\x1c\x06\x14\xcf\xc7\x07hj\xbfH\xdc\xf5~\xf7z\x18Ce\xaf^\x8c\xab \xdfV\xce\xb8\x11\xf8\x06\x03'
Traceback (most recent call last):
File "FBChatScraper.py", line 200, in <module>
main()
File "FBChatScraper.py", line 134, in main
fbms.run()
File "FBChatScraper.py", line 43, in run
thread_contents = self.download_thread(limit, offset, message_timestamp)
File "FBChatScraper.py", line 74, in download_thread
thread_contents = json.loads(res.content)
File "/Users/silman/anaconda/lib/python3.6/json/__init__.py", line 349, in loads
s = s.decode(detect_encoding(s), 'surrogatepass')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x87 in position 0: invalid start byte
在回溯的中间奇怪地打印内容,使我认为有一些看不见的字符将其压下.
oddly printing the content in the middle of the traceback leading me to think there are some invisible characters pushing it down.
我无法将响应加载为json格式,因为无论我如何处理响应内容,其格式都不适合json库解释.
I am unable to get the response loaded into a json format because no matter how i handle the response content it isn't properly formatted for json library to interpret.
此外,如果我只是做print(res.text)
我会得到垃圾:
Moreover if i just do print(res.text)
i get garbage:
Traceback (most recent call last):
File "FBChatScraper.py", line 200, in <module>
main()
File "FBChatScraper.py", line 134, in main
fbms.run()
}sP���c���f�u0���\� QZed�C��� M$x�Ҹ�H�����eǘ�]���5���^�*�ӄaM�Y��b���/ڶ�JW/���>H6z�\��l4����t=i��%Ҳu�x��%�x�
F <���{1i�#%;�rɲ=Rχm��1B�Z(+�(S-���#��\v�{b��
� f/V�i̴��_��83� �_����*��O��
������Z��i-�TVeaG54�!v�a?ǯ|gu-g��.���"J$�L`&�tΊ#s)�H����s���q���^0��[)���j�ॽ�T���U���J�ЁwW���!eg�#j ��r��$y���3�4��4.��M�@Kb�AX�SDb�QJ�X)�,���a� "Sp�h�����sOA0Vé|�������:%�rKdKC���@ M��.�^
� �g���SWQHӳ.��BӄG�,����@E��������
nras��L�/��ch@>]W���c3�ͤ�N9��y��er����hj�H��~�zCe�^�� �Vθ�
Traceback (most recent call last):
File "FBChatScraper.py", line 200, in <module>
main()
File "FBChatScraper.py", line 134, in main
fbms.run()
File "FBChatScraper.py", line 43, in run
thread_contents = self.download_thread(limit, offset, message_timestamp)
File "FBChatScraper.py", line 74, in download_thread
thread_contents = json.loads(res.content)
File "/Users/silman/anaconda/lib/python3.6/json/__init__.py", line 349, in loads
s = s.decode(detect_encoding(s), 'surrogatepass')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x87 in position 0: invalid start byte
我会尽力开展MWE,不确定发帖请求中的哪些数据是私人的,所以我忽略了一些
MWE as best i can, not sure what data from my post request is private so i left some out
使用此数据
url_thread = "https://www.messenger.com/api/graphqlbatch/"
request_data = {
"batch_name": "MessengerGraphQLThreadFetcher",
"__user": "<user_id>",
"__a": "1",
"__dyn": "<dyn>",
"__req": "9",
'__be' : '-1',
'__pc' : 'PHASED:messengerdotcom_pkg',
"fb_dtsg": "AQFni7TU2nes:AQGSC8FSDqyw",
"ttstamp": "265817254666710077746711957586581715370521181008510710777",
"__rev": "3791607",
"jazoest": "<jazoest>",
"queries": '<queries>'
}
headers = {
"authority": "www.messenger.com",
"method": "POST",
"path": "/api/graphqlbatch/",
"scheme": "https",
"accept": "*/*",
"accept-encoding": "gzip, deflate, br",
"accept-language": "en-US,en;q=0.9",
"cache-control": "no-cache",
"content-length": "754",
"content-type" : "application/x-www-form-urlencoded",
"cookie": "<cookies>",
"origin": "https://www.messenger.com",
"pragma": "no-cache",
"referer": "https://www.messenger.com/t/<chatID>",
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36"
}
您可以使用chrome开发人员工具获取所有<items>
,并在网络标签上查找对Request URL: https://www.messenger.com/api/graphqlbatch/
的POST请求.
You can get all the <items>
by using chrome developer tools and lookng on the network tab for a POST request to Request URL: https://www.messenger.com/api/graphqlbatch/
.
在chrome开发工具正在录制时,很容易找到是否向上滚动以重新加载旧消息.
Its easy to find if you scroll up to reload old messages while chrome dev tools is recording.
然后将简单的请求与python放在一起
Then put together a simple request with python
import requests as rq
import time
ses = rq.Session()
thread = <ID of thread found in URL of messenger.com>
conversation_type = <'thread_fbids' if group chat else 'user_ids'>
data = request_data
data['messages[{}][{}][offset]'.format(conversation_type, thread)] = 0
data['messages[{}][{}][timestamp]'.format(conversation_type, thread)] = int(time.time())
data['messages[{}][{}][limit]'.format(conversation_type, thread)] = 2000
res = ses.post(url_thread, data=data, headers=headers)
print(res.content)
thread_contents = json.loads(res.content)
print(thread_contents)
作为我的开发工具返回的内容,您可以看到json的开头此处
As what my dev tools got back you can see the start of the json here
推荐答案
问题是您的请求标头中的这一行:
The problem is this line in your request headers:
"accept-encoding": "gzip, deflate, br",
br
请求 Brotli压缩,这是一种新的压缩标准(请参见 RFC 7932 ). Chrome之所以要求Brotli,是因为Chrome的最新版本本身就可以理解.您之所以要Brotli,是因为您是从Chrome复制标题的.但是requests
并不了解Brotli.
That br
requests Brotli compression, a new-ish compression standard (see RFC 7932) that Google is pushing to replace gzip on the web. Chrome is asking for Brotli because recent versions of Chrome understand it natively. You're asking for Brotli because you copied the headers from Chrome. But requests
doesn't understand Brotli natively.
您可以pip install brotli
并注册解压缩器,或者只是在res.content
上手动调用它.但是更简单的解决方案是删除br
:
You can pip install brotli
and register the decompresser or just call it manually on res.content
. But a simpler solution is to just remove the br
:
"accept-encoding": "gzip, deflate",
…,然后您应该得到gzip
,您和requests
已经知道该如何处理.
… and then you should get gzip
, which you and requests
already know how to handle.
这篇关于Python请求以utf-8编码的响应,但无法解码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!