Python请求以utf-8编码的响应,但无法解码 [英] Python requests response encoded in utf-8 but cannot be decoded

查看:545
本文介绍了Python请求以utf-8编码的响应,但无法解码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用python抓取我的messenger.com(facebook Messenger)聊天,并且我已使用Google chromes开发者工具查看聊天历史记录的POST请求,并且我已将整个标头和正文复制为一种请求格式可以使用.

I am trying to scrape my messenger.com (facebook messenger) chats using python and i have used google chromes developer tools to see the POST request for the chat history and i have copied the entire header and body into a format that requests can use.

我得到的HTTP代码200暗示该请求至少得到了某物,但是我可以print res.encoding使其返回的编码表示为utf-8.但是我无法解码!

I get HTTP code 200 implying the request at least got something but and i can print res.encoding to get the encoding it returned in which it says is utf-8. But i cannot decode it!

这是函数:

def download_thread(self, limit, offset, message_timestamp):
    """Download the specified number of messages from the
    provided thread, with an optional offset
    """
    data = request_data(self.thread, offset=offset,
                        limit=limit, group=self.group,
                        timestamp=message_timestamp)

    res = self.ses.post(url_thread, data=data, headers=headers)

    print(res.content)

    thread_contents = json.loads(res.content)
    print(thread_contents)
    return thread_contents

收益

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x87 in position 0: invalid start byte

尝试json.load(或loads)数据时

但是res.encoding确实返回utf-8.

But res.encoding does return utf-8.

我尝试用gzip解压缩,但那不是gzip压缩的内容.

I tried unzipping with gzip but that says it is is not gzipped content.

如果我只是想做print(res.content)我会得到

If i just try to do print(res.content) i get

Traceback (most recent call last):
  File "FBChatScraper.py", line 200, in <module>
    main()
  File "FBChatScraper.py", line 134, in main
    fbms.run()
0f\x82\x048\xbb\xb9=\x87\xebK0.\xff\x90\xdd\xeb\xfa\x16\xc6\xbbz\x8b\x82)\xe8\xaaV\x01^\xda\x8b\xbd\x15d-\xb1\x10@\x17\\\xd43\xa8\x92w\xe8\xc0\xcdU\xc4\xff\xc7\xfa\x90\xb2\xb3\xf5\x84\x11u\x0b\t\x8f\x83r\xf3}\xe5!y$\xe6\xf6c0\xf0\xb4\x98\xcat_\x0c\x08\xb5\xdd\x8ctx\x91\xa9\x95\rB%\xe2\x93\xa52\x85_\xa6\x10\xc2\xc9\xa3\xee4SDb\xa5\x18QJ\x83X\x19)\xaa$\xf4\xb4\xb7\x0b\x84\x15&\x88\x08L\xc9iP\xa2\xb9\xf2\xaf\x96\x96N\xd8\xcf=\x05\xc1\x18\x8d\xa0\xf2Y\x8e\n\xcf\xc8\x0fE4\xd6)\xa1\xd4\xb7D\xd6{i\xc8P\x96R\x11HC\xac\xbcKyT#~}\x93\xf7@K\xc7r/\x82\xb0\xe4\xefX\xf9j\x08\xa6Hp\xfcn\x06\xfdo\x9a\xd0wJ\xb4fJ(\x89+\x1c\xf6\x0eOI\x90\xac\x9eDD\xfd,\xa5\xe9\x89\x1blh\x86Z\x98\x05\xdd9\xc7\xf4\x80\xfcY\x8e\xad\xee\x99!\x15\x13+\x9b\x07\xe8Fdj\xfc\x11\xfc\xfe7\x06h\x02\x00@>]W\x92\xc9\x02\xb1c3\x82\xcd\xa4\xefN9\x90\xe6\x81y\x9c\x84er\xd4\xc3\x06\x1c\x06\x14\xcf\xc7\x07hj\xbfH\xdc\xf5~\xf7z\x18Ce\xaf^\x8c\xab \xdfV\xce\xb8\x11\xf8\x06\x03'

Traceback (most recent call last):
  File "FBChatScraper.py", line 200, in <module>
    main()
  File "FBChatScraper.py", line 134, in main
    fbms.run()
  File "FBChatScraper.py", line 43, in run
    thread_contents = self.download_thread(limit, offset, message_timestamp)
  File "FBChatScraper.py", line 74, in download_thread
    thread_contents = json.loads(res.content)
  File "/Users/silman/anaconda/lib/python3.6/json/__init__.py", line 349, in loads
    s = s.decode(detect_encoding(s), 'surrogatepass')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x87 in position 0: invalid start byte

在回溯的中间奇怪地打印内容,使我认为有一些看不见的字符将其压下.

oddly printing the content in the middle of the traceback leading me to think there are some invisible characters pushing it down.

我无法将响应加载为json格式,因为无论我如何处理响应内容,其格式都不适合json库解释.

I am unable to get the response loaded into a json format because no matter how i handle the response content it isn't properly formatted for json library to interpret.

此外,如果我只是做print(res.text)我会得到垃圾:

Moreover if i just do print(res.text) i get garbage:

Traceback (most recent call last):
  File "FBChatScraper.py", line 200, in <module>
    main()
  File "FBChatScraper.py", line 134, in main
    fbms.run()
}sP���c���f�u0���\� QZed�C��� M$x�Ҹ�H�����eǘ�]���5���^�*�ӄaM�Y��b���/ڶ�JW/���>H6z�\��l4����t=i��%Ҳu�x��%�x�
       F    <���{1i�#%;�rɲ=Rχm��1B�Z(+�(S-���#��\v�{b��
                                                           �    f/V�i̴��_��83�  �_����*��O��
                                                                                            ������Z��i-�TVeaG54�!v�a?ǯ|gu-g��.���"J$�L`&�tΊ#s)�H����s���q���^׷0��[)���j�ॽ�T���U���J�ЁwW���!eg�#j ��r��$y���3�4��4.��M�@Kb�AX�SDb�QJ�X)�,���a�   "Sp�h�����sOA0Vé|�������:%�rKdKC���@ M��.�^
�       �g���SWQHӳ.��BӄG�,����@E��������
                                        nras��L�/��ch@>]W���c3�ͤ�N9��y��er����hj�H��~�zCe�^�� �Vθ�

Traceback (most recent call last):
  File "FBChatScraper.py", line 200, in <module>
    main()
  File "FBChatScraper.py", line 134, in main
    fbms.run()
  File "FBChatScraper.py", line 43, in run
    thread_contents = self.download_thread(limit, offset, message_timestamp)
  File "FBChatScraper.py", line 74, in download_thread
    thread_contents = json.loads(res.content)
  File "/Users/silman/anaconda/lib/python3.6/json/__init__.py", line 349, in loads
    s = s.decode(detect_encoding(s), 'surrogatepass')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x87 in position 0: invalid start byte

我会尽力开展MWE,不确定发帖请求中的哪些数据是私人的,所以我忽略了一些

MWE as best i can, not sure what data from my post request is private so i left some out

使用此数据

url_thread = "https://www.messenger.com/api/graphqlbatch/"


request_data = {
  "batch_name": "MessengerGraphQLThreadFetcher",
  "__user": "<user_id>",
  "__a": "1",
  "__dyn": "<dyn>",
  "__req": "9",
  '__be'      : '-1',
  '__pc'      : 'PHASED:messengerdotcom_pkg',
  "fb_dtsg": "AQFni7TU2nes:AQGSC8FSDqyw",
  "ttstamp": "265817254666710077746711957586581715370521181008510710777",
  "__rev": "3791607",
  "jazoest": "<jazoest>",
  "queries": '<queries>'
  }

headers = {
  "authority": "www.messenger.com",
  "method": "POST",
  "path": "/api/graphqlbatch/",
  "scheme": "https",
  "accept": "*/*",
  "accept-encoding": "gzip, deflate, br",
  "accept-language": "en-US,en;q=0.9",
  "cache-control": "no-cache",
  "content-length": "754",
  "content-type" : "application/x-www-form-urlencoded",
  "cookie": "<cookies>",
  "origin": "https://www.messenger.com",
  "pragma": "no-cache",
  "referer": "https://www.messenger.com/t/<chatID>",
  "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36"
}

您可以使用chrome开发人员工具获取所有<items>,并在网络标签上查找对Request URL: https://www.messenger.com/api/graphqlbatch/的POST请求.

You can get all the <items> by using chrome developer tools and lookng on the network tab for a POST request to Request URL: https://www.messenger.com/api/graphqlbatch/.

在chrome开发工具正在录制时,很容易找到是否向上滚动以重新加载旧消息.

Its easy to find if you scroll up to reload old messages while chrome dev tools is recording.

然后将简单的请求与python放在一起

Then put together a simple request with python

import requests as rq
import time

ses = rq.Session()
thread = <ID of thread found in URL of messenger.com>

conversation_type = <'thread_fbids' if group chat else 'user_ids'>

data = request_data
data['messages[{}][{}][offset]'.format(conversation_type, thread)] = 0
data['messages[{}][{}][timestamp]'.format(conversation_type, thread)] = int(time.time())
data['messages[{}][{}][limit]'.format(conversation_type, thread)] = 2000

res = ses.post(url_thread, data=data, headers=headers)

print(res.content)
thread_contents = json.loads(res.content)
print(thread_contents)

作为我的开发工具返回的内容,您可以看到json的开头此处

As what my dev tools got back you can see the start of the json here

推荐答案

问题是您的请求标头中的这一行:

The problem is this line in your request headers:

"accept-encoding": "gzip, deflate, br",

br请求 Brotli压缩,这是一种新的压缩标准(请参见 RFC 7932 ). Chrome之所以要求Brotli,是因为Chrome的最新版本本身就可以理解.您之所以要Brotli,是因为您是从Chrome复制标题的.但是requests并不了解Brotli.

That br requests Brotli compression, a new-ish compression standard (see RFC 7932) that Google is pushing to replace gzip on the web. Chrome is asking for Brotli because recent versions of Chrome understand it natively. You're asking for Brotli because you copied the headers from Chrome. But requests doesn't understand Brotli natively.

您可以pip install brotli并注册解压缩器,或者只是在res.content上手动调用它.但是更简单的解决方案是删除br:

You can pip install brotli and register the decompresser or just call it manually on res.content. But a simpler solution is to just remove the br:

"accept-encoding": "gzip, deflate",

…,然后您应该得到gzip,您和requests已经知道该如何处理.

… and then you should get gzip, which you and requests already know how to handle.

这篇关于Python请求以utf-8编码的响应,但无法解码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆