Python请求以utf-8编码但无法解码的响应 [英] Python requests response encoded in utf-8 but cannot be decoded

查看:24
本文介绍了Python请求以utf-8编码但无法解码的响应的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用 python 抓取我的 messenger.com(facebook messenger)聊天记录,我使用谷歌浏览器开发人员工具查看聊天记录的 POST 请求,我已将整个标题和正文复制为请求的格式可以使用.

I am trying to scrape my messenger.com (facebook messenger) chats using python and i have used google chromes developer tools to see the POST request for the chat history and i have copied the entire header and body into a format that requests can use.

我得到 HTTP 代码 200,这意味着该请求至少得到了东西,但是我可以打印 res.encoding 以获取它返回的编码,它说的是 utf-8.但我无法解码!

I get HTTP code 200 implying the request at least got something but and i can print res.encoding to get the encoding it returned in which it says is utf-8. But i cannot decode it!

这里是功能:

def download_thread(self, limit, offset, message_timestamp):
    """Download the specified number of messages from the
    provided thread, with an optional offset
    """
    data = request_data(self.thread, offset=offset,
                        limit=limit, group=self.group,
                        timestamp=message_timestamp)

    res = self.ses.post(url_thread, data=data, headers=headers)

    print(res.content)

    thread_contents = json.loads(res.content)
    print(thread_contents)
    return thread_contents

收益

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x87 in position 0: invalid start byte

当它尝试json.load(或loads)数据时

但是 res.encoding 确实返回 utf-8.

But res.encoding does return utf-8.

我尝试用 gzip 解压,但那不是 gzip 压缩的内容.

I tried unzipping with gzip but that says it is is not gzipped content.

如果我只是尝试做 print(res.content) 我得到

If i just try to do print(res.content) i get

Traceback (most recent call last):
  File "FBChatScraper.py", line 200, in <module>
    main()
  File "FBChatScraper.py", line 134, in main
    fbms.run()
0fx82x048xbbxb9=x87xebK0.xffx90xddxebxfax16xc6xbbzx8bx82)xe8xaaVx01^xdax8bxbdx15d-xb1x10@x17\xd43xa8x92wxe8xc0xcdUxc4xffxc7xfax90xb2xb3xf5x84x11ux0b	x8fx83rxf3}xe5!y$xe6xf6c0xf0xb4x98xcat_x0cx08xb5xddx8ctxx91xa9x95
B%xe2x93xa52x85_xa6x10xc2xc9xa3xee4SDbxa5x18QJx83Xx19)xaa$xf4xb4xb7x0bx84x15&x88x08Lxc9iPxa2xb9xf2xafx96x96Nxd8xcf=x05xc1x18x8dxa0xf2Yx8e
xcfxc8x0fE4xd6)xa1xd4xb7Dxd6{ixc8Px96Rx11HCxacxbcKyT#~}x93xf7@Kxc7r/x82xb0xe4xefXxf9jx08xa6Hpxfcnx06xfdox9axd0wJxb4fJ(x89+x1cxf6x0eOIx90xacx9eDDxfd,xa5xe9x89x1blhx86Zx98x05xdd9xc7xf4x80xfcYx8exadxeex99!x15x13+x9bx07xe8Fdjxfcx11xfcxfe7x06hx02x00@>]Wx92xc9x02xb1c3x82xcdxa4xefN9x90xe6x81yx9cx84erxd4xc3x06x1cx06x14xcfxc7x07hjxbfHxdcxf5~xf7zx18Cexaf^x8cxab xdfVxcexb8x11xf8x06x03'

Traceback (most recent call last):
  File "FBChatScraper.py", line 200, in <module>
    main()
  File "FBChatScraper.py", line 134, in main
    fbms.run()
  File "FBChatScraper.py", line 43, in run
    thread_contents = self.download_thread(limit, offset, message_timestamp)
  File "FBChatScraper.py", line 74, in download_thread
    thread_contents = json.loads(res.content)
  File "/Users/silman/anaconda/lib/python3.6/json/__init__.py", line 349, in loads
    s = s.decode(detect_encoding(s), 'surrogatepass')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x87 in position 0: invalid start byte

奇怪地将内容打印在回溯的中间,让我认为有一些不可见的字符将其向下推.

oddly printing the content in the middle of the traceback leading me to think there are some invisible characters pushing it down.

我无法将响应加载为 json 格式,因为无论我如何处理响应内容,它都没有正确格式化以供 json 库解释.

I am unable to get the response loaded into a json format because no matter how i handle the response content it isn't properly formatted for json library to interpret.

此外,如果我只做 print(res.text) 我会得到垃圾:

Moreover if i just do print(res.text) i get garbage:

Traceback (most recent call last):
  File "FBChatScraper.py", line 200, in <module>
    main()
  File "FBChatScraper.py", line 134, in main
    fbms.run()
}sP���c���f�u0���� QZed�C��� M$x�Ҹ�H�����eǘ�]���5���^�*�ӄaM�Y��b���/ڶ�JW/���>H6z���l4����t=i��%Ҳu�x��%�x�
       F    <���{1i�#%;�rɲ=Rχm��1B�Z(+�(S-���#��v�{b��
                                                           �    f/V�i̴��_��83�  �_����*��O��
                                                                                            ������Z��i-�TVeaG54�!v�a?ǯ|gu-g��.���"J$�L`&�tΊ#s)�H����s���q���^׷0��[)���j�ॽ�T���U���J�ЁwW���!eg�#j ��r��$y���3�4��4.��M�@Kb�AX�SDb�QJ�X)�,���a�   "Sp�h�����sOA0Vé|�������:%�rKdKC���@ M��.�^
�       �g���SWQHӳ.��BӄG�,����@E��������
                                        nras��L�/��ch@>]W���c3�ͤ�N9��y��er����hj�H��~�zCe�^�� �Vθ�

Traceback (most recent call last):
  File "FBChatScraper.py", line 200, in <module>
    main()
  File "FBChatScraper.py", line 134, in main
    fbms.run()
  File "FBChatScraper.py", line 43, in run
    thread_contents = self.download_thread(limit, offset, message_timestamp)
  File "FBChatScraper.py", line 74, in download_thread
    thread_contents = json.loads(res.content)
  File "/Users/silman/anaconda/lib/python3.6/json/__init__.py", line 349, in loads
    s = s.decode(detect_encoding(s), 'surrogatepass')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x87 in position 0: invalid start byte

MWE 尽我所能,不确定我的帖子请求中的哪些数据是私人的,所以我遗漏了一些

MWE as best i can, not sure what data from my post request is private so i left some out

使用这些数据

url_thread = "https://www.messenger.com/api/graphqlbatch/"


request_data = {
  "batch_name": "MessengerGraphQLThreadFetcher",
  "__user": "<user_id>",
  "__a": "1",
  "__dyn": "<dyn>",
  "__req": "9",
  '__be'      : '-1',
  '__pc'      : 'PHASED:messengerdotcom_pkg',
  "fb_dtsg": "AQFni7TU2nes:AQGSC8FSDqyw",
  "ttstamp": "265817254666710077746711957586581715370521181008510710777",
  "__rev": "3791607",
  "jazoest": "<jazoest>",
  "queries": '<queries>'
  }

headers = {
  "authority": "www.messenger.com",
  "method": "POST",
  "path": "/api/graphqlbatch/",
  "scheme": "https",
  "accept": "*/*",
  "accept-encoding": "gzip, deflate, br",
  "accept-language": "en-US,en;q=0.9",
  "cache-control": "no-cache",
  "content-length": "754",
  "content-type" : "application/x-www-form-urlencoded",
  "cookie": "<cookies>",
  "origin": "https://www.messenger.com",
  "pragma": "no-cache",
  "referer": "https://www.messenger.com/t/<chatID>",
  "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36"
}

您可以使用 chrome 开发人员工具获取所有 ,并在网络选项卡上查找对 Request URL:https://www.messenger.com 的 POST 请求/api/graphqlbatch/.

You can get all the <items> by using chrome developer tools and lookng on the network tab for a POST request to Request URL: https://www.messenger.com/api/graphqlbatch/.

如果您在 chrome 开发工具正在录制时向上滚动以重新加载旧消息,则很容易找到.

Its easy to find if you scroll up to reload old messages while chrome dev tools is recording.

然后用python组合一个简单的请求

Then put together a simple request with python

import requests as rq
import time

ses = rq.Session()
thread = <ID of thread found in URL of messenger.com>

conversation_type = <'thread_fbids' if group chat else 'user_ids'>

data = request_data
data['messages[{}][{}][offset]'.format(conversation_type, thread)] = 0
data['messages[{}][{}][timestamp]'.format(conversation_type, thread)] = int(time.time())
data['messages[{}][{}][limit]'.format(conversation_type, thread)] = 2000

res = ses.post(url_thread, data=data, headers=headers)

print(res.content)
thread_contents = json.loads(res.content)
print(thread_contents)

作为我的开发工具返回的内容,您可以在 此处

As what my dev tools got back you can see the start of the json here

推荐答案

问题在于你的请求头中的这一行:

The problem is this line in your request headers:

"accept-encoding": "gzip, deflate, br",

br 要求 Brotli 压缩,这是一种新的压缩标准(请参阅 RFC 7932)Google 正在推动替换网络上的 gzip.Chrome 要求使用 Brotli,因为最新版本的 Chrome 本身就理解它.您要求 Brotli 是因为您从 Chrome 复制了标题.但是 requests 本身并不理解 Brotli.

That br requests Brotli compression, a new-ish compression standard (see RFC 7932) that Google is pushing to replace gzip on the web. Chrome is asking for Brotli because recent versions of Chrome understand it natively. You're asking for Brotli because you copied the headers from Chrome. But requests doesn't understand Brotli natively.

您可以pip install brotli 并注册解压器或在res.content 上手动调用它.但更简单的解决方案是删除 br:

You can pip install brotli and register the decompresser or just call it manually on res.content. But a simpler solution is to just remove the br:

"accept-encoding": "gzip, deflate",

... 然后你应该得到 gzip,你和 requests 已经知道如何处理了.

… and then you should get gzip, which you and requests already know how to handle.

这篇关于Python请求以utf-8编码但无法解码的响应的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆