Python:将原始字符串转换为字节字符串,而无需添加转义符 [英] Python: Convert Raw String to Bytes String without adding escape chraracters

查看:408
本文介绍了Python:将原始字符串转换为字节字符串,而无需添加转义符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个字符串:

'BZh91AY&SYA\xaf\x82\r\x00\x00\x01\x01\x80\x02\xc0\x02\x00 \x00!\x9ah3M\x07<]\xc9\x14\xe1BA\x06\xbe\x084'

我想要:

b'BZh91AY&SYA\xaf\x82\r\x00\x00\x01\x01\x80\x02\xc0\x02\x00 \x00!\x9ah3M\x07<]\xc9\x14\xe1BA\x06\xbe\x084'

但我不断得到:

b'BZh91AY&SYA\\xaf\\x82\\r\\x00\\x00\\x01\\x01\\x80\\x02\\xc0\\x02\\x00 \\x00!\\x9ah3M\\x07<]\\xc9\\x14\\xe1BA\\x06\\xbe\\x084'

上下文

我从网页上抓取了一个字符串,并将其存储在变量un中.现在,我想使用BZip2解压缩它:

I scraped a string off of a webpage and stored it in the variable un. Now I want to decompress it using BZip2:

bz2.decompress(un)

但是,由于unstr对象,因此出现此错误:

However, since un is a str object, I get this error:

TypeError: a bytes-like object is required, not 'str'

因此,我需要将un转换为类似字节的对象,而无需将单个反斜杠更改为转义的反斜杠.

Therefore, I need to convert un to a bytes-like object without changing the single backslash to an escaped backslash.

修改1: 谢谢您的帮助! @wim我明白您现在的意思,但是我对于如何从我的webscraping方法中检索类似字节的对象感到茫然:

Edit 1: Thank you for all the help! @wim I understand what you mean now, but I am at a loss as to how I can retrieve a bytes-like object from my webscraping method:

r = requests.get('http://www.pythonchallenge.com/pc/def/integrity.html')

doc = html.fromstring(r.content)
comment = doc.xpath('//comment()')[0].text.split('\n')[1:3]

pattern = re.compile("[a-z]{2}: '(.+)'")

un = re.search(pattern, comment[0]).group(1)

我正在使用的软件包是requestslxml.htmlrebz2.

The packages that I am using are requests, lxml.html, re, and bz2.

再一次,我的目标是使用bz2解压缩un,但是我很难从网络抓取过程中获得类似字节的对象.

Once again, my goal is to decompress un using bz2, but I am having difficulty getting a bytes-like object from my webscraping process.

有指针吗?

推荐答案

您的bug较早存在.唯一可接受的解决方案是更改抓取代码,以使其返回字节对象而不是文本对象.不要尝试将字符串un转换为字节,这不能可靠地完成.

Your bug exists earlier. The only acceptable solution is to change the scraping code so that it returns a bytes object and not a text object. Do not to try and "convert" your string un into bytes, it can not be done reliably.

不要不要执行此操作:

>>> un = 'BZh91AY&SYA\xaf\x82\r\x00\x00\x01\x01\x80\x02\xc0\x02\x00 \x00!\x9ah3M\x07<]\xc9\x14\xe1BA\x06\xbe\x084'
>>> bz2.decompress(un.encode('raw_unicode_escape'))
b'huge'

"raw_unicode_escape"只是一种Latin-1编码,它对其中的外部字符具有内置的后备功能.此编码将\ uXXXX和\ UXXXXXXXX用于其他代码点.现有的反斜杠不会以任何方式转义.它在Python pickle协议中使用. 对于无法表示为\ xXX序列的Unicode字符,您的数据将被破坏.

The "raw_unicode_escape" is just a Latin-1 encoding which has a built-in fallback for characters outside of it. This encoding uses \uXXXX and \UXXXXXXXX for other code points. Existing backslashes are not escaped in any way. It is used in the Python pickle protocol. For Unicode characters that cannot be represented as a \xXX sequence, your data will become corrupted.

Web抓取代码没有业务返回bz2编码的字节作为str,因此您需要在这里解决问题的原因,而不是尝试处理这些症状.

The web scraping code has no business returning bz2-encoded bytes as a str, so that's where you need to address the cause of the problem, rather than attempting to deal with the symptoms.

这篇关于Python:将原始字符串转换为字节字符串,而无需添加转义符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆