从python中的unicode字符串获取原始字符串 [英] Getting a raw string from a unicode string in python

查看:55
本文介绍了从python中的unicode字符串获取原始字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 Unicode 字符串,我正在从 Python 中的 Web 服务中检索.

I have a Unicode string I'm retrieving from a web service in python.

我需要访问我从该字符串解析的 URL,其中包括各种变音符号.

I need to access a URL I've parsed from this string, that includes various diacritics.

但是,如果我将 unicode 字符串传递给 urllib2,则会产生 unicode 编码错误.完全相同的字符串,作为原始"字符串 r"some string" 正常工作.

However, if I pass the unicode string to urlllib2, it produces a unicode encoding error. The exact same string, as a "raw" string r"some string" works properly.

如何在 python 中获取 unicode 字符串的原始二进制表示,而不将其转换为系统语言环境?

How can I get the raw binary representation of a unicode string in python, without converting it to the system locale?

我已经浏览了 python 文档,一切似乎都回到了 codecs 模块.然而,codecs 模块的文档充其量是稀疏的,整个事情似乎是非常面向文件的.

I've been through the python docs, and every thing seems to come back to the codecs module. However, the documentation for the codecs module is sparse at best, and the whole thing seems to be extremely file oriented.

如果重要的话,我在 Windows 上.

I'm on windows, if it's important.

推荐答案

您需要编码 URL 从 unicode 到字节串.u''r'' 产生两种不同 的对象;一个 unicode 字符串和一个字节字符串.

You need to encode the URL from unicode to a bytestring. u'' and r'' produce two different kinds of objects; a unicode string and a bytestring.

您可以使用 .encode() 方法将 unicode 字符串编码为字节码,但您需要知道要使用的编码.通常,对于 URL,UTF-8 很棒,但您确实需要转义字节以适应 URL 方案:

You can encode a unicode string to bytecode with the .encode() method, but you need to know what encoding to use. Usually, for URLs, UTF-8 is great, but you do need to escape the bytes to fit the URL scheme as well:

import urlparse, urllib

parts = list(urlparse.urlsplit(url))
parts[2] = urllib.quote(parts[2].encode('utf8'))
url = urlparse.urlunsplit(parts)

以上示例基于有根据的猜测,即您面临的问题是由于 URL 路径部分中的非 ASCII 字符造成的,但如果您没有提供更多详细信息,则只能猜测.

The above example is based on an educated guess that the problem you are facing is due to non-ASCII characters in the path part of the URL, but without further details from you it has to remain a guess.

对于域名,您需要应用IDNA RFC3490 编码:

For domain names, you need to apply the IDNA RFC3490 encoding:

parts = list(urlparse.urlsplit(url))
parts[1] = parts[1].encode('idna')
parts = [p.encode('utf8') if isinstance(p, unicode) else p for p in parts]
url = urlparse.urlunsplit(parts)

有关详细信息,请参阅 Python Unicode HOWTO.我还强烈推荐您阅读Joel on Software Unicode 文章作为编码主题的良好入门.

See the Python Unicode HOWTO for more information. I also strongly recommend you read the Joel on Software Unicode article as a good primer on the subject of encodings.

这篇关于从python中的unicode字符串获取原始字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆