将其中包含非 ASCII 符号的 Unicode 对象转换为字符串对象(在 Python 中) [英] Converting Unicode objects with non-ASCII symbols in them into strings objects (in Python)

查看:33
本文介绍了将其中包含非 ASCII 符号的 Unicode 对象转换为字符串对象(在 Python 中)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想发送要由在线服务翻译的中文字符,并返回生成的英文字符串.为此,我使用了简单的 JSON 和 urllib.

是的,我声明.

# -*- 编码:utf-8 -*-

在我的代码之上.

现在,如果我向 urllib 提供字符串类型对象,则一切正常,即使该对象包含 Unicode 信息.我的函数叫做translate.

例如:

stringtest1 = '风景的美丽'打印翻译(字符串测试1)

导致正确的翻译和操作

type(stringtest1)

确认这是一个字符串对象.

但是如果这样做

stringtest1 = u'风景的美丽'

并尝试使用我的翻译功能时出现此错误:

 File "C:\Python27\lib\urllib.py", line 1275, in urlencodev = quote_plus(str(v))UnicodeEncodeError: 'ascii' 编解码器无法对位置 2-8 中的字符进行编码:序号不在范围内 (128)

经过一番研究,这似乎是一个普遍的问题:

现在,如果我输入脚本

stringtest1 = '风景的美丽'stringtest2 = u''的美丽'打印 'stringtest1',stringtest1打印 'stringtest2',stringtest2

执行它返回:

stringtest1 無與倫æ¯"的美麗stringtest2 风景的美丽

但只需在控制台中输入变量:

<预><代码>>>>字符串测试1'\xe7\x84\xa1\xe8\x88\x87\xe5\x80\xab\xe6\xaf\x94\xe7\x9a\x84\xe7\xbe\x8e\xe9\xba\x97'>>>字符串测试2u'\u7121\u8207\u502b\u6bd4\u7684\u7f8e\u9e97'

明白了.

我的问题是我无法控制要翻译的信息如何进入我的函数.而且好像要带上Unicode形式,函数不接受.

那么,我如何将一件事转换成另一件事?

我已阅读堆栈溢出问题将 Unicode 转换为 Python 中的字符串(包含额外符号).

但这不是我所追求的.Urllib 接受字符串对象但不接受 Unicode 对象,两者都包含相同的信息

好吧,至少在我发送未更改信息的 Web 应用程序眼中,我不确定它们在 Python 中是否仍然是等效的东西.

解决方案

当您获得 unicode 对象并希望从中返回 UTF-8 编码的字节字符串时,请使用 theobject.编码('utf8').

你不知道传入的对象是 str 还是 unicode 似乎很奇怪——你肯定控制了调用站点 还有那个功能?!但如果确实如此,无论出于什么奇怪的原因,您可能需要类似的东西:

def ensureutf8(s):如果 isinstance(s, unicode):s = s.encode('utf8')返回

只对条件进行编码,也就是说,如果它接收到一个unicode对象,而不是它接收到的对象已经是一个字节串.无论哪种情况,它都会返回一个字节字符串.

顺便说一句,你的部分困惑似乎是因为你不知道在解释器提示符下输入一个表达式会显示它的 repr,这不是相同的效果你用 print;-).

I want to send Chinese characters to be translated by an online service, and have the resulting English string returned. I'm using simple JSON and urllib for this.

And yes, I am declaring.

# -*- coding: utf-8 -*-

on top of my code.

Now everything works fine if I feed urllib a string type object, even if that object contains what would be Unicode information. My function is called translate.

For example:

stringtest1 = '無與倫比的美麗'

print translate(stringtest1)

results in the proper translation and doing

type(stringtest1) 

confirms this to be a string object.

But if do

stringtest1 = u'無與倫比的美麗'

and try to use my translation function I get this error:

  File "C:\Python27\lib\urllib.py", line 1275, in urlencode
    v = quote_plus(str(v))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-8: ordinal not in range(128)

After researching a bit, it seems this is a common problem:

Now, if I type in a script

stringtest1 = '無與倫比的美麗' 
stringtest2 = u'無與倫比的美麗'
print 'stringtest1',stringtest1
print 'stringtest2',stringtest2

excution of it returns:

stringtest1 無與倫æ¯"的美麗
stringtest2 無與倫比的美麗

But just typing the variables in the console:

>>> stringtest1
'\xe7\x84\xa1\xe8\x88\x87\xe5\x80\xab\xe6\xaf\x94\xe7\x9a\x84\xe7\xbe\x8e\xe9\xba\x97'
>>> stringtest2
u'\u7121\u8207\u502b\u6bd4\u7684\u7f8e\u9e97'

gets me that.

My problem is that I don't control how the information to be translated comes to my function. And it seems I have to bring it in the Unicode form, which is not accepted by the function.

So, how do I convert one thing into the other?

I've read Stack Overflow question Convert Unicode to a string in Python (containing extra symbols).

But this is not what I'm after. Urllib accepts the string object but not the Unicode object, both containing the same information

Well, at least in the eyes of the web application I'm sending the unchanged information to, I'm not sure if they're are still equivalent things in Python.

解决方案

When you get a unicode object and want to return a UTF-8 encoded byte string from it, use theobject.encode('utf8').

It seems strange that you don't know whether the incoming object is a str or unicode -- surely you do control the call sites to that function, too?! But if that is indeed the case, for whatever weird reason, you may need something like:

def ensureutf8(s):
    if isinstance(s, unicode):
        s = s.encode('utf8')
    return s

which only encodes conditionally, that is, if it receives a unicode object, not if the object it receives is already a byte string. It returns a byte string in either case.

BTW, part of your confusion seems to be due to the fact that you don't know that just entering an expression at the interpreter prompt will show you its repr, which is not the same effect you get with print;-).

这篇关于将其中包含非 ASCII 符号的 Unicode 对象转换为字符串对象(在 Python 中)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆