何时在 python 中使用 unicode(string) 和 string.encode('utf-8') [英] When to use unicode(string) and string.encode('utf-8') in python

查看:25
本文介绍了何时在 python 中使用 unicode(string) 和 string.encode('utf-8')的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

电子表格单元格数据中出现了一些奇怪的字符,我尝试按照建议使用 encode('utf-8') 解决它.它没有解决问题,但是当我使用 unicode(string) 时它起作用了.我的问题是有处理所有类型文本数据的标准方法吗?

I had some odd characters coming through with spreadsheet cell data, I tried to resolve it with encode('utf-8') as was suggested. It didn't resolve the problem but when I used unicode(string) it worked. My question is there a standard way to deal with all types of text data?

推荐答案

从根本上说,一个字符串"(python2 中的unicode 字符串",python3 中只是字符串")是一个字符"序列.但是字符"是一种抽象,无法将字符存储在文件系统中或通过网络发送(听起来很奇怪,但实际上没有).文件系统、网络、控制台和其他设备只理解字节".因此,当您与设备或外部程序交谈时,您作为程序员的工作是将字符正确转换为字节,反之亦然.

To put it very basically, a "string" ("unicode string" in python2 and just "string" in python3) is a sequence of "characters". But "character" is an abstraction, there's no way store a character in a file system or send it over network (sounds weird, but there really isn't). File systems, networks, consoles and other devices only understand "bytes". Therefore, it's your job as a programmer to correctly translate characters to bytes and vice versa when you talk to a device or an external program.

字符到字节的转换在 Python 中称为encode()".当您向设备发送字符串时,您将字符编码()"为字节:

Chars-to-bytes translation is called "encode()" in python. When you send a string to a device, you "encode()" your characters to bytes:

some_chunk_of_bytes = some_string.encode(how_exactly)

有很多方法(称为字符编码")将字符表示为字节的组合,因此您必须解释编码器您希望它如何完成.

There are many ways (called "character encodings") to represent a character as a combination of bytes, therefore you have to explain the encoder how exactly you want it to be done.

当你从某个地方读取数据时,你只能得到原始字节并且必须将它们解码()"成有意义的字符:

When you read the data from somewhere, you only get raw bytes and have to "decode()" them to meaningful characters:

some_string = some_chunk_of_bytes.decode(how_exactly)

同样,您必须指定您认为这些字节是如何编码的(无法确定).

Again, you have to specify how you think these bytes are encoded (there's no way to tell for sure).

python 中有许多快捷方式可以对您隐藏这些编码/解码内容.例如,

There are a number of shortcuts in python that hide this encode/decode stuff from you. For example,

 string = unicode(bytes)

在幕后这样做:

 string = bytes.decode(default-encoding)

当你做一些像

print string

实际上是:

sys.stdout.write(string.encode(default-encoding))

但即使您不明确使用encode/decode,您也必须意识到它仍然必须在某个时刻发生.如果你的程序出现乱码,那总是因为你:

But even if you don't use encode/decode explicitly, you have to realize it still must take place at some point. If you get garbled characters in your program, it's always because you:

  • 忘记了编码"步骤,或者
  • 忘记了解码"步骤,或者
  • 提供了不正确的编码"

如上所述,这个描述非常基础,如果你想了解所有细节,请阅读

As said, this description is very basic, if you want to understand all the details, please read

这篇关于何时在 python 中使用 unicode(string) 和 string.encode('utf-8')的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆