Unicode (UTF-8) 在 Python 中读取和写入文件 [英] Unicode (UTF-8) reading and writing to files in Python

查看:22
本文介绍了Unicode (UTF-8) 在 Python 中读取和写入文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在理解文件 (Python 2.4) 中读取和写入文本时遇到了一些脑力障碍.

# 字符串,其中包含 a-acute.ss = u'Capitxe1n'ss8 = ss.encode('utf8')代表(SS),代表(SS8)

<块引用>

("u'Capitxe1n'", "'Capitxc3xa1n'")

打印 ss, ss8打印 >>打开('f1','w'),ss8>>>文件('f1').read()'首领xc3xa1n
'

所以我在文件 f2 中输入 Capitxc3xa1n 到我最喜欢的编辑器中.

那么:

<预><代码>>>>打开('f1').read()'首领xc3xa1n '>>>打开('f2').read()'Capit\xc3\xa1n '>>>open('f1').read().decode('utf8')u'Capitxe1n '>>>open('f2').read().decode('utf8')u'Capit\xc3\xa1n '

我在这里不明白什么?显然,我缺少一些重要的魔法(或理智).在文本文件中键入什么才能获得正确的转换?

我真正无法理解的是 UTF-8 表示的重点是什么,如果您实际上无法让 Python 识别它,当它来自外部时.也许我应该只用 JSON 转储字符串,然后使用它,因为它具有 asciiable 表示!更重要的是,当从文件中输入时,Python 是否可以识别和解码这个 Unicode 对象的 ASCII 表示?如果是这样,我如何获得它?

<预><代码>>>>打印 simplejson.dumps(ss)'"Capitu00e1n"'>>>打印 >>file('f3','w'), simplejson.dumps(ss)>>>simplejson.load(open('f3'))u'Capitxe1n'

解决方案

在符号中

u'Capitxe1n
'

xe1"仅代表一个字节."x" 告诉你 "e1" 是十六进制的.当你写

Capitxc3xa1n

在您的文件中,您有xc3".这些是 4 个字节,在您的代码中您将它们全部读取.当您显示它们时,您可以看到这一点:

<预><代码>>>>打开('f2').read()'Capit\xc3\xa1n '

您可以看到反斜杠被反斜杠转义.所以你的字符串中有四个字节:"、x"、c"和3".

正如其他人在他们的回答中指出的那样,您应该只在编辑器中输入字符,然后您的编辑器应该处理到 UTF-8 的转换并保存.

如果您确实有这种格式的字符串,您可以使用 string_escape 编解码器将其解码为普通字符串:

In [15]: print 'Capit\xc3\xa1n
'.decode('string_escape')卡皮坦

结果是一个以 UTF-8 编码的字符串,其中重音字符由原始字符串中写入 \xc3\xa1 的两个字节表示.如果你想要一个 unicode 字符串,你必须用 UTF-8 再次解码.

对于您的您的文件中没有 UTF-8.实际看看它的样子:

s = u'Capitxe1n
'sutf8 = s.encode('UTF-8')open('utf-8.out', 'w').write(sutf8)

将文件 utf-8.out 的内容与您使用编辑器保存的文件的内容进行比较.

I'm having some brain failure in understanding reading and writing text to a file (Python 2.4).

# The string, which has an a-acute in it.
ss = u'Capitxe1n'
ss8 = ss.encode('utf8')
repr(ss), repr(ss8)

("u'Capitxe1n'", "'Capitxc3xa1n'")

print ss, ss8
print >> open('f1','w'), ss8

>>> file('f1').read()
'Capitxc3xa1n
'

So I type in Capitxc3xa1n into my favorite editor, in file f2.

Then:

>>> open('f1').read()
'Capitxc3xa1n
'
>>> open('f2').read()
'Capit\xc3\xa1n
'
>>> open('f1').read().decode('utf8')
u'Capitxe1n
'
>>> open('f2').read().decode('utf8')
u'Capit\xc3\xa1n
'

What am I not understanding here? Clearly there is some vital bit of magic (or good sense) that I'm missing. What does one type into text files to get proper conversions?

What I'm truly failing to grok here, is what the point of the UTF-8 representation is, if you can't actually get Python to recognize it, when it comes from outside. Maybe I should just JSON dump the string, and use that instead, since that has an asciiable representation! More to the point, is there an ASCII representation of this Unicode object that Python will recognize and decode, when coming in from a file? If so, how do I get it?

>>> print simplejson.dumps(ss)
'"Capitu00e1n"'
>>> print >> file('f3','w'), simplejson.dumps(ss)
>>> simplejson.load(open('f3'))
u'Capitxe1n'

解决方案

In the notation

u'Capitxe1n
'

the "xe1" represents just one byte. "x" tells you that "e1" is in hexadecimal. When you write

Capitxc3xa1n

into your file you have "xc3" in it. Those are 4 bytes and in your code you read them all. You can see this when you display them:

>>> open('f2').read()
'Capit\xc3\xa1n
'

You can see that the backslash is escaped by a backslash. So you have four bytes in your string: "", "x", "c" and "3".

Edit:

As others pointed out in their answers you should just enter the characters in the editor and your editor should then handle the conversion to UTF-8 and save it.

If you actually have a string in this format you can use the string_escape codec to decode it into a normal string:

In [15]: print 'Capit\xc3\xa1n
'.decode('string_escape')
Capitán

The result is a string that is encoded in UTF-8 where the accented character is represented by the two bytes that were written \xc3\xa1 in the original string. If you want to have a unicode string you have to decode again with UTF-8.

To your edit: you don't have UTF-8 in your file. To actually see how it would look like:

s = u'Capitxe1n
'
sutf8 = s.encode('UTF-8')
open('utf-8.out', 'w').write(sutf8)

Compare the content of the file utf-8.out to the content of the file you saved with your editor.

这篇关于Unicode (UTF-8) 在 Python 中读取和写入文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆