解释“纯文本”作为utf-8文本在python [英] Interpret "plain text" as utf-8 text in python

查看：223 发布时间：2017/8/16 22:14:37 python string text encoding utf-8

本文介绍了解释“纯文本”作为utf-8文本在python的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个文本文本应该被解释为utf-8，但没有（这是给我这样）。
这是一个典型的文件行的例子：

I have a text file with text that should have been interpreted as utf-8 but wasn't (it was given to me this way). Here is an example of a typical line of the file:

\\\ロ\\\ン\\\ド\\\ン \\\在\\\住

\u30ed\u30f3\u30c9\u30f3\u5728\u4f4f

应该是：

ロンドン在住

现在，我可以通过在命令行中键入以下命令在python上手动执行： / p>

Now, I can do it manually on python by typing the following in the command line:

>>> h1 = u'\u30ed\u30f3\u30c9\u30f3\u5728\u4f4f'    
>>> print h1
ロンドン在住

这给了我所需要的。有办法可以自动执行吗？我尝试过这样的事情

which gives me what I want. Is there a way that I can do this automatically? I've tried doing stuff like this

>>> f = codecs.open('testfile.txt', encoding='utf-8')
>>> h = f.next()
>>> print h
\u30ed\u30f3\u30c9\u30f3\u5728\u4f4f

I've also tried with the 'encode' and 'decode' functions, any ideas?

谢谢！

推荐答案

\\\ロ\\\ン\\\ド\\\ン\\\在\\\住 不是UTF8;它使用的是python unicode转义格式。使用 unicode_escape 编解码器：

\u30ed\u30f3\u30c9\u30f3\u5728\u4f4f is not UTF8; it's using the python unicode escape format. Use the unicode_escape codec instead:

>>> print '\u30ed\u30f3\u30c9\u30f3\u5728\u4f4f'.decode('unicode_escape')
ロンドン在住

以下是上述短语的UTF-8编码，用于比较：

Here is the UTF-8 encoding of the above phrase, for comparison:

>>> '\u30ed\u30f3\u30c9\u30f3\u5728\u4f4f'.decode('unicode_escape').encode('utf-8')
'\xe3\x83\xad\xe3\x83\xb3\xe3\x83\x89\xe3\x83\xb3\xe5\x9c\xa8\xe4\xbd\x8f'

请注意，使用 unicode_escape 解码的数据被视为Latin- 1对于任何不是公认的Python转义序列的东西。

Note that the data decoded with unicode_escape are treated as Latin-1 for anything that's not a recognised Python escape sequence.

可能您正在查看JSON编码数据，该数据使用相同的符号来指定字符转义。使用 json.loads（）来解码实际的JSON数据;具有这种转义的JSON字符串用引号分隔，通常是较大结构（例如JSON列表或对象）的一部分。

Be careful however; it may be you are really looking at JSON-encoded data, which uses the same notation for specifying character escapes. Use json.loads() to decode actual JSON data; JSON strings with such escapes are delimited with " quotes and are usually part of larger structures (such as JSON lists or objects).

这篇关于解释“纯文本”作为utf-8文本在python的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

解释“纯文本”作为utf-8文本在python [英] Interpret "plain text" as utf-8 text in python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

解释“纯文本”作为utf-8文本在python [英] Interpret &quot;plain text&quot; as utf-8 text in python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

解释“纯文本”作为utf-8文本在python [英] Interpret "plain text" as utf-8 text in python

登录关闭