在python中将unicode文本输出到RTF文件 [英] Outputting unicode text to an RTF file in python

查看:55
本文介绍了在python中将unicode文本输出到RTF文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将 unicode 文本从 python 脚本输出到 RTF 文件.对于背景,维基百科

I am trying to output unicode text to an RTF file from a python script. For background, Wikipedia says

对于 Unicode 转义,使用控制字 \u,后跟一个 16 位有符号十进制整数,给出 Unicode UTF-16 代码单元编号.为了不支持 Unicode 的程序的利益,后面必须跟在指定代码页中该字符的最接近的表示形式.例如,\u1576?将给出阿拉伯字母 bāʼ ب,指定不支持 Unicode 的旧程序应将其呈现为问号.

For a Unicode escape the control word \u is used, followed by a 16-bit signed decimal integer giving the Unicode UTF-16 code unit number. For the benefit of programs without Unicode support, this must be followed by the nearest representation of this character in the specified code page. For example, \u1576? would give the Arabic letter bāʼ ب, specifying that older programs which do not have Unicode support should render it as a question mark instead.

还有这个关于从 Java 输出 RTF 的问题这个在 C# 中这样做.

但是,我无法弄清楚如何从 Python 将 unicode 代码点输出为带有 Unicode UTF-16 代码单元号的 16 位有符号十进制整数".我试过这个:

However, what I can't figure out is how to output the unicode code point as a "16-bit signed decimal integer with the Unicode UTF-16 code unit number" from Python. I've tried this:

for char in unicode_string:
    print '\\' + 'u' + ord(char) + '?',

但输出仅在文字处理器中打开时呈现为乱码;问题似乎是它不是 UTF-16 代码号.但不确定如何获得;虽然可以用 utf-16 编码,但如何获得代码号?

but the output only renders as gibberish when opened in a word processor; the problem appears to be that it's not the UTF-16 code number. But not sure how to get that; though one can encode in utf-16, how does one get the code number?

顺便说一下,PyRTF 不支持 unicode(它被列为待办事项"),虽然 pyrtf-NG 应该支持,但该项目似乎没有维护并且文档很少,所以我对使用它持谨慎态度在准生产系统中.

Incidentally PyRTF does not support unicode (it's listed as a "todo"), and while pyrtf-NG is supposed to do so, that project does not appear to be maintained and has little documentation, so I am wary of using it in a quasi-production system.

我的错误.上面的代码中有两个错误——正如 Wobble 指出的,下面的字符串必须是一个 unicode 字符串,而不是一个已经编码的字符串,并且上面的代码产生了一个字符之间有空格的结果.正确的代码是这样的:

My mistake. There are two bugs in the above code - as pointed out by Wobble below the string has to be a unicode string, not an already encoded one, and the above code produces a result with spaces between characters. The correct code is this:

convertstring=""
for char in unicode(<my_encoded_string>,'utf-8'):
    convertstring = convertstring + '\\' + 'u' + str(ord(char)) + '?'

这很好用,至少在 OpenOffice 上是这样.我把这个留在这里作为其他人的参考(下面讨论后进一步纠正了一个错误).

This works fine, at least with OpenOffice. I am leaving this here as a reference for others (one mistake further corrected after discussion below).

推荐答案

根据您最近编辑中的信息,我认为此功能可以正常工作.除了看下面的改进版本.

Based on the information in your latest edit, I think this function will work properly. Except see the improved version below.

def rtf_encode(unistr):
    return ''.join([c if ord(c) < 128 else u'\\u' + unicode(ord(c)) + u'?' for c in unistr])

>>> test_unicode = u'\xa92012'
>>> print test_unicode
©2012
>>> test_utf8 = test_unicode.encode('utf-8')
>>> print test_utf8
©2012
>>> print rtf_encode(test_utf8.decode('utf-8'))
\u169?2012

这是另一个版本,为了更容易理解,它被分解了一点.我还在返回 ASCII 字符串时使其保持一致,而不是保留 Unicode 并在 join 处弄乱它.它还包含基于评论的修复.

Here's another version that's broken down a little to be easier to understand. I also made it consistent in returning an ASCII string rather than keeping Unicode and flubbing it at the join. It also incorporates a fix based on the comments.

def rtf_encode_char(unichar):
    code = ord(unichar)
    if code < 128:
        return str(unichar)
    return '\\u' + str(code if code <= 32767 else code-65536) + '?'

def rtf_encode(unistr):
    return ''.join(rtf_encode_char(c) for c in unistr)

这篇关于在python中将unicode文本输出到RTF文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆