使用 Python 读取 RTF 文件时出现欧元符号问题 [英] Euro sign issue when reading an RTF file with Python

查看:49
本文介绍了使用 Python 读取 RTF 文件时出现欧元符号问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要使用 Python 和 pyRTF 以 RTF 格式生成文档,一切正常:我对带重音的字母没有问题,它甚至可以接受欧元符号而不会出错,但是我用的是 而不是得到这个标志:¤.我以这种方式对字符串进行编码:

x.encode("iso-8859-15")

我用谷歌搜索了很多,但我无法解决这个问题,我该怎么做才能获得欧元符号?

解决方案

RTF 标准使用 UTF-16,但形状适合 RTF 命令序列格式.记录在 http://en.wikipedia.org/wiki/Rich_Text_Format#Character_encoding.不幸的是,pyRTF 不会为您做任何编码;处理这个问题一直在项目的 TODO 上,但显然他们在放弃图书馆之前从未做到过.

这是基于我最近在一个项目中使用的代码.我现在在 PyPI 上以 rtfunicode 的形式发布了它,并支持Python 2 和 3;python 2版本:

导入编解码器进口重新_charescape = re.compile(u'([\x00-\x1f\\\\{}\x80-\uffff])')def _replace(匹配):代码点 = ord(match.group(1))# 将代码点转换为有符号整数,插入转义序列返回 '​​\\u%s?'% (codepoint if codepoint < 32768 else codepoint - 65536)def rtfunicode_encode(文本,错误):# 编码为 RTF \uDDDDD?有符号的 16 个整数和替换字符返回 _charescape.sub(_replace, 转义).encode('ascii')类编解码器(codecs.Codec):def encode(self, input, errors='strict'):返回 rtfunicode_encode(输入,错误),len(输入)类 IncrementalEncoder(codecs.IncrementalEncoder):定义编码(自我,输入,最终=假):返回 rtfunicode_encode(输入,self.errors)类 StreamWriter(Codec, codecs.StreamWriter):经过def rtfunicode(名称):如果名称 == 'rtfunicode':返回 codecs.CodecInfo(名称='rtfunicode',编码=编解码器().编码,解码=编解码器().解码,增量编码器=增量编码器,流写入器=流写入器,)codecs.register(rtfunicode)

您可以编码为rtfunicode"而不是iso-8859-15":

<预><代码>>>>u'\u20AC'.encode('rtfunicode') # 欧元货币符号'\\u8364?'

以这种方式对您插入到 RTF 文档中的任何文本进行编码.

注意只支持UCS-2 unicode(\uxxxx,2个字节),不支持UCS-4(\Uxxxxxxxx,4个字节);rtfunicode 1.1 通过简单地将 UTF-16 代理对编码为两个 \uDDDDD? 有符号整数来支持这些.

I need to generate a document in RTF using Python and pyRTF, everything is ok: I have no problem with accented letters, it accepts even the euro sign without errors, but instead of , I get this sign: ¤. I encode the strings in this way:

x.encode("iso-8859-15")

I googled a lot, but I was not able to solve this issue, what do I have to do to get the euro sign?

解决方案

The RTF standard uses UTF-16, but shaped to fit the RTF command sequence format. Documented at http://en.wikipedia.org/wiki/Rich_Text_Format#Character_encoding. pyRTF doesn't do any encoding for you, unfortunately; handling this has been on the project's TODO but obviously they never got to that before abandoning the library.

This is based on code I used in a project recently. I've now released this as rtfunicode on PyPI, with support for Python 2 and 3; the python 2 version:

import codecs
import re

_charescape = re.compile(u'([\x00-\x1f\\\\{}\x80-\uffff])')
def _replace(match):
    codepoint = ord(match.group(1))
    # Convert codepoint into a signed integer, insert into escape sequence
    return '\\u%s?' % (codepoint if codepoint < 32768 else codepoint - 65536)    


def rtfunicode_encode(text, errors):
    # Encode to RTF \uDDDDD? signed 16 integers and replacement char
    return _charescape.sub(_replace, escaped).encode('ascii')


class Codec(codecs.Codec):
    def encode(self, input, errors='strict'):
        return rtfunicode_encode(input, errors), len(input)


class IncrementalEncoder(codecs.IncrementalEncoder):
    def encode(self, input, final=False):
        return rtfunicode_encode(input, self.errors)


class StreamWriter(Codec, codecs.StreamWriter):
    pass


def rtfunicode(name):
    if name == 'rtfunicode':
        return codecs.CodecInfo(
            name='rtfunicode',
            encode=Codec().encode,
            decode=Codec().decode,
            incrementalencoder=IncrementalEncoder,
            streamwriter=StreamWriter,
        )

codecs.register(rtfunicode)

Instead of encoding to "iso-8859-15" you can then encode to 'rtfunicode' instead:

>>> u'\u20AC'.encode('rtfunicode') # EURO currency symbol
'\\u8364?'

Encode any text you insert into your RTF document this way.

Note that it only supports UCS-2 unicode (\uxxxx, 2 bytes), not UCS-4 (\Uxxxxxxxx, 4 bytes); rtfunicode 1.1 supports these by simply encoding the UTF-16 surrogate pair to two \uDDDDD? signed integers.

这篇关于使用 Python 读取 RTF 文件时出现欧元符号问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆