在rtf标记中检测多字节和汉字 [英] Detect Multibyte and Chinese Characters in rtf markup

查看:186
本文介绍了在rtf标记中检测多字节和汉字的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试翻译以解析RTF格式的消息(我需要保留格式标记,因此我无法使用仅粘贴到RichTextBox并取出.PlainText的技巧)

获取直接粘贴到写字板中的字符串a基bমূcΟιd的RTF代码:

{\rtf1\ansi\ansicpg1252\deff0\deflang2057{\fonttbl{\f0\fnil\fcharset0 Calibri;}{\f1\fswiss\fcharset128 MS PGothic;}{\f2\fnil\fcharset1 Shonar Bangla;}{\f3\fswiss\fcharset161{\*\fname Arial;}Arial Greek;}}
{\*\generator Msftedit 5.41.21.2510;}\viewkind4\uc1\pard\sa200\sl276\slmult1\lang9\f0\fs22 a\f1\fs24\'8a\'ee\f0\fs22 b\f2\fs24\u2478?\u2498?\f0\fs22 c\f3\fs24\'cf\'e9\f0\fs22 d\par
}

很难确定您与RTF的关系不大.所以这就是我要看的东西

\'8a\'ee\f0\fs22 b\f2\fs24\u2478?\u2498?\f0\fs22 c\f3\fs24\'cf\'e9

注意(u+57FA)是\'8a\'ee,但实际上是两个字符(\u2478?)和(\u2498?)的মূ\u2478?\u2498?可以,但是两个单独的字符ΟιΟι\'cf\'e9.

是否可以确定我要查看的字符应该是一个字符(例如 = \'bb\'f9还是两个字符Οι = \'cf\'e9)?

我一直在想也许是\lang,但这不是事实,因为\lang从首次设置起就没有改变.我已经在考虑字体中不同的Charset值中的不同代码页",但是似乎并没有告诉我是否应该将两个Unicode引用彼此相邻地视为双字节字符. /p>

我怎么知道我要看的字符是双字节(或多字节)还是单字节?

解决方案

\'xx转义代表字节,应使用fcharset编码进行解释. (或者可能是cchs.如果不存在,则回退到ansicpg.)

您需要充分了解编码,才能决定单个\'xx序列是单独表示字符还是仅表示多字节字符的一部分;通常,在使用任何可用的库或OS接口将字节字符串转换为Unicode字符串之前,您将以文本的每个部分为单位来使用,以避免为RTF支持的每个代码页编写逐字节的解析器.

\uxxxx?转义符表示UTF-16代码单元.这要简单得多,但是Word [pad]仅在没有其他选择时才产生这种编码形式,因为它与早期的RTF版本不兼容. (?是接收方无法应对Unicode时的后备字符.)

所以:

  • 两个字符Οι表示为两个字节转义符,因为与该段文本关联的字体使用希腊语单字节编码(字符集161 = cp1253).

  • 一个字符表示为两个字节转义符,因为与该段文本关联的字体使用日语多字节编码(字符集128 = cp932≈Shift-JIS).在Shift-JIS中,前导\'8a字节表示还会再有一个字节发出,高位集范围内的其他所有字节(但不是全部)也发出信号.

  • 两个字符মূ表示为Unicode代码单元转义符,因为没有其他选择:没有任何包含孟加拉语字符的RTF兼容代码页. (ISCII的代码页57003后来出现了.)

I'm trying to translate parse a RTF formatted message (I need to keep the formatting tags so I can't use the trick where you just paste into a RichTextBox and get the .PlainText out)

Take the RTF code for the string a基bমূcΟιd pasted straight into Wordpad:

{\rtf1\ansi\ansicpg1252\deff0\deflang2057{\fonttbl{\f0\fnil\fcharset0 Calibri;}{\f1\fswiss\fcharset128 MS PGothic;}{\f2\fnil\fcharset1 Shonar Bangla;}{\f3\fswiss\fcharset161{\*\fname Arial;}Arial Greek;}}
{\*\generator Msftedit 5.41.21.2510;}\viewkind4\uc1\pard\sa200\sl276\slmult1\lang9\f0\fs22 a\f1\fs24\'8a\'ee\f0\fs22 b\f2\fs24\u2478?\u2498?\f0\fs22 c\f3\fs24\'cf\'e9\f0\fs22 d\par
}

It's difficult to make out if you've not had much to do with RTF. So here's the bit I'm looking at

\'8a\'ee\f0\fs22 b\f2\fs24\u2478?\u2498?\f0\fs22 c\f3\fs24\'cf\'e9

Notice the (u+57FA) is \'8a\'ee but the মূ, which is actually two characters (\u2478?) and (\u2498?), is \u2478?\u2498? which is fine, but the Οι which is two separate characters Ο and ι is \'cf\'e9.

Is there a way to determine if I'm looking at something that should be one character such as = \'bb\'f9 or two characters Ο and ι = \'cf\'e9?

I was thinking that maybe the \lang was it, but that isn't the case at all because the \lang does not change from when it's first set. I am already accounting for the Different Codepages from different Charset values in the fonts, but it doesn't seem to tell me anything about if I should treat two Unicode references next to each other as being a double byte character or not.

How can I tell if the character I'm looking at should be double-byte (or multi-byte) or single byte?

解决方案

\'xx escapes represent bytes and should be interpreted using the fcharset encoding. (Or potentially cchs. Falling back to the ansicpg if not present.)

You need to know that encoding intimately to be able to decide whether a single \'xx sequence represents a character on its own or is only a part of a multi-byte character; typically you will be consuming each section of text as a unit before converting that byte string into a Unicode string using whatever library or OS interface you have available, to avoid having to write byte-by-byte parsers for every code page supported by RTF.

\uxxxx? escapes represent UTF-16 code units. This is much simpler, but Word[pad] only produces this form of encoding as a last resort, because it's not compatible with earlier RTF versions. (? is the fallback character for when the receiver can't cope with the Unicode.)

So:

  • The two characters Οι are represented as two byte-escapes because the font associated with that stretch of text is using a Greek single-byte encoding (charset 161 = cp1253).

  • The one character is represented as two byte-escapes because the font associated with that stretch of text is using a Japanese multibyte encoding (charset 128 = cp932 ≈ Shift-JIS). In Shift-JIS the leading \'8a byte signals a further byte to come, as do various others in the top-bit-set range (but not all of them).

  • The two characters মূ are represented as Unicode code unit escapes, because there's no other option: there isn't any RTF-compatible code page that contains Bengali characters. (Code page 57003 for ISCII came much later.)

这篇关于在rtf标记中检测多字节和汉字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆