在rtf标记中检测多字节和汉字 [英] Detect Multibyte and Chinese Characters in rtf markup
问题描述
我正在尝试翻译以解析RTF格式的消息(我需要保留格式标记,因此我无法使用仅粘贴到RichTextBox
并取出.PlainText
的技巧)
获取直接粘贴到写字板中的字符串a基bমূcΟιd
的RTF代码:
{\rtf1\ansi\ansicpg1252\deff0\deflang2057{\fonttbl{\f0\fnil\fcharset0 Calibri;}{\f1\fswiss\fcharset128 MS PGothic;}{\f2\fnil\fcharset1 Shonar Bangla;}{\f3\fswiss\fcharset161{\*\fname Arial;}Arial Greek;}}
{\*\generator Msftedit 5.41.21.2510;}\viewkind4\uc1\pard\sa200\sl276\slmult1\lang9\f0\fs22 a\f1\fs24\'8a\'ee\f0\fs22 b\f2\fs24\u2478?\u2498?\f0\fs22 c\f3\fs24\'cf\'e9\f0\fs22 d\par
}
很难确定您与RTF的关系不大.所以这就是我要看的东西
\'8a\'ee\f0\fs22 b\f2\fs24\u2478?\u2498?\f0\fs22 c\f3\fs24\'cf\'e9
注意基
(u+57FA
)是\'8a\'ee
,但实际上是两个字符ম
(\u2478?
)和ূ
(\u2498?
)的মূ
是\u2478?\u2498?
可以,但是两个单独的字符Ο
和ι
的Οι
是\'cf\'e9
.
是否可以确定我要查看的字符应该是一个字符(例如基
= \'bb\'f9
还是两个字符Ο
和ι
= \'cf\'e9
)?
我一直在想也许是\lang
,但这不是事实,因为\lang
从首次设置起就没有改变.我已经在考虑字体中不同的Charset
值中的不同代码页",但是似乎并没有告诉我是否应该将两个Unicode引用彼此相邻地视为双字节字符. /p>
我怎么知道我要看的字符是双字节(或多字节)还是单字节?
\'xx
转义代表字节,应使用fcharset
编码进行解释. (或者可能是cchs
.如果不存在,则回退到ansicpg
.)
您需要充分了解编码,才能决定单个\'xx
序列是单独表示字符还是仅表示多字节字符的一部分;通常,在使用任何可用的库或OS接口将字节字符串转换为Unicode字符串之前,您将以文本的每个部分为单位来使用,以避免为RTF支持的每个代码页编写逐字节的解析器.
\uxxxx?
转义符表示UTF-16代码单元.这要简单得多,但是Word [pad]仅在没有其他选择时才产生这种编码形式,因为它与早期的RTF版本不兼容. (?
是接收方无法应对Unicode时的后备字符.)
所以:
-
两个字符
Οι
表示为两个字节转义符,因为与该段文本关联的字体使用希腊语单字节编码(字符集161 = cp1253). -
一个字符
基
表示为两个字节转义符,因为与该段文本关联的字体使用日语多字节编码(字符集128 = cp932≈Shift-JIS).在Shift-JIS中,前导\'8a
字节表示还会再有一个字节发出,高位集范围内的其他所有字节(但不是全部)也发出信号. -
两个字符
মূ
表示为Unicode代码单元转义符,因为没有其他选择:没有任何包含孟加拉语字符的RTF兼容代码页. (ISCII的代码页57003后来出现了.)
I'm trying to translate parse a RTF formatted message (I need to keep the formatting tags so I can't use the trick where you just paste into a RichTextBox
and get the .PlainText
out)
Take the RTF code for the string a基bমূcΟιd
pasted straight into Wordpad:
{\rtf1\ansi\ansicpg1252\deff0\deflang2057{\fonttbl{\f0\fnil\fcharset0 Calibri;}{\f1\fswiss\fcharset128 MS PGothic;}{\f2\fnil\fcharset1 Shonar Bangla;}{\f3\fswiss\fcharset161{\*\fname Arial;}Arial Greek;}}
{\*\generator Msftedit 5.41.21.2510;}\viewkind4\uc1\pard\sa200\sl276\slmult1\lang9\f0\fs22 a\f1\fs24\'8a\'ee\f0\fs22 b\f2\fs24\u2478?\u2498?\f0\fs22 c\f3\fs24\'cf\'e9\f0\fs22 d\par
}
It's difficult to make out if you've not had much to do with RTF. So here's the bit I'm looking at
\'8a\'ee\f0\fs22 b\f2\fs24\u2478?\u2498?\f0\fs22 c\f3\fs24\'cf\'e9
Notice the 基
(u+57FA
) is \'8a\'ee
but the মূ
, which is actually two characters ম
(\u2478?
) and ূ
(\u2498?
), is \u2478?\u2498?
which is fine, but the Οι
which is two separate characters Ο
and ι
is \'cf\'e9
.
Is there a way to determine if I'm looking at something that should be one character such as 基
= \'bb\'f9
or two characters Ο
and ι
= \'cf\'e9
?
I was thinking that maybe the \lang
was it, but that isn't the case at all because the \lang
does not change from when it's first set. I am already accounting for the Different Codepages from different Charset
values in the fonts, but it doesn't seem to tell me anything about if I should treat two Unicode references next to each other as being a double byte character or not.
How can I tell if the character I'm looking at should be double-byte (or multi-byte) or single byte?
\'xx
escapes represent bytes and should be interpreted using the fcharset
encoding. (Or potentially cchs
. Falling back to the ansicpg
if not present.)
You need to know that encoding intimately to be able to decide whether a single \'xx
sequence represents a character on its own or is only a part of a multi-byte character; typically you will be consuming each section of text as a unit before converting that byte string into a Unicode string using whatever library or OS interface you have available, to avoid having to write byte-by-byte parsers for every code page supported by RTF.
\uxxxx?
escapes represent UTF-16 code units. This is much simpler, but Word[pad] only produces this form of encoding as a last resort, because it's not compatible with earlier RTF versions. (?
is the fallback character for when the receiver can't cope with the Unicode.)
So:
The two characters
Οι
are represented as two byte-escapes because the font associated with that stretch of text is using a Greek single-byte encoding (charset 161 = cp1253).The one character
基
is represented as two byte-escapes because the font associated with that stretch of text is using a Japanese multibyte encoding (charset 128 = cp932 ≈ Shift-JIS). In Shift-JIS the leading\'8a
byte signals a further byte to come, as do various others in the top-bit-set range (but not all of them).The two characters
মূ
are represented as Unicode code unit escapes, because there's no other option: there isn't any RTF-compatible code page that contains Bengali characters. (Code page 57003 for ISCII came much later.)
这篇关于在rtf标记中检测多字节和汉字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!