如何最好地使用Windows“16位的wchar_t丑陋处理? [英] How to best deal with Windows' 16-bit wchar_t ugliness?
问题描述
我写信给使用MinGW,它提供了一个虚拟的UTF-8环境应用程序中使用了一层包装。该处理的文件名功能从UTF-8转换并调用相应的_W功能,等包装。我碰到的最大问题是,Windows wchar_t的
是16位的。
I'm writing a wrapper layer to be used with mingw which provides the application with a virtual UTF-8 environment. Functions which deal with filenames are wrappers which convert from UTF-8 and call the corresponding "_w" functions, and so on. The big problem I've run into is that Windows' wchar_t
is 16-bit.
有关文件系统操作,这不是什么大不了的事。我可以来回转换UTF-8和UTF-16之间,这样就可以了。但标准的C多字节/宽字符转换API不允许多wchar_t的字符。
For filesystem operations, it's not a big deal. I can just convert back and forth between UTF-8 and UTF-16, and everything will work. But the standard C multibyte/wide character conversion API does not allow multi-wchar_t characters.
可能的解决方案:
- 提供一个CESU-8环境,而不是UTF-8。我真的不喜欢这个。
- 采取简单的出路,只支持BMP。治疗UTF-8长度序列4为无效。
- 扩展包装更换的MinGW的
wchar_t的
与的typedef int32_t的wchar_t;
和处理WCHAR
和wchar_t的
是不同的。这是一个痛苦的,但它可能是理想的移植意想不到干净POSIX式的环境,不要使用wchar_t的
任何Windows的API的目的的应用程序。 - 下面的技巧:
- Provide a CESU-8 environment instead of UTF-8. I really don't like this one.
- Take the easy way out and only support the BMP. Treat UTF-8 sequences of length 4 as invalid.
- Extending the wrapper to replace mingw's
wchar_t
withtypedef int32_t wchar_t;
and dealing withWCHAR
andwchar_t
being different. This is a pain but it may be ideal for porting apps that expect a clean POSIX-type environment and don't usewchar_t
for any Windows-API purposes. - The following hack:
mbrtowc
输出 wchar_t的
对应高代理看完前3个字节4个字节的UTF后-8字符,并保持在的mbstate_t
对象剩余状态。在接收到下一个字节,将其与保存的状态,以输出低代理相结合它。如果最后一个字节结束是无效的,则返回-1(含EILSEQ)和一个孤独的替代输出流中结束(坏...)。
mbrtowc
outputs a wchar_t
corresponding to the high surrogate after reading the first 3 bytes of a 4-byte UTF-8 character, and keeps the remaining state in the mbstate_t
object. Upon receiving the next byte, it combines it with the saved state to output the low surrogate. If the last byte ends up being invalid, it returns -1 (with EILSEQ) and a lone surrogate ends up in the output stream (bad...).
wcrtomb
输出第2个字节的UTF-8的,当它处理的高代理,并在其的mbstate_t $保存剩余的状态C $ C>对象。当随后处理低代理,它结合了与保存的状态输出最后2个字节的UTF-8。如果没有收到有效的低代理,则返回-1(含EILSEQ)和一个不完整的UTF-8序列输出流中结束(坏...)。
wcrtomb
outputs the first 2 bytes of UTF-8 when it processes the high surrogate, and saves the remaining state in its mbstate_t
object. When it subsequently processes the low surrogate, it combines that with the saved state to output the last 2 bytes of UTF-8. If a valid low surrogate is not received, it returns -1 (with EILSEQ) and an incomplete UTF-8 sequence ends up in the output stream (bad...).
这个技巧的加方是,它只要输入是有效的,并允许访问任何UTF-8字符,因此任何可能的文件名/参数/等。文本应用程序可能需要使用的。
The plus side of this hack is that it works as long as input is valid, and allows access to any UTF-8 character and thus any possible filename/argument/etc. text the application might need to work with.
的缺点是,它不是严格符合标准为ISO C( wchar_t的
字符串是不允许有状态),它延缓了畸形的字符检测到不正确的部分输出有已经被写入。
The cons are that it's not strictly conformant to ISO C (wchar_t
string is not allowed to be stateful) and that it delays detection of malformed characters until incorrect partial output has already been written.
我正在寻找在不同的选项反馈,尤其是我所提出的黑客:无论是合理的,利弊是否可能引起严重的错误,以及是否有任何其他的缺点我还没有考虑可能保持从完全的工作方案。我也很高兴听到我没有想到的任何其他可能的解决方案。
I'm looking for feedback on the different options, and especially my proposed hack: whether it's reasonable, whether the cons are likely to cause severe errors, and whether there are any other cons I haven't yet considered which might keep the scheme from working entirely. I'd also be happy to hear any other possible solutions I haven't thought of.
推荐答案
我会做这样的事#4,但是直到你确定输入是有效不产生任何输出。
I'd do something like #4, but don't generate any output until you're sure the input is valid.
-
mbrtowc
应该去code中的整个字符。如果是在BMP之外,则输出的高代理和低代理存储在的mbstate_t
。 -
wcrtomb
应该高的代理人存储在的mbstate_t
,然后输出的所有4 UTF-8字节,如果人品有效的。
mbrtowc
should decode the entire character. If it's outside the BMP, then output the high surrogate and store the low surrogate in thembstate_t
.wcrtomb
should store high surrogates in thembstate_t
, then output all 4 UTF-8 bytes if the character is valid.
这篇关于如何最好地使用Windows“16位的wchar_t丑陋处理?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!