如何在带有std库的Linux上从utf-16转换为utf-32? [英] How to convert from utf-16 to utf-32 on Linux with std library?
问题描述
在MSVC上,将utf-16转换为utf-32很容易-使用C11的 codecvt_utf16 语言环境方面.但是在GCC(gcc(Debian 4.7.2-5)4.7.2)中,似乎尚未实现此新功能.有没有一种方法可以在没有iconv的Linux上执行这种转换(最好使用std库的转换工具)?
On MSVC converting utf-16 to utf-32 is easy - with C11's codecvt_utf16 locale facet. But in GCC (gcc (Debian 4.7.2-5) 4.7.2) seemingly this new feature hasn't been implemented yet. Is there a way to perform such conversion on Linux without iconv (preferrably using conversion tools of std library)?
推荐答案
将UTF-16解码为UTF-32非常简单.
Decoding UTF-16 into UTF-32 is extremely easy.
您可能希望在编译时检测正在使用的libc版本,如果检测到损坏的libc(没有所需的功能),请部署转换例程.
You may want to detect at compile time the libc version you're using, and deploy your conversion routine if you detect a broken libc (without the functions you need).
输入:
- 指向源UTF-16数据的指针(
char16_t *
,ushort *
,-为方便起见UTF16 *
); - 它的大小;
- 指向UTF-32数据的指针(
char32_t *
,uint *
-为方便起见,UTF32 *
).
- a pointer to the source UTF-16 data (
char16_t *
,ushort *
, -- for convenienceUTF16 *
); - its size;
- a pointer to the UTF-32 data (
char32_t *
,uint *
-- for convenienceUTF32 *
).
代码如下:
void convert_utf16_to_utf32(const UTF16 *input,
size_t input_size,
UTF32 *output)
{
const UTF16 * const end = input + input_size;
while (input < end) {
const UTF16 uc = *input++;
if (!is_surrogate(uc)) {
*output++ = uc;
} else {
if (is_high_surrogate(uc) && input < end && is_low_surrogate(*input))
*output++ = surrogate_to_utf32(uc, *input++);
else
// ERROR
}
}
}
剩下的是错误处理.您可能想在流中插入U+FFFD
¹并继续进行,或者只是纾困,这完全取决于您.辅助功能很简单:
Error handling is left. You might want to insert a U+FFFD
¹ into the stream and keep on going, or just bail out, really up to you. The auxiliary functions are trivial:
int is_surrogate(UTF16 uc) { return (uc - 0xd800u) < 2048u; }
int is_high_surrogate(UTF16 uc) { return (uc & 0xfffffc00) == 0xd800; }
int is_low_surrogate(UTF16 uc) { return (uc & 0xfffffc00) == 0xdc00; }
UTF32 surrogate_to_utf32(UTF16 high, UTF16 low) {
return (high << 10) + low - 0x35fdc00;
}
¹Cf. Unicode:
¹ Cf. Unicode:
- §3.9 Unicode编码形式(使用U + FFFD的最佳实践)
- 第5.22节"U + FFFD替换最佳实践"
²另外请考虑!is_surrogate(uc)
分支是迄今为止最常见的分支(如果第二个分支则是非错误路径),则可能要使用__builtin_expect
或类似的方法对其进行优化.
² Also consider that the !is_surrogate(uc)
branch is by far the most common (as well the non-error path in the second if), you might want to optimize that with __builtin_expect
or similar.
这篇关于如何在带有std库的Linux上从utf-16转换为utf-32?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!