C ++字符串代码点和代码单元有什么好的解决方案? [英] Any good solutions for C++ string code point and code unit?

查看:108
本文介绍了C ++字符串代码点和代码单元有什么好的解决方案?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Java中,字符串具有方法:

In Java, a String has methods:

length()/charAt(), codePointCount()/codePointAt()

C ++ 11具有std::string a = u8"很烫烫的一锅汤";

C++11 has std::string a = u8"很烫烫的一锅汤";

但是a.size()是char数组的长度,无法索引unicode char.

but a.size() is the length of char array, cannot index the unicode char.

在C ++字符串中有针对Unicode的解决方案吗?

Is there any solutions for unicode in C++ string ?

推荐答案

我通常在进行字符操作之前将UTF-8字符串转换为宽的UTF-32/UCS-2字符串. C++实际上确实为我们提供了执行此操作的功能,但是它们不是非常用户友好的,因此我在此处编写了一些更好的转换功能:

I generally convert the UTF-8 string to a wide UTF-32/UCS-2 string before doing character operations. C++ does actually give us functions to do that but they are not very user friendly so I have written some nicer conversion functions here:

// This should convert to whatever the system wide character encoding 
// is for the platform (UTF-32/Linux - UCS-2/Windows)
std::string ws_to_utf8(std::wstring const& s)
{
    std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> cnv;
    std::string utf8 = cnv.to_bytes(s);
    if(cnv.converted() < s.size())
        throw std::runtime_error("incomplete conversion");
    return utf8;
}

std::wstring utf8_to_ws(std::string const& utf8)
{
    std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> cnv;
    std::wstring s = cnv.from_bytes(utf8);
    if(cnv.converted() < utf8.size())
        throw std::runtime_error("incomplete conversion");
    return s;
}

int main()
{
    std::string s = u8"很烫烫的一锅汤";

    auto w = utf8_to_ws(s); // convert to wide (UTF-32/UCS-2)

    // now we can use code-point indexes on the wide string

    std::cout << s << " is " << w.size() << " characters long" << '\n';
}

输出:

很烫烫的一锅汤 is 7 characters long

如果要在任何平台上往返于UTF-32进行转换,则可以使用以下(未经过充分测试的)转换例程:

If you want to convert to and from UTF-32 regardless of platform then you can use the following (not so well tested) conversion routines:

std::string utf32_to_utf8(std::u32string const& utf32)
{
    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> cnv;
    std::string utf8 = cnv.to_bytes(utf32);
    if(cnv.converted() < utf32.size())
        throw std::runtime_error("incomplete conversion");
    return utf8;
}

std::u32string utf8_to_utf32(std::string const& utf8)
{
    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> cnv;
    std::u32string utf32 = cnv.from_bytes(utf8);
    if(cnv.converted() < utf8.size())
        throw std::runtime_error("incomplete conversion");
    return utf32;
}

注意::自C++17 std::wstring_convert起,已弃用 .

NOTE: As of C++17 std::wstring_convert is deprecated.

但是我仍然更喜欢在第三方库上使用它,因为它是便携式,它避免了外部依赖性,在提供替换项之前不会将其删除在所有情况下,替换这些函数的实现都很容易,而不必更改使用它们的所有代码.

However I still prefer to use it over a third party library because it is portable, it avoids external dependencies, it won't be removed until a replacement is provided and in all cases it will be easy to replace the implementations of these functions without having to change all the code that uses them.

这篇关于C ++字符串代码点和代码单元有什么好的解决方案?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆