C ++子字符串多字节字符 [英] C++ substring multi byte characters

查看：141 发布时间：2016/10/28 1:13:13 c++ character-encoding wstring

本文介绍了C ++子字符串多字节字符的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有这个std :: string包含一些字符跨越多个字节。

当我对这个字符串做一个子字符串，输出无效，因为当然，这些字符被计为2个字符。在我看来，我应该使用wstring，因为它将存储这些字符作为一个元素，而不是更多。

所以我决定把这个字符串复制到一个wstring中，但这是没有意义的，因为字符仍然分割成两个字符。

将字符串转换为wstring有很好的解决方案，将特殊字符合并为1个元素而不是2。

感谢

解决方案

如果你这样做一个
lot，在很长的距离，你最好把你的
字符转换为单个元素的编码，使用 wchar_t （或 int32_t ，
或其他最合适的）。这是不是一个简单的副本，
会将每个个别 char 转换为目标类型，但真正的
转换函数将识别多字节字符，
将它们转换为单个元素。 / p>

对于偶尔使用或更短的序列，可以编写自己的
函数来推进 n 字节。对于UTF-8，我使用以下内容：

  inline size_t 
 size（
 Byte ch）
 {
 return byteCountTable [ch]; 
} 
 
 template< typename InputIterator> 
 InputIterator 
 succ（
 InputIterator begin ，
 size_t size，
 std :: random_access_iterator_tag）
 {
 return begin + size; 
} 
 
 template< typename InputIterator> 
 InputIterator 
 succ（
 InputIterator begin，
 size_t size，
 std :: input_iterator_tag）
 {
 while（size！= 0） {
 ++ begin; 
  -  size; 
} 
 return begin; 
} 
 
 template< typename InputIterator> 
 InputIterator 
 succ（
 InputIterator begin，
 InputIterator end）
 {
 if（begin！= end）{
 begin = succ begin，end，size（* begin），
 std :::: iterator_traits< InputIterator> :: iterator_category（））; 
} 
 return begin; 
} 
 
 template< typename InputIterator> 
 size_t 
 characterCount（
 InputIterator begin，
 InputIterator end）
 {
 size_t result = 0; 
 while（begin！= end）{
 ++ result; 
 begin = succ（begin，end）; 
} 
 return result; 
}

I am having this std::string which contains some characters that span multiple bytes.

When I do a substring on this string, the output is not valid, because ofcourse, these characters are counted as 2 characters. In my opinion I should be using a wstring instead, because it will store these characters in as one element instead of more.

So I decided to copy the string into a wstring, but ofcourse this does not make sense, because the characters remain split over 2 characters. This only makes it worse.

Is there a good solution on converting a string to a wstring, merging the special characters into 1 element instead of 2.

Thanks

解决方案

There are really only two possible solutions. If you're doing this a lot, over large distances, you'd be better off converting your characters to a single element encoding, using wchar_t (or int32_t, or whatever is most appropriate. This is not a simple copy, which would convert each individual char into the target type, but a true conversion function, which would recognize the multibyte characters, and convert them into a single element.

For occasional use or shorter sequences, it's possible to write your own functions for advancing n bytes. For UTF-8, I use the following:

inline size_t
size(
    Byte                ch )
{
    return byteCountTable[ ch ] ;
}

template< typename InputIterator >
InputIterator
succ(
    InputIterator       begin,
    size_t              size,
    std::random_access_iterator_tag )
{
    return begin + size ;
}

template< typename InputIterator >
InputIterator
succ(
    InputIterator       begin,
    size_t              size,
    std::input_iterator_tag )
{
    while ( size != 0 ) {
        ++ begin ;
        -- size ;
    }
    return begin ;
}

template< typename InputIterator >
InputIterator
succ(
    InputIterator       begin,
    InputIterator       end )
{
    if ( begin != end ) {
        begin = succ( begin, end, size( *begin ),
                      std::::iterator_traits< InputIterator >::iterator_category() ) ;
    }
    return begin ;
}

template< typename InputIterator >
size_t
characterCount(
    InputIterator       begin,
    InputIterator       end )
{
    size_t              result = 0 ;
    while ( begin != end ) {
        ++ result ;
        begin = succ( begin, end ) ;
    }
    return result ;
}

这篇关于C ++子字符串多字节字符的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

C ++子字符串多字节字符 [英] C++ substring multi byte characters

问题描述

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

C ++子字符串多字节字符 [英] C++ substring multi byte characters

问题描述

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭