C ++子字符串多字节字符 [英] C++ substring multi byte characters

查看:141
本文介绍了C ++子字符串多字节字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这个std :: string包含一些字符跨越多个字节。



当我对这个字符串做一个子字符串,输出无效,因为当然,这些字符被计为2个字符。在我看来,我应该使用wstring,因为它将存储这些字符作为一个元素,而不是更多。



所以我决定把这个字符串复制到一个wstring中,但这是没有意义的,因为字符仍然分割成两个字符。



将字符串转换为wstring有很好的解决方案,将特殊字符合并为1个元素而不是2。



感谢

解决方案

如果你这样做一个
lot,在很长的距离,你最好把你的
字符转换为单个元素的编码,使用 wchar_t (或 int32_t
或其他最合适的)。这是不是一个简单的副本,
会将每个个别 char 转换为目标类型,但真正的
转换函数将识别多字节字符,
将它们转换为单个元素。 / p>

对于偶尔使用或更短的序列,可以编写自己的
函数来推进 n 字节。对于UTF-8,我使用以下内容:

  inline size_t 
size(
Byte ch)
{
return byteCountTable [ch];
}

template< typename InputIterator>
InputIterator
succ(
InputIterator begin ,
size_t size,
std :: random_access_iterator_tag)
{
return begin + size;
}

template< typename InputIterator>
InputIterator
succ(
InputIterator begin,
size_t size,
std :: input_iterator_tag)
{
while(size!= 0) {
++ begin;
- size;
}
return begin;
}

template< typename InputIterator>
InputIterator
succ(
InputIterator begin,
InputIterator end)
{
if(begin!= end){
begin = succ begin,end,size(* begin),
std :::: iterator_traits< InputIterator> :: iterator_category());
}
return begin;
}

template< typename InputIterator>
size_t
characterCount(
InputIterator begin,
InputIterator end)
{
size_t result = 0;
while(begin!= end){
++ result;
begin = succ(begin,end);
}
return result;
}


I am having this std::string which contains some characters that span multiple bytes.

When I do a substring on this string, the output is not valid, because ofcourse, these characters are counted as 2 characters. In my opinion I should be using a wstring instead, because it will store these characters in as one element instead of more.

So I decided to copy the string into a wstring, but ofcourse this does not make sense, because the characters remain split over 2 characters. This only makes it worse.

Is there a good solution on converting a string to a wstring, merging the special characters into 1 element instead of 2.

Thanks

解决方案

There are really only two possible solutions. If you're doing this a lot, over large distances, you'd be better off converting your characters to a single element encoding, using wchar_t (or int32_t, or whatever is most appropriate. This is not a simple copy, which would convert each individual char into the target type, but a true conversion function, which would recognize the multibyte characters, and convert them into a single element.

For occasional use or shorter sequences, it's possible to write your own functions for advancing n bytes. For UTF-8, I use the following:

inline size_t
size(
    Byte                ch )
{
    return byteCountTable[ ch ] ;
}

template< typename InputIterator >
InputIterator
succ(
    InputIterator       begin,
    size_t              size,
    std::random_access_iterator_tag )
{
    return begin + size ;
}

template< typename InputIterator >
InputIterator
succ(
    InputIterator       begin,
    size_t              size,
    std::input_iterator_tag )
{
    while ( size != 0 ) {
        ++ begin ;
        -- size ;
    }
    return begin ;
}

template< typename InputIterator >
InputIterator
succ(
    InputIterator       begin,
    InputIterator       end )
{
    if ( begin != end ) {
        begin = succ( begin, end, size( *begin ),
                      std::::iterator_traits< InputIterator >::iterator_category() ) ;
    }
    return begin ;
}

template< typename InputIterator >
size_t
characterCount(
    InputIterator       begin,
    InputIterator       end )
{
    size_t              result = 0 ;
    while ( begin != end ) {
        ++ result ;
        begin = succ( begin, end ) ;
    }
    return result ;
}

这篇关于C ++子字符串多字节字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆