C ++子字符串多字节字符 [英] C++ substring multi byte characters
问题描述
我有这个std :: string包含一些字符跨越多个字节。
当我对这个字符串做一个子字符串,输出无效,因为当然,这些字符被计为2个字符。在我看来,我应该使用wstring,因为它将存储这些字符作为一个元素,而不是更多。
所以我决定把这个字符串复制到一个wstring中,但这是没有意义的,因为字符仍然分割成两个字符。
将字符串转换为wstring有很好的解决方案,将特殊字符合并为1个元素而不是2。
感谢
如果你这样做一个
lot,在很长的距离,你最好把你的
字符转换为单个元素的编码,使用 wchar_t
(或 int32_t
,
或其他最合适的)。这是不是一个简单的副本,
会将每个个别 char
转换为目标类型,但真正的
转换函数将识别多字节字符,
将它们转换为单个元素。 / p>
对于偶尔使用或更短的序列,可以编写自己的
函数来推进 n
字节。对于UTF-8,我使用以下内容:
inline size_t
size(
Byte ch)
{
return byteCountTable [ch];
}
template< typename InputIterator>
InputIterator
succ(
InputIterator begin ,
size_t size,
std :: random_access_iterator_tag)
{
return begin + size;
}
template< typename InputIterator>
InputIterator
succ(
InputIterator begin,
size_t size,
std :: input_iterator_tag)
{
while(size!= 0) {
++ begin;
- size;
}
return begin;
}
template< typename InputIterator>
InputIterator
succ(
InputIterator begin,
InputIterator end)
{
if(begin!= end){
begin = succ begin,end,size(* begin),
std :::: iterator_traits< InputIterator> :: iterator_category());
}
return begin;
}
template< typename InputIterator>
size_t
characterCount(
InputIterator begin,
InputIterator end)
{
size_t result = 0;
while(begin!= end){
++ result;
begin = succ(begin,end);
}
return result;
}
I am having this std::string which contains some characters that span multiple bytes.
When I do a substring on this string, the output is not valid, because ofcourse, these characters are counted as 2 characters. In my opinion I should be using a wstring instead, because it will store these characters in as one element instead of more.
So I decided to copy the string into a wstring, but ofcourse this does not make sense, because the characters remain split over 2 characters. This only makes it worse.
Is there a good solution on converting a string to a wstring, merging the special characters into 1 element instead of 2.
Thanks
There are really only two possible solutions. If you're doing this a
lot, over large distances, you'd be better off converting your
characters to a single element encoding, using wchar_t
(or int32_t
,
or whatever is most appropriate. This is not a simple copy, which
would convert each individual char
into the target type, but a true
conversion function, which would recognize the multibyte characters, and
convert them into a single element.
For occasional use or shorter sequences, it's possible to write your own
functions for advancing n
bytes. For UTF-8, I use the following:
inline size_t
size(
Byte ch )
{
return byteCountTable[ ch ] ;
}
template< typename InputIterator >
InputIterator
succ(
InputIterator begin,
size_t size,
std::random_access_iterator_tag )
{
return begin + size ;
}
template< typename InputIterator >
InputIterator
succ(
InputIterator begin,
size_t size,
std::input_iterator_tag )
{
while ( size != 0 ) {
++ begin ;
-- size ;
}
return begin ;
}
template< typename InputIterator >
InputIterator
succ(
InputIterator begin,
InputIterator end )
{
if ( begin != end ) {
begin = succ( begin, end, size( *begin ),
std::::iterator_traits< InputIterator >::iterator_category() ) ;
}
return begin ;
}
template< typename InputIterator >
size_t
characterCount(
InputIterator begin,
InputIterator end )
{
size_t result = 0 ;
while ( begin != end ) {
++ result ;
begin = succ( begin, end ) ;
}
return result ;
}
这篇关于C ++子字符串多字节字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!