substr带有字符而不是字节 [英] substr with characters instead of bytes

查看:287
本文介绍了substr带有字符而不是字节的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我有一个 string s =101870002PTäPOPVä#PersonTätigkeitsdarstellung001100001& 0111010101101870100092001000010



do a substring(30,40)它返回#PersonTätigkeitsdarstellung,以空格开头。
我猜它是计数字节而不是字符。



通常,字符串的大小为110,当我做 s.length() s.size()因为3个特殊字符返回113.



我想知道是否有办法避免

解决方案



感谢您的帮助! / div>

在utf-8中,代码点(字符)ä由两个代码单元(utf-8中的1个字节)组成。 C ++不支持将字符串视为代码点序列。因此,就标准库而言, std :: string(ä)。size()是2。



一个简单的方法是使用 std :: wstring wstring 使用至少与系统支持的最宽字符集一样宽的字符类型( wchar_t )。因此,如果系统支持足够宽的编码来用单个代码单元表示任何(非复合)unicode字符,那么字符串方法将按照您的预期运行。目前utf-32足够宽,并且由(大多数?)unix像操作系统支持。



需要注意的是,Windows只支持utf-16而不支持utf- 32,所以如果你选择 wstring 方法,并将你的程序移植到Windows,并且你的程序的用户尝试使用超过2个字节宽的Unicode字符,那么假设



wstring 方法也不接受控制或复合字符



这里是一个小测试代码,转换一个 std :: string 包含一个多字节utf-8字符ä并将其转换为 wstring

  string foo(ä); // read whatever you want 
wstring_convert< codecvt_utf8< wchar_t>>转换器;
wstring wfoo = converter.from_bytes(foo.data());
cout<< foo.size()<< endl; // 2 on my system
cout<< wfoo.size()<< endl; // 1在我的系统上

不幸的是,libstdc ++没有实现 codecvt> 它至少在gcc-4.8的c ++ 11中引入。如果你不能需要libc ++,那么类似的功能可能在Boost.Locale中。



或者,如果你希望将代码保存到不支持utf-32,你可以使用 std :: string 并使用外部库来进行迭代和计数等。这里有一个: http://utfcpp.sourceforge.net/ ,另一个: http://site.icu-project.org/ 。我相信这是推荐的方法。


Suppose i have a string s = "101870002PTäPO PVä #Person Tätigkeitsdarstellung 001100001&0111010101101870100092001000010"

When I do a substring(30,40) it returns " #Person Tätigkeitsdarstellung" beginning with a space. I guess it's counting bytes instead of characters.

Normally the size of the string is 110 and when I do a s.length() or s.size() it returns 113 because of the 3 special characters.

I was wondering if there is a way to avoid this empty space at the beginning of the return value.

Thanks for your help!

解决方案

In utf-8, the code point (character) ä consists of two code units (which are 1 byte in utf-8). C++ does not have support for treating strings as sequence of code points. Therefore, as far the standard library is concerned, std::string("ä").size() is 2.

A simple approach is to use std::wstring. wstring uses a character type (wchar_t) which is at least as wide as the widest character set supported by the system. Therefore, if the system supports a wide enough encoding to represent any (non-composite) unicode character with a single code unit, then string methods will behave as you would expect. Currently utf-32 is wide enough and is supported by (most?) unix like OS.

A thing to note is that Windows only supports utf-16 and not utf-32, so if you choose wstring approach and port your program to Windows and a user of your program tries to use unicode characters that are more than 2 bytes wide, then the presumption of one code unit per code point does not hold.

The wstring approach also doesn't take control or composite characters into consideration.

Here's a little test code which converts a std::string containing a multi byte utf-8 character ä and converts it to a wstring:

string foo("ä"); // read however you want
wstring_convert<codecvt_utf8<wchar_t>> converter;
wstring wfoo = converter.from_bytes(foo.data());
cout << foo.size() << endl; // 2 on my system
cout << wfoo.size() << endl; // 1 on my system

Unfortunately, libstdc++ hasn't implemented <codecvt> which was introduced in c++11 as of gcc-4.8 at least. If you can't require libc++, then similar functionality is probably in Boost.Locale.

Alternatively, if you wish to keep your code portable to systems that don't support utf-32, you can keep using std::string and use an external library for iterating and counting and such. Here's one: http://utfcpp.sourceforge.net/ and another: http://site.icu-project.org/. I believe this is the recommended approach.

这篇关于substr带有字符而不是字节的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆