substr带有字符而不是字节 [英] substr with characters instead of bytes
问题描述
假设我有一个 string s =101870002PTäPOPVä#PersonTätigkeitsdarstellung001100001& 0111010101101870100092001000010
do a substring(30,40)
它返回#PersonTätigkeitsdarstellung,以空格开头。
我猜它是计数字节而不是字符。
通常,字符串的大小为110,当我做 s.length()
或 s.size()
因为3个特殊字符返回113.
我想知道是否有办法避免
感谢您的帮助! / div>
在utf-8中,代码点(字符)ä
由两个代码单元(utf-8中的1个字节)组成。 C ++不支持将字符串视为代码点序列。因此,就标准库而言, std :: string(ä)。size()
是2。
一个简单的方法是使用 std :: wstring
。 wstring
使用至少与系统支持的最宽字符集一样宽的字符类型( wchar_t
)。因此,如果系统支持足够宽的编码来用单个代码单元表示任何(非复合)unicode字符,那么字符串方法将按照您的预期运行。目前utf-32足够宽,并且由(大多数?)unix像操作系统支持。
需要注意的是,Windows只支持utf-16而不支持utf- 32,所以如果你选择 wstring
方法,并将你的程序移植到Windows,并且你的程序的用户尝试使用超过2个字节宽的Unicode字符,那么假设
wstring
方法也不接受控制或复合字符
这里是一个小测试代码,转换一个 std :: string
包含一个多字节utf-8字符ä
并将其转换为 wstring
:
string foo(ä); // read whatever you want
wstring_convert< codecvt_utf8< wchar_t>>转换器;
wstring wfoo = converter.from_bytes(foo.data());
cout<< foo.size()<< endl; // 2 on my system
cout<< wfoo.size()<< endl; // 1在我的系统上
不幸的是,libstdc ++没有实现 codecvt>
它至少在gcc-4.8的c ++ 11中引入。如果你不能需要libc ++,那么类似的功能可能在Boost.Locale中。
或者,如果你希望将代码保存到不支持utf-32,你可以使用 std :: string
并使用外部库来进行迭代和计数等。这里有一个: http://utfcpp.sourceforge.net/ ,另一个: http://site.icu-project.org/ 。我相信这是推荐的方法。
Suppose i have a string s = "101870002PTäPO PVä #Person Tätigkeitsdarstellung 001100001&0111010101101870100092001000010"
When I do a substring(30,40)
it returns " #Person Tätigkeitsdarstellung" beginning with a space.
I guess it's counting bytes instead of characters.
Normally the size of the string is 110 and when I do a s.length()
or s.size()
it returns 113 because of the 3 special characters.
I was wondering if there is a way to avoid this empty space at the beginning of the return value.
Thanks for your help!
In utf-8, the code point (character) ä
consists of two code units (which are 1 byte in utf-8). C++ does not have support for treating strings as sequence of code points. Therefore, as far the standard library is concerned, std::string("ä").size()
is 2.
A simple approach is to use std::wstring
. wstring
uses a character type (wchar_t
) which is at least as wide as the widest character set supported by the system. Therefore, if the system supports a wide enough encoding to represent any (non-composite) unicode character with a single code unit, then string methods will behave as you would expect. Currently utf-32 is wide enough and is supported by (most?) unix like OS.
A thing to note is that Windows only supports utf-16 and not utf-32, so if you choose wstring
approach and port your program to Windows and a user of your program tries to use unicode characters that are more than 2 bytes wide, then the presumption of one code unit per code point does not hold.
The wstring
approach also doesn't take control or composite characters into consideration.
Here's a little test code which converts a std::string
containing a multi byte utf-8 character ä
and converts it to a wstring
:
string foo("ä"); // read however you want
wstring_convert<codecvt_utf8<wchar_t>> converter;
wstring wfoo = converter.from_bytes(foo.data());
cout << foo.size() << endl; // 2 on my system
cout << wfoo.size() << endl; // 1 on my system
Unfortunately, libstdc++ hasn't implemented <codecvt>
which was introduced in c++11 as of gcc-4.8 at least. If you can't require libc++, then similar functionality is probably in Boost.Locale.
Alternatively, if you wish to keep your code portable to systems that don't support utf-32, you can keep using std::string
and use an external library for iterating and counting and such. Here's one: http://utfcpp.sourceforge.net/ and another: http://site.icu-project.org/. I believe this is the recommended approach.
这篇关于substr带有字符而不是字节的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!