substr带有字符而不是字节 [英] substr with characters instead of bytes

查看：287 发布时间：2016/10/30 4:21:06 c++ string substring special-characters substr

本文介绍了substr带有字符而不是字节的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

假设我有一个 string s =101870002PTäPOPVä#PersonTätigkeitsdarstellung001100001& 0111010101101870100092001000010

do a substring（30,40）它返回#PersonTätigkeitsdarstellung，以空格开头。
我猜它是计数字节而不是字符。

通常，字符串的大小为110，当我做 s.length（）或 s.size（）因为3个特殊字符返回113.

我想知道是否有办法避免

解决方案

感谢您的帮助！ / div>

在utf-8中，代码点（字符）ä由两个代码单元（utf-8中的1个字节）组成。 C ++不支持将字符串视为代码点序列。因此，就标准库而言， std :: string（ä）。size（）是2。

一个简单的方法是使用 std :: wstring 。 wstring 使用至少与系统支持的最宽字符集一样宽的字符类型（ wchar_t ）。因此，如果系统支持足够宽的编码来用单个代码单元表示任何（非复合）unicode字符，那么字符串方法将按照您的预期运行。目前utf-32足够宽，并且由（大多数？）unix像操作系统支持。

需要注意的是，Windows只支持utf-16而不支持utf- 32，所以如果你选择 wstring 方法，并将你的程序移植到Windows，并且你的程序的用户尝试使用超过2个字节宽的Unicode字符，那么假设

wstring 方法也不接受控制或复合字符

这里是一个小测试代码，转换一个 std :: string 包含一个多字节utf-8字符ä并将其转换为 wstring ：

  string foo（ä）; // read whatever you want 
 wstring_convert< codecvt_utf8< wchar_t>>转换器; 
 wstring wfoo = converter.from_bytes（foo.data（））; 
 cout<< foo.size（）<< endl; // 2 on my system 
 cout<< wfoo.size（）<< endl; // 1在我的系统上

不幸的是，libstdc ++没有实现 codecvt> 它至少在gcc-4.8的c ++ 11中引入。如果你不能需要libc ++，那么类似的功能可能在Boost.Locale中。

或者，如果你希望将代码保存到不支持utf-32，你可以使用 std :: string 并使用外部库来进行迭代和计数等。这里有一个： http://utfcpp.sourceforge.net/ ，另一个： http://site.icu-project.org/ 。我相信这是推荐的方法。

Suppose i have a string s = "101870002PTäPO PVä #Person Tätigkeitsdarstellung 001100001&0111010101101870100092001000010"

When I do a substring(30,40) it returns " #Person Tätigkeitsdarstellung" beginning with a space. I guess it's counting bytes instead of characters.

Normally the size of the string is 110 and when I do a s.length() or s.size() it returns 113 because of the 3 special characters.

I was wondering if there is a way to avoid this empty space at the beginning of the return value.

Thanks for your help!

解决方案

In utf-8, the code point (character) ä consists of two code units (which are 1 byte in utf-8). C++ does not have support for treating strings as sequence of code points. Therefore, as far the standard library is concerned, std::string("ä").size() is 2.

A simple approach is to use std::wstring. wstring uses a character type (wchar_t) which is at least as wide as the widest character set supported by the system. Therefore, if the system supports a wide enough encoding to represent any (non-composite) unicode character with a single code unit, then string methods will behave as you would expect. Currently utf-32 is wide enough and is supported by (most?) unix like OS.

A thing to note is that Windows only supports utf-16 and not utf-32, so if you choose wstring approach and port your program to Windows and a user of your program tries to use unicode characters that are more than 2 bytes wide, then the presumption of one code unit per code point does not hold.

The wstring approach also doesn't take control or composite characters into consideration.

Here's a little test code which converts a std::string containing a multi byte utf-8 character ä and converts it to a wstring:

string foo("ä"); // read however you want
wstring_convert<codecvt_utf8<wchar_t>> converter;
wstring wfoo = converter.from_bytes(foo.data());
cout << foo.size() << endl; // 2 on my system
cout << wfoo.size() << endl; // 1 on my system

Unfortunately, libstdc++ hasn't implemented <codecvt> which was introduced in c++11 as of gcc-4.8 at least. If you can't require libc++, then similar functionality is probably in Boost.Locale.

Alternatively, if you wish to keep your code portable to systems that don't support utf-32, you can keep using std::string and use an external library for iterating and counting and such. Here's one: http://utfcpp.sourceforge.net/ and another: http://site.icu-project.org/. I believe this is the recommended approach.

这篇关于substr带有字符而不是字节的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

substr带有字符而不是字节 [英] substr with characters instead of bytes

问题描述

相关文章

C/C++开发最新文章

热门教程

热门工具

登录关闭

substr带有字符而不是字节 [英] substr with characters instead of bytes

问题描述

相关文章

C/C++开发最新文章

热门教程

热门工具

登录 关闭

登录关闭