std :: u8string与std :: string有何不同? [英] how std::u8string will be different from std::string?

查看:172
本文介绍了std :: u8string与std :: string有何不同?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果我有一个字符串:

std::string s = u8"你好";

在C ++ 20中,

std::u8string s = u8"你好";

std :: u8string 有何不同from std :: string

how std::u8string will be different from std::string?

推荐答案

由于 u8string string 是一个模板在 char8_t 上,而另一个模板在在 char 上, real 问题是使用基于 char8_t 的字符串有什么区别与基于 char 的字符串。

Since the difference between u8string and string is that one is templated on char8_t and the other on char, the real question is what is the difference between using char8_t-based strings vs. char-based strings.

这实际上归结为:基于类型的编码。

It really comes down to this: type-based encoding.

任何基于 char 的字符串( char * char [] string 等)可以用UTF-8编码。但话又说回来,可能不是。您可以在假设每个 char * 等价物都将以UTF-8编码的前提下开发代码。您可以在每个字符串文字前写一个 u8 和/或以其他方式确保对它们进行正确编码。但是:

Any char-based string (char*, char[], string, etc) may be encoded in UTF-8. But then again, it may not. You could develop your code under an assumption that every char* equivalent will be UTF-8 encoded. And you could write a u8 in front of every string literal and/or otherwise ensure they're properly encoded. But:


  1. 其他人的代码可能不同意。因此,您不能使用任何可能返回不使用UTF-8编码的 char * s库。

您可能不小心违反了自己的戒律。毕竟, char not_utf8 [] =你好; 是有条件支持的C ++。该 char [] 的编码将是编译器的窄编码,无论 是什么。在某些编译器上可能是UTF-8,而在其他编译器上可能是其他东西。

You might accidentally violate your own precepts. After all, char not_utf8[] = "你好"; is conditionally supported C++. The encoding of that char[] will be the compiler's narrow encoding... whatever that is. It may be UTF-8 on some compilers and something else on others.

您无法告诉其他人(甚至您团队中的其他人)的代码这就是你在做什么。也就是说,您的API无法声明特定的 char * 是UTF-8编码的。这必须是用户假定的内容或已经在文档中阅读的内容,而不是他们在代码中看到的内容。

You can't tell other people's code (or even other people on your team) that this is what you're doing. That is, your API cannot declare that a particular char* is UTF-8-encoded. This has to be something the user assumes or has otherwise read in your documentation, rather than something they see in code.

请注意,对于UTF-16或UTF-32的用户,这些问题都不存在。如果您使用基于 char16_t 的字符串,所有这些问题都将消失。如果其他人的代码返回 char16_t 字符串,则说明他们在做什么。如果他们返回其他内容,那么您就会知道这些内容可能不是UTF-16。您基于UTF-16的代码可以与其互操作。如果您编写一个返回基于 char16_t 的字符串的API,则使用该代码的每个人都可以从该字符串的类型中看到其编码方式。并保证这是一个编译错误: char16_t not_utf16 [] =你好;

Note that none of these problems exist for users of UTF-16 or UTF-32. If you use a char16_t-based string, all of these problems go away. If other people's code returns a char16_t string, you know what they're doing. If they return something else, then you know that those things probably aren't UTF-16. Your UTF-16-based code can interop with theirs. If you write an API that returns a char16_t-based string, everyone using your code can see from the type of the string what encoding it is. And this is guaranteed to be a compile error: char16_t not_utf16[] = "你好";

现在是,这些东西都没有保证。任何特定的 char16_t 字符串都可以包含任何值,即使那些对于UTF-16非法的值也是如此。但是 char16_t 表示一种类型,其默认假设是特定的编码。鉴于此,如果您提供的字符串类型不是UTF-16编码的,那么认为这是用户的错误/行为,即违反合同,将是不合理的。

Now yes, there is no guarantee of any of these things. Any particular char16_t string could have any values in it, even those that are illegal for UTF-16. But char16_t represents a type for which the default assumption is a specific encoding. Given that, if you present a string with this type that isn't UTF-16 encoded, it would not be unreasonable to consider this a mistake/perfidy by the user, that it is a contract violation.

我们可以看到缺少类似的基于类型的UTF-8设施对C ++的影响。考虑 filesystem :: path 。它可以采用任何Unicode编码的字符串。对于UTF-16 / 32,路径的构造函数采用基于 char16 / 32_t 的字符串。但是您不能将UTF-8字符串传递给 path 的构造函数;基于 char 的构造函数假定该编码是实现定义的窄编码,而不是UTF-8。因此,相反,您必须使用 filesystem :: u8path ,这是一个单独的函数,返回一个 path ,它是由UTF-8编码的字符串构成的。

We can see how C++ has been impacted by lacking similar, type-based facilities for UTF-8. Consider filesystem::path. It can take strings in any Unicode encoding. For UTF-16/32, path's constructor takes char16/32_t-based strings. But you cannot pass a UTF-8 string to path's constructor; the char-based constructor assumes that the encoding is the implementation-defined narrow encoding, not UTF-8. So instead, you have to employ filesystem::u8path, which is a separate function that returns a path, constructed from a UTF-8-encoded string.

更糟糕的是,如果您尝试传递UTF-8编码的基于char 的字符串到 path 的构造函数...它可以很好地编译。

What's worse is that if you try to pass a UTF-8 encoded char-based string to path's constructor... it compiles fine. Despite being at best non-portable, it may just appear to work.

char8_t 及其所有配件,尽管充其量是不可移植的。像 u8string 这样的存在,是为了允许UTF-8用户获得与其他UTF编码相同的功能。在C ++ 20中, filesystem :: path 将获得基于 char8_t 的字符串和 u8path 将会过时。

char8_t, and all of its accoutrements like u8string, exist to allow UTF-8 users the same power that other UTF-encodings get. In C++20, filesystem::path will get overloads for char8_t-based strings, and u8path will become obsolete.

另外, char8_t 并没有特殊的别名语言。因此,采用基于 char8_t 的字符串的API 肯定是一种采用字符数组而不是任意字节数组的API。

And, as an added bonus, char8_t doesn't have special aliasing language around it. So an API that takes char8_t-based strings is certainly an API that takes a character array, rather than an arbitrary byte array.

这篇关于std :: u8string与std :: string有何不同?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆