std :: u8string与std :: string有何不同? [英] how std::u8string will be different from std::string?
问题描述
如果我有一个字符串:
std::string s = u8"你好";
在C ++ 20中,
std::u8string s = u8"你好";
std :: u8string
有何不同from std :: string
?
how std::u8string
will be different from std::string
?
推荐答案
由于 u8string
和 string
是一个模板在 char8_t
上,而另一个模板在在 char
上, real 问题是使用基于 char8_t
的字符串有什么区别与基于 char
的字符串。
Since the difference between u8string
and string
is that one is templated on char8_t
and the other on char
, the real question is what is the difference between using char8_t
-based strings vs. char
-based strings.
这实际上归结为:基于类型的编码。
It really comes down to this: type-based encoding.
任何基于 char
的字符串( char *
, char []
, string
等)可以用UTF-8编码。但话又说回来,可能不是。您可以在假设每个 char *
等价物都将以UTF-8编码的前提下开发代码。您可以在每个字符串文字前写一个 u8
和/或以其他方式确保对它们进行正确编码。但是:
Any char
-based string (char*
, char[]
, string
, etc) may be encoded in UTF-8. But then again, it may not. You could develop your code under an assumption that every char*
equivalent will be UTF-8 encoded. And you could write a u8
in front of every string literal and/or otherwise ensure they're properly encoded. But:
-
其他人的代码可能不同意。因此,您不能使用任何可能返回不使用UTF-8编码的
char *
s库。
您可能不小心违反了自己的戒律。毕竟, char not_utf8 [] =你好;
是有条件支持的C ++。该 char []
的编码将是编译器的窄编码,无论 是什么。在某些编译器上可能是UTF-8,而在其他编译器上可能是其他东西。
You might accidentally violate your own precepts. After all, char not_utf8[] = "你好";
is conditionally supported C++. The encoding of that char[]
will be the compiler's narrow encoding... whatever that is. It may be UTF-8 on some compilers and something else on others.
您无法告诉其他人(甚至您团队中的其他人)的代码这就是你在做什么。也就是说,您的API无法声明特定的 char *
是UTF-8编码的。这必须是用户假定的内容或已经在文档中阅读的内容,而不是他们在代码中看到的内容。
You can't tell other people's code (or even other people on your team) that this is what you're doing. That is, your API cannot declare that a particular char*
is UTF-8-encoded. This has to be something the user assumes or has otherwise read in your documentation, rather than something they see in code.
请注意,对于UTF-16或UTF-32的用户,这些问题都不存在。如果您使用基于 char16_t
的字符串,所有这些问题都将消失。如果其他人的代码返回 char16_t
字符串,则说明他们在做什么。如果他们返回其他内容,那么您就会知道这些内容可能不是UTF-16。您基于UTF-16的代码可以与其互操作。如果您编写一个返回基于 char16_t
的字符串的API,则使用该代码的每个人都可以从该字符串的类型中看到其编码方式。并保证这是一个编译错误: char16_t not_utf16 [] =你好;
Note that none of these problems exist for users of UTF-16 or UTF-32. If you use a char16_t
-based string, all of these problems go away. If other people's code returns a char16_t
string, you know what they're doing. If they return something else, then you know that those things probably aren't UTF-16. Your UTF-16-based code can interop with theirs. If you write an API that returns a char16_t
-based string, everyone using your code can see from the type of the string what encoding it is. And this is guaranteed to be a compile error: char16_t not_utf16[] = "你好";
现在是,这些东西都没有保证。任何特定的 char16_t
字符串都可以包含任何值,即使那些对于UTF-16非法的值也是如此。但是 char16_t
表示一种类型,其默认假设是特定的编码。鉴于此,如果您提供的字符串类型不是UTF-16编码的,那么认为这是用户的错误/行为,即违反合同,将是不合理的。
Now yes, there is no guarantee of any of these things. Any particular char16_t
string could have any values in it, even those that are illegal for UTF-16. But char16_t
represents a type for which the default assumption is a specific encoding. Given that, if you present a string with this type that isn't UTF-16 encoded, it would not be unreasonable to consider this a mistake/perfidy by the user, that it is a contract violation.
我们可以看到缺少类似的基于类型的UTF-8设施对C ++的影响。考虑 filesystem :: path
。它可以采用任何Unicode编码的字符串。对于UTF-16 / 32,路径
的构造函数采用基于 char16 / 32_t
的字符串。但是您不能将UTF-8字符串传递给 path
的构造函数;基于 char
的构造函数假定该编码是实现定义的窄编码,而不是UTF-8。因此,相反,您必须使用 filesystem :: u8path
,这是一个单独的函数,返回一个 path
,它是由UTF-8编码的字符串构成的。
We can see how C++ has been impacted by lacking similar, type-based facilities for UTF-8. Consider filesystem::path
. It can take strings in any Unicode encoding. For UTF-16/32, path
's constructor takes char16/32_t
-based strings. But you cannot pass a UTF-8 string to path
's constructor; the char
-based constructor assumes that the encoding is the implementation-defined narrow encoding, not UTF-8. So instead, you have to employ filesystem::u8path
, which is a separate function that returns a path
, constructed from a UTF-8-encoded string.
更糟糕的是,如果您尝试传递UTF-8编码的基于char
的字符串到 path
的构造函数...它可以很好地编译。
What's worse is that if you try to pass a UTF-8 encoded char
-based string to path
's constructor... it compiles fine. Despite being at best non-portable, it may just appear to work.
char8_t
及其所有配件,尽管充其量是不可移植的。像 u8string
这样的存在,是为了允许UTF-8用户获得与其他UTF编码相同的功能。在C ++ 20中, filesystem :: path
将获得基于 char8_t
的字符串和 u8path
将会过时。
char8_t
, and all of its accoutrements like u8string
, exist to allow UTF-8 users the same power that other UTF-encodings get. In C++20, filesystem::path
will get overloads for char8_t
-based strings, and u8path
will become obsolete.
另外, char8_t
并没有特殊的别名语言。因此,采用基于 char8_t
的字符串的API 肯定是一种采用字符数组而不是任意字节数组的API。
And, as an added bonus, char8_t
doesn't have special aliasing language around it. So an API that takes char8_t
-based strings is certainly an API that takes a character array, rather than an arbitrary byte array.
这篇关于std :: u8string与std :: string有何不同?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!