C ++ 17之前的char8_t/UTF8 chars状况不佳吗? [英] Situation of char8_t / UTF8 chars pre-C++17 and poor-man-ing it?

查看:46
本文介绍了C ++ 17之前的char8_t/UTF8 chars状况不佳吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直以此问题的形式阅读链接,并且当然关于准备即将推出的"utf8"字符类型 char8_t 的问题code>及其对应的字符串类型(在C ++ 20中),并且可以说到了时间.还有就是一团糟.

I've been reading links as this question and of course this question on preparing for the upcoming "utf8" char type char8_t and their corresponding string type in C++20, and can say, up to a point, that it's about time. Also that it's a mess.

请随时纠正我的错误之处:

Feel free to correct me where I'm wrong:

  • C ++,任何标准,都没有办法指定源代码具有给定的文本编码(类似于Python的#encoding:... 元数据),也无法将其编译成什么标准(例如说#!/bin/env g ++ -std = c ++ 14 ).
  • 直到C ++ 11,还没有办法指定任何给定的字符串文字将具有给定的编码-如果需要,编译器可以自由地将UTF8字符串文字重新解析为UTF16甚至EBCDIC.
  • C ++ 11引入了 u16"text" u32"text" 以及相关的char类型以生成UTF16和UTF32编码的文本,但不提供字符串或流设施来与它们一起工作,因此它们基本上是无用的.
  • C ++ 11 引入了 u8"text" 来生成UTF8编码的字符串...但是甚至都没有引入适当的UTF8字符类型或字符串类型(这就是 char8_t 在C ++ 20中的用途吗?),所以它甚至比以上.
  • 由于所有这些,最终引入了 char8_t 后,它杀死了许多原本有效的代码,到目前为止,所寻求的某些补救措施包括 完全禁用char8_t行为 .
  • 即使那样,也没有现成的工具(例如:与< random> 不同的废话层接口)来检查,转换(在同一字符串内)或转换(跨字符串复制)类型)在C ++中的文本编码.甚至编解码器似乎也已被删除.
  • C++, any standards, have no means to specify that the source code has a given text encoding (something like Python's # encoding:... metadata), nor what Standards can it be compiled into (like say #!/bin/env g++ -std=c++14) .
  • Up until C++11, there was also no way to specify that any given string literal would have a given encoding - the compiler was free to reparse a UTF8 string literal into say UTF16 or even EBCDIC if it so desired.
  • C++11 introduces u16"text" and u32"text" and associated char types to produce UTF16 and UTF32-encoded text, but does not provide string or stream facilities to work with them, so they're basically useless.
  • C++11 also introduces u8"text" for producing an UTF8-encoded string... but does not even introduce either a proper UTF8 char type or string type (that's what char8_t is intended to be in C++20?), so it's even uselesser than the above.
  • Because of all this, when char8_t is finally introduced, it kills lots of code that was intended to be valid and so far some of the remediations sought include disabling char8_t behaviour altogether.
  • Even then, there's no readily available tooling (as in: not the same crap tier interface as <random>) to check, transform (within the same string) or convert (copying across string types) text encodings in C++. Even codecvt seems to have been dropped.

鉴于上述所有原因,我对我们为什么处于这种奇怪状态以及是否会变得更好有一些疑问.从历史上看,Unicode支持一直是C ++的最低点之一.

Given all of the above, I have some questions regarding why are we in this weird status and if it'll ever get better. Historically Unicode support has been one of the lowest points of C++.

同样,我想知道穷人对整个概念的模仿有多有用(免责声明:是 cxxomfort ,我已经向后移植了很多东西.工作需求:最新办公室的MSVC目标是MSVC 2012).

Similarly, am wondering how useful is a poor-man's-emulation of the whole concept (disclaimer: am the maintainer of cxxomfort, I already backport lots of things. Work needs: latest MSVC target at the office is MSVC 2012).

  • 为什么在引入 u8"text" 的适当时间,C ++没有在适当的时间添加 char8_t ?否则会延迟引入 u8 ?li>
  • 或者,为什么不是C ++ 20中随 char8_t 引入的另一个不间断的前缀,如 c8"text" 改变?我以为TPTB讨厌更改,甚至还破坏了最简单的情况: cout<<前缀"hello world" .
  • 在功能上, char8_t 是(更接近) unsigned char char 的别名吗?
  • 如果是前者,则正在努力解决例如: typedef std :: basic_string< unsigned char>.u8string 一种可行的仿真策略?在编写自己的反向端口/参考实现之前,可以研究一下吗?
  • 在C ++ 17或以下版本中,将文本标记为(预期为)UTF-8 *仅用于存储*的最接近的是什么?
  • Why did C++ not add char8_t at the proper time when u8"text" was introduced or otherwise delay introduction of u8?
  • Alternatively, why wasn't another, non-breaking prefix like c8"text" introduced with char8_t in C++20 instead of introducing a wide-scope breaking change? I thought TPTB hated breaking changes, even more something that literally breaks the simplest possible case: cout<< prefix"hello world".
  • Is char8_t intended to functionally be (closer to) an alias of unsigned char or of char?
  • If the former, is working up the way to eg.: typedef std::basic_string<unsigned char> u8string a viable emulation strategy? Are there backport / reference implementations available one can look into before writing my own?
  • What's the closest we have in C++17-or-below to marking text as (intended to be) UTF-8 *for storage only*?

re: char8_t unsigned char ,这或多或少是我在使用伪代码的目的:

re: char8_t as unsigned char, this is more or less what I'm looking at in terms of pseudocode:

// this is here basically only for type-distinctiveness
class char8_t {
  unsigned char value;

  public:
  non_explicit constexpr char8_t (unsigned char ch = 0x00) noexcept;
  operator unsigned char () const noexcept;
  // implement all operators to mirror operations on unsigned char
};

// public adapter jic
friend unsigned char to_char (char8_t);

// note we're *not* using our new char-type here
namespace std {
  typedef std::basic_string<unsigned char> u8string;
}

// unsure if these two would actually be needed
// (couldn't make a compelling case so far,
// even testing with Windows's broken conhost)

namespace std {
  basic_istream<char8_t> u8cin;
  basic_ostream<char8_t> u8cout;
}

// we work up operator<<, operator>> and string conversion from there
// adding utf8-validity checks where needed

std::ostream& operator<< (std::ostream&, std::u8string const&);
std::istream& operator>> (std::istream&, std::u8string&);

// likely a macro; we'll see
#define u8c(ch) static_cast<char8_t>(ch)
// char8_t ch = u8c('x');

// very likely not a macro pre-C++20; can't skip utf-8 validity check on [2]?
u8string u8s (char8_t const* str); // [1], likely trivial
u8string u8s (char const* str);    // [2], non-trivial
// C++20 and up
#define u8s(str) u8##str // or something; not sure

// end result:

// no, I can't even think how would one spell this:
u8string text = u8s("H€łlo Ẅørλd");
// this wouldn't work without refactoring u8string into a full specialization, 
// to add the required constructor, but doing so is a PITA because 
// the basic_string interface is YAIM (yet another infamous mess):
u8string text = u8"H€łlo Ẅørλd";

我已经将此C ++标记为通用名称,但这更多地是关于C ++ 20之前的Standards的实现((价值)).更重要的是,我不是在寻找" perfect "解决方案或理由.在这种情况下,穷人的生活绰绰有余.

I've tagged this C++ as a general, but this is more about (the value of) implementation for Standards pre-C++20. More importantly, I'm not looking for "perfect" solutions or justifications; given the context, poor-man's is more than good enough.

推荐答案

我是 P0482 的作者和 P1423 char8_t 论文.

也很混乱.

我完全同意. SG16 正在努力改善与Unicode和文本相关的所有事物,但我们必须从零开始级别,因此需要一段时间.

I completely agree. SG16 is working to improve all things Unicode and text related, but we're having to start near ground level, so it is going to take a while.

如果您尚未看到它,则下面链接的存储库提供了一些实用程序,可用于编写可在C ++ 17和C ++ 20中使用的代码.

If you haven't seen it yet, the repository linked below provides some utilities for writing code that will work in C++17 and C++20.

C ++,任何标准,都没有办法指定源代码具有给定的文本编码(类似于Python的#encoding:...元数据),也无法将其编译成什么标准(例如说#!/bin/env g ++ -std = c ++ 14).

C++, any standards, have no means to specify that the source code has a given text encoding (something like Python's # encoding:... metadata), nor what Standards can it be compiled into (like say #!/bin/env g++ -std=c++14).

这是正确的,但并非没有先例.IBM的xlC编译器支持 #pragma filetag 指令的行为类似于Python的编码声明.我开始探索该空间的论文,希望将其提交给布拉格会议,但没有及时完成.我希望将其提交给6月的瓦尔纳会议.

This is correct, but not without precedent. IBM's xlC compiler supports a #pragma filetag directive that behaves similarly to Python's encoding declaration. I started on a paper exploring this space and had hoped to submit it for the Prague meeting, but did not complete it in time. I expect to submit it for the Varna meeting (in June).

直到C ++ 11之前,还没有办法指定任何给定的字符串文字将具有给定的编码-如果需要,编译器可以自由地将UTF8字符串文字重新解析为UTF16甚至EBCDIC.

Up until C++11, there was also no way to specify that any given string literal would have a given encoding - the compiler was free to reparse a UTF8 string literal into say UTF16 or even EBCDIC if it so desired.

正确的,并且从技术上讲,对于 char16_t char32_t 字符串文字而言,这一直适用,直到C ++ 20和采用翻译阶段1 中,源代码内容将转换为编译器的内部编码然后在翻译阶段5 中,将字符和字符串文字转换为编码适当的执行字符集.

Correct, and this technically remained true for char16_t and char32_t string literals until C++20 and the adoption of P1041. Note though that there is no reparsing going on. In translation phase 1, the source code contents are converted to the compiler's internal encoding and then in translation phase 5, character and string literals are converted to the encoding of the appropriate execution character set.

C ++ 11引入了u16"text"和u32"text"以及相关的char类型来生成UTF16和UTF32编码的文本,但是不提供使用它们的字符串或流工具,因此它们基本上是无用的.

C++11 introduces u16"text" and u32"text" and associated char types to produce UTF16 and UTF32-encoded text, but does not provide string or stream facilities to work with them, so they're basically useless.

正确. P1629 是我们希望针对C ++ 23完成的更重要的更改之一.目标是提供文本编码器,解码器和代码转换器,以帮助在代码单元和代码点级别处理文本.我们还将为枚举字素簇提供支持.

Correct. P1629 is one of the more significant changes we're hoping to complete for C++23. The goal is to provide text encoders, decoders, and transcoders that facilitate working with text at the code unit and code point levels. We would also provide support for enumerating grapheme clusters.

C ++ 11还引入了u8"text"来产生UTF8编码的字符串...但是甚至没有引入适当的UTF8 char类型或字符串类型(这就是char8_t在C ++ 20中的用途)吗?),因此它甚至比上面的没用了.

C++11 also introduces u8"text" for producing an UTF8-encoded string... but does not even introduce either a proper UTF8 char type or string type (that's what char8_t is intended to be in C++20?), so it's even uselesser than the above.

正确.C ++ 20的目标是:1)启用区分类型系统中的"text" u8"text" ,2)启用区分区域设置和UTF-8文本(从类型系统强制执行),3)确保对UTF-8代码单元使用无符号类型,以及4)避免使用 char 类型别名.这就是我们有时间为C ++ 20完成的工作(标准化不是一个快速的过程).

Correct. The goal for C++20 was to 1) enable differentiating "text" and u8"text" in the type system, 2) enable separating locale dependent and UTF-8 text (with enforcement from the type system), 3) ensure use of an unsigned type for UTF-8 code units, and 4) avoid the char type aliasing penalty. That was all we had time to get done for C++20 (standardization is not a rapid process).

由于所有这些,当最终引入char8_t时,它将杀死许多原本是有效的代码,到目前为止,所寻求的一些补救措施包括完全禁用char8_t行为.

Because of all this, when char8_t is finally introduced, it kills lots of code that was intended to be valid and so far some of the remediations sought include disabling char8_t behaviour altogether.

提出了

正确的 char8_t 作为重大更改;不要掉以轻心.在这种情况下,它被认为是可以接受的,因为1)代码搜索发现很少使用 u8 字符和字符串文字,2)P1423中讨论的解决向后兼容性问题的选项被认为是足够的,以及3)一项不间断的提案会给该语言增加长期负担,而收效甚微.

Correct, char8_t was proposed as a breaking change; something not to be taken lightly. In this case, it was deemed acceptable because 1) code searches found little use of u8 character and string literals, 2) the options for addressing backward compatibility concerns as discussed in P1423 were considered adequate, and 3) a non-breaking proposal would have added long term baggage to the language for little gain.

即使那样,也没有现成的工具(如:与不在同一个废话层界面上)来检查,转换(在同一字符串内)或转换(跨字符串类型复制)C ++中的文本编码.甚至编解码器似乎也已被删除.

Even then, there's no readily available tooling (as in: not the same crap tier interface as ) to check, transform (within the same string) or convert (copying across string types) text encodings in C++. Even codecvt seems to have been dropped.

正确.我们将努力改善这种情况,但这需要时间. codecvt 尚未被删除;< codecvt> 标头和各种UTF转换器在C ++ 17中已弃用. std :: codecvt 受到性能和可用性问题的困扰,因此我们不能将其继续作为基础.我们认为 P1629 是一个更好的方向.

Correct. We'll be working to improve this situation, but it will take time. codecvt has not been dropped (yet); the <codecvt> header and various UTF converters were deprecated in C++17. std::codecvt suffers from performance and usability issues, so is not considered something we can continue to build on. We believe P1629 is a superior direction.

为什么在引入u8文本"时C ++不能在适当的时间添加char8_t,否则会延迟引入u8?

Why did C++ not add char8_t at the proper time when u8"text" was introduced or otherwise delay introduction of u8?

我问了一位参与这项最初工作的C ++委员会成员.他告诉我说,他问当时从事Unicode工作的人们是否应该添加一种新类型,并且答复是嗯,我们不需要它".

I asked one of the C++ committee members who was involved in that original effort. He told me that he asked the people working on Unicode at the time if a new type should be added and the response was, "eh, we don't need it".

或者,为什么没有在C ++ 20中用char8_t引入另一个不间断的前缀,例如c8"text",而不是引入广泛的突破性变化?我认为TPTB讨厌破坏更改,甚至更讨厌破坏字面上最简单的情况:cout<<前缀"hello world".

Alternatively, why wasn't another, non-breaking prefix like c8"text" introduced with char8_t in C++20 instead of introducing a wide-scope breaking change? I thought TPTB hated breaking changes, even more something that literally breaks the simplest possible case: cout<< prefix"hello world".

考虑了一个不同的前缀,有一点我简短地赞成这种方法.但是,如前所述,这将为我们提供两种拼写UTF-8文字和相关历史包ways的方式.从长远来看,只要我们有合理的手段减轻破损,就可以感觉到突破性的变化会带来更多好处.

A different prefix was considered and at one point I briefly favored that approach. However, as mentioned earlier, that would have left us with two ways of spelling UTF-8 literals and related historical baggage. In the long run, it was felt that a breaking change, so long as we had reasonable means to mitigate the breakage, offered more benefits.

关于这个简单的测试用例,花一点时间考虑一下代码应该的作用.然后阅读以下内容:什么是printf()的字符格式为char8_t *?.

With regard to that simple test case, take a minute to think about what that code should do. Then go read this: What is the printf() formatting character for char8_t *?.

char8_t是否打算在功能上(更接近于)unsigned char或char的别名?

Is char8_t intended to functionally be (closer to) an alias of unsigned char or of char?

char8_t 是有意且明确地不是别名(因为这会对性能产生负面影响),但被指定为与 unsigned char 相同的基础表示形式.在 char 上使用 unsigned char 的原因是要避免使用 u8'\ x80'<之类的表达式.0 的计算结果为true(今天的 char 可能是,也可能不是).

char8_t is intentionally and explicitly not an alias (because that has negative performance implications) but is specified to have the same underlying representation as unsigned char. The reason for unsigned char over char is to avoid expressions like u8'\x80' < 0 ever evaluating to true (which may or may not be the case with char today).

如果是前者,是否正在努力实现例如:typedef std :: basic_string u8string的可行仿真策略?在编写自己的反向端口/参考实现之前,可以先进行研究吗?

If the former, is working up the way to eg.: typedef std::basic_string u8string a viable emulation strategy? Are there backport / reference implementations available one can look into before writing my own?

我不会评论这种方法是否是一个好主意,但是以前已经做过.例如, EASTL具有这样的typedef (该项目还提供 char8_t (如果本机类型不可用)

I won't comment on whether this approach is a good idea or not, but it has been done before. For example, EASTL has such a typedef (That project also provides a definition of char8_t if the native type isn't available)

在C ++ 17或以下版本中,将文本标记为(预期为)UTF-8 仅用于存储的最接近的是什么?

我认为这个问题没有一个正确的答案.我见过项目使用 unsigned char 或通过类提供类似类型的 char8_t .

I don't think there is one right answer to this question. I've seen projects use unsigned char or provide a char8_t like type via a class.

关于您的伪代码,对前面提到的 char8_t-remediation存储库中的代码进行了一些调整提供 unsigned char 类型而不是 char 类型应该可以使类似以下代码的代码正常工作.请参阅 _as_char的定义用户定义的文字和 U8 .

With regard to your pseudocode, some tweaks to the code in the previously mentioned char8_t-remediation repository to provide unsigned char types instead of char should enable code like the following to work. See the definitions of the _as_char user-defined literals and U8 macro.

typedef std::basic_string<unsigned char> u8string;
u8string u8s(U8("text"));

这篇关于C ++ 17之前的char8_t/UTF8 chars状况不佳吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆