char8_t *的printf()格式字符是什么? [英] What is the printf() formatting character for char8_t *?

查看:426
本文介绍了char8_t *的printf()格式字符是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

还有char8_t吗?

我假设某个地方有一些C ++ 20决策,但我找不到. 还有 P1428 ,但是该文档没有提及有关printf()家庭与char8_t *char8_t.

I assume there is some C++20 decision, somewhere, but I could not find it. There is also P1428, but that doc is not mentioning anything about printf() family v.s. char8_t * or char8_t.

使用std::cout建议可能是一个答案.不幸的是,它不再编译了.

Use std::cout advice might be an answer. Unfortunately, that does not compile anymore.

// does not compile under C++20
// error : overload resolution selected deleted operator '<<'
// see P1423, proposal 7
std::cout <<  u8"A2";
std::cout <<  char8_t ('A');

对于C 2.x和char8_t

从此处开始.

更新

我用u8序列中的单个元素进行了更多测试. 这确实是行不通的. char8_t *printf("%s")确实可以工作,但是char8_tprintf("%c")是等待发生的事故.

I have done some more tests with a single element from a u8 sequence. And that indeed does not work. char8_t * to printf("%s") does work, but char8_t to printf("%c") is an accident waiting to happen.

请参阅- https://wandbox.org/permlink/6NQtkKeZ9JUFw4Sd -问题是,根据当前的现状,未实现char8_t,而未实现char8_t *. -让我重复一遍:没有实现类型可以保存char8_t *序列中的单个元素.

Please see -- https://wandbox.org/permlink/6NQtkKeZ9JUFw4Sd -- Problem is, as per the current status quo, char8_t is not implemented, char8_t * is. -- let me repeat: there is no implemented type to hold a single element from a char8_t * sequence.

如果要使用单个u8字形,则需要将其编码为u8字符串

If you want a single u8 glyph you need to code it as an u8 string

char8_t const * single_glyph = u8"ア";

目前看来,以一种肯定的方式印刷以上是

And it seems at present, to print the above the sort of a sure way is

// works with warnings
std::printf("%s", single_glyph ) ;

要开始阅读此主题,可能需要这两篇论文

To start reading on this subject, probably these two papers are required

  1. http://www.open-std .org/jtc1/sc22/wg14/www/docs/n2231.htm
  2. http://www.open -std.org/jtc1/sc22/wg21/docs/papers/2019/p1423r2.html
  1. http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2231.htm
  2. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1423r2.html

按此顺序.

我的主要DEVENV是VisualStudio 2019,同时带有VS附带的MSVC和CLANG 8.0.1.使用std:c ++ latest.开发人员机器为WIN10 [版本10.0.18362.476]

My primary DEVENV is VisualStudio 2019, with both MSVC and CLANG 8.0.1, as delivered with VS. With std:c++latest. Dev machine is WIN10 [Version 10.0.18362.476]

推荐答案

我是char8_t P0482 P1423 针对C ++的提案和

I'm the author of the char8_t P0482 and P1423 proposals for C++ and the N2231 proposal for C (that has not yet been accepted).

让我们考虑以下应该应该做什么:

Let's think about what the following should do:

printf("Hello %s\n", u8"Jöel");
std::cout << "Hello " << u8"Jöel" << "\n";

实际上,让我们再退一步.在标准输出的接收器端需要什么编码?有几种可能性.如果将标准输出连接到控制台/终端,则预期的编码是配置控制台/终端的编码.在美国的Windows系统上,它可能是 CP437 .在UNIX/Linux系统上,这可能是UTF-8.在美国的z/OS系统上,这很可能 EBCDIC代码页037 .如果标准输出已被重定向,则预期的编码可能与语言环境有关.在美国的Windows系统上,这意味着活动代码页(ACP),可能 Windows 1252 .在UNIX/Linux和z/OS上,它可能与控制台/终端相同(Windows是奇怪的系统,此处的控制台编码和区域设置编码具有不同的默认值).

Actually, let's take a further step back. What encoding is expected on the receiver side of standard output? There are a few possibilities. If standard out is connected to a console/terminal, then the expected encoding is the one that the console/terminal is configured for. On a Windows system in the United States, this is likely to be CP437. On a UNIX/Linux system, this is likely UTF-8. On a z/OS system in the United States, this is likely EBCDIC code page 037. If standard out has been redirected, then the expected encoding is likely locale dependent. On a Windows system in the United States, that would mean the Active Code Page (ACP), likely Windows 1252. On UNIX/Linux and z/OS, it would likely be the same as the console/terminal (Windows is the odd system here that has different defaults for console encoding vs locale encoding).

返回该示例代码.该UTF-8编码的ö字符(U + 00F6,带有拉丁字母的小拉丁字母O,编码为0xC3 0xB6)的预期或期望的行为是什么?对于Windows写入控制台,为了使字符正确显示,需要将编码后的序列转码为0x94,而对于需要依赖于语言环境的输出的Windows,则需要将其转码为0xF6.对于UNIX/Linux,该顺序可能应该通过.对于z/OS,可能需要将其转码为0xCC.但是在所有这些系统上,这些默认值都是可以配置的(例如,通过LANG环境变量).

Back to that example code. What is the expected or desired behavior for that UTF-8 encoded ö character (U+00F6, {LATIN SMALL LETTER O WITH DIAERESIS}, encoded as 0xC3 0xB6)? For Windows writing to the console, for the character to display properly, the encoded sequence would need to be transcoded to 0x94 while for Windows where locale dependent output is expected, it would need to be transcoded to 0xF6. For UNIX/Linux, the sequence should probably be passed through. For z/OS, it may need to be transcoded to 0xCC. But on all of these systems, these defaults are configurable (e.g., via the LANG environment variable).

假定将转码转换为运行时确定的编码是理想的行为,应如何处理转码错误?例如,如果目标编码缺少ö的表示,应该怎么办?如果存在格式错误的UTF-8序列怎么办? printf应该停止并报告错误吗? std::cout应该抛出异常吗?还是应该替换实现定义的字符,例如U + FFFD {REPLACEMENT CHARACTER}或??

Assuming that transcoding to a run-time determined encoding is the desired behavior, how should transcoding errors be handled? For example, what should happen if the target encoding lacks representation for ö? What if an ill-formed UTF-8 sequence is present? Should printf stop and report an error? Should std::cout throw an exception? Or should an implementation defined character such as U+FFFD {REPLACEMENT CHARACTER} or ? be substituted?

如果std::cout刻有std::codecvt刻面,应该怎么办?大概该方面将期望传入的文本采用特定的编码.在呈现给构面之前,是否应该将UTF-8文本转码为执行字符集,与语言环境相关的编码或控制台/终端编码之一?如果是这样,哪一个?实现是否应该知道流是否已连接到控制台/终端?如果程序员想覆盖默认值,例如总是写UTF-8,该怎么办?

What should happen if std::cout is imbued with a std::codecvt facet? Presumably that facet will expect incoming text to be in a particular encoding. Should UTF-8 text be transcoded to one of the execution character set, the locale dependent encoding, or the console/terminal encoding before being presented to the facet? If so, which one? Should the implementation have to be aware of whether the stream is connected to a console/terminal? What if the programmer wants to override the default and, for example, always write UTF-8?

这些是非常棘手的问题,我们没有很好的答案.已建议使用std::u8out作为明确选择加入UTF-8的方法,但不能解决预期的标准输出编码问题,codecvt方面的问题以及其他iostream问题,例如隐式语言环境相关的格式.

These are rather difficult questions that we don't have good answers for. std::u8out has been suggested, as a way to explicitly opt-in to UTF-8, but doesn't solve the problems of expected standard output encoding, issues with codecvt facets, and other iostreams problems like implicit locale dependent formatting.

就个人而言,为了继续提供良好的Unicode支持,我认为我们将不得不投资替代iostream,该iostreams的目的是:1)提供字节输出,并在顶部提供文本支持; 2)支持编码(在文本层),3)与语言环境无关(但是像std::format所提供的那样,显式选择支持基于语言环境的格式),4)比iostream更具性能.

Personally, in order to provide good Unicode support going forward, I think we're going to have to invest in a replacement for iostreams that 1) provides byte output with text support layered on top, 2) is encoding aware (in the text layer), 3) is locale independent (but with explicit opt-in support for locale dependent formatting like that provided by std::format), 4) is more performant than iostreams.

SG16想听听您的想法和建议.有关联系信息,请参见 https://github.com/sg16-unicode/sg16 .

SG16 would like to hear your thoughts and suggestions. See https://github.com/sg16-unicode/sg16 for contact information.

这篇关于char8_t *的printf()格式字符是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆