在编译/运行时,字符串文字的原始字节流入/流出Windows(非宽)执行字符集, ANSI代码页与UTF-8 [英] Flow of raw bytes of string literal into/out of the Windows (non-wide) execution character set at compile/runtime, & ANSI code pages vs. UTF-8

查看:163
本文介绍了在编译/运行时,字符串文字的原始字节流入/流出Windows(非宽)执行字符集, ANSI代码页与UTF-8的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在Windows上确认我对原始字符串文字和(非宽)执行字符集的理解。



我想要特定确认的相关段落以BOLD。但首先是一些背景。






背景



(相关问题见下文粗体 >

由于 @ TheUndeadFish的回答我昨天发布的这个问题 ,我已尝试了解在MSVC中用作执行字符集字符集编码 Windows(在执行字符集的C ++规范意义;请参阅 @DietmarKühl的发帖)。



我怀疑某些人可能认为浪费时间甚至尝试理解MSVC中非ASCII字符的 char * (即非宽字符串)的ANSI相关行为。



例如,请考虑 @ IInspectable在此处的评论


您不能在
Windows API的ANSI版本中抛出一个UTF-8编码的字符串,并希望发生任何事情。


请注意,在基于Windows MFC的应用程序的当前i18n项目中,我将 我希望编译器生成 执行宽字符集 字符串,



但是,我想要 了解现有代码,它已经具有使用ANSI API函数的一些国际化。 即使某些人认为ANSI API对非ASCII字符串的行为是疯狂的,我想了解它。



我认为其他人,我发现很难找到有关在Windows上的非宽的执行字符集的明确文档。



特别是,因为(非宽)执行字符集由C ++标准定义为 char (与 wchar_t 相对),UTF-16不能在内部用于存储非全宽执行字符集中的字符。在这个时代,有意义的是,通过UTF-8编码的Unicode字符集(基于 char 的编码)将被用作字符集,执行字符集的编码。对我的理解,这是在Linux上的情况。但是,很遗憾,这是在Windows上的情况下 - 即使是MSVC 2013。



这导致我的两个问题的第一个。 p>




问题#1 :请确认以下段落是否正确。



有了这个背景,这里是我的问题。 在MSVC中,包括VS 2013,似乎执行字符集是(许多可能的)ANSI字符集之一,使用(许多可能的) (请注意,我问的是NON-WIDE 执行字符集。)这是否正确?






(假设我在问题1中正确)



如果我理解正确,上述粗体段落可以说是大部分在Windows上使用ANSI API的疯狂的原因。



具体来说,考虑sane情况 - 其中使用Unicode和UTF- 执行字符集



在这种情况下,编译代码的机器无关紧要, ,并且它运行代码运行什么机器无关紧要,或什么时候。字符串文字的实际原始字节将始终在使用UTF-8作为编码的Unicode字符集中进行内部表示,运行时系统将始终将这些字符串语义地视为UTF-8。



在疯狂情况下(如果我理解正确)没有这样的运气,其中ANSI字符集和代码页编码用作执行字符集。在这种情况下(Windows世界),编译代码的机器与运行代码的机器相比,运行时行为可能受影响






这里是问题#2 :请再次确认以下段落是否正确。考虑到这个持续的背景,我怀疑:具体来说,对于MSVC,执行字符集和其编码依赖于编译器在编译时运行的机器上的编译器选择的语言环境中的一些不那么容易理解的方式。这将确定烧录到可执行文件中的字符文字的原始字节。并且,在运行时,MSVC C运行时库可以使用不同的 执行字符集并编码以解释



(我可以在某个时候添加这个问题的例子)。






最后评论



从根本上来说,如果我理解正确,上面的粗体段落解释了在Windows上使用ANSI API的疯狂。由于ANSI字符集和由编译器选择的编码器和C运行时选择的ANSI字符集和编码可能存在差异,字符串文字中的非ASCII字符可能不会按预期在运行MSVC程序时,在程序中使用ANSI API。



(注意,ANSIinsanity真的只适用于字符串文字,因为根据C ++标准实际的源代码必须写在ASCII的子集中(并且源代码注释被编译器丢弃)。



上面的描述是我最好的理解Windows上的ANSI API关于字符串字面量。

解决方案

这是一个很长的故事,我想知道我的理解是正确的。有问题找到一个单一的明确问题。但是,我认为我可以解决导致这种情况的一些误解。



第一,ANSI是(窄)执行字符集的同义词。 UTF-16是执行的宽字符集。



编译器不会为你选择。如果使用窄的 char 字符串,就编译器(运行时)知道它们是ANSI。



是,特定的ANSI字符编码可能很重要。如果您在PC上编译一个文字,并且您的源代码在CP1252中,那么ä字符编译为UTF-16 ä。但是,相同的字节可能是其他编码中的另一个非ASCII字符,这将导致不同的UTF-16字符。



但是请注意,MSVC 完全能够编译UTF-8和UTF-16源代码,只要它以 U + FEFF BOM开头即可。这使得整个理论问题几乎是一个非问题。



具体来说,使用 MSVC ,执行字符集及其编码依赖...



不,MSVC与执行字符集无关,真的。 char(0xE4)的含义由操作系统决定。要看到这一点,请检查MinGW编译器。 MinGW生成的可执行文件与MSVC的可执行文件相同,因为它们都是同一个操作系统的目标。


I would like confirmation regarding my understanding of raw string literals and the (non-wide) execution character set on Windows.

Relevant paragraphs for which I desire specific confirmation are in BOLD. But first, some background.


BACKGROUND

(relevant questions are in the paragraphs below in bold)

As a result of the helpful discussion beneath @TheUndeadFish's answer to this question that I posted yesterday, I have attempted to understand the rules determining the character set and encoding used as the execution character set in MSVC on Windows (in the C++ specification sense of execution character set; see @DietmarKühl's posting).

I suspect that some might consider it a waste of time to even bother trying to understand the ANSI-related behavior of char * (i.e., non-wide) strings for non-ASCII characters in MSVC.

For example, consider @IInspectable's comment here:

You cannot throw a UTF-8 encoded string at the ANSI version of a Windows API and hope for anything sane to happen.

Please note that in my current i18n project on a Windows MFC-based application, I will be removing all calls to the non-wide (i.e., ANSI) versions of API calls, and I expect the compiler to generate execution wide-character set strings, NOT execution character set (non-wide) strings internally.

However, I want to understand the existing code, which already has some internationalization that uses the ANSI API functions. Even if some consider the behavior of the ANSI API on non-ASCII strings to be insane, I want to understand it.

I think like others, I have found it difficult to locate clarified documentation about the non-wide execution character set on Windows.

In particular, because the (non-wide) execution character set is defined by the C++ standard to be a sequence of char (as opposed to wchar_t), UTF-16 cannot be used internally to store characters in the non-wide execution character set. In this day and age, it makes sense that the Unicode character set, encoded via UTF-8 (a char-based encoding), would therefore be used as the character set and encoding of the execution character set. To my understanding, this is the case on Linux. However, sadly, this is not the case on Windows - even MSVC 2013.

This leads to the first of my two questions.


Question #1: Please confirm that I'm correct in the following paragraph.

With this background, here's my question. In MSVC, including VS 2013, it seems that the execution character set is one of the (many possible) ANSI character sets, using one of the (many possible) code pages corresponding to that particular given ANSI character set to define the encoding - rather than the Unicode character set with UTF-8 encoding. (Note that I am asking about the NON-WIDE execution character set.) Is this correct?


BACKGROUND, CONTINUED (assuming I'm correct in Question #1)

If I understand things correctly, than the above bolded paragraph is arguably a large part of the cause of the "insanity" of using the ANSI API on Windows.

Specifically, consider the "sane" case - in which Unicode and UTF-8 are used as the execution character set.

In this case, it does not matter what machine the code is compiled on, or when, and it does not matter what machine the code runs on, or when. The actual raw bytes of a string literal will always be internally represented in the Unicode character set with UTF-8 as the encoding, and the runtime system will always treat such strings, semantically, as UTF-8.

No such luck in the "insane" case (if I understand correctly), in which ANSI character sets and code page encodings are used as the execution character set. In this case (the Windows world), the runtime behavior may be affected by the machine that the code is compiled on, in comparison with the machine the code runs on.


Here, then, is Question #2: Again, please confirm that I'm correct in the following paragraph.

With this continued background in mind, I suspect that: Specifically, with MSVC, the execution character set and its encoding depends in some not-so-easy-to-understand way on the locale selected by the compiler on the machine the compiler is running on, at the time of compilation. This will determine the raw bytes for character literals that are 'burned into' the executable. And, at run-time, the MSVC C runtime library may be using a different execution character set and encoding to interpret the raw bytes of character literals that were burned into the executable. Am I correct?

(I may add examples into this question at some point.)


FINAL COMMENTS

Fundamentally, if I understand correctly, the above bolded paragraph explains the "insanity" of using the ANSI API on Windows. Due to the possible difference between the ANSI character set and encoding chosen by the compiler and the ANSI character set and encoding chosen by the C runtime, non-ASCII characters in string literals may not appear as expected in a running MSVC program when the ANSI API is used in the program.

(Note that the ANSI "insanity" really only applies to string literals, because according to the C++ standard the actual source code must be written in a subset of ASCII (and source code comments are discarded by the compiler).)

The description above is the best current understanding I have of the ANSI API on Windows in regards to string literals. I would like confirmation that my explanation is well-formed and that my understanding is correct.

解决方案

A very long story, and I have problems finding a single clear question. However, I think I can resolve a number of misunderstandings that led to this.

First of, "ANSI" is a synonym for the (narrow) execution character set. UTF-16 is the execution wide-character set.

The compiler will NOT choose for you. If you use narrow char strings, they are ANSI as far as the compiler (runtime) is aware.

Yes, the particular "ANSI" character encoding can matter. If you compile a L"ä" literal on your PC, and your source code is in CP1252, then that ä character is compiled to a UTF-16 ä. However, the same byte could be another non-ASCII character in other encodigns, which would result in a different UTF-16 character.

Note however that MSVC is perfectly capable of compiling both UTF-8 and UTF-16 source code, as long as it starts with U+FEFF "BOM". This makes the whole theoretical problem pretty much a non-issue.

[edit] "Specifically, with MSVC, the execution character set and its encoding depends..."

No, MSVC has nothing to do with the execution character set, really. The meaning of char(0xE4) is determined by the OS. To see this, check the MinGW compiler. Executables produced by MinGW behave the same as those of MSVC, as both target the same OS.

这篇关于在编译/运行时,字符串文字的原始字节流入/流出Windows(非宽)执行字符集, ANSI代码页与UTF-8的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆