u8文学应该如何运作? [英] How are u8-literals supposed to work?

查看:66
本文介绍了u8文学应该如何运作?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

难以理解u8-literals的语义,或者更确切地说,难以理解g ++ 4.8.1上的结果

Having trouble to understand the semantics of u8-literals, or rather, understanding the result on g++ 4.8.1

这是我的期望:

const std::string utf8 = u8"åäö"; // or some other extended ASCII characters
assert( utf8.size() > 3);

这是g ++ 4.8.1的结果

This is the result on g++ 4.8.1

const std::string utf8 = u8"åäö"; // or some other extended ASCII characters
assert( utf8.size() == 3);




  • 源文件为ISO-8859(-1)

  • 我们使用以下编译器指令:-m64 -std = c ++ 11 -pthread -O3 -fpic

  • 在我的世界中,无论源文件的编码如何,最终的utf8字符串都应长于3。

    In my world, regardless of the encoding of the source file the resulting utf8 string should be longer than 3.

    或者,我是否完全误解了u8的语义,以及它针对的用例?请启发我。

    Or, have I totally misunderstood the semantics of u8, and the use-case it targets? Please enlighten me.

    更新

    如果我明确告诉编译器是什么编码正如许多人所建议的那样,源文件得到了u8文字的预期行为。 但是,常规文字也会被编码为utf8

    If I explicitly tell the compiler what encoding the source file is, as many suggested, I got the expected behavior for u8 literals. But, regular literals also gets encoded to utf8

    即:

    const std::string utf8 = u8"åäö"; // or some other extended ASCII characters
    assert( utf8.size() > 3);
    assert( utf8 == "åäö");
    




    • 编译器指令:g ++ -m64 -std = c ++ 11- pthread -O3 -finput-charset = ISO8859-1

    • 尝试了iconv定义的其他一些字符集,例如:ISO_8859-1等...

    • 我现在比以前更加困惑...

      I'm even more confused now than before...

      推荐答案

      u8 前缀实际上仅表示在编译此代码时,从此文字生成UTF-8字符串。它没有说明编译器应如何解释源文件中的文字。

      The u8 prefix really just means "when compiling this code, generate a UTF-8 string from this literal". It says nothing about how the literal in the source file should be interpreted by the compiler.

      因此,您有几个因素在起作用:

      So you have several factors at play:


      1. 该编码是写入的源文件(在您的情况下,显然是ISO-8859)。根据这种编码,字符串文字是åäö(3个字节,包含值0xc5、0xe4、0xf6)

      2. 在什么时候编译器假定阅读源文件? (我怀疑GCC的默认值为UTF-8,但是我可能错了。

      3. 编译器用于目标文件中生成的字符串的编码。您可以将其指定为UTF- 8通过前缀 u8

      1. which encoding is the source file written in (In your case, apparently ISO-8859). According to this encoding, the string literal is "åäö" (3 bytes, containing the values 0xc5, 0xe4, 0xf6)
      2. which encoding does the compiler assume when reading the source file? (I suspect that GCC defaults to UTF-8, but I could be wrong.
      3. the encoding that the compiler uses for the generated string in the object file. You specify this to be UTF-8 via the u8 prefix.

      最有可能的是,#2会出错。如果编译器将源文件解释为ISO-8859,则它将读取这三个字符,将它们转换为UTF-8,然后写入它们,从而为您提供6字节(我认为每个字符都编码为2

      Most likely, #2 is where this goes wrong. If the compiler interprets the source file as ISO-8859, then it will read the three characters, convert them to UTF-8, and write those, giving you a 6-byte (I think each of those chars encodes to 2 byte in UTF-8) string as a result.

      但是,如果假定源文件为UTF-8,则不需要进行转换。完全没有:它读取3个字节,并假定它们为UTF-8(即使它们对于UTF-8是无效的垃圾值),并且由于您也要求输出字符串也为UTF-8,因此它只会输出这些字节相同的3个字节。

      However, if it assumes the source file to be UTF-8, then it won't need to do a conversion at all: it reads 3 bytes, which it assumes are UTF-8 (even though they're invalid garbage values for UTF-8), and since you asked for the output string to be UTF-8 as well, it just outputs those same 3 bytes.

      您可以使用 -finput-charset 告诉GCC假定采用哪种源编码。将源编码为UTF-8,也可以使用 \uXXXX 字符串文字中的转义序列( \u00E5 而不是å

      You can tell GCC which source encoding to assume with -finput-charset, or you can encode the source as UTF-8, or you can use the \uXXXX escape sequences in the string literal ( \u00E5 instead of å, for example)

      为了澄清一点,当您指定字符串文字时在源代码中使用 u8 前缀,然后您告诉编译器无论阅读源文本时使用哪种编码,请写入目标文件时将其转换为UTF-8。您没有说出应如何解释源文本。这取决于编译器的决定(也许基于传递给它的标志,也许基于进程的环境,或者仅使用硬编码默认值)。

      To clarify a bit, when you specify a string literal with the u8 prefix in your source code, then you are telling the compiler that "regardless of which encoding you used when reading the source text, please convert it to UTF-8 when writing it out to the object file". You are saying nothing about how the source text should be interpreted. That is up to the compiler to decide (perhaps based on which flags you passed to it, perhaps based on the process' environment, or perhaps just using a hardcoded default)

      如果源文本中的字符串包含字节0xc5、0xe4、0xf6,,您告诉它源文本编码为ISO-8859,则编译器将识别出该字符串包含将会看到 u8 前缀,并将这些字符转换为UTF-8,写入字节序列0xc3、0xa5、0xc3、0xa4、0xc3, 0xb6到目标文件,在这种情况下,您将得到一个有效的UTF-8编码文本字符串,其中包含字符åäö的UTF-8表示形式。

      If the string in your source text contains the bytes 0xc5, 0xe4, 0xf6, and you tell it that "the source text is encoded as ISO-8859", then the compiler will recognize that "the string consists of the characters "åäö". It will see the u8 prefix, and convert these characters to UTF-8, writing the byte sequence 0xc3, 0xa5, 0xc3, 0xa4, 0xc3, 0xb6 to the object file. In this case, you end up with a valid UTF-8 encoded text string containing the UTF-8 representation of the characters "åäö".

      但是,如果源文本中的字符串包含相同的字节,并且您使编译器认为源文本编码为UTF-8 ,则编译器可能会做两件事(具体取决于实现:

      However, if the string in your source text contains the same byte, and you make the compiler believe that the source text is encoded as UTF-8, then there are two things the compiler may do (depending on implementation:


      • 它可能会尝试解析t字节为UTF-8,在这种情况下,它将识别这不是有效的UTF-8序列,并发出错误。这就是Clang所做的。

      • 或者,它可能会说:好吧,我在这里有3个字节,我被告知要假定它们构成一个有效的UTF-8字符串。我将保留对他们,看看会发生什么。然后,当应该将字符串写入目标文件时,它会确定,我之前有3个字节,它们被标记为UTF-8。 u8 前缀在这里意味着我应该将此字符串写为UTF-8。很酷,那么无需进行转换。我只需写这3个字节就可以了。这就是GCC所做的。

      • it might try to parse the bytes as UTF-8, in which case it will recognize that "this is not a valid UTF-8 sequence", and issue an error. This is what Clang does.
      • alternatively, it might say "ok, I have 3 bytes here, I am told to assume that they form a valid UTF-8 string. I'll hold on to them and see what happens". Then, when it is supposed to write the string to the object file, it goes "ok, I have these 3 bytes from before, which are marked as being UTF-8. The u8 prefix here means that I am supposed to write this string as UTF-8. Cool, no need to do a conversion then. I'll just write these 3 bytes and I'm done". This is what GCC does.

      两者均有效。 C ++语言没有指出要求编译器检查传递给它的字符串文字的有效性。

      Both are valid. The C++ language doesn't state that the compiler is required to check the validity of the string literals you pass to it.

      但是在两种情况下,请注意 u8 前缀与您的问题没有任何关系。这只是告诉编译器将读取时对字符串进行的任何编码转换为UTF-8。但是即使在此转换之前,字符串也已经出现乱码,因为字节对应于ISO-8859字符数据,但是编译器认为它们是UTF-8(因为您没有另外告诉它)。

      But in both cases, note that the u8 prefix has nothing to do with your problem. That just tells the compiler to convert from "whatever encoding the string had when you read it, to UTF-8". But even before this conversion, the string was already garbled, because the bytes corresponded to ISO-8859 character data, but the compiler believed them to be UTF-8 (because you didn't tell it otherwise).

      您所看到的问题仅仅是,从源文件中读取字符串文字时,编译器不知道使用哪种编码。

      The problem you are seeing is simply that the compiler didn't know which encoding to use when reading the string literal from your source file.

      您要注意的 other 问题是,将使用编译器喜欢的任何编码方式来编码不带前缀的传统字符串文字。精确引入了 u8 前缀(以及相应的UTF-16和UTF-32前缀),以允许您指定希望编译器将输出写入的编码。普通的无前缀字面量根本不指定编码,而是由编译器决定编码。

      The other thing you are noticing is that a "traditional" string literal, with no prefix, is going to be encoded with whatever encoding the compiler likes. The u8 prefix (and the corresponding UTF-16 and UTF-32 prefixes) were entroduced precisely to allow you to specify which encoding you wanted the compiler to write the output in. The plain prefix-less literals do not specify an encoding at all, leaving it up to the compiler to decide on one.

      这篇关于u8文学应该如何运作?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆