const std :: wstring如何编码以及如何更改为UTF-16 [英] How is const std::wstring encoded and how to change to UTF-16

查看:82
本文介绍了const std :: wstring如何编码以及如何更改为UTF-16的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我创建了这个最低限度的C ++示例代码段,以在定义字符串时比较 std :: string std :: wstring 中的字节(以十六进制表示)可以使用两种类型的德语非ASCII字符.

I created this minimum working C++ example snippet to compare bytes (by their hex representation) in a std::string and a std::wstring when defining a string with german non-ASCII characters in either type.

#include <iostream>
#include <iomanip>
#include <string>

int main(int, char**) {
    std::wstring wstr = L"äöüß";
    std::string str = "äöüß";

    for ( unsigned char c : str ) {
        std::cout << std::setw(2) << std::setfill('0') << std::hex << static_cast<unsigned short>(c) << ' ';
    }
    std::cout << std::endl;

    for ( wchar_t c : wstr ) {
        std::cout << std::setw(4) << std::setfill('0') << std::hex << static_cast<unsigned short>(c) << ' ';
    }
    std::cout << std::endl;

    return 0;
}

此代码段的输出为

c3 a4 c3 b6 c3 bc c3 9f 
00c3 00a4 00c3 00b6 00c3 00bc 00c3 0178

我在运行 Windows 10 64位Pro 的PC上运行此程序,并使用构建系统在版本16.8.1中的 MSVC 2019社区版上进行了编译cmake 和以下 CMakeLists.txt

I ran this on a PC running itself Windows 10 64-bit Pro, compiling with MSVC 2019 Community Edition in Version 16.8.1, using build system cmake with following CMakeLists.txt

cmake_minimum_required(VERSION 3.0.0)
project(wstring VERSION 0.1.0)

set(CMAKE_CXX_STANDARD 17)

include(CTest)
enable_testing()

add_executable(wstring main.cpp)

set(CPACK_PROJECT_NAME ${PROJECT_NAME})
set(CPACK_PROJECT_VERSION ${PROJECT_VERSION})
include(CPack)

我读到, std :: string 是基于 char 类型的,它是一个字节.我看到我的代码段的输出表明 str ( std :: string 变量)是 UTF-8 编码的.我读到,Microsoft编译器使用2字节的 wchar_t s组成 std :: wstring s(而不是4字节的 wchar_t s例如GNU gcc),因此期望 wstr ( std :: wstring 变量)经过(任何类型) UTF-16 编码.但我无法弄清楚为什么ß"字样会(拉丁字母s)被编码为 0x00c30178 ,而我原来期望的是 0x00df .可能有人请告诉我:

I read, that std::strings are based on char type which is a single byte. I see that the output of my snippet indicates that str (the std::string variable) is UTF-8 encoded. I read up, that Microsoft compilers use wchar_ts with 2 bytes to make up std::wstrings (instead of 4 byte wchar_ts by e.g. GNU gcc) and therefore would expect wstr (the std::wstring variable) to be (any kind of) UTF-16 encoded. But I cannot figure out why the "ß" (latin sharp s) is encoded as 0x00c30178 I had expected 0x00df instead. May somebody please tell me:

  • 为什么会这样?
  • 如何最终得到UTF-16编码的 std :: wstring s(Big Endian可以,我不介意BOM)?我可能需要以某种方式告诉编译器吗?
  • 这是哪种编码?
  • Why this is happening?
  • How may I end up with UTF-16 encoded std::wstrings (Big Endian would be fine, I do not mind a BOM)? Do I probably need to tell the compiler somehow?
  • What kind of encoding is this?

更改了标题,因为它不适合问题(实际上UTF-8和UTF-16是不同的编码,所以我已经是新答案了……)

changed title, as it did not fit the questions properly (and actually UTF-8 and UTF-16 are different encodings so the I my self new the answer already...)

忘了提及:我使用提到的编译器的 amd64 目标

forgot to mention: I use the amd64 target of the mentioned compiler

如果添加dxiv注释中指出的/utf-8 标志(请参阅已链接SO-Post ),我得到了所需的输出

if adding the /utf-8 flag as pointed out in the comments by dxiv (see his linked SO-Post), I get the desired output

c3 a4 c3 b6 c3 bc c3 9f
00e4 00f6 00fc 00df

对我来说看起来像UTF-16-BE(无BOM).由于我对cmake命令的正确顺序有疑问,这是我当前的 CmakeLists.txt 文件.重要的是将 add_compile_options 命令放在 add_executable 命令之前(为方便起见,我添加了通知)

which looks like UTF-16-BE (no BOM) for me. As I had issues with the correct order of cmake commands this is my current CmakeLists.txt file. It is important to put the add_compile_options command before the add_executable command (I added the Notice for convenience)

cmake_minimum_required(VERSION 3.0.0)
project(enctest VERSION 0.1.0)

set(CMAKE_CXX_STANDARD 17)

include(CTest)
enable_testing()

if (MSVC)
  message(NOTICE "compiling with MSVC")
  add_compile_options(/utf-8)
endif()

add_executable(enctest main.cpp)

set(CPACK_PROJECT_NAME ${PROJECT_NAME})
set(CPACK_PROJECT_VERSION ${PROJECT_VERSION})
include(CPack)

我发现 if-endif 的方式比生成器语法更易读,但是编写了 add_compile_options("$< $< CXX_COMPILER_ID:MSVC> ::/utf-8>''')代替也可以.

I find the if-endif way more readable, than the generator-syntax one, but writing add_compile_options("$<$<CXX_COMPILER_ID:MSVC>:/utf-8>") instead would work as well.

注意:对于Qt项目, .pro 文件有一个不错的选择(请参见

Note: For Qt-Projects there is a nice switch for the .pro file (see this Qt-Form post)

win32 {
    QMAKE_CXXFLAGS += /utf-8
}

我的问题的第一部分仍然是开放的:对于ß", 0x00c30178 是什么编码?(拉丁语尖锐的s)?

Still the first part of my question is open: What encoding is 0x00c30178 for "ß" (latin sharp s)?

推荐答案

正如注释中所阐明的,源 .cpp 文件是UTF-8编码的.没有BOM,也没有显式的/source-charset:utf-8 开关,Visual C ++编译器默认假定源文件保存在活动代码页编码中.从设置源字符集文档:

As clarified in the comments, the source .cpp file is UTF-8 encoded. Without a BOM, and without an explicit /source-charset:utf-8 switch, the Visual C++ compiler defaults to assuming the source file is saved in the active codepage encoding. From the Set Source Character Set documentation:

默认情况下,Visual Studio将检测一个字节顺序标记,以确定源文件是否为编码的Unicode格式,例如UTF-16或UTF-8.如果未找到字节序标记,则除非使用/source-charset选项指定字符集名称或代码页,否则将假定源文件是使用当前用户代码页编码的.

By default, Visual Studio detects a byte-order mark to determine if the source file is in an encoded Unicode format, for example, UTF-16 or UTF-8. If no byte-order mark is found, it assumes the source file is encoded using the current user code page, unless you specify a character set name or code page by using the /source-charset option.

äöüß的UTF-8编码为 C3 A4 C3 B6 C3 BC C3 9F ,因此该行:

The UTF-8 encoding of äöüß is C3 A4 C3 B6 C3 BC C3 9F, and therefore the line:

    std::wstring wstr = L"äöüß";

被编译器视为:

    std::wstring wstr = L"\xC3\xA4\xC3\xB6\xC3\xBC\xC3\x9F"`;

假设活动代码页是通常的 Windows-1252 ,(扩展)字符映射为:

Assuming the active codepage to be the usual Windows-1252, the (extended) characters map as:

    win-1252    char    unicode

      \xC3       Ã       U+00C3
      \xA4       ¤       U+00A4
      \xB6       ¶       U+00B6
      \xBC       ¼       U+00BC
      \x9F       Ÿ       U+0178

因此, L"\ xC3 \ xA4 \ xC3 \ xB6 \ xC3 \ xBC \ xC3 \ x9F""被翻译为:

    std::wstring wstr = L"\u00C3\u00A4\u00C3\u00B6\u00C3\u00BC\u00C3\u0178"`;

为避免这种(错误)翻译,需要通过传递显式的/source-charset:utf-8 (或 /utf-8 )编译器开关.对于基于CMake的项目,可以使用 add_compile_options 完成,如

To avoid such (mis)translation, Visual C++ needs to be told that the source file is encoded as UTF-8 by passing an explicit /source-charset:utf-8 (or /utf-8) compiler switch. For CMake based projects, this can be done using add_compile_options as shown at Possible to force CMake/MSVC to use UTF-8 encoding for source files without a BOM? C4819.

这篇关于const std :: wstring如何编码以及如何更改为UTF-16的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆