WChars,Encoding,标准和可移植性 [英] WChars, Encodings, Standards and Portability

查看:105
本文介绍了WChars,Encoding,标准和可移植性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下可能不符合SO问题的资格;如果是超出范围,请随时告诉我去走。这里的问题基本上是我正确地理解C标准,这是正确的方式去处理事情吗?



我想要求澄清,确认并修正了我对C(以及C ++和C ++ 0x)中字符处理的理解。首先,一个重要的观察:



可移植性和序列化是正交的概念。

>可移植的东西就像C, unsigned int wchar_t 。可序列化的东西就像 uint32_t 或UTF-8。 Portable意味着您可以重新编译同一个源,并在每个支持的平台上获得一个工作结果,但是二进制表示可能完全不同(或甚至不存在,例如TCP-over-carrier pigeon)。另一方面,可序列化的东西总是具有相同的表示,例如。我可以在Windows桌面,我的手机或我的牙刷上读取PNG文件。可移植的东西是内部的,可序列化的东西处理I / O。可移植的东西是类型安全的,可序列化的东西需要类型打。 < / preamble>



当涉及C中的字符处理时,有两组分别与可移植性和序列化相关的事情:




  • wchar_t setlocale() code> mbsrtowcs() / wcsrtombs() C标准没有说编码事实上,它对任何文本或编码属性是完全不可知的。它只说你的入口点是 main(int,char **);你得到一个类型 wchar_t


  • iconv ()和UTF-8,16,32:一个用于在定义明确,固定的编码之间进行转码的函数/库,除了一个例外,iconv处理的所有编码都被普遍理解和同意。




C的便携式编码不可知世界与 wchar_t 可移植字符类型,确定性外部世界是WCHAR-T和UTF之间的 iconv转换



将我的字符串内部存储在编码不可知的wstring中,通过 wcsrtombs()与CRT接口,并使用 iconv()用于序列化?概念上:

 我的程序
< - wcstombs --- / ===== ========= \ --- iconv(UTF8,WCHAR_T) - >
CRT | wchar_t [] | <磁盘>
--- mbstowcs - > \ ============== /< - iconv(WCHAR_T,UTF8)---
|
+ - iconv(WCHAR_T,UCS-4) - +
|
...< ---(adv。Unicode malarkey)----- libicu --- +

实际上,这意味着我将为我的程序入口点编写两个样板包装器,例如对于C ++:

  //便携式wmain() - 包装器
#include< clocale&
#include< cwchar>
#include< string>
#include< vector>

std :: vector< std :: wstring> parse(int argc,char * argv []); // use mbsrtowcs etc

int wmain(const std :: vector< std :: wstring> args); // user started here

#if defined(_WIN32)|| defined(WIN32)
#include< windows.h>
externCint main()
{
setlocale(LC_CTYPE,);
int argc;
wchar_t * const * const argv = CommandLineToArgvW(GetCommandLineW(),& argc);
return wmain(std :: vector< std :: wstring>(argv,argv + argc));
}
#else
externCint main(int argc,char * argv [])
{
setlocale(LC_CTYPE,);
return wmain(parse(argc,argv));
}
#endif
//序列化实用程序

#include< iconv.h>

typedef std :: basic_string< uint16_t> U16String;
typedef std :: basic_string< uint32_t> U32String;

U16String toUTF16(std :: wstring s);
U32String toUTF32(std :: wstring s);

/ * ... * /

使用纯标准C / C ++编写一个惯用的,可移植的,通用的,编码不可知的程序核心,以及使用iconv的一个定义良好的I / O接口到UTF? (请注意,Unicode标准化或区分替换等问题超出了范围;只有在您确定要实际使用Unicode之后(而不是您可能想到的任何其他编码系统),才能处理这些问题)



更新



很多很好的评论我想补充几点意见:




  • 如果你的应用程序明确地想要处理Unicode文本,你应该使核心的 iconv -conversion部分,并使用 uint32_t / char32_t - 使用UCS-4内部的字符串。


  • Windows:使用宽字符串通常很好,控制台),因为没有任何明显的多字节控制台编码支持,并且 mbstowcs 本质上是无用的(除了微不足道的加宽) 。从例如Explorer-drop和 GetCommandLineW + CommandLineToArgvW 接收宽字符串参数(可能应该


  • 文件系统:文件系统似乎没有任何编码概念,只需将任何以null结束的字符串作为文件名。大多数系统采用字节字符串,但Windows / NTFS采用16位字符串。在发现哪些文件存在以及处理该数据时(例如, char16_t 不构成有效UTF16的序列(例如裸体代理)是有效的NTFS文件名),您必须小心。标准C fopen 无法打开所有NTFS文件,因为没有可能的转换将映射到所有可能的16位字符串。可能需要使用特定于Windows的 _wfopen 。作为推论,通常没有明确定义的多少字符包括给定文件名的概念,因为首先没有字符的概念。注意说明。



解决方案


使用纯标准C / C ++


不能,只有正确的方式写一个惯用的,可移植的,通用的,编码不可知的程序核心没有办法满足所有这些属性,至少如果你想让你的程序在Windows上运行。在Windows上,您必须忽略几乎所有的C和C ++标准,并且专门使用 wchar_t (不一定在内部,而是在系统的所有接口)。例如,如果您从

开始

  int main(int argc,char ** argv)

您已经失去了对命令行参数的Unicode支持。你必须写

  int wmain(int argc,wchar_t ** argv)

,或者使用 GetCommandLineW 函数,其中没有一个在C标准中指定。 / p>

更具体地说,




  • Windows上的任何支持Unicode的程序必须主动忽略C和C ++标准,例如命令行参数,文件和控制台I / O或文件和目录操作。这当然不是惯用的 。使用Boost.Filesystem或Qt等Microsoft扩展程序或包装程序。

  • 便携性极难实现,特别是对于Unicode支持。你真的必须准备好,你认为你知道的一切都可能是错的。例如,您必须考虑用于打开文件的文件名可能与实际使用的文件名不同,并且两个看似不同的文件名可能代表同一个文件。在您创建两个文件 b 后,您最终可能会有一个文件 c ,或两个文件 >和 e ,其文件名不同于您传递给操作系统的文件名。您需要一个外部封装库或大量的 #ifdef

  • 在实践中工作,特别是如果你想要便携。您必须知道 wchar_t 是Windows上的UTF-16代码单元, char 通常是)Linux上的UTF-8代码单元。编码感知通常是更理想的目标:确保您始终知道您使用的编码,或使用将其抽出的包装器库。



我想我必须得出结论,在C或C ++中构建一个支持Unicode的便携式应用程序是完全不可能的,除非你愿意使用额外的库和系统特定的扩展,并付出很多努力。不幸的是,大多数应用程序已经在相对简单的任务中失败,例如将希腊字符写入控制台或支持系统以正确方式允许的任何文件名,这些任务只是真正的Unicode支持的第一步。 / p>

The following may not qualify as a SO question; if it is out of bounds, please feel free to tell me to go away. The question here is basically, "Do I understand the C standard correctly and is this the right way to go about things?"

I would like to ask for clarification, confirmation and corrections on my understanding of character handling in C (and thus C++ and C++0x). First off, an important observation:

Portability and serialization are orthogonal concepts.

Portable things are things like C, unsigned int, wchar_t. Serializable things are things like uint32_t or UTF-8. "Portable" means that you can recompile the same source and get a working result on every supported platform, but the binary representation may be totally different (or not even exist, e.g. TCP-over-carrier pigeon). Serializable things on the other hand always have the same representation, e.g. the PNG file I can read on my Windows desktop, on my phone or on my toothbrush. Portable things are internal, serializable things deal with I/O. Portable things are typesafe, serializable things need type punning. </preamble>

When it comes to character handling in C, there are two groups of things related respectively to portability and serialization:

  • wchar_t, setlocale(), mbsrtowcs()/wcsrtombs(): The C standard says nothing about "encodings"; in fact, it is entirely agnostic to any text or encoding properties. It only says "your entry point is main(int, char**); you get a type wchar_t which can hold all your system's characters; you get functions to read input char-sequences and make them into workable wstrings and vice versa.

  • iconv() and UTF-8,16,32: A function/library to transcode between well-defined, definite, fixed encodings. All encodings handled by iconv are universally understood and agreed upon, with one exception.

The bridge between the portable, encoding-agnostic world of C with its wchar_t portable character type and the deterministic outside world is iconv conversion between WCHAR-T and UTF.

So, should I always store my strings internally in an encoding-agnostic wstring, interface with the CRT via wcsrtombs(), and use iconv() for serialization? Conceptually:

                        my program
    <-- wcstombs ---  /==============\   --- iconv(UTF8, WCHAR_T) -->
CRT                   |   wchar_t[]  |                                <Disk>
    --- mbstowcs -->  \==============/   <-- iconv(WCHAR_T, UTF8) ---
                            |
                            +-- iconv(WCHAR_T, UCS-4) --+
                                                        |
       ... <--- (adv. Unicode malarkey) ----- libicu ---+

Practically, that means that I'd write two boiler-plate wrappers for my program entry point, e.g. for C++:

// Portable wmain()-wrapper
#include <clocale>
#include <cwchar>
#include <string>
#include <vector>

std::vector<std::wstring> parse(int argc, char * argv[]); // use mbsrtowcs etc

int wmain(const std::vector<std::wstring> args); // user starts here

#if defined(_WIN32) || defined(WIN32)
#include <windows.h>
extern "C" int main()
{
  setlocale(LC_CTYPE, "");
  int argc;
  wchar_t * const * const argv = CommandLineToArgvW(GetCommandLineW(), &argc);
  return wmain(std::vector<std::wstring>(argv, argv + argc));
}
#else
extern "C" int main(int argc, char * argv[])
{
  setlocale(LC_CTYPE, "");
  return wmain(parse(argc, argv));
}
#endif
// Serialization utilities

#include <iconv.h>

typedef std::basic_string<uint16_t> U16String;
typedef std::basic_string<uint32_t> U32String;

U16String toUTF16(std::wstring s);
U32String toUTF32(std::wstring s);

/* ... */

Is this the right way to write an idiomatic, portable, universal, encoding-agnostic program core using only pure standard C/C++, together with a well-defined I/O interface to UTF using iconv? (Note that issues like Unicode normalization or diacritic replacement are outside the scope; only after you decide that you actually want Unicode (as opposed to any other coding system you might fancy) is it time to deal with those specifics, e.g. using a dedicated library like libicu.)

Updates

Following many very nice comments I'd like to add a few observations:

  • If your application explicitly wants to deal with Unicode text, you should make the iconv-conversion part of the core and use uint32_t/char32_t-strings internally with UCS-4.

  • Windows: While using wide strings is generally fine, it appears that interaction with the console (any console, for that matter) is limited, as there does not appear to be support for any sensible multi-byte console encoding and mbstowcs is essentially useless (other than for trivial widening). Receiving wide-string arguments from, say, an Explorer-drop together with GetCommandLineW+CommandLineToArgvW works (perhaps there should be a separate wrapper for Windows).

  • File systems: File systems don't seem to have any notion of encoding and simply take any null-terminated string as a file name. Most systems take byte strings, but Windows/NTFS takes 16-bit strings. You have to take care when discovering which files exist and when handling that data (e.g. char16_t sequences that do not constitute valid UTF16 (e.g. naked surrogates) are valid NTFS filenames). The Standard C fopen is not able to open all NTFS files, since there is no possible conversion that will map to all possible 16-bit strings. Use of the Windows-specific _wfopen may be required. As a corollary, there is in general no well defined notion of "how many characters" comprise a given file name, as there is no notion of "character" in the first place. Caveat emptor.

解决方案

Is this the right way to write an idiomatic, portable, universal, encoding-agnostic program core using only pure standard C/C++

No, and there is no way at all to fulfill all these properties, at least if you want your program to run on Windows. On Windows, you have to ignore the C and C++ standards almost everywhere and work exclusively with wchar_t (not necessarily internally, but at all interfaces to the system). For example, if you start with

int main(int argc, char** argv)

you have already lost Unicode support for command line arguments. You have to write

int wmain(int argc, wchar_t** argv)

instead, or use the GetCommandLineW function, none of which is specified in the C standard.

More specifically,

  • any Unicode-capable program on Windows must actively ignore the C and C++ standard for things like command line arguments, file and console I/O, or file and directory manipulation. This is certainly not idiomatic. Use the Microsoft extensions or wrappers like Boost.Filesystem or Qt instead.
  • Portability is extremely hard to achieve, especially for Unicode support. You really have to be prepared that everything you think you know is possibly wrong. For example, you have to consider that the filenames you use to open files can be different from the filenames that are actually used, and that two seemingly different filenames may represent the same file. After you create two files a and b, you might end up with a single file c, or two files d and e, whose filenames are different from the file names you passed to the OS. Either you need an external wrapper library or lots of #ifdefs.
  • Encoding agnosticity usually just doesn't work in practice, especially if you want to be portable. You have to know that wchar_t is a UTF-16 code unit on Windows and that char is often (bot not always) a UTF-8 code unit on Linux. Encoding-awareness is often the more desirable goal: make sure that you always know with which encoding you work, or use a wrapper library that abstracts them away.

I think I have to conclude that it's completely impossible to build a portable Unicode-capable application in C or C++ unless you are willing to use additional libraries and system-specific extensions, and to put lots of effort in it. Unfortunately, most applications already fail at comparatively simple tasks such as "writing Greek characters to the console" or "supporting any filename allowed by the system in a correct manner", and such tasks are only the first tiny steps towards true Unicode support.

这篇关于WChars,Encoding,标准和可移植性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆