我的程序如何从ASCII转换为Unicode? [英] How can my program switch from ASCII to Unicode?

查看:80
本文介绍了我的程序如何从ASCII转换为Unicode?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用C ++编写一个可以在Unix和Windows上运行的程序.该程序应该能够同时使用:Unicode和非Unicode环境.其行为应仅取决于环境设置.

我要拥有的一个不错的功能是操纵从目录读取的文件名.这些可以是unicode...

最简单的方法是什么?

解决方案

我想用C ++编写一个程序应该可以在Unix和Windows上使用.

首先,请确保您了解Unix如何支持Unicode和Windows如何支持Unicode之间的区别.

在Unicode之前的时代,这两个平台是相似的,因为每个语言环境都有自己的首选字符编码.字符串是 char 的数组.一个 char =一个字符,除了在一些使用双字节编码的东亚语言环境中(由于它们是非自同步的,因此很难处理).

但是他们以两种不同的方式处理Unicode.

Windows NT在Unicode最初是固定宽度的16位字符编码的早期就采用了Unicode.Microsoft使用16位字符( wchar_t )而不是8位char编写了Windows API的全新版本.为了向后兼容,他们保留了旧的"ANSI" API并定义了大量的宏,因此您可以根据是否定义了 _UNICODE 来调用"ANSI"或"Unicode"版本.

在Unix世界(特别是Bell Labs的Plan 9)中,开发人员认为将Unix现有的东亚多字节字符支持扩展为处理3字节字符会更容易,并创建了现在称为UTF-的编码.8.近年来,类似Unix的系统已将UTF-8设置为大多数语言环境的默认编码.

Windows从理论上说可以扩展其对ANSI的支持以包括UTF-8,但是他们仍然没有,因为关于字符最大大小的硬编码假设.因此,在Windows上,您陷入了一个不支持UTF-8的OS API和一个不支持UTF-8的C ++运行时库中.

其结果是:

  • UTF-8是在Unix上最容易使用的编码.
  • UTF-16是Windows上最容易使用的编码.

这给跨平台代码带来了许多复杂的事情.如果您只选择一种Unicode编码并坚持使用它,就会更容易.

应该使用哪种编码?

请参见 UTF-8或UTF-16或UTF-32或UCS-2

总结:

  • UTF-8可让您保留8位代码单元的假设.
  • UTF-32可让您保留固定宽度字符的假设.
  • UTF-16很烂,但是由于Windows和Windows的存在,它仍然存在Java.

wchar_t

是标准的C ++宽字符"类型.但是它的编码不是标准化的:在Windows上是UTF-16,在Unix上是UTF-32.除了在那些使用 locale-dependent wchar_t 编码作为东亚编程遗产的平台上.

如果要使用UTF-32,请使用 uint32_t 或等效的typedef来存储字符.如果定义了 __ STDC_ISO_10646 __ uint32_t ,则使用 wchar_t .

新的C ++标准将具有 char16_t char32_t ,这有望消除关于如何表示UTF-16和UTF-32的困惑.

TCHAR

当定义 _UNICODE char (假定为"ANSI"时,

wchar_t (假定为UTF-16)的Windows typedef) 除此以外.它旨在处理上述过载的Windows API.

我的意见中, TCHAR 很烂.它结合了具有依赖于平台的 char 的缺点和具有依赖于平台的 wchar_t 的缺点.避免它.

最重要的考虑因素

字符编码与信息交换有关.这就是"II"在ASCII中的含义.您的程序并非凭空存在的.您必须读写文件,与UTF-16相比,文件更可能以UTF-8编码.

另一方面,您可能正在使用使用UTF-16(或更常见的是UTF-32)字符的库.在Windows上尤其如此.

我的建议是使用编码形式,以最大程度地减少您要做的转换量.

该程序应该可以使用两者:Unicode和非Unicode环境

最好让您的程序在内部完全以Unicode工作,并且仅处理用于读取遗留数据(或写入遗留数据的遗留编码),但如果明确要求,则仅处理 .

I want to write a program in C++ that should work on Unix and Windows. This program should be able to use both: the Unicode and non Unicode environments. Its behavior should depend only on the environment settings.

One of the nice features that I want to have, is to manipulate file names read from directories. These can be unicode... or not.

What is the easiest way to achieve that?

解决方案

I want to write a program in C++ that should work on Unix and Windows.

First, make sure you understand the difference between how Unix supports Unicode and how Windows supports Unicode.

In the pre-Unicode days, both platforms were similar in that each locale had its own preferred character encodings. Strings were arrays of char. One char = one character, except in a few East Asian locales that used double-byte encodings (which were awkward to handle due to being non-self-synchronizing).

But they approached Unicode in two different ways.

Windows NT adopted Unicode in the early days when Unicode was intended to be a fixed-width 16-bit character encoding. Microsoft wrote an entirely new version of the Windows API using 16-bit characters (wchar_t) instead of 8-bit char. For backwards-compatibility, they kept the old "ANSI" API around and defined a ton of macros so you could call either the "ANSI" or "Unicode" version depending on whether _UNICODE was defined.

In the Unix world (specifically, Plan 9 from Bell Labs), developers decided it would be easier to expand Unix's existing East Asian multi-byte character support to handle 3-byte characters, and created the encoding now known as UTF-8. In recent years, Unix-like systems have been making UTF-8 the default encoding for most locales.

Windows theoretically could expand their ANSI support to include UTF-8, but they still haven't, because of hard-coded assumptions about the maximum size of a character. So, on Windows, you're stuck with an OS API that doesn't support UTF-8 and a C++ runtime library that doesn't support UTF-8.

The upshot of this is that:

  • UTF-8 is the easiest encoding to work with on Unix.
  • UTF-16 is the easiest encoding to work with on Windows.

This creates just as much complication for cross-platform code as it sounds. It's easier if you just pick one Unicode encoding and stick to it.

Which encoding should that be?

See UTF-8 or UTF-16 or UTF-32 or UCS-2

In summary:

  • UTF-8 lets you keep the assumption of 8-bit code units.
  • UTF-32 lets you keep the assumption of fixed-width characters.
  • UTF-16 sucks, but it's still around because of Windows and Java.

wchar_t

is the standard C++ "wide character" type. But it's encoding is not standardized: It's UTF-16 on Windows and UTF-32 on Unix. Except on those platforms that use locale-dependent wchar_t encodings as a legacy from East Asian programming.

If you want to use UTF-32, use a uint32_t or equivalent typedef to store characters. Or use wchar_t if __STDC_ISO_10646__ is defined and uint32_t.

The new C++ standard will have char16_t and char32_t, which will hopefully clear up the confusion on how to represent UTF-16 and UTF-32.

TCHAR

is a Windows typedef for wchar_t (assumed to be UTF-16) when _UNICODE is defined and char (assumed to be "ANSI") otherwise. It was designed to deal with the overloaded Windows API mentioned above.

In my opinion, TCHAR sucks. It combines the disadvantages of having platform-dependent char with the disadvantages of platform-dependent wchar_t. Avoid it.

The most important consideration

Character encodings are about information interchange. That's what the "II" stands for in ASCII. Your program doesn't exist in a vacuum. You have to read and write files, which are more likely to be encoded in UTF-8 than in UTF-16.

On the other hand, you may be working with libraries that use UTF-16 (or more rarely, UTF-32) characters. This is especially true on Windows.

My recommendation is to use the encoding form that minimizes the amount of conversion you have to do.

This program should be able to use both: the Unicode and non Unicode environments

It would be much better to have your program work entirely in Unicode internally and only deal with legacy encodings for reading legacy data (or writing it, but only if explicitly asked to.)

这篇关于我的程序如何从ASCII转换为Unicode?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆