宽字符(unicode)和多字节字符 [英] wide character (unicode) and multi-byte character

查看:336
本文介绍了宽字符(unicode)和多字节字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

大家好,

宽字符和多字节字符是

Windows上的两种流行编码方案。宽字符使用unicode编码方案。但是每次我在与另一个团队 - 代码页 - 同时交谈时感到困惑。


当我有时看到时,我更加困惑需要代码页参数进行广泛的

字符转换,有时我们不需要转换。这里是

两个例子,


代码页在处理unciode字符时在WideCharToMultiByte中使用


int WideCharToMultiByte (

UINT CodePage,

DWORD dwFlags,

LPCWSTR lpWideCharStr,

int cchWideChar,

LPSTR lpMultiByteStr,

int cbMultiByte,

LPCSTR lpDefaultChar,

LPBOOL lpUsedDefaultChar);

$ b处理unciode字符时,$ b代码页不用于wcstombs

size_t wcstombs(

char * mbstr,

const wchar_t * wcstr,

size_t count);


我的问题是,什么是代码页(似乎我目前的理解不是

正确)?代码页是否与多字节字符有关或仅与
有宽字符的关系?任何人都可以解释代码页,宽字符和多字节字符之间的含义和

关系吗?

提前感谢,

George

解决方案

关于代码页:
http://www.mihai-nita.net/article.php?artID=20060806a

$ b $当处理unciode字符时,b代码页在wcstombs中不被使用



wcstombs是一个dumb-down WideCharToMultiByte的版本

它使用默认的系统代码页(或ANSI代码页),

,用户对各种转换选项的控制较少(dwFlags)


实际上,wcstombs是用WideCharToMultiByte实现的。

-

Mihai Nita [微软MVP,Windows - SDK]
http://www.mihai-nita.net

------------------------------------------

将_year_替换为_以获取真实的电子邮件


谢谢Mihai!

这是一篇非常好的文章我读了两遍。它解决了我的大部分问题。

澄清了我的大部分问题。我还是想让你帮忙确认,


1. Unicode,ANSI UTF-8和UTF-16是字符集或代码页,编号

字符和数字之间的映射是正确的吗?


2.什么是编码方法?它有名字吗?我只在文章中看到B或Q中的样本来表示编码方法。


我认为编码方法是代码页之间的另一个映射级别

字符数和存储字节数。我的理解是否正确?

问候,

乔治


Mihai N.写道:


关于代码页:
http://www.mihai-nita.net/article.php?artID=20060806a

$ b $当处理unciode字符时,b代码页在wcstombs中不被使用



wcstombs是一个dumb-down WideCharToMultiByte的版本

它使用默认的系统代码页(或ANSI代码页),

,用户对各种转换选项的控制较少(dwFlags)


实际上,wcstombs是根据WideCharToMultiByte实现的。


-

Mihai Nita [微软MVP,Windows - SDK]
http://www.mihai-nita.net

------------------------------------------

将_year_替换为_以获取真实的电子邮件


1。 Unicode,ANSI UTF-8和UTF-16是字符集或代码页,编号


字符和数字之间的映射,是正确的吗?



Unicode =代码页


UTF-8,UTF-16,UTF-32 =字符编码表格
http://www.unicode.org/glossary/#cha..._encoding_form


ANSI =在Windows术语中ANSI是用词不当,意思是?默认系统

代码页。请参阅 http://www.mihai-nita.net/article .php?artID =词汇表


Unicode术语有点复杂(你还有一个Character

编码方案,等等,但你可能不需要整个辣酱玉米饼馅

来掌握基础知识。


2.编码方法是什么?它有名字吗?我只在文件中的样本中看到B或Q来表示编码方法。



B = BASE64,Q = Quoted-Printable
http://www.faqs.org/rfcs/rfc2047.html


我认为编码方法是代码页之间的另一层映射

字符数和存储字节。我的理解是否正确?



是的。它也被称为字节序列化

因为对于计算机中的普通文本(让我们说代码页1252,Western

European)从代码编写maping值到字节是1:1,直接存储,

编码部分不是很明显。 ''a'的代码是0x61,它存储为

作为字节61.

这就是为什么很多程序员都不会理解这个额外的水平。

-

Mihai Nita [微软MVP,Windows - SDK]
http://www.mihai-nita.net

------------- -----------------------------

将_year_替换为_以获取真实的电子邮件


Hello everyone,
Wide character and multi-byte character are two popular encoding schemes on
Windows. And wide character is using unicode encoding scheme. But each time I
feel confused when talking with another team -- codepage -- at the same time.

I am more confused when I saw sometimes we need codepage parameter for wide
character conversion, and sometimes we do not need for conversion. Here are
two examples,

code page is used in WideCharToMultiByte when dealing with unciode character

int WideCharToMultiByte (
UINT CodePage,
DWORD dwFlags,
LPCWSTR lpWideCharStr,
int cchWideChar,
LPSTR lpMultiByteStr,
int cbMultiByte,
LPCSTR lpDefaultChar,
LPBOOL lpUsedDefaultChar );

code page is not used in wcstombs when dealing with unciode character

size_t wcstombs (
char* mbstr,
const wchar_t* wcstr,
size_t count );

My question is, what is codepage (seems my current understanding is not
correct)? Does codepage have anything to do with multi-byte character or only
have relationship with wide character? Could anyone explain the meaning and
relationship between codepage, wide character and multi-byte character?
thanks in advance,
George

解决方案

About code page:
http://www.mihai-nita.net/article.php?artID=20060806a

code page is not used in wcstombs when dealing with unciode character

wcstombs is a "dumb-down" version of WideCharToMultiByte
It uses the default system code page (or ANSI code page),
and the user has less control of the various conversion options (dwFlags)

In fact, wcstombs is implemented in terms of WideCharToMultiByte.
--
Mihai Nita [Microsoft MVP, Windows - SDK]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email


Thanks Mihai!
It is a very good article and I read through it twice. It solves and
clarifies most of my questions. I still want to let you help to confirm,

1. Unicode, ANSI UTF-8 and UTF-16 is character set or code page, number
mapping between character and number, is that correct?

2. What is the encoding approach? Does it has a name? I only see B or Q in
the samples in the article to represent encoding approach.

I think encoding approach is another level of mapping between code page
character number and storage bytes. Is my understanding correct?
regards,
George

"Mihai N." wrote:

About code page:
http://www.mihai-nita.net/article.php?artID=20060806a

code page is not used in wcstombs when dealing with unciode character


wcstombs is a "dumb-down" version of WideCharToMultiByte
It uses the default system code page (or ANSI code page),
and the user has less control of the various conversion options (dwFlags)

In fact, wcstombs is implemented in terms of WideCharToMultiByte.
--
Mihai Nita [Microsoft MVP, Windows - SDK]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email


1. Unicode, ANSI UTF-8 and UTF-16 is character set or code page, number

mapping between character and number, is that correct?

Unicode = code page

UTF-8, UTF-16, UTF-32 = Character Encoding Forms
http://www.unicode.org/glossary/#cha..._encoding_form

ANSI = in the Windows lingo ANSI is a misnomer, meaning ?the default system
code page.? See http://www.mihai-nita.net/article.php?artID=glossary

The Unicode lingo is a bit more complicated (you also have a "Character
Encoding Scheme", etc.), but you probably don''t need the whole enchilada
to get a grasp of the basics.

2. What is the encoding approach? Does it has a name? I only see B or Q in
the samples in the article to represent encoding approach.

B = BASE64, Q = Quoted-Printable
http://www.faqs.org/rfcs/rfc2047.html

I think encoding approach is another level of mapping between code page
character number and storage bytes. Is my understanding correct?

Yes. It is also called "byte serialization"
Since for normal text in a computer (let''s say in code page 1252, Western
European) the maping from code value to byte is 1:1, direct storage, the
encoding part is not quite obvious. The code for ''a'' is 0x61 and it is stored
as the byte 61.
This is why many programmers don''t "grok" this extra level.
--
Mihai Nita [Microsoft MVP, Windows - SDK]
http://www.mihai-nita.net
------------------------------------------
Replace _year_ with _ to get the real email


这篇关于宽字符(unicode)和多字节字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆