是否可以从UNICODE LE / BE字符中提取位(vs 2015,c ++,mfc) [英] Is it possible to extract the bits from the UNICODE LE/BE character (vs 2015, c++, mfc)

查看:76
本文介绍了是否可以从UNICODE LE / BE字符中提取位(vs 2015,c ++,mfc)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

   我正在使用vs 2015,mfc,c ++。 (对于我的应用程序)我有一些unicode字符不确定LE / BE。但根据我的搜索,我发现LE字符高达11,00000。所以在此之后只有BE站立。所以我想unicode字符已经是
我有LE。我猜。所以LE代表2个字节,是吗。第3个字节是'\ n'。

    I am using vs 2015, mfc, c++. (For my application) I have some unicode characters not sure about LE/BE. But as per my search i found LE characters up to 11,00000. So after this only BE stands. So i guess the unicode characters already i have that is LE. I guess. So LE stands for 2 bytes, is it. 3rd byte is '\n'.

================= =====

======================

因此,如果它是LE,是否可以从任何函数中提取此小端字符中的位或任何其他方式。也可以将这些位恢复为相同的unicode字符

So if it is LE, is it possible to extract the bits from this little endian character from any function or any other way. Also is it possible to revert back those bits to the same unicode character

另一个疑问我怎么能以编程方式知道属于Little endian或big endian的字符。 

Another doubt how can i know programatically a character which belongs in Little endian or big endian. 

======================

======================

问候,

Satheesh

推荐答案

我不知道你的意思是"我发现LE字符高达11,00000",但你的理解是有缺陷的。

I have no idea what you mean by "i found LE characters up to 11,00000", but your understanding is flawed.

" Unicode"它本身就是一个抽象的东西。只有超过一百万的Unicode字符。 为了在软件中使用,必须使用Unicode"编码"。 所有编码都将处理每个可能的Unicode字符。

"Unicode" by itself is an abstract thing. There are just over a million Unicode characters.  In order to get used in software, one has to use a Unicode "encoding".  All of the encodings will handle every possible Unicode character.

通常,最方便的编码是UTF-8,它将每个字符存储为字节。 一些Unicode字符占用两个或三个字节。

Usually, the most convenient encoding is UTF-8, which stores each character as bytes.  Some Unicode characters take two or three bytes.

Windows从一开始就使用了UTF-16,其中字符被编码为两个字节的单位。 当您将双字节单位写入文件时,您必须决定写入字节的顺序,就像使用任何16位值一样。  Windows系统
使用little-endian,因此U + 1234代码点将被写入文件34 12。 有些系统使用big-endian,因此U + 1234代码点将被写入文件12 34. 这两件事都是指Unicode字符U + 1234. 
几乎在每种情况下,给定的程序总是处理一个或另一个。

Windows from the beginning has used UTF-16, where the characters are encoded into two-byte units.  When you write two-byte units to a file, you have to decide in which order to write the bytes, just as you do with ANY 16-bit value.  Windows systems use little-endian, so that the U+1234 code point would be written to the file as 34 12.  Some systems use big-endian, so that the U+1234 code point would be written to file as 12 34.  Both of those things refer to the Unicode character U+1234.  In virtually every case, a given program always deals with one ordering or the other.

给定一个双字节序列,你无法判断它是UTF-16LE还是UTF -16BE,或UTF-8。 这就是为什么Unicode文件总是应该以"字节顺序标记"开头的原因。 这是Unicode字符U + FEFF。 
因此,如果文件的前两个字节是FF FE,则您有一个UTF-16 little-endian文件。 如果文件的第一个到字节是FE FF,则您有一个UTF-16 big-endian文件。 如果前三个字节是EF BB BF,那么你有一个UTF-8文件。

Given a two-byte sequence, you cannot tell whether it is UTF-16LE or UTF-16BE, or UTF-8 for that matter.  That's why Unicode files are always supposed to start with a "byte order marker".  This is the Unicode character U+FEFF.  So, if the first two bytes of the file are FF FE, you have a UTF-16 little-endian file.  If the first to bytes of the file are FE FF, you have a UTF-16 big-endian file.  If the first three bytes are EF BB BF, then you have a UTF-8 file.


这篇关于是否可以从UNICODE LE / BE字符中提取位(vs 2015,c ++,mfc)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆