哪个字符编码(Unicode版本),并设置一个char对象对应? [英] To which character encoding (Unicode version) set does a char object correspond?

查看:619
本文介绍了哪个字符编码(Unicode版本),并设置一个char对象对应?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

什么的Unicode的字符编码做了字符对象对应于:




  • C#


  • 的Java


  • 的JavaScript (我知道有实际上不是一个字符键入但我假设字符串类型仍然是实现为Unicode字符数组)




一般情况下,是有编程语言之间的共同约定使用特定的字符编码



更新




  1. 我试图澄清我的问题。我所做的更改将在下面的评论中讨论

  2. 回复:有什么问题你想解决?的,我是从语言感兴趣的代码生成独立表达式和该文件的特定的编码是相关的。


解决方案

我不知道,我回答你的问题,但让我做一个几句话是希望提供一些线索。



在核心,通用编程语言如我们所谈论的(C,C ++,C#,Java和PHP)的那些没有单纯的数据,文的概念。数据由积分值(即数字)的序列。没有固有的的那些数字后面。



转动数流到文本的过程的语义之一,它是。通常留给消费者来分配相关语义的数据流。



警告:的我将现在使用的编码字样,这不幸的是有多个不等价含义。 编码的第一要义是意为一个数字的分配。一个数字的语义解释也被称为字。例如,在ASCII编码32装置空间和65的装置captial A。 ASCII只分配含义128号,所以每一个ASCII码的字符的可以方便地通过一个8位字节(最高位始终为0)表示。没有与指定字符256个数字,因此,所有的每个字符使用一个字节编码许多。在这些固定宽度的编码,因为它需要字节表示一个文本字符串有尽可能多的字符。也有哪些字符取字节来表示数量可变的其他编码。



现在,Unicode是也是一个编码,即意为数字的分配。在第一128号是同样为ASCII,但其分配含义(理论上)2 ^ 21的数字。因为有很多含义这不是严格以书面形式(如零宽度加入者或音调符号改性剂)中,码点术语优于字符的意义上的字符。尽管如此,为至少21位的任何积分数据类型宽可以表示一个编码点。通常是一个挑选一个32位的类型,该编码,其中每一个元素代表一个编码点,被称为UTF-32或UCS-4



现在我们有编码的第二个意思是:我可以把Unicode代码串并的转换的它变成8位或16位的字符串值,从而进一步编码的信息。在这个新的,转化的形式(称为Unicode转换格式,或UTF),我们现在有8位或16位值的串(所谓的代码单元),但是每个单独的值一般不对应以任何有意义的 - 它首先必须被解码为Unicode代码序列



因此,从编程的角度来看,如果要修改的文本的(不是字节),那么你应该存储您的文本作为Unicode代码点的序列。实际上这意味着需要一个32比特的数据类型。在C和C ++的字符数据类型通常是8位宽(尽管这只是一个最低要求),而在C#和Java是宽总是16位。一个8位的字符可以令人信服地用来存储的转化的UTF-8字符串,和一个16位的字符可以存储一个的转化的UTF-16串,但为了在原始的,有意义的Unicode代码来获得(尤其是在码点的字符串的长度),你将永远有进行解码。



通常情况下你的文字处理库就能做到的解码和编码为你,所以他们会很乐意接受UTF8和UTF16字符串(但有代价的),但如果你想腾出自己这个额外的间接,存储你的字符串作为原料Unicode代码在一个足够宽键入


What Unicode character encoding does a char object correspond to in:

  • C#

  • Java

  • JavaScript (I know there is not actually a char type but I am assuming that the String type is still implemented as an array of Unicode characters)

In general, is there a common convention among programming languages to use a specific character encoding?

Update

  1. I have tried to clarify my question. The changes I made are discussed in the comments below.
  2. Re: "What problem are you trying to solve?", I am interested in code generation from language independent expressions, and the particular encoding of the file is relevant.

解决方案

I'm not sure that I am answering your question, but let me make a few remarks that hopefully shed some light.

At the core, general-purpose programming languages like the ones we are talking about (C, C++, C#, Java, PHP) do not have a notion of "text", merely of "data". Data consists of sequences of integral values (i.e. numbers). There is no inherent meaning behind those numbers.

The process of turning a stream of numbers into a text is one of semantics, and it is usually left to the consumer to assign the relevant semantics to a data stream.

Warning: I will now use the word "encoding", which unfortunately has multiple inequivalent meanings. The first meaning of "encoding" is the assignment of meaning to a number. The semantic interpretation of a number is also called a "character". For example, in the ASCII encoding, 32 means "space" and 65 means "captial A". ASCII only assigns meanings to 128 numbers, so every ASCII character can be conveniently represented by a single 8-bit byte (with the top bit always 0). There are many encodings with assign characters to 256 numbers, thus all using one byte per character. In these fixed-width encodings, a text string has as many characters as it takes bytes to represent. There are also other encodings in which characters take a variable amount of bytes to represent.

Now, Unicode is also an encoding, i.e. an assignment of meaning to numbers. On the first 128 numbers it is the same as ASCII, but it assigns meanings to (theoretically) 2^21 numbers. Because there are lots of meanings which aren't strictly "characters" in the sense of writing (such as zero-width joiners or diacritic modifiers), the term "codepoint" is preferred over "character". Nonetheless, any integral data type that is at least 21 bits wide can represent one codepoint. Typically one picks a 32-bit type, and this encoding, in which every element stands for one codepoint, is called UTF-32 or UCS-4.

Now we have a second meaning of "encoding": I can take a string of Unicode codepoints and transform it into a string of 8-bit or 16-bit values, thus further "encoding" the information. In this new, transformed form (called "unicode transformation format", or "UTF"), we now have strings of 8-bit or 16-bit values (called "code units"), but each individual value does not in general correspond to anything meaningful -- it first has to be decoded into a sequence of Unicode codepoints.

Thus, from a programming perspective, if you want to modify text (not bytes), then you should store your text as a sequence of Unicode codepoints. Practically that means that you need a 32-bit data type. The char data type in C and C++ is usually 8 bits wide (though that's only a minimum), while on C# and Java it is always 16 bits wide. An 8-bit char could conceivably be used to store a transformed UTF-8 string, and a 16-bit char could store a transformed UTF-16 string, but in order to get at the raw, meaningful Unicode codepoints (and in particular at the length of the string in codepoints) you will always have to perform decoding.

Typically your text processing libraries will be able to do the decoding and encoding for you, so they will happily accept UTF8 and UTF16 strings (but at a price), but if you want to spare yourself this extra indirection, store your strings as raw Unicode codepoints in a sufficiently wide type.

这篇关于哪个字符编码(Unicode版本),并设置一个char对象对应?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆