UTF8,代码点及其在Erlang和Elixir中的表示形式 [英] UTF8, codepoints, and their representation in Erlang and Elixir

查看:56
本文介绍了UTF8,代码点及其在Erlang和Elixir中的表示形式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

通过Elixir处理unicode:

going through Elixir's handling of unicode:

iex> String.codepoints("abc§")
["a", "b", "c", "§"]

非常好,其中的byte_size/2不是4而是5,因为最后一个字符占用了2个字节,我明白了.

very good, and byte_size/2 of this is not 4 but 5, because the last char is taking 2 bytes, I get that.

?运算符(或者是宏?找不到答案)告诉我

The ? operator (or is it a macro? can't find the answer) tells me that

iex(69)> ?§
167

太好了;因此,我查看了UTF-8编码表,并看到值 c2 a7 作为char的十六进制编码.这意味着两个字节(由byte_size/1见证)是c2(十进制为94)和a7(十进制为167).167是我之前评估时获得的结果.确切地说,我不明白的是..为什么根据 运算符的描述,该数字是一个代码点"?当我尝试向后工作并评估二进制文件时,我得到了想要的东西:

Great; so then I look into the UTF-8 encoding table, and see value c2 a7 as hex encoding for the char. That means the two bytes (as witnessed by byte_size/1) are c2 (94 in decimal) and a7 (167 in decimal). That 167 is the result I got when evaluating earlier. What I don't understand, exactly, is.. why that number is a "code point", as per the description of the ? operator. When I try to work backwards, and evaluate the binary, I get what I want:

iex(72)> <<0xc2, 0xa7>>
"§"

要使我完全成为香蕉,这就是我在Erlang外壳中得到的东西:

And to make me go completely bananas, this is what I get in Erlang shell:

24> <<167>>.
<<"§">>
25> <<"\x{a7}">>.
<<"§">>
26> <<"\x{c2}\x{a7}">>.
<<"§"/utf8>>
27> <<"\x{c2a7}">>.    
<<"§">>

!而Elixir只对上面的代码感到满意...我不了解什么?考虑到Elixir坚持char占用2个字节,而Unicode表似乎同意,为什么Erlang会对单个字节感到完全满意?

!! while Elixir is only happy with the code above... what is it that I don't understand? Why is Erlang perfectly happy with a single byte, given that Elixir insists that char takes 2 bytes - and Unicode table seems to agree?

推荐答案

代码点是标识Unicode字符的地方.§的代码点是167(0xA7).可以根据您选择的编码,以不同的方式用字节表示一个代码点.

The codepoint is what identifies the Unicode character. The codepoint for § is 167 (0xA7). A codepoint can be represented in bytes in different ways, depending of your encoding of choice.

此处的混淆来自于以下事实:当编码点167(0xA7)编码为UTF-8时,由字节0xC2 0xA7标识.

The confusion here comes from the fact that the codepoint 167 (0xA7) is identified by the bytes 0xC2 0xA7 when encoded to UTF-8.

在将Erlang添加到对话中时,您必须记住Erlang的默认编码为/is latin1(已尝试迁移到UTF-8,但我不确定它是否适用于外壳-有人请更正我)

When you add Erlang to the conversation, you have to remember Erlang default encoding was/is latin1 (there is an effort to migrate to UTF-8 but I am not sure if it made to the shell - someone please correct me).

在latin1中,代码点§(0xA7)也由字节0xA7表示.因此,直接解释您的结果:

In latin1, the codepoint § (0xA7) is also represented by the byte 0xA7. So explaining your results directly:

24> <<167>>.
<<"§">> %% this is encoded in latin1

25> <<"\x{a7}">>.
<<"§">> %% still latin1

26> <<"\x{c2}\x{a7}">>.
<<"§"/utf8>> %% this is encoded in utf8, as the /utf8 modifier says

27> <<"\x{c2a7}">>.
<<"§">>  %% this is latin1

最后一个非常有趣,并且可能造成混淆.在Erlang二进制文件中,如果传递的整数值大于255,则该整数将被截断.因此,最后一个示例有效地执行了<< 49831>> ,该代码在被截断后变成了< 167> ,这再次等效于<<§§">> 在latin1中.

The last one is quite interesting and potentially confusing. In Erlang binaries, if you pass an integer with value more than 255, it is truncated. So the last example is effectively doing <<49831>> which when truncated becomes <<167>>, which is again equivalent to <<"§">> in latin1.

这篇关于UTF8,代码点及其在Erlang和Elixir中的表示形式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆