用 Erlang 解析 ASCII 字符 [英] Parsing ASCII characters with Erlang

查看:25
本文介绍了用 Erlang 解析 ASCII 字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对需要完成什么解析以及在什么终端客户端/服务器上感到困惑.

Confused with what parsing needs to be done and at what end client/server.

When i send an Umlaut 'Ö' to my ejabberd, 
it is received by ejabberd as <<"195, 150">>

在此之后,我将此作为推送通知发送给我的客户(通过 GCM/APNS 静默).从那里,客户端通过 UTF-8 解码对每个数字一个一个进行构建(这是错误的).

Following this i send this to my client as Push notifications (via GCM/APNS silently). From there, the client builds by UTF-8 decoding on each numeral one by one (this is wrong).

i.e. 195 is first decoded to gibberish character � and so on.

如果要接收两个字节或 3 个或更多字节,则此重构需要识别.这随字母的语言而变化(例如德语).

This reconstruction needs identification if two bytes are to be entertained or 3 or more. This varies with the language of letters (German here e.g.).

客户端如何确定要重构的语言(一次性解码的字节数)?

How would the client identify which language it is going to reconstruct (no. of bytes to decode in one go)?

要添加更多,

lists:flatten(mochijson2:encode({struct,[{registration_ids,[Reg_id]},{data ,[{message,Message},{type,Type},{enum,ENUM},{groupid,Groupid},{groupname,Groupname},{sender,Sender_list},{receiver,Content_list}]},{time_to_live,2419200}]})).

生成的json为:

<代码> {\" registration_ids \ :[\" APA91bGLjnkhqZlqFEp7mTo9p1vu9s92_A0UIzlUHnhl4xdFTaZ_0HpD5SISB4jNRPi2D7_c8D_mbhUT_k-T2Bo_i_G3Jt1kIqbgQKrFwB3gp1jeGatrOMsfG4gAJSEkClZFFIJEEyow \ ],\" 数据\ :{\" 消息\ :[104105],\" 类型\:[71,82,79,85,80],\"enum\":2001,\"groupid\":[71,73,68],\"groupname\":[71,114,111,117,112,78,97,109,101],\"sender\":[49,64,100,101,118,108,97,98,47,115,100,115],\"接收器\":[97,115,97,115]},\"time_to_live\":2419200}"

在那里我给了hi"作为消息,而 mochijson 给了我 ASCII 值 [104,105].

where i had given "hi" as message and mochijson gave me ASCII values [104,105].

The groupname field was given the value "Groupname",
the ASCIIs are also correct after json creation i.e. 71,114,111,117,112,78,97,109,101

但是当我使用 http://www.unit-conversion.info/texttools/ascii/

It is decodes as Ǎo��me and not "Groupname".

那么,谁应该做解析?应该如何处理.

So, who should do the parsing? How the same should be handled.

当重建 ASCII 时,我重建的消息都是胡言乱语.

My reconstructed message is all gibberuish when the ASCII is reconstructed.

谢谢

推荐答案

这里需要担心的事情是多方面的,与所需的编码或数据结构有关.在 Erlang 中,文本以下列方式之一处理:

The things to worry about here is manyfold, and has to do with both the encoding desired or the datastructure. In Erlang, text is handled in one of the following ways:

  1. 字节列表([0..255, ...])
    • 如果您侦听套接字并将数据作为列表返回,您会得到这样的结果.
    • VM 假定无编码.它们是字节,意味着更多.
    • 然而,VM 可以将这些解释为字符串(例如在 io:format("~s~n", [List]) 中).发生这种情况时(特别是 ~s 标志),VM 假定编码为 latin-1 (ISO-8859-1).
  1. lists of bytes ([0..255, ...])
    • This is what you get if you listen to a socket and the data is returned as a list.
    • The VM assumes no encoding. They're bytes and mean little more.
    • The VM can however interpret these as strings (say in io:format("~s~n", [List])). When that happens (with the ~s flag specifically), the VM assumes the encoding is latin-1 (ISO-8859-1).
  • 您可以从作为 unicode 作为列表读取的文件中获取它们.
  • 当您有诸如 io:format("~ts~n", [List]) 之类的格式化程序时,您可以在输出中使用它们,其中 ~ts 就像~s 但作为 unicode.
  • 那些列表代表您在 unicode 标准中看到的代码点,没有任何编码(它们不是 UTF-x)
  • 这可以与 latin-1 字符列表结合使用,因为 Unicode 代码点和 latin1 字符在 255 以下具有相同的序列号.
  • You may get those from files that are read as unicode and as a list.
  • You can use them in output when you have a formatter such as io:format("~ts~n", [List]) where ~ts is like ~s but as unicode.
  • Those lists represent the codepoints you see in the unicode standard, without any encoding (they are not UTF-x)
  • This can work in conjunction with latin-1 lists of characters because the Unicode codepoints and latin1 characters have the same sequence numbers below 255.
  • 如果您以 binary 格式收听或阅读任何内容,这就是您所得到的.
  • 可以告诉 VM 假设很多事情:
  • This is what you get if you listen or read to/from anything under a binary format.
  • The VM can be told to assume many things:
  1. 它们是没有特定含义的字节序列 (0..255) (<<>)
  2. 它们是 utf-8 编码序列 (<<>)
  3. 它们是 utf-16 编码序列 (<<>)
  4. 它们是 utf-32 编码序列 (<<>)
  1. They are sequences of bytes (0..255) without specific meaning (<<Bin/binary>>)
  2. They are utf-8 encoded sequences (<<Bin/utf-8>>)
  3. They are utf-16 encoded sequences (<<Bin/utf-16>>)
  4. They are utf-32 encoded sequences (<<Bin/utf-32>>)

  • io:format("~s~n", [Bin]) 仍然会假设任何序列都是 latin-1 序列;io:format("~ts~n", [Bin]) 将仅假设 UTF-8.
  • io:format("~s~n", [Bin]) will still assume any sequence is a latin-1 sequence; io:format("~ts~n", [Bin]) will assume UTF-8 only.
  • 总之:

    • 字节列表
    • latin-1 字符列表
    • Unicode 代码点列表
    • 二进制字节
    • utf-8 二进制
    • utf-16 二进制
    • utf-32 二进制
    • 列出了许多用于快速连接的输出

    还要注意:直到 17.0 版本,所有 Erlang 源文件都是 latin-1.17.0 添加了一个选项,通过添加以下标头,让编译器将您的源文件读取为 unicode:

    Also to note: until version 17.0, all Erlang source files were latin-1 only. 17.0 added an option to have the compiler read your source file as unicode by adding this header:

    %% -*- coding: utf-8 -*-
    

    下一个因素是,根据规范,JSON 假定 UTF-8 作为其所有内容的编码.此外,Erlang 中的 JSON 库倾向于假设二进制文件是字符串,而列表是 JSON 数组.

    The next factor is that JSON, by specification, is assuming UTF-8 as an encoding for everything it has. Furthermore, JSON libraries in Erlang will tend to assume that a binary is a string, and that lists are JSON arrays.

    这意味着,如果您希望输出足够,则必须使用 UTF-8 编码的二进制文件来表示任何 JSON.

    This means that if you want your output to be adequate, you must use UTF-8 encoded binaries to represent any JSON.

    如果你拥有的是:

    • 表示 utf 编码字符串的字节列表,然后 list_to_binary(List) 以获得正确的二进制表示
    • 代码点列表,然后使用 unicode:characters_to_binary(List, unicode, utf8) 得到一个 utf-8 编码的二进制
    • 代表 latin-1 字符串的二进制文件:unicode:characters_to_binary(Bin, latin1, utf8)
    • 任何其他 UTF 编码的二进制文件:unicode:characters_to_binary(Bin, utf16 | utf32, utf8)
    • A list of bytes that represent a utf-encoded string, then list_to_binary(List) to get the proper binary representation
    • A list of codepoints, then use unicode:characters_to_binary(List, unicode, utf8) to get a utf-8 encoded binary
    • A binary representing a latin-1 string: unicode:characters_to_binary(Bin, latin1, utf8)
    • A binary of any other UTF encoding: unicode:characters_to_binary(Bin, utf16 | utf32, utf8)

    采用该 UTF-8 二进制文件,并将其发送到 JSON 库.如果 JSON 库是正确的并且客户端正确解析了它,那么它应该是正确的.

    Take that UTF-8 binary, and send it to the JSON library. If the JSON library is correct and the client parses it properly, then it should be right.

    这篇关于用 Erlang 解析 ASCII 字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆