解析ASCII字符使用Erlang [英] Parsing ASCII characters with Erlang

查看:1736
本文介绍了解析ASCII字符使用Erlang的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

什么需要做的解析和在什么终端客户机/服务器混淆。

 当我发送一个变音符'O'我的ejabberd,
它是由ejabberd接收为<&下,195,150>>

在此之后我送这个给我的客户如推送通知(通过GCM / APNS默默)。从那里,客户端通过建立UTF-8解码上的每个数字一一(这是错误的)。

 即。 195是第一代coded到乱码字符,依此类推。

此重建需要鉴定,如果两个字节被受理或3个以上。
这种变化与字母的语言(德语这里例如)。

将如何在客户端识别它要重建的语言(编号字节脱code一气呵成)?

要添加更多的,

 列表:扁平化(mochijson2:EN code({结构,[{registration_ids,[REG_ID]},{数据,[{message,Message},{type,Type},{enum,ENUM},{groupid,Groupid},{groupname,Groupname},{sender,Sender_list},{receiver,Content_list}]},{time_to_live,2419200}]})).

生产的JSON为:

<$p$p><$c$c>\"{\\\"registration_ids\\\":[\\\"APA91bGLjnkhqZlqFEp7mTo9p1vu9s92_A0UIzlUHnhl4xdFTaZ_0HpD5SISB4jNRPi2D7_c8D_mbhUT_k-T2Bo_i_G3Jt1kIqbgQKrFwB3gp1jeGatrOMsfG4gAJSEkClZFFIJEEyow\\\"],\\\"data\\\":{\\\"message\\\":[104,105],\\\"type\\\":[71,82,79,85,80],\\\"enum\\\":2001,\\\"groupid\\\":[71,73,68],\\\"groupname\\\":[71,114,111,117,112,78,97,109,101],\\\"sender\\\":[49,64,100,101,118,108,97,98,47,115,100,115],\\\"receiver\\\":[97,115,97,115]},\\\"time_to_live\\\":2419200}\"

在这里我给了喜的消息,并mochijson给我的ASCII值[104,105]。

 的组名字段指定值组名
在ASCIIs也创造JSON即71,114,111,117,112,78,97,109,101后正确

然而,当我使用 http://www.unit-conversion.info/texttools/ascii /

 这是德codeS作为Ǎome而不是组名。

那么,谁应该做解析?如何相同的处理方式。

我的重建消息是当ASCII重构所有gibberuish。

感谢


解决方案

的东西这里担心的是许多倍,与双方所期望的编码或数据结构的做。在二郎,文本在下列方式之一进行处理:


  1. 字节的列表( [0..255,...]

    • 这是你会得到什么,如果你听一个插座和数据返回一个列表。

    • 虚拟机承担的没有编码的。他们是字节,意味着多一点。

    • 的虚拟机可以跨不过preT这些字符串(例如在 IO:格式(〜S〜N,[名单]))。当发生这种情况(与 382 4 标志明确),虚拟机假设编码为的Latin-1 (异8859-1)。


  2. 的Uni code codepoints 的名单( [0..1114111,...] )。

    • 您可能会得到那些文件被读作单向code的的一个列表。

    • 您可以使用它们的输出,当你有一个格式化,如 IO:格式(〜TS〜N,[名单]),其中〜TS 382 4 但作为UNI code。

    • 这些清单重新present的codepoints您在UNI code标准看,不进行​​任何编码(他们的 UTF-X

    • 这可以与字符的Latin-1列出了一起工作,因为统一code codepoints和latin1的字符有低于255相同的序列号。


  3. 二进制文件(&LT;&LT; 0..255,...&GT;&GT;

    • 这是你会得到什么,如果你听或根据二进制格式从什么读/

    • 的VM可以告诉承担很多事情:

      1. 它们是字节序列( 0..255 )没有特定的含义(&LT;&LT;斌/二进制&GT;&GT;

      2. 他们是UTF-8 EN codeD序列(&LT;&LT;斌/ UTF-8&GT;&GT;

      3. 他们是UTF-16连接codeD序列(&LT;&LT;斌/ UTF-16 GT;&GT;

      4. 他们是UTF-32连接codeD序列(&LT;&LT;斌/ UTF-32 GT;&GT;


    • IO:格式(〜S〜N,[宾])仍将承担任何序列是Latin-1的序列; IO:格式(〜TS〜N,[宾])将承担 UTF-8


  4. 在UNI code列出和UTF-EN codeD二进制文件的混合列表(称为 iodata()),专门用于输出。

因此​​,在一个要点是:


  • 字节的名单

  • Latin-1字符的
  • 列表

  • 统一code名单的 codepoints

  • 字节的二进制

  • UTF-8二进制

  • UTF-16二进制

  • UTF-32二进制

  • 输出其中的许多列出了快速串联

另外要注意:到17.0版本中,所有二郎源文件只Latin-1的。 17.0增加了一个选项,编译器中加入这个头读取您的源文件UNI code:

  %%  -  *  - 编码:UTF-8  -  *  - 

接下来的因素是JSON,通过规范,是假设 UTF-8 作为它的一切编码。此外,在二郎JSON库往往会认为二进制是一个字符串,而该名单是JSON阵列。

这意味着,如果你希望你的输出是足够的,你必须使用UTF-8 EN codeD二进制文件重新present任何JSON。

如果你拥有的是:


  • 的重新present一个UTF-CN codeD字符串,那么 list_to_binary(列表)字节的列表来得到正确的二进制重新presentation

  • codepoints的列表,然后使用 UNI code:characters_to_binary(列表,UNI code,UTF8)来得到一个UTF-8 EN codeD二进制

  • 二元再presenting拉丁-1字符串: UNI code:characters_to_binary(斌LATIN1,UTF8)

  • 任何其他UTF编码的二进制: UNI code:characters_to_binary(宾,UTF16 | UTF32,UTF8)

接招UTF-8二进制,并将其发送到JSON库。如果JSON库是正确的的客户端解析它正确的,那么它应该是正确的。

Confused with what parsing needs to be done and at what end client/server.

When i send an Umlaut 'Ö' to my ejabberd, 
it is received by ejabberd as <<"195, 150">>

Following this i send this to my client as Push notifications (via GCM/APNS silently). From there, the client builds by UTF-8 decoding on each numeral one by one (this is wrong).

i.e. 195 is first decoded to gibberish character � and so on.

This reconstruction needs identification if two bytes are to be entertained or 3 or more. This varies with the language of letters (German here e.g.).

How would the client identify which language it is going to reconstruct (no. of bytes to decode in one go)?

To add more,

lists:flatten(mochijson2:encode({struct,[{registration_ids,[Reg_id]},{data ,[{message,Message},{type,Type},{enum,ENUM},{groupid,Groupid},{groupname,Groupname},{sender,Sender_list},{receiver,Content_list}]},{time_to_live,2419200}]})).

Produced the json as:

"{\"registration_ids\":[\"APA91bGLjnkhqZlqFEp7mTo9p1vu9s92_A0UIzlUHnhl4xdFTaZ_0HpD5SISB4jNRPi2D7_c8D_mbhUT_k-T2Bo_i_G3Jt1kIqbgQKrFwB3gp1jeGatrOMsfG4gAJSEkClZFFIJEEyow\"],\"data\":{\"message\":[104,105],\"type\":[71,82,79,85,80],\"enum\":2001,\"groupid\":[71,73,68],\"groupname\":[71,114,111,117,112,78,97,109,101],\"sender\":[49,64,100,101,118,108,97,98,47,115,100,115],\"receiver\":[97,115,97,115]},\"time_to_live\":2419200}"

where i had given "hi" as message and mochijson gave me ASCII values [104,105].

The groupname field was given the value "Groupname",
the ASCIIs are also correct after json creation i.e. 71,114,111,117,112,78,97,109,101

However when i use http://www.unit-conversion.info/texttools/ascii/

It is decodes as Ǎo��me and not "Groupname".

So, who should do the parsing? How the same should be handled.

My reconstructed message is all gibberuish when the ASCII is reconstructed.

Thanks

解决方案

The things to worry about here is manyfold, and has to do with both the encoding desired or the datastructure. In Erlang, text is handled in one of the following ways:

  1. lists of bytes ([0..255, ...])
    • This is what you get if you listen to a socket and the data is returned as a list.
    • The VM assumes no encoding. They're bytes and mean little more.
    • The VM can however interpret these as strings (say in io:format("~s~n", [List])). When that happens (with the ~s flag specifically), the VM assumes the encoding is latin-1 (ISO-8859-1).
  2. lists of Unicode codepoints ([0..1114111, ...]).
    • You may get those from files that are read as unicode and as a list.
    • You can use them in output when you have a formatter such as io:format("~ts~n", [List]) where ~ts is like ~s but as unicode.
    • Those lists represent the codepoints you see in the unicode standard, without any encoding (they are not UTF-x)
    • This can work in conjunction with latin-1 lists of characters because the Unicode codepoints and latin1 characters have the same sequence numbers below 255.
  3. Binaries (<<0..255, ...>>)
    • This is what you get if you listen or read to/from anything under a binary format.
    • The VM can be told to assume many things:

      1. They are sequences of bytes (0..255) without specific meaning (<<Bin/binary>>)
      2. They are utf-8 encoded sequences (<<Bin/utf-8>>)
      3. They are utf-16 encoded sequences (<<Bin/utf-16>>)
      4. They are utf-32 encoded sequences (<<Bin/utf-32>>)

    • io:format("~s~n", [Bin]) will still assume any sequence is a latin-1 sequence; io:format("~ts~n", [Bin]) will assume UTF-8 only.
  4. A mixed list of both unicode lists and utf-encoded binaries (known as iodata()), used exclusively for output.

So in a gist:

  • lists of bytes
  • lists of latin-1 characters
  • lists of Unicode codepoints
  • binary of bytes
  • utf-8 binary
  • utf-16 binary
  • utf-32 binary
  • lists of many of these for output that is quickly concatenated

Also to note: until version 17.0, all Erlang source files were latin-1 only. 17.0 added an option to have the compiler read your source file as unicode by adding this header:

%% -*- coding: utf-8 -*-

The next factor is that JSON, by specification, is assuming UTF-8 as an encoding for everything it has. Furthermore, JSON libraries in Erlang will tend to assume that a binary is a string, and that lists are JSON arrays.

This means that if you want your output to be adequate, you must use UTF-8 encoded binaries to represent any JSON.

If what you have is:

  • A list of bytes that represent a utf-encoded string, then list_to_binary(List) to get the proper binary representation
  • A list of codepoints, then use unicode:characters_to_binary(List, unicode, utf8) to get a utf-8 encoded binary
  • A binary representing a latin-1 string: unicode:characters_to_binary(Bin, latin1, utf8)
  • A binary of any other UTF encoding: unicode:characters_to_binary(Bin, utf16 | utf32, utf8)

Take that UTF-8 binary, and send it to the JSON library. If the JSON library is correct and the client parses it properly, then it should be right.

这篇关于解析ASCII字符使用Erlang的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆