Google计算器中的特殊字符 [英] Special Characters in Google Calculator

查看:246
本文介绍了Google计算器中的特殊字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

上一个问题中,我被告知Google通过UTF-8编码对查询的响应。这样解决了一个问题,就是无断层空间(A0)在被卷曲传递到我的终端后会混乱。这是通过将curl输出管道输入到inconv并转换为UTF-8来解决的。然而,即使有了这个解决方案,我仍然会得到一些奇怪的输出。



考虑以下2米转换为英尺:



http:/ /www.google.com/ig/calculator?hl=zh-CN&q=2%20m%20in%20feet



这是我输出的结果看到我的浏览器和其他地方:

  {lhs:2米,rhs:6.56167979英尺(6英尺6英寸x3csup \x3e47\x3c / sup\x3e\x26#8260; \x3csub\x3e64\x3c / sub\x3e inches),error:,icc:false} 

预期输出为:

  {lhs:2米,rhs:6.56167979英尺(6英尺6 47/64英寸),错误:,icc:false} 

我可以使用正则表达式或其他解决方案进行文本替换,但我想知道这里发生了什么。任何见解?



我正在运行Mac OS X Mountain Lion 10.8.2

解决方案可以通过命令行来解析JSON。



echo -en $(curl -s'http://www.google.com / ig / calculator?hl = en& q = 4 ^ 22')> 〜/ temp.html



这样我们就可以通过浏览器查看有效的HTML,但我们需要将所有东西都缩小到可以通过标准输出显示。



echo -en$(curl -s --connect-timeout 10http://www.google。 com / ig / calculator?hl = en& q = 2%20m%20in%20feet)| sed -e's< sup> / g'-e:a -e's /< [>] *> // g; /< / N; // ba'| perl -MHTML :: Entities -ne'打印decode_entities($ _)'| iconv -f ISO-8859-1 -t UTF-8



对于echo命令,-e解释转义,例如\ x3e ,\ x3c和\ x26(<,>和& amp;分别),而-n抑制回显通常会添加的换行符。



pipe to sed在所有(上标)标签之前添加一个空格,然后删除所有HTML标签。



到perl的管道然后解码所有HTML实体,如/ to /分数斜线)。
http://zh.wikipedia.org/wiki/Html_special_characters#Character_entity_references_in_HTML


Pipe to iconv将ISO-8859-1输出转换为预期的UTF-8。这是最后一次完成,因为perl行可以生成需要正确转换的UTF-8实体。



这仍然有区分分数和指数的问题(47/64,其中47用上标标签包装,64用下标标签包装,10 ^ 13用13上标标签包装)。



我们可以得到超级愚蠢,并制作一个非常长的sed行来解析所有的特殊字符(以下是在AppleScript中,所以你可以看到语法有多荒谬):

设置jsonResponse来执行shell脚本curl& queryURL& | sed -e's / [†] /,/ g'-e's / \\\\x26#215; / * / g'-e's / \\\\\ \\ x26#188; / 1 \\\ / 4 \\ g'-e \\ s \\\\\\\\\\ \\ / \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ 26 \\ 190; / 3 \\ \\ \\ / \\ / \\'\\'\\ / \\ \\ \\ \\ -e's \\\\\\\\\\\\'××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××× \\ / 8 / g'-e's / \\\\\\\\\\\\\\\'×××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××× \\\x3e\\([0-9] * \\)\\\\x3c\\ / sup\\\\x3e\\\ \\\\x26#8260; \\\\x3csub\\\\x3e\\([0-9] * \\)\\\\\ \\ x3c \\ / sub\\\\\x3e / \\\\\''\\\\\\\\\\\\ \\\\x3e\ \\([0-9] * \\)\\\\ \\\\\\\\\\\\\ /(/(/ g'



MacRoman集合中的†(匕首)字符为十进制160(Macintosh编码)。十六进制,这是0xA0或\xA0,并以UTF-8编码编码到非破坏空间,这就是谷歌传递的信息。所以在AppleScript中,为了从UTF-8中取代非破坏空间,我们必须由于Macintosh编码,请使用†(匕首)。



还有几个特殊的分数符号,sed行处理:
http://tlt.its.psu.edu/suggestions/international/bylanguage/mathchart.html#fractions



故事的寓意在处理JSON时,只需使用一个好的JSON解析器即可。



子道德是:不要使用AppleScript来处理JSON。 b $ b

In a previous question I was told that Google passes UTF-8 encoded responses to queries. This solved a problem with non-breaking spaces (A0) being muddled after being passed by curl to my terminal. This was solved by piping the curl output to inconv and converting to UTF-8. However, even with this solution in place, I am still getting some strange output.

Consider the following conversion of 2 m to feet:

http://www.google.com/ig/calculator?hl=en&q=2%20m%20in%20feet

This is the output I'm seeing in my browser and elsewhere:

{lhs: "2 meters",rhs: "6.56167979 feet (6 feet 6\x3csup\x3e47\x3c/sup\x3e\x26#8260;\x3csub\x3e64\x3c/sub\x3e inches)",error: "",icc: false}

The expected output is:

{lhs: "2 meters",rhs: "6.56167979 feet (6 feet 6 47/64 inches)",error: "",icc: false}

I could just do a text replace using regular expressions or some other solution, but I would like to know what's happening here. Any insight?

I am running Mac OS X Mountain Lion 10.8.2

解决方案

Google Calculator as accessed via curl is returning JSON. Google is using \xHH notation as that is standard for JSON. If the output was being sent to a browser (or anything else that parses HTML) instead of standard output, only a good JSON decoder would be necessary.

Let's see what we can do from the command line to parse the JSON.

echo -en $(curl -s 'http://www.google.com/ig/calculator?hl=en&q=4^22') > ~/temp.html

This gets us valid HTML which we can view via a browser, but we need to reduce everything to something that can display via standard output.

echo -en "$(curl -s --connect-timeout 10 "http://www.google.com/ig/calculator?hl=en&q=2%20m%20in%20feet")" | sed -e 's/<sup>/ &/g' -e :a -e 's/<[^>]*>//g;/</N;//ba' | perl -MHTML::Entities -ne 'print decode_entities($_)' | iconv -f ISO-8859-1 -t UTF-8

For the echo command, the -e interprets escapes such as \x3e, \x3c, and \x26 (<, >, and & respectively), while the -n suppresses the newline that echo would normally add.

The pipe to sed adds a space before all (superscript) tags and then removes all HTML tags.

The pipe to perl then decodes all the HTML entities such as ⁄ to ⁄ (fraction slash). http://en.wikipedia.org/wiki/Html_special_characters#Character_entity_references_in_HTML

The pipe to iconv converts the ISO-8859-1 output to the expected UTF-8. This is done last since the perl line can produce UTF-8 entities that will need to be properly converted.

This is still going to have issues with distinguishing between fractions and exponents (47/64 where 47 is wrapped in superscript tags and 64 is wrapped in subscript tags, and 10^13 where 13 is wrapped in superscript tags).

We could get super silly and make a really long sed line to parse all the special characters (the following is in AppleScript so you can see just how ridiculous the syntax gets):

set jsonResponse to do shell script "curl " & queryURL & " | sed -e 's/[†]/,/g' -e 's/\\\\x26#215;/*/g' -e 's/\\\\x26#188;/ 1\\/4/g' -e 's/\\\\x26#189;/ 1\\/2/g' -e 's/\\\\x26#190;/ 3\\/4/g' -e 's/\\\\x26#8539;/ 1\\/8/g' -e 's/\\\\x26#8540;/ 3\\/8/g' -e 's/\\\\x26#8541;/ 5\\/8/g' -e 's/\\\\x26#8542;/ 7\\/8/g' -e 's/\\\\x3csup\\\\x3e\\([0-9]*\\)\\\\x3c\\/sup\\\\x3e\\\\x26#8260;\\\\x3csub\\\\x3e\\([0-9]*\\)\\\\x3c\\/sub\\\\x3e/ \\1\\/\\2/g' -e 's/\\\\x3csup\\\\x3e\\([0-9]*\\)\\\\x3c\\/sup\\\\x3e/^\\1/' -e 's/( /(/g'"

The † (dagger) character is 160 in decimal within the MacRoman set (Macintosh encoding). In hexadecimal this is 0xA0 or \xA0 and encodes to the Non-Breaking Space in UTF-8 encoding, which is what Google is passing. So in AppleScript, in order to replace the Non-Breaking Space from UTF-8, we have to use the † (dagger) due to the Macintosh encoding.

There are also several special fraction symbols that the sed line deals with: http://tlt.its.psu.edu/suggestions/international/bylanguage/mathchart.html#fractions

The moral of the story is that when dealing with JSON, just use a good JSON parser.

A sub-moral is: don't use AppleScript to deal with JSON.

这篇关于Google计算器中的特殊字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆