Google计算器中的特殊字符 [英] Special Characters in Google Calculator
问题描述
在上一个问题中,我被告知Google通过UTF-8编码对查询的响应。这样解决了一个问题,就是无断层空间(A0)在被卷曲传递到我的终端后会混乱。这是通过将curl输出管道输入到inconv并转换为UTF-8来解决的。然而,即使有了这个解决方案,我仍然会得到一些奇怪的输出。
考虑以下2米转换为英尺:
http:/ /www.google.com/ig/calculator?hl=zh-CN&q=2%20m%20in%20feet
这是我输出的结果看到我的浏览器和其他地方:
{lhs:2米,rhs:6.56167979英尺(6英尺6英寸x3csup \x3e47\x3c / sup\x3e\x26#8260; \x3csub\x3e64\x3c / sub\x3e inches),error:,icc:false}
预期输出为:
{lhs:2米,rhs:6.56167979英尺(6英尺6 47/64英寸),错误:,icc:false}
我可以使用正则表达式或其他解决方案进行文本替换,但我想知道这里发生了什么。任何见解?
我正在运行Mac OS X Mountain Lion 10.8.2
echo -en $(curl -s'http://www.google.com / ig / calculator?hl = en& q = 4 ^ 22')> 〜/ temp.html
这样我们就可以通过浏览器查看有效的HTML,但我们需要将所有东西都缩小到可以通过标准输出显示。
echo -en$(curl -s --connect-timeout 10http://www.google。 com / ig / calculator?hl = en& q = 2%20m%20in%20feet)| sed -e's< sup> / g'-e:a -e's /< [>] *> // g; /< / N; // ba'| perl -MHTML :: Entities -ne'打印decode_entities($ _)'| iconv -f ISO-8859-1 -t UTF-8
对于echo命令,-e解释转义,例如\ x3e ,\ x3c和\ x26(<,>和& amp;分别),而-n抑制回显通常会添加的换行符。
pipe to sed在所有(上标)标签之前添加一个空格,然后删除所有HTML标签。
到perl的管道然后解码所有HTML实体,如/ to /分数斜线)。
http://zh.wikipedia.org/wiki/Html_special_characters#Character_entity_references_in_HTML
Pipe to iconv将ISO-8859-1输出转换为预期的UTF-8。这是最后一次完成,因为perl行可以生成需要正确转换的UTF-8实体。
这仍然有区分分数和指数的问题(47/64,其中47用上标标签包装,64用下标标签包装,10 ^ 13用13上标标签包装)。
我们可以得到超级愚蠢,并制作一个非常长的sed行来解析所有的特殊字符(以下是在AppleScript中,所以你可以看到语法有多荒谬):
设置jsonResponse来执行shell脚本curl& queryURL& | sed -e's / [†] /,/ g'-e's / \\\\x26#215; / * / g'-e's / \\\\\ \\ x26#188; / 1 \\\ / 4 \\ g'-e \\ s \\\\\\\\\\ \\ / \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ 26 \\ 190; / 3 \\ \\ \\ / \\ / \\'\\'\\ / \\ \\ \\ \\ -e's \\\\\\\\\\\\'××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××× \\ / 8 / g'-e's / \\\\\\\\\\\\\\\'×××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××××× \\\x3e\\([0-9] * \\)\\\\x3c\\ / sup\\\\x3e\\\ \\\\x26#8260; \\\\x3csub\\\\x3e\\([0-9] * \\)\\\\\ \\ x3c \\ / sub\\\\\x3e / \\\\\''\\\\\\\\\\\\ \\\\x3e\ \\([0-9] * \\)\\\\ \\\\\\\\\\\\\ /(/(/ g'
MacRoman集合中的†(匕首)字符为十进制160(Macintosh编码)。十六进制,这是0xA0或\xA0,并以UTF-8编码编码到非破坏空间,这就是谷歌传递的信息。所以在AppleScript中,为了从UTF-8中取代非破坏空间,我们必须由于Macintosh编码,请使用†(匕首)。
还有几个特殊的分数符号,sed行处理:
http://tlt.its.psu.edu/suggestions/international/bylanguage/mathchart.html#fractions
故事的寓意在处理JSON时,只需使用一个好的JSON解析器即可。
子道德是:不要使用AppleScript来处理JSON。 b $ b
In a previous question I was told that Google passes UTF-8 encoded responses to queries. This solved a problem with non-breaking spaces (A0) being muddled after being passed by curl to my terminal. This was solved by piping the curl output to inconv and converting to UTF-8. However, even with this solution in place, I am still getting some strange output.
Consider the following conversion of 2 m to feet:
http://www.google.com/ig/calculator?hl=en&q=2%20m%20in%20feet
This is the output I'm seeing in my browser and elsewhere:
{lhs: "2 meters",rhs: "6.56167979 feet (6 feet 6\x3csup\x3e47\x3c/sup\x3e\x26#8260;\x3csub\x3e64\x3c/sub\x3e inches)",error: "",icc: false}
The expected output is:
{lhs: "2 meters",rhs: "6.56167979 feet (6 feet 6 47/64 inches)",error: "",icc: false}
I could just do a text replace using regular expressions or some other solution, but I would like to know what's happening here. Any insight?
I am running Mac OS X Mountain Lion 10.8.2
Google Calculator as accessed via curl is returning JSON. Google is using \xHH notation as that is standard for JSON. If the output was being sent to a browser (or anything else that parses HTML) instead of standard output, only a good JSON decoder would be necessary.
Let's see what we can do from the command line to parse the JSON.
echo -en $(curl -s 'http://www.google.com/ig/calculator?hl=en&q=4^22') > ~/temp.html
This gets us valid HTML which we can view via a browser, but we need to reduce everything to something that can display via standard output.
echo -en "$(curl -s --connect-timeout 10 "http://www.google.com/ig/calculator?hl=en&q=2%20m%20in%20feet")" | sed -e 's/<sup>/ &/g' -e :a -e 's/<[^>]*>//g;/</N;//ba' | perl -MHTML::Entities -ne 'print decode_entities($_)' | iconv -f ISO-8859-1 -t UTF-8
For the echo command, the -e interprets escapes such as \x3e, \x3c, and \x26 (<, >, and & respectively), while the -n suppresses the newline that echo would normally add.
The pipe to sed adds a space before all (superscript) tags and then removes all HTML tags.
The pipe to perl then decodes all the HTML entities such as ⁄ to ⁄ (fraction slash). http://en.wikipedia.org/wiki/Html_special_characters#Character_entity_references_in_HTML
The pipe to iconv converts the ISO-8859-1 output to the expected UTF-8. This is done last since the perl line can produce UTF-8 entities that will need to be properly converted.
This is still going to have issues with distinguishing between fractions and exponents (47/64 where 47 is wrapped in superscript tags and 64 is wrapped in subscript tags, and 10^13 where 13 is wrapped in superscript tags).
We could get super silly and make a really long sed line to parse all the special characters (the following is in AppleScript so you can see just how ridiculous the syntax gets):
set jsonResponse to do shell script "curl " & queryURL & " | sed -e 's/[†]/,/g' -e 's/\\\\x26#215;/*/g' -e 's/\\\\x26#188;/ 1\\/4/g' -e 's/\\\\x26#189;/ 1\\/2/g' -e 's/\\\\x26#190;/ 3\\/4/g' -e 's/\\\\x26#8539;/ 1\\/8/g' -e 's/\\\\x26#8540;/ 3\\/8/g' -e 's/\\\\x26#8541;/ 5\\/8/g' -e 's/\\\\x26#8542;/ 7\\/8/g' -e 's/\\\\x3csup\\\\x3e\\([0-9]*\\)\\\\x3c\\/sup\\\\x3e\\\\x26#8260;\\\\x3csub\\\\x3e\\([0-9]*\\)\\\\x3c\\/sub\\\\x3e/ \\1\\/\\2/g' -e 's/\\\\x3csup\\\\x3e\\([0-9]*\\)\\\\x3c\\/sup\\\\x3e/^\\1/' -e 's/( /(/g'"
The † (dagger) character is 160 in decimal within the MacRoman set (Macintosh encoding). In hexadecimal this is 0xA0 or \xA0 and encodes to the Non-Breaking Space in UTF-8 encoding, which is what Google is passing. So in AppleScript, in order to replace the Non-Breaking Space from UTF-8, we have to use the † (dagger) due to the Macintosh encoding.
- http://en.wikipedia.org/wiki/Mac_Roman#Codepage_layout
- http://en.wikipedia.org/wiki/UTF-8
- http://en.wikipedia.org/wiki/C1_Controls_and_Latin-1_Supplement
There are also several special fraction symbols that the sed line deals with: http://tlt.its.psu.edu/suggestions/international/bylanguage/mathchart.html#fractions
The moral of the story is that when dealing with JSON, just use a good JSON parser.
A sub-moral is: don't use AppleScript to deal with JSON.
这篇关于Google计算器中的特殊字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!