JSON解析与UNIX工具 [英] Parsing JSON with UNIX tools
问题描述
我试图解析JSON从卷曲请求返回,就像这样:
I'm trying to parse JSON returned from a curl request, like so:
curl 'http://twitter.com/users/username.json' |
sed -e 's/[{}]/''/g' |
awk -v k="text" '{n=split($0,a,","); for (i=1; i<=n; i++) print a[i]}'
以上拆分成JSON领域中,例如:
The above splits the JSON into fields, for example:
% ...
"geo_enabled":false
"friends_count":245
"profile_text_color":"000000"
"status":"in_reply_to_screen_name":null
"source":"web"
"truncated":false
"text":"My status"
"favorited":false
% ...
如何打印一个特定领域(由表示-v K =文本
)?
推荐答案
要快速地提取值的一个特定的键,我个人喜欢用的grep -o,它只有返回正则表达式的比赛。例如,为了获得鸣叫文本字段中,是这样的:
To quickly extract the values for a particular key, I personally like to use "grep -o", which only returns the regex's match. For example, to get the "text" field from tweets, something like:
grep -Po '"text":.*?[^\\]",' tweets.json
这正则表达式是比你想象的更强大的;例如,它涉及罚款与具有嵌入逗号和逃脱在他们里面引号中的字符串。我觉得有一点更多的工作,你可以做一个,实际上是保证抽取的价值,如果它是原子的。 (如果有嵌套,那么正则表达式不能做到这一点,当然。)
This regex is more robust than you might think; for example, it deals fine with strings having embedded commas and escaped quotes inside them. I think with a little more work you could make one that is actually guaranteed to extract the value, if it's atomic. (If it has nesting, then a regex can't do it of course.)
和进一步清洗(尽管保持字符串的原转义),你可以使用类似: | perl的-pe的/文://; S / ^//; S /,$ //'
。 (我这样做是为了这种分析。)
And to further clean (albeit keeping the string's original escaping) you can use something like: | perl -pe 's/"text"://; s/^"//; s/",$//'
. (I did this for this analysis.)
要所有谁坚持,你应该使用一个真正的JSON解析器的仇敌 - 是的,这正确性是必要的,但
To all the haters who insist you should use a real JSON parser -- yes, that is essential for correctness, but
- 要做到一个真正的快速分析,就像在命令行上计数值来检查数据清洗错误或得到的数据的手感一般,敲打出的东西比较快。打开一个编辑器编写一个脚本分心。
-
的grep -o
是数量级比Python标准JSON
库更快,至少在做这鸣叫(这是每个〜2 KB)。我不知道这是否只是因为JSON
慢(我应该比较某个时候yajl);但原则上,正则表达式应该会更快,因为它是有限状态和更优化的,而不是在具有支持递归解析器,在这种情况下,花费大量的CPU建设树木您不关心结构。 (如果有人写了一个有限状态换能器做适当的(深度有限)JSON解析,这将是太棒了!在我们的grep -o其间。)
- To do a really quick analysis, like counting values to check on data cleaning bugs or get a general feel for the data, banging out something on the command line is faster. Opening an editor to write a script is distracting.
grep -o
is orders of magnitude faster than the Python standardjson
library, at least when doing this for tweets (which are ~2 KB each). I'm not sure if this is just becausejson
is slow (I should compare to yajl sometime); but in principle, a regex should be faster since it's finite state and much more optimizable, instead of a parser that has to support recursion, and in this case, spends lots of CPU building trees for structures you don't care about. (If someone wrote a finite state transducer that did proper (depth-limited) JSON parsing, that would be fantastic! In the meantime we have "grep -o".)
要写出维护code,我总是用一个真正的解析库。我没有尝试过 jsawk 的,但如果它工作得很好,这将解决点#1。
To write maintainable code, I always use a real parsing library. I haven't tried jsawk, but if it works well, that would address point #1.
最后一个,怪的,解决方法:我写了使用Python脚本 JSON
并提取你想要的钥匙,进入制表符分隔栏;然后,我通过身边 AWK
的包装,它允许指定访问列管。 在这里:在json2tsv和tsvawk脚本的。因此,对于这个例子将是:
One last, wackier, solution: I wrote a script that uses Python json
and extracts the keys you want, into tab-separated columns; then I pipe through a wrapper around awk
that allows named access to columns. In here: the json2tsv and tsvawk scripts. So for this example it would be:
json2tsv id text < tweets.json | tsvawk '{print "tweet " $id " is: " $text}'
这个方法并没有解决#2,是不是一个Python脚本更低效的,这是一个脆:它迫使换行符和标签的规范化的字符串值,发挥好与awk的域/记录分隔视图世界。但它确实让你留在命令行上,用得比较多的正确性的grep -o
。
This approach doesn't address #2, is more inefficient than a single Python script, and it's a little brittle: it forces normalization of newlines and tabs in string values, to play nice with awk's field/record-delimited view of the world. But it does let you stay on the command line, with more correctness than grep -o
.
这篇关于JSON解析与UNIX工具的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!