如何在 Pig 中解析 JSON? [英] How do I parse JSON in Pig?

查看:29
本文介绍了如何在 Pig 中解析 JSON?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 s3 中有很多经过 gzip 压缩的日志文件,它们有 3 种类型的日志行:b、c、i.i 和 c 都是单级 json:

I have a lot of gzip'd log files in s3 that has 3 types of log lines: b,c,i. i and c are both single level json:

{"this":"that","test":"4"}

类型 b 是深度嵌套的 json.我遇到了这个 gist 谈论编译一个 jar 来完成这项工作.由于我的 Java 技能并不出色,所以我真的不知道从这里开始做什么.

Type b is deeply nested json. I came across this gist talking about compiling a jar to make this work. Since my java skills are less than stellar, I didn't really know what to do from here.

{"this":{"foo":"bar","baz":{"test":"me"},"total":"5"}}

由于类型 i 和 c 的顺序并不总是相同,这使得在生成正则表达式中指定所有内容变得困难.Pig 是否可以处理 JSON(在 gzip 文件中)?我使用的是基于 Amazon Elastic Map Reduce 实例构建的 Pig 版本.

Since types i and c are not always in the same order, this makes specifying everything in the generate regex difficult. Is handling JSON (in a gzip'd file) possible with Pig? I am using whichever version of Pig comes built on an Amazon Elastic Map Reduce instance.

这归结为两个问题:1)我可以用 Pig 解析 JSON(如果可以,如何解析)?2) 如果我可以解析 JSON(来自 gzip 的日志文件),我可以解析嵌套的 JSON 对象吗?

This boils down to two questions: 1) Can I parse JSON with Pig (and if so, how)? 2) If I can parse JSON (from a gzip'd logfile), can I parse nested JSON objects?

推荐答案

经过大量变通方法和解决问题后,我能够回答以完成这项工作.我在我的博客上写了一篇关于如何做到这一点的文章.它可以在这里找到:http://eric.lubow.org/2011/hadoop/pig-queries-parsing-json-on-amazons-elastic-map-reduce-using-s3-data/

After a lot of workarounds and working through things, I was able to answer to get this done. I did a write-up about it on my blog about how to do this. It is available here: http://eric.lubow.org/2011/hadoop/pig-queries-parsing-json-on-amazons-elastic-map-reduce-using-s3-data/

这篇关于如何在 Pig 中解析 JSON?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆