将原始JSON加载到Pig中 [英] Loading Raw JSON into Pig

查看:130
本文介绍了将原始JSON加载到Pig中的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文件,其中每一行都是一个JSON对象(实际上,这是stackoverflow的转储).我想尽可能轻松地将其加载到Apache Pig中,但是我在弄清楚如何告诉Pig输入格式是麻烦的.这是一个条目示例,

I have a file where each line is a JSON object (actually, it's a dump of stackoverflow). I would like to load this into Apache Pig as easily as possible, but I am having trouble figuring out how I can tell Pig what the input format is. Here's an example of an entry,

{ 
"_id" : { "$oid" : "506492073401d91fa7fdffbe" }, 
"Body" : "....", 
"ViewCount" : 7351, 
"LastEditorDisplayName" : "Rich B", 
"Title" : ".....", 
"LastEditorUserId" : 140328, 
"LastActivityDate" : { "$date" : 1314819738077 }, 
"LastEditDate" : { "$date" : 1313882544213 }, 
"AnswerCount" : 12, "CommentCount" : 19, 
"AcceptedAnswerId" : 7, 
"Score" : 83, 
"PostTypeId" : "question", 
"OwnerUserId" : 8, 
"Tags" : [ "c#", "winforms" ], 
"CreationDate" : { "$date" : 1217540572667 }, 
"FavoriteCount" : 13, "Id" : 4, 
"ForumName" : "stackoverflow.com" 
}

有没有一种方法可以将其中每一行都是以上内容之一的文件加载到Pig中,而无需手动指定架构?还是一种基于在所有对象中观察到的(可能是嵌套的)键自动生成模式的方法?如果确实需要手动指定架构,架构字符串将是什么样?

Is there a way I can load a file where each line is one of the above into Pig without having to specify the schema by hand? Or perhaps a way to automatically generate a schema based on the (possibly nested) keys observed in all objects? If I do need to specify the schema by hand, what would the schema string look like?

谢谢!

推荐答案

快速简便的方法:使用Twitter的Elephantbird项目.内部是一个名为com.twitter.elephantbird.pig.load.JsonLoader的加载器.像这样直接使用时,

The quick and easy way: use Twitter's elephantbird project. Inside is a loader called com.twitter.elephantbird.pig.load.JsonLoader. When used directly like so,

A = LOAD '/path/to/data.json' USING com.twitter.elephantbird.pig.load.JsonLoader() as (json:map[]);
B = FOREACH A GENERATE json#'fieldName' AS field_name;

嵌套的元素将不会加载.但是,您可以轻松地将其更改为(如果需要),将其更改为

nested elements won't be loaded. However, you can easily fix that (if desired) by changing it to,

A = LOAD '/path/to/data.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad')

包含大象鸟很容易-使用Maven(或等效的)依赖管理器,将组织为"com.twitter.elephantbird"的项目大象鸟"拉出,然后在pig中发出通常的register命令

Including elephantbird is easy -- simply pull the the project "elephant-bird" with organization "com.twitter.elephantbird" using Maven (or equivalent's) dependency manager, then issuing the usual register command in pig

register 'lib/elephantbird.jar';

这篇关于将原始JSON加载到Pig中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆