将原始 JSON 加载到 Pig [英] Loading Raw JSON into Pig
问题描述
我有一个文件,其中每一行都是一个 JSON 对象(实际上,它是 stackoverflow 的转储).我想尽可能轻松地将其加载到 Apache Pig 中,但我无法弄清楚如何告诉 Pig 输入格式是什么.这是一个条目的示例,
I have a file where each line is a JSON object (actually, it's a dump of stackoverflow). I would like to load this into Apache Pig as easily as possible, but I am having trouble figuring out how I can tell Pig what the input format is. Here's an example of an entry,
{
"_id" : { "$oid" : "506492073401d91fa7fdffbe" },
"Body" : "....",
"ViewCount" : 7351,
"LastEditorDisplayName" : "Rich B",
"Title" : ".....",
"LastEditorUserId" : 140328,
"LastActivityDate" : { "$date" : 1314819738077 },
"LastEditDate" : { "$date" : 1313882544213 },
"AnswerCount" : 12, "CommentCount" : 19,
"AcceptedAnswerId" : 7,
"Score" : 83,
"PostTypeId" : "question",
"OwnerUserId" : 8,
"Tags" : [ "c#", "winforms" ],
"CreationDate" : { "$date" : 1217540572667 },
"FavoriteCount" : 13, "Id" : 4,
"ForumName" : "stackoverflow.com"
}
有没有一种方法可以将每行都是上述行之一的文件加载到 Pig 中,而无需手动指定模式?或者也许是一种基于所有对象中观察到的(可能是嵌套的)键自动生成模式的方法?如果我确实需要手动指定架构,架构字符串会是什么样的?
Is there a way I can load a file where each line is one of the above into Pig without having to specify the schema by hand? Or perhaps a way to automatically generate a schema based on the (possibly nested) keys observed in all objects? If I do need to specify the schema by hand, what would the schema string look like?
谢谢!
推荐答案
快速简便的方法:使用 Twitter 的大象鸟项目.里面是一个名为 com.twitter.elephantbird.pig.load.JsonLoader
的加载器.像这样直接使用时,
The quick and easy way: use Twitter's elephantbird project. Inside is a loader called com.twitter.elephantbird.pig.load.JsonLoader
. When used directly like so,
A = LOAD '/path/to/data.json' USING com.twitter.elephantbird.pig.load.JsonLoader() as (json:map[]);
B = FOREACH A GENERATE json#'fieldName' AS field_name;
嵌套元素不会被加载.但是,您可以通过将其更改为,
nested elements won't be loaded. However, you can easily fix that (if desired) by changing it to,
A = LOAD '/path/to/data.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad')
包含elephantbird 很容易——只需使用Maven(或等效的)依赖管理器将项目elephant-bird"与组织com.twitter.elephantbird"拉出,然后发出通常的register
命令在猪
Including elephantbird is easy -- simply pull the the project "elephant-bird" with organization "com.twitter.elephantbird" using Maven (or equivalent's) dependency manager, then issuing the usual register
command in pig
register 'lib/elephantbird.jar';
这篇关于将原始 JSON 加载到 Pig的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!