将具有不同架构的json加载到PIG中 [英] Loading json with varying schema into PIG
问题描述
我遇到了将一组json文档加载到PIG中的问题. 我所拥有的是很多json文档,这些文档在它们具有的字段中各不相同,我需要的字段在大多数文档中,而在缺少空白的情况下,我想获取一个空值.
I ran into an issue loading a set json documents into PIG. What I have is a lot of json documents that all vary in the fields they have, the fields that I need are in most documents and in whare missing I would like to get a null value.
我刚刚下载并编译了最新的Pig版本(直接从apache git存储库下载了0.12),以确保尚未解决此问题.
I just downloaded and compiled the latest Pig version (0.12 straight from the apache git repository) just to be sure this hasn't been solved yet.
我拥有的是一个像这样的json文档:
What I have is a json document like this:
{"foo":1,"bar":2,"baz":3}
当我使用此将其加载到PIG中时
When I load this into PIG using this
Json1 = LOAD 'test.json' USING JsonLoader('foo:int,bar:int,baz:int');
DESCRIBE Json1;
DUMP Json1;
我得到了预期的结果
Json1: {foo: int,bar: int,baz: int}
(1,2,3)
但是,当字段在模式中的顺序不同时:
However when the fields are in a different order in the schema :
Json2 = LOAD 'test.json' USING JsonLoader('baz:int,bar:int,foo:int');
DESCRIBE Json2;
DUMP Json2;
我得到了不希望的结果:
I get an undesired result:
Json2: {baz: int,bar: int,foo: int}
(1,2,3)
应该是
(3,2,1)
显然,架构定义中的字段名称与json中的字段名称无关.
Apparently the field names in the schema definition have nothing to do with the fieldnames in the json.
我需要的是将json文件(带有嵌入式文档!)中的特定字段加载到PIG中.
What I need is to load specific fields from a json file (with embedded documents!) into PIG.
我该如何解决?
推荐答案
我认为,即使是最新版本的Pig,这也是一个已知问题,因此除了使用功能更强大的JsonLoader之外,没有其他简便的方法
I think this is a known issue with even the latest version of Pig, so there isn't an easy way around this other than to use a more capable JsonLoader.
使用 Elephant Bird JSONLoader 代替,它将按照您期望的方式运行-换句话说,遵守字段顺序.
Use the Elephant Bird JSONLoader instead which will behave the way you expect - in other words respect field ordering.
这篇关于将具有不同架构的json加载到PIG中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!