将具有不同模式的 json 加载到 PIG 中 [英] Loading json with varying schema into PIG
问题描述
我在将一组 json 文档加载到 PIG 时遇到了问题.我有很多 json 文档,它们的字段各不相同,我需要的字段在大多数文档中,而在缺少的情况下,我想获得空值.
我刚刚下载并编译了最新的 Pig 版本(直接来自 apache git 存储库的 0.12)只是为了确保这个问题还没有得到解决.
我有一个这样的json文档:
{"foo":1,"bar":2,"baz":3}
当我使用这个将它加载到 PIG 时
Json1 = LOAD 'test.json' USING JsonLoader('foo:int,bar:int,baz:int');描述 Json1;转储 Json1;
我得到了预期的结果
Json1: {foo: int,bar: int,baz: int}(1,2,3)
但是,当字段在架构中的顺序不同时:
Json2 = LOAD 'test.json' USING JsonLoader('baz:int,bar:int,foo:int');描述 Json2;转储 Json2;
我得到了不想要的结果:
Json2: {baz: int,bar: int,foo: int}(1,2,3)
应该是
(3,2,1)
显然模式定义中的字段名称与 json 中的字段名称无关.
我需要的是将 json 文件(带有嵌入的文档!)中的特定字段加载到 PIG 中.
我该如何解决这个问题?
我认为即使是最新版本的 Pig 也是一个已知问题,因此除了使用功能更强大的 JsonLoader 之外,没有其他简单的方法可以解决此问题.
使用 Elephant Bird JSONLoader 而不是它的行为方式与您期望的一样 - 换句话说,尊重字段顺序.>
I ran into an issue loading a set json documents into PIG. What I have is a lot of json documents that all vary in the fields they have, the fields that I need are in most documents and in whare missing I would like to get a null value.
I just downloaded and compiled the latest Pig version (0.12 straight from the apache git repository) just to be sure this hasn't been solved yet.
What I have is a json document like this:
{"foo":1,"bar":2,"baz":3}
When I load this into PIG using this
Json1 = LOAD 'test.json' USING JsonLoader('foo:int,bar:int,baz:int');
DESCRIBE Json1;
DUMP Json1;
I get the expected results
Json1: {foo: int,bar: int,baz: int}
(1,2,3)
However when the fields are in a different order in the schema :
Json2 = LOAD 'test.json' USING JsonLoader('baz:int,bar:int,foo:int');
DESCRIBE Json2;
DUMP Json2;
I get an undesired result:
Json2: {baz: int,bar: int,foo: int}
(1,2,3)
That should have been
(3,2,1)
Apparently the field names in the schema definition have nothing to do with the fieldnames in the json.
What I need is to load specific fields from a json file (with embedded documents!) into PIG.
How do I resolve this?
I think this is a known issue with even the latest version of Pig, so there isn't an easy way around this other than to use a more capable JsonLoader.
Use the Elephant Bird JSONLoader instead which will behave the way you expect - in other words respect field ordering.
这篇关于将具有不同模式的 json 加载到 PIG 中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!