如何在 Pig Latin 中加载每行带有 JSON 数组的文件 [英] How to load a file with a JSON array per line in Pig Latin

查看:30
本文介绍了如何在 Pig Latin 中加载每行带有 JSON 数组的文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

现有脚本创建文本文件,每行包含一组 JSON 对象,例如,

An existing script creates text files with an array of JSON objects per line, e.g.,

[{"foo":1,"bar":2},{"foo":3,"bar":4}]
[{"foo":5,"bar":6},{"foo":7,"bar":8},{"foo":9,"bar":0}]
…

我想在 Pig 中加载这些数据,分解数组并处理每个单独的对象.

I would like to load this data in Pig, exploding the arrays and processing each individual object.

我曾在 Twitter 的 Elephant Bird 中查看使用 JsonLoader,但无济于事.它不会抱怨 JSON,但在运行以下命令时我得到成功读取 0 条记录":

I have looked at using the JsonLoader in Twitter’s Elephant Bird to no avail. It doesn’t complain about the JSON, but I get "Successfully read 0 records" when running the following:

register '/tmp/elephant-bird/core/target/elephant-bird-core-4.3-SNAPSHOT.jar';
register '/tmp/elephant-bird/hadoop-compat/target/elephant-bird-hadoop-compat-4.3-SNAPSHOT.jar';
register '/tmp/elephant-bird/pig/target/elephant-bird-pig-4.3-SNAPSHOT.jar';
register '/usr/local/lib/json-simple-1.1.1.jar';

a = load '/path/to/file.json' using com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad=true');
dump a;

我也尝试过正常加载文件,将每一行视为包含单列字符数组,然后尝试将其解析为 JSON,但我找不到似乎可以解决问题的预先存在的 UDF.

I have also tried loading the file as normal, treating each line as a containing a single column chararray, and then trying to parse that as JSON, but I can’t find a pre-existing UDF which seems to do the trick.

有什么想法吗?

推荐答案

就像 Donald 说的,你应该在这里使用 UDF.在 Xplenty 中,我们编写了 JsonStringToBag 来补充 ElephantBird 的 JsonStringToMap.

Like Donald said, you should use a UDF here. Here in Xplenty we wrote JsonStringToBag to complement ElephantBird's JsonStringToMap.

这篇关于如何在 Pig Latin 中加载每行带有 JSON 数组的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆