PIG 中 Elephant-Bird UDF 中的 JSON 数组字段处理 [英] JSON Array field handling in Elephant-Bird UDF in PIG
问题描述
一个关于 PIG 中 JSON 处理的快速问题.
A quick question on the JSON handling in PIG.
我尝试了一些名为 Elephant-Bird 的 JsonLoader 来加载和处理 JSON 数据,如下所示:
I tried some JsonLoader called Elephant-Bird to load and handle JSON data like the followings:
{
"SV":1,
"AD":[
{
"ID":"46931606",
"C1":"46",
"C2":"469",
"ST":"46931",
"PO":1
},
{
"ID":"46721489",
"C1":"46",
"C2":"467",
"ST":"46721",
"PO":5
}
]
}
加载器适用于简单字段,但不适用于任何数组字段.我不知道如何使用此 UDF 或以任何其他方式访问数组中的元素(上面的AD"字段)?请指教.
The loader works well for simple fields but it doesn't work well for any array field. I don't know how I can access elements in the array ("AD" field above) with this UDF or in any other way? Please advise.
推荐答案
你应该像这样使用 -nestedLoad 参数:
You should use -nestedLoad param like this:
a = load 'input' using com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS (json:map[]).
然后你使用下面的代码:
And then you use the following code:
b = FOREACH a GENERATE (json#'AD') as AD:bag{t:Tuple(m:map[])};
那么你的 json 数组就变成了 bag 数据类型.您可以将其展平以获得元组.
Then your json array become a bag datatype. You can flatten it to get tuple.
c = FOREACH b GENERATE FLATTEN(AD);
d = FOREACH c GENERATE AD::m#ID AS ID, AD::m#C1 AS C1, AD::m#C2 AS C2, AD::m#ST AS ST, AD::m#PO AS PO
此时会得到schema为(ID:bytearray, C)的元组数据类型
At this time, you will get the tuple data type which the schema is (ID:bytearray, C)
这篇关于PIG 中 Elephant-Bird UDF 中的 JSON 数组字段处理的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!