从Pig Latin中的复杂JSON对象中仅加载一些值 [英] Load only a few values from complex JSON object in Pig Latin
问题描述
我有一个复杂的JSON文件,如下所示: http://pastebin.com/4UfadbqS >
我只想使用Pig Latin从这些JSON对象中加载几个值.我尝试这样做:
mydata = LOAD 'data.json'
USING JsonLoader('id:chararray, created_at:chararray,
user: {(language:chararray)}’);
STORE mydata
INTO 'output';
但是似乎Pig Latin只是从JSON中获取前三个值并保存它们(它不将列名识别为键).有没有办法做到这一点?或者我应该只列出Pig中来自JSON的所有值,然后对它们进行过滤?
上述方法几乎没有问题
1. JsonLoader将始终期望您输入的完整模式,但是您只提供了三个字段.
2. JsonLoader始终希望整个输入为单行,但是您的输入为多行.
3. JsonLoader将不支持嵌套模式,但是您的输入包含嵌套模式.
要解决所有上述问题,请使用第三方库elephant-bird jar
.
从此链接下载( elephant-bird-pig-4.1.jar和Elephant-bird-hadoop-compat-4.1.jar )jar文件
http://www.java2s.com/Code/Jar/e/elephant.htm 并尝试以下方法
我复制了您的整个输入,并将其格式化为一行,如下所示.
input.json
{"filter_level":"medium","retweeted":false,"in_reply_to_screen_name":null,"possibly_sensitive":false,"truncated":false,"lang":"en","in_reply_to_status_id_str":null,"id":488927960280211456,"in_reply_to_user_id_str":null,"in_reply_to_status_id":null,"created_at":"Tue Jul 15 06:08:04 +0000 2014","favorite_count":0,"place":null,"coordinates":null,"text":"RT @BulleyBufton: @MinaANDMaya PLEASE RT /VOTE BULLEY. Last day to help me win my old rescue @HilbraesDogs £5k https://t.co/Y8g47fLYY1 http\u2026","contributors":null,"retweeted_stt
atus":{"filter_level":"low","contributors":null,"text":"@MinaANDMaya PLEASE RT /VOTE BULLEY. Last day to help me win my old rescue @HilbraesDogs £5k https://t.co/Y8g47fLYY1 httpp
://t.co/DDco9wVXtP","geo":null,"retweeted":false,"in_reply_to_screen_name":"MinaANDMaya","possibly_sensitive":false,"truncated":false,"lang":"en","entities":{"trends":[],"symbols":[],"urls":[{"expanded_url":"https://www.animalfriendsquote.co.uk/fb-worldcup/","indices":[93,116],"display_url":"animalfriendsquote.co.uk/fb-worldcup/","url":"https://t.co/Y8g47fLYY1"}],"hashtags":[],"media":[{"sizes":{"thumb":{"w":150,"resize":"crop","h":150},"small":{"w":340,"resize":"fit","h":455},"large":{"w":706,"resize":"fit","h":946},"medium":{"w":600,"resize":"fit","h":803}},"id":488926730481332224,"media_url_https":"https://pbs.twimg.com/media/BskERVuIcAAJZGu.jpg","media_url":"http://pbs.twimg.com/media/BskERVuIcAAJZGu.jpg","expanded_url":"http://twitter.com/BulleyBufton/status/488926827394904064/photo/1","indices":[117,139],"id_str":"488926730481332224","type":"photo","display_url":"pic.twitter.com/DDco9wVXtP","url":"http://t.co/DDco9wVXtP"}],"user_mentions":[{"id":132204038,"name":"Mina*Bad Yoga Kitty*","indices":[0,12],"screen_name":"MinaANDMaya","id_str":"132204038"},{"id":2308374684,"name":"Julianna Kaminski","indices":[75,88],"screen_name":"HilbraesDogs","id_str":"2308374684"}]},"in_reply_to_status_id_str":null,"id":488926827394904064,"source":"<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android<\/a>","in_reply_to_user_id_str":"132204038","favorited":false,"in_reply_to_status_id":null,"retweet_count":6,"created_at":"Tue Jul 15 06:03:34 +0000 2014","in_reply_to_user_id":132204038,"favorite_count":3,"id_str":"488926827394904064","place":null,"user":{"location":"CHICAGO , USA","default_profile":false,"statuses_count":8868,"profile_background_tile":true,"lang":"en","profile_link_color":"AD54E8","profile_banner_url":"https://pbs.twimg.com/profile_banners/225136520/1403608773","id":225136520,"following":null,"favourites_count":5082,"protected":false,"profile_text_color":"3D1957","verified":false,"description":"I'm Bulley, I'm proof that there is always hope.\r\nI was in rescue kennels in UK for 9yrs. @ada_bscakes took me in.\r\nWe've moved to America to start a new life.","contributors_enabled":false,"profile_sidebar_border_color":"000000","name":"BULLEY","profile_background_color":"0A0A0A","created_at":"Fri Dec 10 19:55:17 +0000 2010","default_profile_image":false,"followers_count":3421,"profile_image_url_https":"https://pbs.twimg.com/profile_images/486614595457789952/gtcLac9w_normal.jpeg","geo_enabled":true,"profile_background_image_url":"http://pbs.twimg.com/profile_background_images/378800000166829702/isbjd7O4.jpeg","profile_background_image_url_https":"https://pbs.twimg.com/profile_background_images/378800000166829702/isbjd7O4.jpeg","follow_request_sent":null,"url":null,"utc_offset":-39600,"time_zone":"International Date Line West","notifications":null,"profile_use_background_image":true,"friends_count":3702,"profile_sidebar_fill_color":"7AC3EE","screen_name":"BulleyBufton","id_str":"225136520","profile_image_url":"http://pbs.twimg.com/profile_images/486614595457789952/gtcLac9w_normal.jpeg","listed_count":29,"is_translator":false},"coordinates":null},"geo":null,"entities":{"trends":[],"symbols":[],"urls":[{"expanded_url":"https://www.animalfriendsquote.co.uk/fb-worldcup/","indices":[111,134],"display_url":"animalfriendsquote.co.uk/fb-worldcup/","url":"https://t.co/Y8g47fLYY1"}],"hashtags":[],"media":[{"sizes":{"thumb":{"w":150,"resize":"crop","h":150},"small":{"w":340,"resize":"fit","h":455},"large":{"w":706,"resize":"fit","h":946},"medium":{"w":600,"resize":"fit","h":803}},"id":488926730481332224,"media_url_https":"https://pbs.twimg.com/media/BskERVuIcAAJZGu.jpg","media_url":"http://pbs.twimg.com/media/BskERVuIcAAJZGu.jpg","expanded_url":"http://twitter.com/BulleyBufton/status/488926827394904064/photo/1","source_status_id_str":"488926827394904064","indices":[139,140],"source_status_id":488926827394904064,"id_str":"488926730481332224","type":"photo","display_url":"pic.twitter.com/DDco9wVXtP","url":"http://t.co/DDco9wVXtP"}],"user_mentions":[{"id":225136520,"name":"BULLEY","indices":[3,16],"screen_name":"BulleyBufton","id_str":"225136520"},{"id":132204038,"name":"Mina*Bad Yoga Kitty*","indices":[18,30],"screen_name":"MinaANDMaya","id_str":"132204038"},{"id":2308374684,"name":"Julianna Kaminski","indices":[93,106],"screen_name":"HilbraesDogs","id_str":"2308374684"}]},"source":"<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android<\/a>","favorited":false,"in_reply_to_user_id":null,"retweet_count":0,"id_str":"488927960280211456","user":{"location":"","default_profile":false,"statuses_count":1370,"profile_background_tile":true,"lang":"zh-tw","profile_link_color":"038544","profile_banner_url":"https://pbs.twimg.com/profile_banners/2272804116/1404662156","id":2272804116,"following":null,"favourites_count":2000,"protected":false,"profile_text_color":"333333","verified":false,"description":"No More Sorrow","contributors_enabled":false,"profile_sidebar_border_color":"000000","name":"Winnie","profile_background_color":"14DBBA","created_at":"Thu Jan 02 10:13:01 +0000 2014","default_profile_image":false,"followers_count":311,"profile_image_url_https":"https://pbs.twimg.com/profile_images/478106512083017728/4ao_8JjE_normal.jpeg","geo_enabled":false,"profile_background_image_url":"http://pbs.twimg.com/profile_background_images/431815421189029888/YrRNpUfd.jpeg","profile_background_image_url_https":"https://pbs.twimg.com/profile_background_images/431815421189029888/YrRNpUfd.jpeg","follow_request_sent":null,"url":null,"utc_offset":null,"time_zone":null,"notifications":null,"profile_use_background_image":true,"friends_count":455,"profile_sidebar_fill_color":"DDEEF6","screen_name":"winnie341881","id_str":"2272804116","profile_image_url":"http://pbs.twimg.com/profile_images/478106512083017728/4ao_8JjE_normal.jpeg","listed_count":0,"is_translator":false}}
PigScript:
REGISTER '/tmp/elephant-bird-hadoop-compat-4.1.jar';
REGISTER '/tmp/elephant-bird-pig-4.1.jar';
A = LOAD 'input.json ' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS myMap;
B = FOREACH A GENERATE myMap#'id' AS ID,myMap#'created_at' AS createdAT,myMap#'user' AS User;
DUMP B;
输出:
(488927960280211456,Tue Jul 15 06:08:04 +0000 2014,[location#,default_profile#false,profile_background_tile#true,statuses_count#1370,lang#zh-tw,profile_link_color#038544,profile_banner_url#https://pbs.twimg.com/profile_banners/2272804116/1404662156,id#2272804116,following#,protected#false,favourites_count#2000,profile_text_color#333333,contributors_enabled#false,description#No More Sorrow,verified#false,name#Winnie,profile_sidebar_border_color#000000,profile_background_color#14DBBA,created_at#Thu Jan 02 10:13:01 +0000 2014,default_profile_image#false,followers_count#311,geo_enabled#false,profile_image_url_https#https://pbs.twimg.com/profile_images/478106512083017728/4ao_8JjE_normal.jpeg,profile_background_image_url#http://pbs.twimg.com/profile_background_images/431815421189029888/YrRNpUfd.jpeg,profile_background_image_url_https#https://pbs.twimg.com/profile_background_images/431815421189029888/YrRNpUfd.jpeg,follow_request_sent#,url#,utc_offset#,time_zone#,notifications#,friends_count#455,profile_use_background_image#true,profile_sidebar_fill_color#DDEEF6,screen_name#winnie341881,id_str#2272804116,profile_image_url#http://pbs.twimg.com/profile_images/478106512083017728/4ao_8JjE_normal.jpeg,is_translator#false,listed_count#0])
在elephantbird library
中,所有值均为stored as key/value pair(ie MAP datatype)
,因此可以很容易地从加载的数据中提取所需的字段.
在上面的脚本中,我根据您的需要提取了'id','created_at' and 'user'
的值.
假设您要从用户"数据(ex: 'friends_count' and 'followers_count'
)中提取某些字段,在这种情况下,您需要投影'user'
字段并提取所需的数据.下面的示例代码.
PigScript:
REGISTER '/tmp/elephant-bird-hadoop-compat-4.1.jar';
REGISTER '/tmp/elephant-bird-pig-4.1.jar';
A = LOAD 'input.json ' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS myMap;
B = FOREACH A GENERATE 'user' AS User;
C = FOREACH B GENERATE User#'friends_count', User#'followers_count';
DUMP C;
输出:
(455,311)
I have a complex JSON file that looks like this: http://pastebin.com/4UfadbqS
I would like to load only several values from these JSON objects using Pig Latin. I tried doing that like this:
mydata = LOAD 'data.json'
USING JsonLoader('id:chararray, created_at:chararray,
user: {(language:chararray)}’);
STORE mydata
INTO 'output';
But it seems that Pig Latin is just taking the first 3 values from the JSON and saving them (it does not recognize the column name as a key). Is there a way to achieve this? OR should I just list ALL the values from JSON in a Pig and filter them after that?
There are few problems in the above approach
1. JsonLoader will always expect the full schema of your input but you gave only three fields.
2. JsonLoader will always expect the entire input as a single line but your input is multiline.
3. JsonLoader will not support nested schema but your input contains nested schema.
To solve all the above problems you have use the thirdparty library elephant-bird jar
.
Download the (elephant-bird-pig-4.1.jar and elephant-bird-hadoop-compat-4.1.jar) jar file from this link
http://www.java2s.com/Code/Jar/e/elephant.htm and try the below approach
I copied your entire input and formatted as a single line as below.
input.json
{"filter_level":"medium","retweeted":false,"in_reply_to_screen_name":null,"possibly_sensitive":false,"truncated":false,"lang":"en","in_reply_to_status_id_str":null,"id":488927960280211456,"in_reply_to_user_id_str":null,"in_reply_to_status_id":null,"created_at":"Tue Jul 15 06:08:04 +0000 2014","favorite_count":0,"place":null,"coordinates":null,"text":"RT @BulleyBufton: @MinaANDMaya PLEASE RT /VOTE BULLEY. Last day to help me win my old rescue @HilbraesDogs £5k https://t.co/Y8g47fLYY1 http\u2026","contributors":null,"retweeted_stt
atus":{"filter_level":"low","contributors":null,"text":"@MinaANDMaya PLEASE RT /VOTE BULLEY. Last day to help me win my old rescue @HilbraesDogs £5k https://t.co/Y8g47fLYY1 httpp
://t.co/DDco9wVXtP","geo":null,"retweeted":false,"in_reply_to_screen_name":"MinaANDMaya","possibly_sensitive":false,"truncated":false,"lang":"en","entities":{"trends":[],"symbols":[],"urls":[{"expanded_url":"https://www.animalfriendsquote.co.uk/fb-worldcup/","indices":[93,116],"display_url":"animalfriendsquote.co.uk/fb-worldcup/","url":"https://t.co/Y8g47fLYY1"}],"hashtags":[],"media":[{"sizes":{"thumb":{"w":150,"resize":"crop","h":150},"small":{"w":340,"resize":"fit","h":455},"large":{"w":706,"resize":"fit","h":946},"medium":{"w":600,"resize":"fit","h":803}},"id":488926730481332224,"media_url_https":"https://pbs.twimg.com/media/BskERVuIcAAJZGu.jpg","media_url":"http://pbs.twimg.com/media/BskERVuIcAAJZGu.jpg","expanded_url":"http://twitter.com/BulleyBufton/status/488926827394904064/photo/1","indices":[117,139],"id_str":"488926730481332224","type":"photo","display_url":"pic.twitter.com/DDco9wVXtP","url":"http://t.co/DDco9wVXtP"}],"user_mentions":[{"id":132204038,"name":"Mina*Bad Yoga Kitty*","indices":[0,12],"screen_name":"MinaANDMaya","id_str":"132204038"},{"id":2308374684,"name":"Julianna Kaminski","indices":[75,88],"screen_name":"HilbraesDogs","id_str":"2308374684"}]},"in_reply_to_status_id_str":null,"id":488926827394904064,"source":"<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android<\/a>","in_reply_to_user_id_str":"132204038","favorited":false,"in_reply_to_status_id":null,"retweet_count":6,"created_at":"Tue Jul 15 06:03:34 +0000 2014","in_reply_to_user_id":132204038,"favorite_count":3,"id_str":"488926827394904064","place":null,"user":{"location":"CHICAGO , USA","default_profile":false,"statuses_count":8868,"profile_background_tile":true,"lang":"en","profile_link_color":"AD54E8","profile_banner_url":"https://pbs.twimg.com/profile_banners/225136520/1403608773","id":225136520,"following":null,"favourites_count":5082,"protected":false,"profile_text_color":"3D1957","verified":false,"description":"I'm Bulley, I'm proof that there is always hope.\r\nI was in rescue kennels in UK for 9yrs. @ada_bscakes took me in.\r\nWe've moved to America to start a new life.","contributors_enabled":false,"profile_sidebar_border_color":"000000","name":"BULLEY","profile_background_color":"0A0A0A","created_at":"Fri Dec 10 19:55:17 +0000 2010","default_profile_image":false,"followers_count":3421,"profile_image_url_https":"https://pbs.twimg.com/profile_images/486614595457789952/gtcLac9w_normal.jpeg","geo_enabled":true,"profile_background_image_url":"http://pbs.twimg.com/profile_background_images/378800000166829702/isbjd7O4.jpeg","profile_background_image_url_https":"https://pbs.twimg.com/profile_background_images/378800000166829702/isbjd7O4.jpeg","follow_request_sent":null,"url":null,"utc_offset":-39600,"time_zone":"International Date Line West","notifications":null,"profile_use_background_image":true,"friends_count":3702,"profile_sidebar_fill_color":"7AC3EE","screen_name":"BulleyBufton","id_str":"225136520","profile_image_url":"http://pbs.twimg.com/profile_images/486614595457789952/gtcLac9w_normal.jpeg","listed_count":29,"is_translator":false},"coordinates":null},"geo":null,"entities":{"trends":[],"symbols":[],"urls":[{"expanded_url":"https://www.animalfriendsquote.co.uk/fb-worldcup/","indices":[111,134],"display_url":"animalfriendsquote.co.uk/fb-worldcup/","url":"https://t.co/Y8g47fLYY1"}],"hashtags":[],"media":[{"sizes":{"thumb":{"w":150,"resize":"crop","h":150},"small":{"w":340,"resize":"fit","h":455},"large":{"w":706,"resize":"fit","h":946},"medium":{"w":600,"resize":"fit","h":803}},"id":488926730481332224,"media_url_https":"https://pbs.twimg.com/media/BskERVuIcAAJZGu.jpg","media_url":"http://pbs.twimg.com/media/BskERVuIcAAJZGu.jpg","expanded_url":"http://twitter.com/BulleyBufton/status/488926827394904064/photo/1","source_status_id_str":"488926827394904064","indices":[139,140],"source_status_id":488926827394904064,"id_str":"488926730481332224","type":"photo","display_url":"pic.twitter.com/DDco9wVXtP","url":"http://t.co/DDco9wVXtP"}],"user_mentions":[{"id":225136520,"name":"BULLEY","indices":[3,16],"screen_name":"BulleyBufton","id_str":"225136520"},{"id":132204038,"name":"Mina*Bad Yoga Kitty*","indices":[18,30],"screen_name":"MinaANDMaya","id_str":"132204038"},{"id":2308374684,"name":"Julianna Kaminski","indices":[93,106],"screen_name":"HilbraesDogs","id_str":"2308374684"}]},"source":"<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android<\/a>","favorited":false,"in_reply_to_user_id":null,"retweet_count":0,"id_str":"488927960280211456","user":{"location":"","default_profile":false,"statuses_count":1370,"profile_background_tile":true,"lang":"zh-tw","profile_link_color":"038544","profile_banner_url":"https://pbs.twimg.com/profile_banners/2272804116/1404662156","id":2272804116,"following":null,"favourites_count":2000,"protected":false,"profile_text_color":"333333","verified":false,"description":"No More Sorrow","contributors_enabled":false,"profile_sidebar_border_color":"000000","name":"Winnie","profile_background_color":"14DBBA","created_at":"Thu Jan 02 10:13:01 +0000 2014","default_profile_image":false,"followers_count":311,"profile_image_url_https":"https://pbs.twimg.com/profile_images/478106512083017728/4ao_8JjE_normal.jpeg","geo_enabled":false,"profile_background_image_url":"http://pbs.twimg.com/profile_background_images/431815421189029888/YrRNpUfd.jpeg","profile_background_image_url_https":"https://pbs.twimg.com/profile_background_images/431815421189029888/YrRNpUfd.jpeg","follow_request_sent":null,"url":null,"utc_offset":null,"time_zone":null,"notifications":null,"profile_use_background_image":true,"friends_count":455,"profile_sidebar_fill_color":"DDEEF6","screen_name":"winnie341881","id_str":"2272804116","profile_image_url":"http://pbs.twimg.com/profile_images/478106512083017728/4ao_8JjE_normal.jpeg","listed_count":0,"is_translator":false}}
PigScript:
REGISTER '/tmp/elephant-bird-hadoop-compat-4.1.jar';
REGISTER '/tmp/elephant-bird-pig-4.1.jar';
A = LOAD 'input.json ' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS myMap;
B = FOREACH A GENERATE myMap#'id' AS ID,myMap#'created_at' AS createdAT,myMap#'user' AS User;
DUMP B;
Output:
(488927960280211456,Tue Jul 15 06:08:04 +0000 2014,[location#,default_profile#false,profile_background_tile#true,statuses_count#1370,lang#zh-tw,profile_link_color#038544,profile_banner_url#https://pbs.twimg.com/profile_banners/2272804116/1404662156,id#2272804116,following#,protected#false,favourites_count#2000,profile_text_color#333333,contributors_enabled#false,description#No More Sorrow,verified#false,name#Winnie,profile_sidebar_border_color#000000,profile_background_color#14DBBA,created_at#Thu Jan 02 10:13:01 +0000 2014,default_profile_image#false,followers_count#311,geo_enabled#false,profile_image_url_https#https://pbs.twimg.com/profile_images/478106512083017728/4ao_8JjE_normal.jpeg,profile_background_image_url#http://pbs.twimg.com/profile_background_images/431815421189029888/YrRNpUfd.jpeg,profile_background_image_url_https#https://pbs.twimg.com/profile_background_images/431815421189029888/YrRNpUfd.jpeg,follow_request_sent#,url#,utc_offset#,time_zone#,notifications#,friends_count#455,profile_use_background_image#true,profile_sidebar_fill_color#DDEEF6,screen_name#winnie341881,id_str#2272804116,profile_image_url#http://pbs.twimg.com/profile_images/478106512083017728/4ao_8JjE_normal.jpeg,is_translator#false,listed_count#0])
In elephantbird library
all the values will be stored as key/value pair(ie MAP datatype)
, so it will be easy to extract the required fields from the loaded data.
In the above pigscript i have extracted the value of 'id','created_at' and 'user'
as per your need.
Suppose you want to extract some fields from 'user' data( ex: 'friends_count' and 'followers_count'
), in that case you need to project the 'user'
field and extract the required data. sample code below.
PigScript:
REGISTER '/tmp/elephant-bird-hadoop-compat-4.1.jar';
REGISTER '/tmp/elephant-bird-pig-4.1.jar';
A = LOAD 'input.json ' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS myMap;
B = FOREACH A GENERATE 'user' AS User;
C = FOREACH B GENERATE User#'friends_count', User#'followers_count';
DUMP C;
Output:
(455,311)
这篇关于从Pig Latin中的复杂JSON对象中仅加载一些值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!